Dealing with a multi-index pandas Series and DataFrame

1. Introduction

Whether we like it or not, in pandas we will come across to Series or DataFrame with multi-index. A multi-index often be generated from method .groupby() or .set_index(). We will tend to use reset_index() to set it back to normal Series/ DataFrame. But in some situation knowing how to deal with a multi-index will be a benifit. The method we used in multi-index will give us a deeper understanding of DataFrame and Series.

In this article we will talk about:

  • What is a multi-index Series/ DataFrame?
  • How to select a multi-index Series/ DataFrame?
  • How to concat two multi-index DataFrame?

 

2. What is a multi-index Series/ DataFrame?

Visually we can know which Series/ DataFrame is a multi-index. But also we can use attribute ".index" to check if it is a multi-index. Pandas will show it in the return values.

We have a simple dataset as below for example:

import pandas as pd

stocks = pd.read_csv(r'stocks.csv')

 

2.1 Visually check a multi-index:

s = stocks.groupby(['Symbol', 'Date'])['Close'].mean()

2.2 Using ".index" attribute to check a multi-index:

s.index

2.3 On a deeper level:

A multi-index Series/ DataFrame is no more than a Series/ DataFrame as normal ones, but has an added-dimention.It makes:

  • A multi-index Series acts more like an normal DataFrame (we will talk about this again in the next section).
  • A multi-index Series/ DataFrame can be .unstack().
s.unstack()

This unstack() data table is by no means a tidy-data (useful information Date series is in the columns name), which makes it not so helpful.

 

3. How to select a multi-index?

3.1 Select multi-index Series

We can use .loc[] from outer level to inner level, it will return the value.

Also we can skip the outer level with .loc[:, ...], it will return a normal Series.

With this multi-index Series:

s.loc['AAPL', '2016-10-03']

s.loc[:, '2016-10-03']

  

If we look carefully, this .loc[] operation is exactly the same as we are choosing a DataFrame. Frist element is rows name, comma, second element is columns name.

Actually we are choosing the DataFrame as we .unstack() from original multi-index Series.

 

3.2 Select multi-index DataFrame

As exmaple, we use a dataframe as below:

df = stocks.set_index(['Symbol', 'Date'])

We can use .loc[] from outer level to inner level. But different with multi-index Series, as it is already a DataFrame, so we can not just use two commas to seperate.

(This operation is unvaild)

df.loc["AAPL", "2016-10-03", "Close"]

 

We can use a tuple to tell Python which is for row selection. After that we can use a comma to choose columns.

df.loc[('AAPL', '2016-10-03'), :]

  

df.loc[('AAPL', '2016-10-03'), 'Close']

 3.3 Select multi-index DataFrame more intelligence: both outer level and inner level can be a list, for multi-selection.

df.loc[(['AAPL', 'MSFT'], '2016-10-03'), 'Close']

df.loc[('AAPL', ['2016-10-03', '2016-10-04']), 'Close']

Sometime we may want to skip the outer level. We may think of:

(This operation is unvaild)

df.loc[(:, '2016-10-03'), 'Close']

The correct way is using built-in function slice().

df.loc[(slice(None), '2016-10-03'), :]

3.4 Reindex a multi-index DataFrame

Another common use of .loc[] is to re-arrange index order of a DataFrame. But a mulit-index DataFrame will ignore this operation, without warning or error.

For example, we have a multi-index DataFrame as below:

example = stocks.groupby(["Symbol", "Date"])["Close"].mean().to_frame()

What if I want the output of this table is in order CSCO, APPL, MSFT?

We may try to do as below, it works with normal DataFrame, and it is most common used when I want to reindex a table.

(This operation is unvaild)

example.loc[["CSCO", "AAPL", "MSFT"]]

The correct way is not so straight-forward. It uses reindex() with pd.MultiIndex.from_product() as below:

example.reindex(pd.MultiIndex.from_product([["CSCO", "AAPL", "MSFT"],
example.index.get_level_values(1).unique()], names=['Symbol', 'Date']),fill_value=0)

  

 More complex use case of multi-index DataFrame reindex can be found here.

 3.5 Adding rows to multi-index DataFrame

With the same .loc[] operation logic mentioned above, we can add a new row to multi-index DataFrame as we may do in normal ones.

example.loc[("合计", "-"), :] = example["Close"].sum()

  

 

4. How to concat two multi-index DataFrame?

 If we have new columns to add, we can use pd.merge(). But we have to use arguments left_index=True and right_index=True.

df1 = df.loc[(slice(None), '2016-10-04'), 'Close'].to_frame()
df2 = df.loc[(slice(None), '2016-10-04'), 'Volume'].to_frame()

 

pd.merge(df1, df2, left_index=True, right_index=True)

  

 

 

 

posted @ 2019-12-12 19:46  DrVonGoosewing  阅读(745)  评论(0)    收藏  举报