Re-index to insert missing rows in a multi-indexed dataframe - pandas

I have a MultiIndexed DataFrame with three levels of indices. I would like to expand my third level to contain all values in a given range, but only for the existing values in the two upper levels.
For example, assume the first level is name, the second level is date and the third level is hour. I would like to have rows for all 24 possible hours (even if some are currently missing), but only for the already existing names and dates. The values in new rows can be filled with zeros.
So a simple example input would be:
>>> import pandas as pd
>>> df = pd.DataFrame([[1,1,1,3],[2,2,1,4], [3,3,2,5]], columns=['A', 'B', 'C','val'])
>>> df.set_index(['A', 'B', 'C'], inplace=True)
>>> df
val
A B C
1 1 1 3
2 2 1 4
3 3 2 5
if the required values for C are [1,2,3], the desired output would be:
val
A B C
1 1 1 3
2 0
3 0
2 2 1 4
2 0
3 0
3 3 1 0
2 5
3 0
I know how to achieve this using groupby and applying a defined function for each group, but I was wondering if there was a cleaner way of doing this with reindex (I couldn't make this one work for a MultiIndex case, but perhaps I'm missing something)

Use -
partial_indices = [ i[0:2] for i in df.index.values ]
C_reqd = [1, 2, 3]
final_indices = [j+(i,) for j in partial_indices for i in C_reqd]
index = pd.MultiIndex.from_tuples(final_indices, names=['A', 'B', 'C'])
df2 = pd.DataFrame(pd.Series(0, index), columns=['val'])
df2.update(df)
Output
df2
val
A B C
1 1 1 3.0
2 0.0
3 0.0
2 2 1 4.0
2 0.0
3 0.0
3 3 1 0.0
2 5.0
3 0.0

Related

how to subtract 2 data frame with the same size?

How I can subtract values of 2 different data frame with the same size and columns?
for example df1-df2 in the following:
df1:
A B
4 5
0 6
df2:
A B
6 0
7 1
output:
diff:
A B
-2 5
-7 5
Note: I have too many columns and rows, please don't suggest manually methods. no for loop please
I guess this is what you want.
df1 = pd.DataFrame({"A": [4,0], "B": [5,6]})
df2 = pd.DataFrame({"A": [6,7], "B": [0,1]})
df = df1 - df2
df
Out[4]:
A B
0 -2 5
1 -7 5

Get group counts of level 1 after doing a group by on two columns

I am doing a group by on two columns and need the count of the number of values in level-1
I tried the following:
>>> import pandas as pd
>>> df = pd.DataFrame({'A': ['one', 'one', 'two', 'three', 'three', 'one'], 'B': [1, 2, 0, 4, 3, 4], 'C': [3,3,3,3,4,8]})
>>> print(df)
A B C
0 one 1 3
1 one 2 3
2 two 0 3
3 three 4 3
4 three 3 4
5 one 4 8
>>> aggregator = {'C': {'sC' : 'sum','cC':'count'}}
>>> df.groupby(["A", "B"]).agg(aggregator)
/envs/pandas/lib/python3.7/site-packages/pandas/core/groupby/generic.py:1315: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
C
sC cC
A B
one 1 3 1
2 3 1
4 8 1
three 3 4 1
4 3 1
two 0 3 1
I want an output something like this where the last column tC gives me the count corresponding to group one, two and three.
C
sC cC tC
A B
one 1 3 1 3
2 3 1
4 8 1
three 3 4 1 2
4 3 1
two 0 3 1 1
If there is only one column for aggregation pass list of tuples:
aggregator = [('sC' , 'sum'),('cC', 'count')]
df = df.groupby(["A", "B"])['C'].agg(aggregator)
For last column convert first level to Series of MultiIndex, get counts by GroupBy.transform and GroupBy.size and for first values only use numpy.where:
s = df.index.get_level_values(0).to_series()
df['tC'] = np.where(s.duplicated(), np.nan, s.groupby(s).transform('size'))
print(df)
sC cC tC
A B
one 1 3 1 3.0
2 3 1 NaN
4 8 1 NaN
three 3 4 1 2.0
4 3 1 NaN
two 0 3 1 1.0
You can also set duplicated values to empty string in tC column, but then later all numeric operation with this column failed, because mixed values - numeric with strings:
df['tC'] = np.where(s.duplicated(), '', s.groupby(s).transform('size'))
print(df)
sC cC tC
A B
one 1 3 1 3
2 3 1
4 8 1
three 3 4 1 2
4 3 1
two 0 3 1 1

Why I get the different group size number using pandas groupby() with or without column selection?

I try to use the numpy.size() to count the group size for the groups from pandas Dataframe groupby(), and I get strange result.
>>> df=pd.DataFrame({'A':[1,1,2,2], 'B':[1,2,3,4],'C':[0.11,0.32,0.93,0.65],'D':["This","That","How","What"]})
>>> df
A B C D
0 1 1 0.11 This
1 1 2 0.32 That
2 2 3 0.93 How
3 2 4 0.65 What
>>> df.groupby('A',as_index=False).agg(np.size)
A B C D
0 1 2 2.0 2
1 2 2 2.0 2
>>> df.groupby('A',as_index=False)['C'].agg(np.size)
A C
0 1 8
1 2 8
>>> df.groupby('A',as_index=False)[['C']].agg(np.size)
A C
0 1 2.0
1 2 2.0
>>> grouped = df.groupby('A',as_index=False)
>>> grouped['C','D'].agg(np.size)
A C D
0 1 2.0 2
1 2 2.0 2
In the code, if we use groupby() following ['C'], the group size is 8, equal to the correct group size * column number, that is 2 * 4; if we use groupby() following column [['C']] or ['C','D'], the group size is right.
Why?
It seems that pandas try to execute the aggregation first then do the actual column selection.
If you want to know the group size use one of these:
grouped.size()
grouped.agg("size)
len(grouped)

Assigning to a slice from another DataFrame requires matching column names?

If I want to set (replace) part of a DataFrame with values from another, I should be able to assign to a slice (as in this question) like this:
df.loc[rows, cols] = df2
Not so in this case, it nulls out the slice instead:
In [32]: df
Out[32]:
A B
0 1 -0.240180
1 2 -0.012547
2 3 -0.301475
In [33]: df2
Out[33]:
C
0 x
1 y
2 z
In [34]: df.loc[:,'B']=df2
In [35]: df
Out[35]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
But it does work with just a column (Series) from df2, which is not an option if I want multiple columns:
In [36]: df.loc[:,'B']=df2['C']
In [37]: df
Out[37]:
A B
0 1 x
1 2 y
2 3 z
Or if the column names match:
In [47]: df3
Out[47]:
B
0 w
1 a
2 t
In [48]: df.loc[:,'B']=df3
In [49]: df
Out[49]:
A B
0 1 w
1 2 a
2 3 t
Is this expected? I don't see any explanation for it in docs or Stackoverflow.
Yes, this is expected. Label alignment is one of the core features of pandas. When you use df.loc[:,'B'] = df2 it needs to align two DataFrames:
df.align(df2)
Out:
( A B C
0 1 -0.240180 NaN
1 2 -0.012547 NaN
2 3 -0.301475 NaN, A B C
0 NaN NaN x
1 NaN NaN y
2 NaN NaN z)
The above shows how each DataFrame looks when aligned as a tuple (the first one is df and the second one is df2). If your df2 also had a column named B with values [1, 2, 3], it would become:
df.align(df2)
Out:
( A B C
0 1 -0.240180 NaN
1 2 -0.012547 NaN
2 3 -0.301475 NaN, A B C
0 NaN 1 x
1 NaN 2 y
2 NaN 3 z)
Since B's are aligned, your assignment would result in
df.loc[:,'B'] = df2
df
Out:
A B
0 1 1
1 2 2
2 3 3
When you use a Series, the alignment will be on a single axis (on index in your example). Since they exactly match, there will be no problem and it will assign the values from df2['C'] to df['B'].
You can either rename the labels before the alignment or use a data structure that doesn't have labels (a numpy array, a list, a tuple...).
You can use the underlying NumPy array:
df.loc[:,'B'] = df2.values
df
A B
0 1 x
1 2 y
2 3 z
Pandas indexing is always sensitive to labeling of both rows and columns. In this case, your rows check out, but your columns do not. (B != C).
Using the underlying NumPy array makes the operation index-insensitive.
The reason that this does work when df2 is a Series is because Series have no concept of columns. The only alignment is on the rows, which are aligned.

Pandas: Delete duplicated items in a specific column

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3