Get group counts of level 1 after doing a group by on two columns - pandas

I am doing a group by on two columns and need the count of the number of values in level-1
I tried the following:
>>> import pandas as pd
>>> df = pd.DataFrame({'A': ['one', 'one', 'two', 'three', 'three', 'one'], 'B': [1, 2, 0, 4, 3, 4], 'C': [3,3,3,3,4,8]})
>>> print(df)
A B C
0 one 1 3
1 one 2 3
2 two 0 3
3 three 4 3
4 three 3 4
5 one 4 8
>>> aggregator = {'C': {'sC' : 'sum','cC':'count'}}
>>> df.groupby(["A", "B"]).agg(aggregator)
/envs/pandas/lib/python3.7/site-packages/pandas/core/groupby/generic.py:1315: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
C
sC cC
A B
one 1 3 1
2 3 1
4 8 1
three 3 4 1
4 3 1
two 0 3 1
I want an output something like this where the last column tC gives me the count corresponding to group one, two and three.
C
sC cC tC
A B
one 1 3 1 3
2 3 1
4 8 1
three 3 4 1 2
4 3 1
two 0 3 1 1

If there is only one column for aggregation pass list of tuples:
aggregator = [('sC' , 'sum'),('cC', 'count')]
df = df.groupby(["A", "B"])['C'].agg(aggregator)
For last column convert first level to Series of MultiIndex, get counts by GroupBy.transform and GroupBy.size and for first values only use numpy.where:
s = df.index.get_level_values(0).to_series()
df['tC'] = np.where(s.duplicated(), np.nan, s.groupby(s).transform('size'))
print(df)
sC cC tC
A B
one 1 3 1 3.0
2 3 1 NaN
4 8 1 NaN
three 3 4 1 2.0
4 3 1 NaN
two 0 3 1 1.0
You can also set duplicated values to empty string in tC column, but then later all numeric operation with this column failed, because mixed values - numeric with strings:
df['tC'] = np.where(s.duplicated(), '', s.groupby(s).transform('size'))
print(df)
sC cC tC
A B
one 1 3 1 3
2 3 1
4 8 1
three 3 4 1 2
4 3 1
two 0 3 1 1

Related

Add new columns using a pandas Series [duplicate]

This question already has answers here:
How to assign values to multiple non existing columns in a pandas dataframe?
(2 answers)
Closed 4 years ago.
I have a pandas DataFrame and a pandas Series. I want to add new constant columns that have the values of the dataframe. In an example:
In [1]: import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,4,5], 'b': [2,2,3,2,5]})
In [2]: df1
Out[2]:
a b
0 1 2
1 2 2
2 3 3
3 4 2
4 5 5
In [3]: s1 = pd.Series({'c':2, 'd':3})
In [4]: s1
Out[4]:
c 2
d 3
dtype: int64
In [5]: for key, value in s1.to_dict().items():
df1[key] = value
My ugly loop does what I want. But there must be definitely a better solution using maybe some merge or group operation I guess
In [6]: df1
Out[6]:
a b c d
0 1 2 2 3
1 2 2 2 3
2 3 3 2 3
3 4 2 2 3
4 5 5 2 3
Any suggestions?
Use assign with unpacking Series by **:
df1 = df1.assign(**s1)
print (df1)
a b c d
0 1 2 2 3
1 2 2 2 3
2 3 3 2 3
3 4 2 2 3
4 5 5 2 3
Numpy solution for new DataFrame with numpy.broadcast_to and join:
df = pd.DataFrame(np.broadcast_to(s1.values, (len(df1),len(s1))),
index=df1.index,
columns=s1.index)
df1 = df1.join(df)
print (df1)
a b c d
0 1 2 2 3
1 2 2 2 3
2 3 3 2 3
3 4 2 2 3
4 5 5 2 3

Re-index to insert missing rows in a multi-indexed dataframe

I have a MultiIndexed DataFrame with three levels of indices. I would like to expand my third level to contain all values in a given range, but only for the existing values in the two upper levels.
For example, assume the first level is name, the second level is date and the third level is hour. I would like to have rows for all 24 possible hours (even if some are currently missing), but only for the already existing names and dates. The values in new rows can be filled with zeros.
So a simple example input would be:
>>> import pandas as pd
>>> df = pd.DataFrame([[1,1,1,3],[2,2,1,4], [3,3,2,5]], columns=['A', 'B', 'C','val'])
>>> df.set_index(['A', 'B', 'C'], inplace=True)
>>> df
val
A B C
1 1 1 3
2 2 1 4
3 3 2 5
if the required values for C are [1,2,3], the desired output would be:
val
A B C
1 1 1 3
2 0
3 0
2 2 1 4
2 0
3 0
3 3 1 0
2 5
3 0
I know how to achieve this using groupby and applying a defined function for each group, but I was wondering if there was a cleaner way of doing this with reindex (I couldn't make this one work for a MultiIndex case, but perhaps I'm missing something)
Use -
partial_indices = [ i[0:2] for i in df.index.values ]
C_reqd = [1, 2, 3]
final_indices = [j+(i,) for j in partial_indices for i in C_reqd]
index = pd.MultiIndex.from_tuples(final_indices, names=['A', 'B', 'C'])
df2 = pd.DataFrame(pd.Series(0, index), columns=['val'])
df2.update(df)
Output
df2
val
A B C
1 1 1 3.0
2 0.0
3 0.0
2 2 1 4.0
2 0.0
3 0.0
3 3 1 0.0
2 5.0
3 0.0

Create a column of counts in a pandas dataframe

I want to create a column of counts in a pandas dataframe. Here is the input:
dict = {'id': [1,2,3,4,5,6], 'cat': ['A', 'A', 'A', 'A', 'A', 'B'], 'status': [1, 1, 1, 1, 2, 1]}
id cat status
0 1 A 1
1 2 A 1
2 3 A 1
3 4 A 1
4 5 A 2
5 6 B 1
Preferred output:
id cat status status_1_for_cat_count status_2_for_category_count
0 1 A 1 4 1
1 2 A 1 4 1
2 3 A 1 4 1
3 4 A 1 4 1
4 5 A 2 4 1
5 6 B 1 1 0
As can hopefully be seen, I'm trying to get the full counts added for each row to two columns (one for each status). I have tried several approaches, mostly with groupby in combination with unique_counts, transform, apply, filter, merges and what not, but have not been able to get this to work. I am able to do this on a single column easily (I want to create a column of value_counts in my pandas dataframe), but not with two different statuses combined with the category.
Another option, use pd.crosstab to create a two way table with cat as index, then join back with the original data frame on cat column:
df.join(pd.crosstab(df.cat, 'status_' + df.status.astype(str)), on='cat')
# cat id status status_1 status_2
#0 A 1 1 4 1
#1 A 2 1 4 1
#2 A 3 1 4 1
#3 A 4 1 4 1
#4 A 5 2 4 1
#5 B 6 1 1 0
You can use get_dummies first then groupby transform i.e
one = pd.get_dummies(df.set_index(['id','cat']).astype(str))
two = one.groupby(['cat']).transform('sum').reset_index()
id cat status_1 status_2
0 1 A 4 1
1 2 A 4 1
2 3 A 4 1
3 4 A 4 1
4 5 A 4 1
5 6 B 1 0

Need to loop over pandas series to find indices of variable

I have a dataframe and a list. I would like to iterate over elements in the list and find their location in dataframe then store this to a new dataframe
my_list = ['1','2','3','4','5']
df1 = pd.DataFrame(my_list, columns=['Num'])
dataframe : df1
Num
0 1
1 2
2 3
3 4
4 5
dataframe : df2
0 1 2 3 4
0 9 12 8 6 7
1 11 1 4 10 13
2 5 14 2 0 3
I've tried something similar to this but doesn't work
for x in my_list:
i,j= np.array(np.where(df==x)).tolist()
df2['X'] = df.append(i)
df2['Y'] = df.append(j)
so looking for a result like this
dataframe : df1 updated
Num X Y
0 1 1 1
1 2 2 2
2 3 2 4
3 4 1 2
4 5 2 0
any hints or ideas would be appreciated
Instead of trying to find the value in df2, why not just make df2 a flat dataframe.
df2 = pd.melt(df2)
df2.reset_index(inplace=True)
df2.columns = ['X', 'Y', 'Num']
so now your df2 just looks like this:
Index X Y Num
0 0 0 9
1 1 0 11
2 2 0 5
3 3 1 12
4 4 1 1
5 5 1 14
You can of course sort by Num and if you just want the values from your list you can further filter df2:
df2 = df2[df2.Num.isin(my_list)]

How to remove levels from a multi-indexed dataframe?

For example, I have:
In [1]: df = pd.DataFrame([8, 9],
index=pd.MultiIndex.from_tuples([(1, 1, 1),
(1, 3, 2)]),
columns=['A'])
In [2] df
Out[2]:
A
1 1 1 8
3 2 9
Is there a better way to remove the last level from the index than this:
In [3]: pd.DataFrame(df.values,
index=df.index.droplevel(2),
columns=df.columns)
Out[3]:
A
1 1 8
3 9
df.reset_index(level=2, drop=True)
Out[29]:
A
1 1 8
3 9
You don't need to create a new DataFrame instance! You can modify the index:
df.index = df.index.droplevel(2)
df
A
1 1 8
3 9
You can also specify negative indices, for selection from the end:
df.index = df.index.droplevel(-1)
If your index has names like
A
X Y Z
1 1 1 8
3 2 9
Then you can also remove by specifying the index name
df.index = df.index.droplevel('Z')
From 0.24+, we can directly droplevel on df. So, to drop the last level of the index:
>>> df
col
1 5 1 4 foo
3 2 8 bar
2 4 3 7 saz
>>> df.droplevel(-1)
col
1 5 1 foo
3 2 bar
2 4 3 saz
The axis whose levels are dropped can also be controlled with axis argument and it defaults to 0, i.e., over index. Multiple levels can be dropped at once via supplying a list and if any of the index has a name, those can be used, too (as exemplified in the linked doc).
Note: the argument to droplevel is tried to be first interpreted as a label; so if any of the levels happens to have an integer name, it will be dropped i.e., not positionally:
>>> df
col
this -1 other 0
1 5 1 4 foo
3 2 8 bar
2 4 3 7 saz
# literally drops `-1` level
>>> df.droplevel(-1)
col
this other 0
1 1 4 foo
2 8 bar
2 3 7 saz
# literally level `0` is dropped
>>> df.droplevel(0)
col
this -1 other
1 5 1 foo
3 2 bar
2 4 3 saz
To make sure a positional dropping happens, we can go for the names attribute and select positionally there:
>>> df
col
this -1 other 0
1 5 1 4 foo
3 2 8 bar
2 4 3 7 saz
# go get the name of the last level, drop whatever it is
>>> df.droplevel(df.index.names[-1])
col
this -1 other
1 5 1 foo
3 2 bar
2 4 3 saz
# similarly...
>>> df.droplevel(df.index.names[0])
col
-1 other 0
5 1 4 foo
3 2 8 bar
4 3 7 saz
Lastly, droplevel returns a new dataframe, so df = df.droplevel(...) is needed to see the change in df.