How to indicate count of values in categorical column in Pandas, Python? - pandas

I have the following Pandas DataFrame:
ID CAT
1 A
1 B
1 A
2 A
2 B
2 A
1 B
1 A
I'd like to have a table that indicates the number of occurance per CAT values for each ID in different columns like this:
ID CAT_A_NUM CAT_B_NUM
1 3 2
2 2 1
I tried in many ways, like this one with pivot table, but unsuccessfully:
df.pivot_table(values='CAT', index='ID', columns='CAT', aggfunc='count')

you can use crosstab():
df=pd.DataFrame(data={'ID':[1,1,1,2,2,2,1,1],'CAT':['A','B','A','A','B','A','B','A']})
final = pd.crosstab(df['ID'], df['CAT'])
final.columns=['CAT_A_NUM','CAT_B_NUM']
final
ID CAT_A_NUM CAT_B_NUM
1 3 2
2 2 1

Probably you can use groupby + unstack
df.groupby(["ID","CAT"]).size().unstack()
which gives
CAT A B
ID
1 3 2
2 2 1

Related

How to merge two datasets on incomplete columns?

I want to merge two datasets on 'key1' and 'key2' columns so that in case of missing value, for example, in the 'key2' column, it would take all combinations of the second key that belong to the first key. Here is an example:
def merge_nan_as_any(mask, data, on, how)
...
mask = pd.DataFrame({'key1': [1,1,2,2],
'key2': [None,3,1,2],
'value2': [1,2,3,4]})
data = pd.DataFrame({'key1': [1,1,1,2,2,2],
'key2': [1,2,3,1,2,3],
'value1': [1,2,3,4,5,6]})
result = merge_nan_as_any(mask, data, on=['key1', 'key2'], how='left')
result = pd.DataFrame({'key1': [1,1,1,1,2,2],
'key2': [1,2,3,3,1,2],
'value2': [1,1,1,2,3,4],
'value1': [1,2,3,3,4,5]})
There is a missed value of the second key, so it takes all rows from the second dataset that satisfy the condition: key1 must equal to 1, key2 is any the second key value from the second dataset. How to do that?
The first obvious solution that came to my mind is to iterate over the first dataset and filter out combinations that satisfy the condition and the second one is to split the first dataset into several ones so that they have NaNs in the same columns and merge each of them on columns that have values.
But I don't like these solutions and guess there is more elegant way to do what I want.
I will appreciate for any help!
Simple approach, merge on key1/key2 for the non-NaN values, merge on key1 only for the NaN values and concat:
m = mask['key2'].notna()
result = pd.concat([data.merge(mask[~m].drop(columns='key2'), on='key1'),
data.merge(mask[m], on=['key1', 'key2']),
], ignore_index=True)
Output:
key1 key2 value1 value2
0 1 1 1 1
1 1 2 2 1
2 1 3 3 1
3 1 3 3 2
4 2 1 4 3
5 2 2 5 4
I would begin by filling the null values with a list of all unique values from the other dataframe. Then, explode it to get all possible combinations and transform back to numeric. Finally, merge them both achieving the expected output:
mask['key2'] = mask['key2'].fillna(' '.join([str(x) for x in data['key2'].unique()])).astype(str).str.split(' ')
mask = mask.explode('key2')
mask['key2'] = pd.to_numeric(mask['key2'])
pd.merge(mask,data,on=['key1','key2'],how='left')
Outputting:
key1 key2 value2 value1
0 1 1 1 1
1 1 2 1 2
2 1 3 1 3
3 1 3 2 3
4 2 1 3 4
5 2 2 4 5
use pandasql it will be easy:
mask.sql("""
select data.*,self.value2
from self left join data
on self.key1=data.key1 and (self.key2=data.key2 or self.key2 is null)
""",**globals())
out:
key1 key2 value1 value2
0 1 1 1 1
1 1 2 2 1
2 1 3 3 1
3 1 3 3 2
4 2 1 4 3
5 2 2 5 4

Pandas: How to aggregate a column with multiple functions and add the results as other columns?

Suppose I have a dataframe like:
A B
0 1 1
1 1 2
2 2 3
3 2 4
I want to add min of B and max of B as new columns named minB and maxB.
Expected
A minB maxB
0 1 1 2
1 2 3 4
Use Named Aggregation:
df.groupby("A", as_index=False).agg(
minB=("B", "min"),
maxB=("B", "max")
)
Use numpy.min & numpy.max:
In [472]: import numpy as np
In [473]: df.groupby('A').agg({'B':[np.min, np.max]}).reset_index()
Out[473]:
A B
amin amax
0 1 1 2
1 2 3 4

Convert subset of rows to column pyspark dataframe

Suppose we have the following df
Id PlaceCod Val
1 1 0
1 2 3
2 2 4
2 1 5
3 1 6
How can I convert this DF to this one:
Id Store Warehouse
1 0 3
2 5 4
3 6 null
I've tried to use df.pivot(f.col("PlaceCod")) but got error message 'DataFrame has no pivot attribute'
As posted by #Emma on the comments:
df.groupby('Id').pivot('PlaceCod').agg(F.first('Val'))
Using the above solution my problem was solved!

Dataframe merge by row

I have two pd df and I want to merge df2 to each row of df1 based on the ID in df1. The final df should look like in df3.
How do I do it? I tried merge, join and concat and didn't get want I wanted.
df1
ID Division
1 10
2 2
3 4
... ...
df2
Product type Level
1 0
1 1
1 2
2 0
2 1
2 2
2 3
df3
ID Product type Level Division
1 1 0 10
1 1 1 10
1 1 2 10
1 2 0 10
1 2 1 10
1 2 2 10
1 2 3 10
and repeat for ID 2 and ......
Looks like you are looking for a Cartesian product of two dataframes. The following approach should achieve what you want,
(df1.assign(key=1)
.merge(df2.assign(key=1))
.drop('key', axis=1))
Consider such an option:
set index in both DataFrames to 0,
perform an outer join (on indices, so the result is just the Cartesian
product),
reset index.
The code to do it is:
df1.index = [0] * df1.index.size
df2.index = [0] * df2.index.size
result = df1.join(df2, how='outer').reset_index(drop=True)

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])