Pandas groupby and max of string column - pandas

Sample DF:
df = pd.DataFrame(np.random.randint(1,10,size=(6,2)),columns = list("AB"))
df["A"] = ["1111","2222","1111","1111","2222","1111"]
df["B"] = ["20010101","20010101","20010101","20010101","20010201","20010201"]
df
OP:
A B
0 1111 20010101
1 2222 20010101
2 1111 20010101
3 1111 20010101
4 2222 20010201
5 1111 20010201
I am trying to find the max transactions done by the user_id in a single day.
For example, for ID: "1111" has done 3 transactions on "200010101" and 1 transaction on "20010201" so the maximum here should be 3, while the ID: 2222 has done 1 transaction on "20010101" and 1 transaction on "20010202" so the op is 1.
Expected OP:
MAX TRANS IN SINGLE DAY
1111 3
2222 1
Any pandas way to achieve this instead of creating groups and iterating through it.

To find max you need groupby, unstack, max on index
In [1832]: df.groupby(['A', 'B'])['A'].count().unstack().max(axis=1)
Out[1832]:
A
1111 3
2222 1
dtype: int64

We can do groupby twice. First we get the count of each occurence in column B of each ID in column A. Then we groupby again and get the max value:
df2 = pd.DataFrame(df.groupby(['A', 'B'])['B'].count())\
.rename({'B':'MAX TRANS SINGLE DAY'}, axis=1)\
.reset_index()
df = df2.groupby('A', as_index=False).agg({'MAX TRANS SINGLE DAY':['max', 'min']})
print(df)
A MAX TRANS SINGLE DAY
max min
0 1111 3 1
1 2222 1 1

Related

Leave the first TWO dates for each id

I have a dataframe of id number and dates:
import pandas as pd
df = pd.DataFrame([['1','01/01/2000'], ['1','01/07/2002'],['1', '04/05/2003'],
['2','01/05/2010'], ['2','08/08/2009'],
['3','12/11/2008']], columns=['id','start_date'])
df
id start_date
0 1 01/01/2000
1 1 01/07/2002
2 1 04/05/2003
3 2 01/05/2010
4 2 08/08/2009
5 3 12/11/2008
I am looking for a way to leave for each id the first TWO dates (i.e. the two earliest dates).
for the example above the output would be:
id start_date
0 1 01/01/2000
1 1 01/07/2002
2 2 08/08/2009
3 2 01/05/2010
4 3 12/11/2008
Thanks!
ensure timestamp
df['start_date']=pd.to_datetime(df['start_date'])
sort values
df=df.sort_values(by=['id','start_date'])
group and select first 2 only
df_=df.groupby('id')['id','start_date'].head(2)
Just group by id and then you can call head. Be sure to sort your values first.
df = df.sort_values(['id', 'start_date'])
df.groupby('id').head(2)
full code:
df = pd.DataFrame([['1','01/01/2000'], ['1','01/07/2002'],['1', '04/05/2003'],
['2','01/05/2010'], ['2','08/08/2009'],
['3','12/11/2008']], columns=['id','start_date'])
# 1. convert 'start_time' column to datetime
df['start_date'] = pd.to_datetime(df['start_date'])
# 2. sort the dataframe ascending by 'start_time'
df.sort_values(by='start_date', ascending=True, inplace=True)
# 3. select only the first two occurances of each id
df.groupby('id').head(2)
output:
id start_date
0 1 2000-01-01
1 1 2002-01-07
5 3 2008-12-11
4 2 2009-08-08
3 2 2010-01-05

Pivoting and transposing using pandas dataframe

Suppose that I have a pandas dataframe like the one below:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
The above would give me the following output:
print(df)
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
or
|fk ID| value | valId |
| 1 | 3 | 1 |
| 1 | 3 | 2 |
| 2 | 4 | 1 |
| 2 | 5 | 2 |
and I would like to transpose and pivot it in such a way that I get the following table and the same order of column names:
fk ID value valID fkID value valID
| 1 | 3 | 1 | 1 | 3 | 2 |
| 2 | 4 | 1 | 2 | 5 | 2 |
The most straightforward solution I can think of is
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
# concatenate the rows (Series) of each 'fk ID' group side by side
def flatten_group(g):
return pd.concat(row for _, row in g.iterrows())
res = df.groupby('fk ID', as_index=False).apply(flatten_group)
However, using Series.iterrows is not ideal, and can be very slow if the size of each group is large.
Furthermore, the above solution doesn't work if the 'fk ID' groups have different sizes. To see that, we can add a third group to the DataFrame
>>> df2 = df.append({'fk ID': 3, 'value':10, 'valID': 4},
ignore_index=True)
>>> df2
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
4 3 10 4
>>> df2.groupby('fk ID', as_index=False).apply(flatten_group)
0 fk ID 1
value 3
valID 1
fk ID 1
value 3
valID 2
1 fk ID 2
value 4
valID 1
fk ID 2
value 5
valID 2
2 fk ID 3
value 10
valID 4
dtype: int64
The result is not a DataFrame as one could expect, because pandas can't align the columns of the groups.
To solve this I suggest the following solution. It should work for any group size, and should be faster for large DataFrames.
import numpy as np
def flatten_group(g):
# flatten each group data into a single row
flat_data = g.to_numpy().reshape(1,-1)
return pd.DataFrame(flat_data)
# group the rows by 'fk ID'
groups = df.groupby('fk ID', group_keys=False)
# get the maximum group size
max_group_size = groups.size().max()
# contruct the new columns by repeating the
# original columns 'max_group_size' times
new_cols = np.tile(df.columns, max_group_size)
# aggregate the flattened rows
res = groups.apply(flatten_group).reset_index(drop=True)
# update the columns
res.columns = new_cols
Output:
# df
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1 3 2
1 2 4 1 2 5 2
# df2
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1.0 3.0 2.0
1 2 4 1 2.0 5.0 2.0
2 3 10 4 NaN NaN NaN
You can cast df as a numpy array, reshape it and cast it back to a dataframe, then rename the columns (0..5).
This is working too if values are not numbers but strings.
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
nrows = 2
array = df.to_numpy().reshape((nrows, -1))
pd.DataFrame(array).rename(mapper=lambda x: df.columns[x % len(df.columns)], axis=1)
If your group sizes are guaranteed to be the same, you could merge your odd and even rows:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
df_even = df[df.index%2==0].reset_index(drop=True)
df_odd = df[df.index%2==1].reset_index(drop=True)
df_odd.join(df_even, rsuffix='_2')
Yields
fk ID value valID fk ID_2 value_2 valID_2
0 1 3 2 1 3 1
1 2 5 2 2 4 1
I'd expect this to be pretty performant, and this could be generalized for any number of rows in each group (vs assuming odd/even for two rows per group), but will require that you have the same number of rows per fk ID.

Pandas pivot rows into columns with count of occurence per row

I have the following dataframe with one column representing IDs (one same ID can appear several time in the column) and another one representing an occurence of a category for this ID. Each category can have several occurences per ID.
id category
1234 happy
4567 sad
8910 medium
...............
1234 happy
4567 medium
I would like to pivot this table to get the following
id happy sad medium
1234 2 0 0
4567 0 1 1
8910 0 0 1
I've tried the following
df.pivot_table(index= "id", columns = "category", aggfunc= 'count', fill_value = 0)
But it's only returning me the IDs as indexes.
Could anyone help?
You can use pd.crosstab:
print (pd.crosstab(df["id"], df["category"]))
If you want to stick with pivot_table, you need to add an extra column as value:
print (df.assign(value=0).pivot_table(index="id", columns="category", values="value", aggfunc='count', fill_value=0))
category happy medium sad
id
1234 2 0 0
4567 0 1 1
8910 0 1 0

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)

generating matrix with pandas

I want to generate a matrix using pandas for the data df with the following logic:
Group by id
Low: Mid Top: End
For day 1: Count if (If level has Mid and End and if day == 1)
For day 2: Count if (If level has Mid and End and if day == 2)
….
Begin: Mid to New
For day 1: Count if (If level has Mid and New and if day == 1)
For day 2: Count if (If level has Mid and New and if day == 2)
….
df = pd.DataFrame({'Id':[111,111,222,333,333,444,555,555,555,666,666],'Level':['End','Mid','End','End','Mid','New','End','New','Mid','New','Mid'],'day' : ['',3,'','',2,3,'',3,4,'',2]})
Id |Level | day
111 |End|
111 |Mid| 3
222 |End|
333 |End|
333 |Mid| 2
444 |New| 3
555 |End|
555 |New| 3
555 |Mid| 4
666 |New|
666 |Mid| 2
The matrix would look like this:
Low Top day1 day2 day3 day4
Mid End 0 1 1 0
Mid New 0 1 0 1
New End 0 0 1 0
New Mid 0 0 0 1
Thank you! Thank you!
Starting from your dataframe
# all the combination of Levels
level_combos=[c for c in itertools.combinations(df['Level'].unique().tolist(), 2)]
# create output and fill with zeros
df_output=pd.DataFrame(0,index=level_combos,columns=range(4))
Probably is not very efficient, but it should work
for g in df.groupby(['Id']): # group by ID
# combination of levels for this ID
level_combos_this_id=[c for c in itertools.combinations(g[1]['Level'].unique().tolist(), 2)]
# set to 1 the days present
df_output.loc[level_combos_this_id,pd.to_numeric(g[1]['day']).dropna(inplace=True).values]=1
Finally rename the columns to get to the desired output
df_output.columns=['day'+str(i+1) for i in range(4)]