generating matrix with pandas - pandas

I want to generate a matrix using pandas for the data df with the following logic:
Group by id
Low: Mid Top: End
For day 1: Count if (If level has Mid and End and if day == 1)
For day 2: Count if (If level has Mid and End and if day == 2)
….
Begin: Mid to New
For day 1: Count if (If level has Mid and New and if day == 1)
For day 2: Count if (If level has Mid and New and if day == 2)
….
df = pd.DataFrame({'Id':[111,111,222,333,333,444,555,555,555,666,666],'Level':['End','Mid','End','End','Mid','New','End','New','Mid','New','Mid'],'day' : ['',3,'','',2,3,'',3,4,'',2]})
Id |Level | day
111 |End|
111 |Mid| 3
222 |End|
333 |End|
333 |Mid| 2
444 |New| 3
555 |End|
555 |New| 3
555 |Mid| 4
666 |New|
666 |Mid| 2
The matrix would look like this:
Low Top day1 day2 day3 day4
Mid End 0 1 1 0
Mid New 0 1 0 1
New End 0 0 1 0
New Mid 0 0 0 1
Thank you! Thank you!

Starting from your dataframe
# all the combination of Levels
level_combos=[c for c in itertools.combinations(df['Level'].unique().tolist(), 2)]
# create output and fill with zeros
df_output=pd.DataFrame(0,index=level_combos,columns=range(4))
Probably is not very efficient, but it should work
for g in df.groupby(['Id']): # group by ID
# combination of levels for this ID
level_combos_this_id=[c for c in itertools.combinations(g[1]['Level'].unique().tolist(), 2)]
# set to 1 the days present
df_output.loc[level_combos_this_id,pd.to_numeric(g[1]['day']).dropna(inplace=True).values]=1
Finally rename the columns to get to the desired output
df_output.columns=['day'+str(i+1) for i in range(4)]

Related

pandas: pivot - group by multiple columns

df = pd.DataFrame({'id': ['id1', 'id1','id1', 'id2','id1','id1','id1'],
'activity':['swimming','running','jogging','walking','walking','walking','walking'],
'month':[2,3,4,3,4,4,3]})
pd.crosstab(df['id'], df['activity'])
I'd like to add another column for month in the output to get counts per user within each month for the respective activity.
df.set_index(['id','month'])['activity'].unstack().reset_index()
I get error.
edit: Expected output in the image. I do not know how to create a table.
You can pass a list of columns to pd.crosstab:
x = pd.crosstab([df["id"], df["month"]], df["activity"]).reset_index()
x.columns.name = None
print(x)
Prints:
id month jogging running swimming walking
0 id1 2 0 0 1 0
1 id1 3 0 1 0 1
2 id1 4 1 0 0 2
3 id2 3 0 0 0 1

How to achieve this in pandas dataframe

I have two dataframes df1 and df2' :- df1` :-
Date
ID
total calls
24-02-2021
1
15
22-02-2021
1
25
20-02-2021
3
100
21-02-2021
4
30
df2:
Date
ID
total calls
match_flag
24-02-2021
1
16
1
22-02-2021
1
25
1
20-02-2021
3
99
1
24-02-2021
2
80
not_found
21-02-2021
4
25
0
I want to first match based on Id and Date if both matches I want to check for an addional condition of total calls and if the difference between total calls in df1 and df2 is +-1 then I want to consider that row as match and update the flag and if it does not satisfy the +-1 condition want to update the flag to 0 and if that date for the ID is not found in df1 then update to not_found
Updating the df1 and df2 matched on ID and DateId
df1:
ID
Call_Date
TId
StartTime
EndTime
total calls
Type
Indicator
DateId
562124
18-10-2021
480271
18-10-2021
18-10-2021
1
Regular Call
SA
20211018
df2 :
ID
total calls
DateId
Start_Time
End_Time
Indicator
Type
match_flag
562124
0
20211018
2021-10-18T13:06:00.000+0000
2021-10-18T13:07:00.000+0000
AD
R
not_found
You can use a merge:
s = df2.merge(df1, on=['Date', 'ID'], how='left')['total calls_y']
df2['match_flag'] = (df2['total calls']
.sub(s).abs().le(1) # is absolute diff ≤ 1?
.astype(int) # convert to int
.mask(s.isna(), 'not_found') # mask missing
)
output:
Date ID total calls match_flag
0 24-02-2021 1 16 1
1 22-02-2021 1 25 1
2 20-02-2021 3 99 1
3 24-02-2021 2 80 not_found
4 21-02-2021 4 25 0

Pandas groupby and max of string column

Sample DF:
df = pd.DataFrame(np.random.randint(1,10,size=(6,2)),columns = list("AB"))
df["A"] = ["1111","2222","1111","1111","2222","1111"]
df["B"] = ["20010101","20010101","20010101","20010101","20010201","20010201"]
df
OP:
A B
0 1111 20010101
1 2222 20010101
2 1111 20010101
3 1111 20010101
4 2222 20010201
5 1111 20010201
I am trying to find the max transactions done by the user_id in a single day.
For example, for ID: "1111" has done 3 transactions on "200010101" and 1 transaction on "20010201" so the maximum here should be 3, while the ID: 2222 has done 1 transaction on "20010101" and 1 transaction on "20010202" so the op is 1.
Expected OP:
MAX TRANS IN SINGLE DAY
1111 3
2222 1
Any pandas way to achieve this instead of creating groups and iterating through it.
To find max you need groupby, unstack, max on index
In [1832]: df.groupby(['A', 'B'])['A'].count().unstack().max(axis=1)
Out[1832]:
A
1111 3
2222 1
dtype: int64
We can do groupby twice. First we get the count of each occurence in column B of each ID in column A. Then we groupby again and get the max value:
df2 = pd.DataFrame(df.groupby(['A', 'B'])['B'].count())\
.rename({'B':'MAX TRANS SINGLE DAY'}, axis=1)\
.reset_index()
df = df2.groupby('A', as_index=False).agg({'MAX TRANS SINGLE DAY':['max', 'min']})
print(df)
A MAX TRANS SINGLE DAY
max min
0 1111 3 1
1 2222 1 1

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

pandas pivoting start and end

I need help with pivoting my df to get the start and end day.
Id Day Value
111 6 a
111 5 a
111 4 a
111 2 a
111 1 a
222 3 a
222 2 a
222 1 a
333 1 a
The desired result would be:
Id StartDay EndDay
111 4 6
111 1 2 (since 111 skips day 3)
222 1 3
333 1 1
Thanks a bunch!
So, my first thought was just :
df.groupby('Id').Day.agg(['min','max'])
But then I noticed your stipulation "(since 111 skips day 3)", which means we have to make an identifier which tells us if the current row is in the same 'block' as the previous (same Id, contiguous Day). So, we sort:
df.sort_values(['Id','Day'], inplace=True)
Then define the block:
df['block'] = ((df.Day!=(df.shift(1).Day+1).fillna(0).astype(int))).astype(int).cumsum()
(adapted from top answer to this question: Finding consecutive segments in a pandas data frame)
then group by Id and block:
df.groupby(['Id','block']).Day.agg(['min','max'])
Giving:
Id block min max
111 1 1 2
111 2 4 6
222 3 1 3
333 4 1 1