Pandas data manipulation from column to row elements [duplicate] - pandas

This question already has answers here:
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 9 months ago.
I have dataset with millions of rows, here is an example of what it looks like and what I intend to output:
data = [[1, 100, 8], [1, 100, 4],
[1, 100,6], [2, 100, 0],
[2, 200, 1], [3, 300, 7],
[4, 400, 2], [5, 100, 6],
[5, 100, 3], [5, 600, 1]]
df= pd.DataFrame(data, columns =['user', 'time', 'item'])
print(df)
user time item
1 100 8
1 100 4
1 100 6
2 100 0
2 200 1
3 300 7
4 400 2
5 100 6
5 100 3
5 600 1
The desired output should have all items consumed by a user within the same time to appear together in the items column as follows
user time item
1 100 8,4,6
2 100 0
5 100 6,3
2 200 1
3 300 7
4 400 2
5 500 6
For example, user: 1 consumed products 8,4,6 within time: 100
How could this be achieved?

Use df.astype with Groupby.agg and df.sort_values:
In [489]: out = df.astype(str).groupby(['user', 'time'])['item'].agg(','.join).reset_index().sort_values('time')
In [490]: out
Out[490]:
user time item
0 1 100 8,4,6
1 2 100 0
5 5 100 6,3
2 2 200 1
3 3 300 7
4 4 400 2
6 5 600 1

Related

Cut continuous data by outliers

For example I have DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'b': [2, 2, 4, 3, 1000, 2000, 1, 500, 3]})
I need to cut by outliers and get these intervals: 1-4, 5-6, 7, 8, 9.
Cutting with pd.cut and pd.qcut does not give these results
You can group them by consecutive values depending on the above/below mask:
m = df['b'].gt(100)
df['group'] = m.ne(m.shift()).cumsum()
output:
a b group
0 1 2 1
1 2 2 1
2 3 4 1
3 4 3 1
4 5 1000 2
5 6 2000 2
6 7 1 3
7 8 500 4
8 9 3 5

Pandas - Add many new columns based on many aggregate functions

Pandas 1.0.5
import pandas as pd
d = pd.DataFrame({
"card_id": [1, 1, 2, 2, 1, 1, 2, 2],
"day": [1, 1, 1, 1, 2, 2, 2, 2],
"amount": [1, 2, 10, 20, 3, 4, 30, 40]
})
#add columns
d['count'] = d.groupby(['card_id', 'day'])["amount"].transform('count')
d['min'] = d.groupby(['card_id', 'day'])["amount"].transform('min')
d['max'] = d.groupby(['card_id', 'day'])["amount"].transform('max')
I would like to change the three transform lines to one line. I tried this:
d['count', 'min', 'max'] = d.groupby(['card_id', 'day'])["amount"].transform('count', 'min', 'max')
Error: "TypeError: count() takes 1 positional argument but 3 were given"
I also tried this:
d[('count', 'min', 'max')] = d.groupby(['card_id', 'day']).agg(
count = pd.NamedAgg('amount', 'count')
,min = pd.NamedAgg('amount', 'min')
,max = pd.NamedAgg('amount', 'max')
)
Error: "TypeError: incompatible index of inserted column with frame index"
Use merge,
d = pd.DataFrame({
"card_id": [1, 1, 2, 2, 1, 1, 2, 2],
"day": [1, 1, 1, 1, 2, 2, 2, 2],
"amount": [1, 2, 10, 20, 3, 4, 30, 40]
})
df_out = d.groupby(['card_id', 'day']).agg(
count = pd.NamedAgg('amount', 'count')
,min = pd.NamedAgg('amount', 'min')
,max = pd.NamedAgg('amount', 'max')
)
d.merge(df_out, left_on=['card_id', 'day'], right_index=True)
Output:
card_id day amount count min max
0 1 1 1 2 1 2
1 1 1 2 2 1 2
2 2 1 10 2 10 20
3 2 1 20 2 10 20
4 1 2 3 2 3 4
5 1 2 4 2 3 4
6 2 2 30 2 30 40
7 2 2 40 2 30 40
The output of you groupyby is creating a multilevel index and the index of this ouput doesn't match the index of d, hence the error. However, we can join the columns in d to the index in output of the groupby using merge with column names and right_index=True.
You could use the assign function to get the results in one go:
grouping = df.groupby(["card_id", "day"])
df.assign(
count=grouping.transform("count"),
min=grouping.transform("min"),
max=grouping.transform("max"),
)
card_id day amount count min max
0 1 1 1 2 1 2
1 1 1 2 2 1 2
2 2 1 10 2 10 20
3 2 1 20 2 10 20
4 1 2 3 2 3 4
5 1 2 4 2 3 4
6 2 2 30 2 30 40
7 2 2 40 2 30 40

Get and Modify column in groups on rows that meet a condition

I have this DataFrame:
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12], 'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
day hour sales
0 1 10 0
1 1 10 40
2 1 10 30
3 2 11 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 12 20
And I would like to filter to get the first entry of each day that has volume greater than 0. And as an additional thing I would like to change the 'sales' column for these to 9.
So to get something like this:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
I only came up with this iterative solution. But is there a solution, how I can apply it in a more functional way?
# Group by day:
groups = df.groupby(by=['day'])
# Get all indices of first non-zero sales entry per day:
indices = []
for name, group in groups:
group = group[group['sales'] > 0]
indices.append(group.index.to_list()[0])
# Change their values:
df.iloc[indices, df.columns.get_loc('hour')] = 9
You can create a group of df['day'] after checking of sales is greater than 0 , then get idxmax and filter out groups which doesnot have any value greater than 0 by using any , then assign with loc[]
g = df['sales'].gt(0).groupby(df['day'])
idx = g.idxmax()
df.loc[idx[g.any()],'hour']=9
print(df)
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
Create a mask m that groups by day as well as rows where the sales are not 0.
Then, use this mask as well as df['sales'] > 0 to change those specific rows to 9 with np.where()
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12],
'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
m = df.groupby(['day', df['sales'].ne(0)])['sales'].transform('first')
df['hour'] = np.where((df['sales'] == m) & (df['sales'] > 0), 9, df['hour'])
df
Out[37]:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20

Subtract columns from DataFrames with different shapes by looking up based on another column

df2 has more columns and rows than df1. For each row in df2, I want to lookup a corresponding row in df1 based on matching values in one of their columns. From this matching row in df1, I want to subtract a column between df2 and df1. I tried set_index and directly subtracting the dataframes, but that resulted in a lot of NaN.
df1 = pd.DataFrame([[1, 10], [2, 20], [3, 30]],
columns=['A', 'B'])
df2 = pd.DataFrame([[1, 100, 15], [1, 200, 20],
[2, 100, 30], [2, 200, 35],
[3, 100, 50], [3, 200, 55]],
columns=['A', 'X', 'B'])
# For each row in df2, lookup in df1 based on column A, and produce
# difference of values in columnn B.
expected = pd.DataFrame([[1, 100, 5], [1, 200, 10],
[2, 100, 10], [2, 200, 15],
[3, 100, 20], [3, 200, 25]],
columns=['A', 'X', 'B'])
DataFrames:
df1
A B
0 1 10
1 2 20
2 3 30
df2
A X B
0 1 100 15
1 1 200 20
2 2 100 30
3 2 200 35
4 3 100 50
5 3 200 55
expected
A X B
0 1 100 5
1 1 200 10
2 2 100 10
3 2 200 15
4 3 100 20
5 3 200 25
set_index df1 to 'A' and map it back to df2.A. After that do subtraction:
df2['B'] -= df2.A.map(df1.set_index('A').B)
Out[216]:
A X B
0 1 100 5
1 1 200 10
2 2 100 10
3 2 200 15
4 3 100 20
5 3 200 25
Note: In case df2.A has values doesn't exist in df1.A, it will return NaN on that row. I leave it that way because your sample data doesn't specify how to handle it. If you want to keep the value of B the same in that case, you just need to chain .fillna(0) to the end of map or call method subtract with fill_value=0 option
df2['B'] -= df2.A.map(df1.set_index('A').B).fillna(0)
You can use merge also:
df2.merge(df1, on='A').eval('B = B_x - B_y').drop(['B_x','B_y'], axis=1)
Output:
A X B
0 1 100 5
1 1 200 10
2 2 100 10
3 2 200 15
4 3 100 20
5 3 200 25

pandas convert lists in multiple columns within DataFrame to separate columns

I am trying to convert a list within multiple columns of a pandas DataFrame into separate columns.
Say, I have a dataframe like this:
0 1
0 [1, 2, 3] [4, 5, 6]
1 [1, 2, 3] [4, 5, 6]
2 [1, 2, 3] [4, 5, 6]
And would like to convert it to something like this:
0 1 2 0 1 2
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
I have managed to do this in a loop. However, I would like to do this in fewer lines.
My code snippet so far is as follows:
import pandas as pd
df = pd.DataFrame([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]]])
output1 = df[0].apply(pd.Series)
output2 = df[1].apply(pd.Series)
output = pd.concat([output1, output2], axis=1)
If you don't care about the column names you could do:
>>> df.apply(np.hstack, axis=1).apply(pd.Series)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
Using sum
pd.DataFrame(df.sum(1).tolist())
0 1 2 3 4 5
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6