pandas pivoting start and end - pandas

I need help with pivoting my df to get the start and end day.
Id Day Value
111 6 a
111 5 a
111 4 a
111 2 a
111 1 a
222 3 a
222 2 a
222 1 a
333 1 a
The desired result would be:
Id StartDay EndDay
111 4 6
111 1 2 (since 111 skips day 3)
222 1 3
333 1 1
Thanks a bunch!

So, my first thought was just :
df.groupby('Id').Day.agg(['min','max'])
But then I noticed your stipulation "(since 111 skips day 3)", which means we have to make an identifier which tells us if the current row is in the same 'block' as the previous (same Id, contiguous Day). So, we sort:
df.sort_values(['Id','Day'], inplace=True)
Then define the block:
df['block'] = ((df.Day!=(df.shift(1).Day+1).fillna(0).astype(int))).astype(int).cumsum()
(adapted from top answer to this question: Finding consecutive segments in a pandas data frame)
then group by Id and block:
df.groupby(['Id','block']).Day.agg(['min','max'])
Giving:
Id block min max
111 1 1 2
111 2 4 6
222 3 1 3
333 4 1 1

Related

Pandas split dataframe by sessions

I've got the next DataFrame:
id
sec
1
45
2
1
3
176
1
19
1
876
3
123
I want to split it to groups by id by sessions, or create multiple dataframes of this sessions. Like I want to have sessions of each id (session is When more than 30 seconds have passed between user actions)
For example:
sessions for id 1: [45, 19], [876]
I tried gruopby and cat, but I have no idea how to implement this
To identify the session you can use:
df['session'] = (df.sort_values(by=['id', 'sec'])
.groupby('id')['sec']
.apply(lambda s: s.diff().gt(30).cumsum().add(1))
)
Output:
id sec session
0 1 45 1
1 2 1 1
2 3 176 2
3 1 19 1
4 1 876 2
5 3 123 1

Duplicated IDs pandas

I have the following dataframes (df1,df2):
ID
Q1
111
2
111
3
112
1
ID
Q2
111
1
111
5
112
7
Since the IDs are duplicated, I want to reinitialize them, using the following code:
df1.sort_values('ID',inplace=True)
df1['ID_new'] = range(len(df1))
df2.sort_values('ID',inplace=True)
df2['ID_new'] = range(len(df2))
in order to have smth like this:
ID_new
ID
Q1
0
111
2
1
111
3
2
112
1
ID_new
ID
Q2
0
111
1
1
111
5
2
112
7
The question is: are we sure that the ID_new will be the same for df1 and df2?
For example:
is it possible that ID_new = 1 corresponds to the first ID=111 in df1 and to the second ID = 111 in df2?
If yes, there is another way to reinitialize it in a more robust way?

PySpark : Merge two dataframes

I have two dataframes, DF1 and DF2 and they have same column names
Lets say the DF1 is of the following format,
Item Id
item
model
price
1
item 1
22
100
2
item 2
33
300
3
item 3
44
400
4
item 4
55
500
DF2 contains following format
Item Id
item
model
price
1
item 1
222
1000
1
item 1
2222
10000
2
item 2
333
3000
3
item 3
444
4000
4
item 4
555
5000
I need to combine the two dataframes such that the result should be like:
Item Id
item
model
price
1
item 1
22
100
1
item 1
222
1000
1
item 1
2222
10000
2
item 2
33
300
2
item 2
333
3000
3
item 3
44
400
3
item 3
444
4000
4
item 4
55
500
4
item 4
555
5000
I need to use only pyspark not pandas. Thanks for help.
You may use a union here
df1.union(df2)
or more specific
df1.select("Item Id","item","model","price").union(df2.select("Item Id","item","model","price"))
optionally you may order your results
df1.union(df2).orderBy("Item Id","item","model","price")
Let me know if this works for you.

Calculate percentiles using SQL for a group/partition

I want to calculate the percentiles for a given partition/group in SQL. For example the input data looks like -
CustID Product ID quantity_purchased
1 111 2
2 111 3
3 111 2
4 111 5
1 222 2
2 222 6
4 222 7
6 222 2
I want to get percentiles on each product ID group. The output should be -
Product ID min 25% 50% 75% max
111 2 2 2.5 3.5 5
222 2 2 4 6.25 7
How to achieve this using SQL?
You can use percentile_cont():
select product_id, min(quantity_purchased), max(quantity_purchased),
percentile_cont(0.25) within group (order by quantity_purchased),
percentile_cont(0.50) within group (order by quantity_purchased),
percentile_cont(0.75) within group (order by quantity_purchased)
from t
group by product_id;

generating matrix with pandas

I want to generate a matrix using pandas for the data df with the following logic:
Group by id
Low: Mid Top: End
For day 1: Count if (If level has Mid and End and if day == 1)
For day 2: Count if (If level has Mid and End and if day == 2)
….
Begin: Mid to New
For day 1: Count if (If level has Mid and New and if day == 1)
For day 2: Count if (If level has Mid and New and if day == 2)
….
df = pd.DataFrame({'Id':[111,111,222,333,333,444,555,555,555,666,666],'Level':['End','Mid','End','End','Mid','New','End','New','Mid','New','Mid'],'day' : ['',3,'','',2,3,'',3,4,'',2]})
Id |Level | day
111 |End|
111 |Mid| 3
222 |End|
333 |End|
333 |Mid| 2
444 |New| 3
555 |End|
555 |New| 3
555 |Mid| 4
666 |New|
666 |Mid| 2
The matrix would look like this:
Low Top day1 day2 day3 day4
Mid End 0 1 1 0
Mid New 0 1 0 1
New End 0 0 1 0
New Mid 0 0 0 1
Thank you! Thank you!
Starting from your dataframe
# all the combination of Levels
level_combos=[c for c in itertools.combinations(df['Level'].unique().tolist(), 2)]
# create output and fill with zeros
df_output=pd.DataFrame(0,index=level_combos,columns=range(4))
Probably is not very efficient, but it should work
for g in df.groupby(['Id']): # group by ID
# combination of levels for this ID
level_combos_this_id=[c for c in itertools.combinations(g[1]['Level'].unique().tolist(), 2)]
# set to 1 the days present
df_output.loc[level_combos_this_id,pd.to_numeric(g[1]['day']).dropna(inplace=True).values]=1
Finally rename the columns to get to the desired output
df_output.columns=['day'+str(i+1) for i in range(4)]