PySpark : Merge two dataframes - dataframe

I have two dataframes, DF1 and DF2 and they have same column names
Lets say the DF1 is of the following format,
Item Id
item
model
price
1
item 1
22
100
2
item 2
33
300
3
item 3
44
400
4
item 4
55
500
DF2 contains following format
Item Id
item
model
price
1
item 1
222
1000
1
item 1
2222
10000
2
item 2
333
3000
3
item 3
444
4000
4
item 4
555
5000
I need to combine the two dataframes such that the result should be like:
Item Id
item
model
price
1
item 1
22
100
1
item 1
222
1000
1
item 1
2222
10000
2
item 2
33
300
2
item 2
333
3000
3
item 3
44
400
3
item 3
444
4000
4
item 4
55
500
4
item 4
555
5000
I need to use only pyspark not pandas. Thanks for help.

You may use a union here
df1.union(df2)
or more specific
df1.select("Item Id","item","model","price").union(df2.select("Item Id","item","model","price"))
optionally you may order your results
df1.union(df2).orderBy("Item Id","item","model","price")
Let me know if this works for you.

Related

Pandas split dataframe by sessions

I've got the next DataFrame:
id
sec
1
45
2
1
3
176
1
19
1
876
3
123
I want to split it to groups by id by sessions, or create multiple dataframes of this sessions. Like I want to have sessions of each id (session is When more than 30 seconds have passed between user actions)
For example:
sessions for id 1: [45, 19], [876]
I tried gruopby and cat, but I have no idea how to implement this
To identify the session you can use:
df['session'] = (df.sort_values(by=['id', 'sec'])
.groupby('id')['sec']
.apply(lambda s: s.diff().gt(30).cumsum().add(1))
)
Output:
id sec session
0 1 45 1
1 2 1 1
2 3 176 2
3 1 19 1
4 1 876 2
5 3 123 1

How to calculate leftovers of each balance top-up using first in first out technique?

Imagine we have user balances. There's a table with top-up and withdrawals. Let's call it balance_updates.
transaction_id
user_id
current_balance
amount
created_at
1
1
100
100
...
2
1
0
-100
3
2
400
400
4
2
300
-100
5
2
200
-200
6
2
300
100
7
2
50
-50
What I want to get off this is a list of top-ups and their leftovers using the first in first out technique for each user.
So the result could be this
top_up
user_id
leftover
1
1
0
3
2
50
6
2
100
Honestly, I struggle to turn it to SQL. Tho I know how to do it on paper. Got any ideas?

Group by and provide groups only if unique in group

i have the Following dataset :
Amount Document Number
0 200 12345
1 90 2222
2 200 456789
3 90 4444
4 300 4789
5 300 4789
So basically i want to get group numbers for the above data (using ngroup maybe)
Grouping the data on the basis of amount. assign a group number to one group only if the Document numbers in that group has unique numbers.
This is what i would like the outcome to be.
Amount Document Number Group
0 200 12345 1
1 90 2222 2
2 200 456789 1
3 90 4444 2
4 300 4789
5 300 4789
Grouping the data on the basis of amount. assign the rows to one group only if the Document number is a unique number.
I think you want rank():
select t.*, rank() over (order by amount, document_number) as grouping
from t;
In pandas, you could first create a mask where any group by amount has a dup is flagged as False with groupby.transform and duplicated, then use this mask and groupby.ngroup like:
mask_dup = ~(df.duplicated().groupby(df['Amount']).transform(any))
df.loc[mask_dup, 'Group'] = df[mask_dup].groupby('Amount').ngroup()+1
print (df)
Amount Document Number Group
0 200 12345 2.0
1 90 2222 1.0
2 200 456789 2.0
3 90 4444 1.0
4 300 4789 NaN
5 300 4789 NaN
if you have more than the two columns at first you need to specify the subset in duplicated

pandas pivoting start and end

I need help with pivoting my df to get the start and end day.
Id Day Value
111 6 a
111 5 a
111 4 a
111 2 a
111 1 a
222 3 a
222 2 a
222 1 a
333 1 a
The desired result would be:
Id StartDay EndDay
111 4 6
111 1 2 (since 111 skips day 3)
222 1 3
333 1 1
Thanks a bunch!
So, my first thought was just :
df.groupby('Id').Day.agg(['min','max'])
But then I noticed your stipulation "(since 111 skips day 3)", which means we have to make an identifier which tells us if the current row is in the same 'block' as the previous (same Id, contiguous Day). So, we sort:
df.sort_values(['Id','Day'], inplace=True)
Then define the block:
df['block'] = ((df.Day!=(df.shift(1).Day+1).fillna(0).astype(int))).astype(int).cumsum()
(adapted from top answer to this question: Finding consecutive segments in a pandas data frame)
then group by Id and block:
df.groupby(['Id','block']).Day.agg(['min','max'])
Giving:
Id block min max
111 1 1 2
111 2 4 6
222 3 1 3
333 4 1 1

Possible to group by counts?

I am trying to change something like this:
Index Record Time
1 10 100
1 10 200
1 10 300
1 10 400
1 3 500
1 10 600
1 10 700
2 10 800
2 10 900
2 10 1000
3 5 1100
3 5 1200
3 5 1300
into this:
Index CountSeq Record LastTime
1 4 10 400
1 1 3 500
1 2 10 700
2 3 10 1000
3 3 5 1300
I am trying to apply this logic per unique index -- I just included three indexes to show the outcome.
So for a given index I want to combine them by streaks of the same Record. So notice that the first four entries for Index 1 have Records 10, but it is more succinct to say that there were 4 entries with record 10, ending at time 400. Then I repeat the process going forward, in sequence.
In short I am trying to perform a count-grouping over sequential chunks of the same Record, within each index. In other words I am NOT looking for this:
select index, count(*) as countseq, record, max(time) as lasttime
from Table1
group by index,record
Which combines everything by the same record whereas I want them to be separated by sequence breaks.
Is there a way to do this in SQL?
It's hard to solve your problem without having a single primary key, so I'll assume you have a primary key column that increases each row (primkey). This request would return the same table with a 'diff' column that has value 1 if the previous primkey row has the same index and record as the current one, 0 otherwise :
SELECT *,
IF((SELECT index, record FROM yourTable p2 WHERE p1.primkey = p2.primkey)
= (SELECT index, record FROM yourTable p2 WHERE p1.primkey-1 = p2.primkey), 1, 0) as diff
FROM yourTable p1
If you use a temporary variable that increases each time the IF expression is false, you would get a result like this :
primkey Index Record Time diff
1 1 10 100 1
2 1 10 200 1
3 1 10 300 1
4 1 10 400 1
5 1 3 500 2
6 1 10 600 3
7 1 10 700 3
8 2 10 800 4
9 2 10 900 4
10 2 10 1000 4
11 3 5 1100 5
12 3 5 1200 5
13 3 5 1300 5
Which would solve your problem, you would just add 'diff' to the group by clause.
Unfortunately I can't test it on sqlite, but you should be able to use variables like this.
It's probably a dirty workaround but I couldn't find any better way, hope it helps.