HIVE - increment value on column change - hive

I'm just basically trying to add a column with a unique identifier for a journey. I have a table that looks similar to this:
Time id station newtrip
2017-11-15 16:45 100 St.George TRUE
2017-11-15 16:46 100 Bloor FALSE
2017-11-15 16:47 110 Wellesley TRUE
2017-11-15 16:48 110 Wellesley FALSE
2017-11-15 16:49 200 Dundas TRUE
2017-11-15 16:55 200 College FALSE
2017-11-15 16:56 200 Union FALSE
2017-11-15 17:51 200 Union TRUE
2017-11-15 17:52 200 St.Andrew FALSE
All I am trying to do is increment a counter every time that last column shows true. So the result should look like:
Time id station newtrip journeyID
2017-11-15 16:45 100 St.George TRUE 1
2017-11-15 16:46 100 Bloor FALSE 1
2017-11-15 16:47 110 Wellesley TRUE 2
2017-11-15 16:48 110 Wellesley FALSE 2
2017-11-15 16:49 200 Dundas TRUE 3
2017-11-15 16:55 200 College FALSE 3
2017-11-15 16:56 200 Union FALSE 3
2017-11-15 17:51 200 Union TRUE 4
2017-11-15 17:52 200 St.Andrew FALSE 4
Couple things to note
If using window analytic functions, I am not partitioning on anything. I want this to cover the entire table (about 30mil rows).
I was able to get this to work on my Hortonworks Ambari VM by adding a row_counter (rowid) for the whole table and then doing something like:
SUM(IF(newtrip, 1, 0)) OVER(order by rowid) as journeyID
But when running the EXACT same code in a cluster in AWS EMR, adding a row ID seems to mess up the order of the other columns and I get messed up results.
Surely there is an easy way to do this?

Related

Pandas - Filtering alternate Monday

I have a Dataframe that has sales data by day. I would like to be able to filter out sales data of every alternate Monday. For example, if I select June 27 the next date I would like to filter would be July 11 and the next date would be July 25 and so on.
I have my Dataframe as below
sale_date, count
2022-06-27, 100
2022-07-01, 150
2022-07-07, 100
2022-07-11, 150
2022-06-20, 100
2022-07-25, 150
I would expect the output to be
sale_date, count
2022-06-27, 100
2022-07-11, 150
2022-07-25, 150
You can use:
# convert to datetime
date = pd.to_datetime(df['sale_date'])
# is the day a Monday (0 = Monday)?
m1 = date.dt.weekday.eq(0)
# is the week an "even" week?
m2 = date.dt.isocalendar().week.mod(2).eq(0)
# if both conditions are True, keep the row
out = df[m1&m2]
output:
sale_date count
0 2022-06-27 100
3 2022-07-11 150
5 2022-07-25 150
intermediates:
sale_date count weekday weekday.eq(0) week week.mod(2) week.mod(2).eq(0)
0 2022-06-27 100 0 True 26 0 True
1 2022-07-01 150 4 False 26 0 True
2 2022-07-07 100 3 False 27 1 False
3 2022-07-11 150 0 True 28 0 True
4 2022-06-20 100 0 True 25 1 False
5 2022-07-25 150 0 True 30 0 True
df11=df1.resample("2w-mon",closed="left",on="sale_date")["count"].first().reset_index()
df11.assign(sale_date=df11.sale_date-pd.Timedelta(days=7))
out:
sale_date count
0 2022-06-27 100
1 2022-07-11 100
2 2022-07-25 150

Pandas extract hierarchical info?

I have a dataframe which describes serial numbers of items arranged in boxes:
df=pd.DataFrame({'barcode':['1000']*3+['2000']*4+['3000']*3, 'box_number': ['10']*2+['11']+['12']*4+['13','14','15'],'serials': map(str,range(800,810))})
barcode box_number serials
0 1000 10 800
1 1000 10 801
2 1000 11 802
3 2000 12 803
4 2000 12 804
5 2000 12 805
6 2000 12 806
7 3000 13 807
8 3000 14 808
9 3000 15 809
I want to group them hierarchically to output to hierarchical XML, so that every barcode has a list of box numbers which each have list of serials in them.
So I did a groupby which seems to do exactly what I want:
df.groupby(['barcode','box_number'])['serials'].apply(' '.join)
barcode box_number
1000 10 800 801
11 802
2000 12 803 804 805 806
3000 13 807
14 808
15 809
Name: serials, dtype: object
Now, I want to extract this info practically the way it is displayed so that I get a row for each barcode with data grouped similar to this:
row['1000']== {'10': '800 801','11':'802'}
row['2000']== {'12': '803 804 805 806'}
row['3000']== {'13': '807','14':'808','15':'809' }
But I can't seem to figure out how to get this done. I tried reset_index(), another groupby() -- but this doesn't work on existing result as it is a Series, but I can't seem to be able to understand the right way.
How should I this most concisely? I looked over questions here, but didn't seem to find similar issue.
Use dictionary comrehension for get nested dictonary with Series.xs and Series.to_dict:
s = df.groupby(['barcode','box_number'])['serials'].apply(' '.join)
d = {lev: s.xs(lev).to_dict() for lev in s.index.levels[0]}
print (d)
{'1000': {'10': '800 801', '11': '802'},
'2000': {'12': '803 804 805 806'},
'3000': {'13': '807', '14': '808', '15': '809'}}

How to sum of certain values using pandas datetime operations

Headline is not clear. Let me explain.
I have a dataframe like this:
Order Quantity Date Accepted Date Delivered
20 01-05-2010 01-02-2011
10 01-11-2010 01-03-2011
300 01-12-2010 01-04-2011
5 01-03-2011 01-03-2012
20 01-04-2012 01-11-2013
10 01-07-2013 01-12-2014
I want to basically create another column that contains the total undelivered items for each row.
Expected output:
Order Quantity Date Accepted Date Delivered Pending Order
20 01-05-2010 01-02-2011 20
10 01-11-2010 01-03-2011 30
300 01-12-2010 01-04-2011 330
5 01-03-2011 01-03-2012 305
20 01-04-2012 01-11-2013 20
10 01-07-2013 01-12-2014 30
Here, I have taken a part of your dataframe and try to get the result.
df = pd.DataFrame({'order': [20, 10, 300, 200],
'Date_aceepted': ['01-05-2010', '01-11-2010', '01-12-2010', '01-12-2010'],
'Date_delever': ['01-02-2011', '01-03-2011', '01-04-2011', '01-12-2010']})
order Date_aceepted Date_delever
0 20 01-05-2010 01-02-2011
1 10 01-11-2010 01-03-2011
2 300 01-12-2010 01-04-2011
3 200 01-12-2010 01-12-2010
Then I will change the Date_accepted and Date_deliver to date time by using pandas data time module
df['date1'] = pd.to_datetime(df['Date_aceepted'])
df['date2'] = pd.to_datetime(df['Date_delever'])
Then I will make a new data frame in which the Date_accepted and Date_delever are not the same. I assume you just need that in your final result.
dff = df[df['date1'] != df['date2']]
You can see the last row in which both accepted and delever are same is now removed in dff.
order Date_aceepted Date_delever date1 date2
0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02
1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03
2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04
Then I did use pandas cumsum of pending order
dff['pending'] = dff['order'].cumsum()
and it gives
order Date_aceepted Date_delever date1 date2 pending
0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02 20
1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03 30
2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04 330
The final data frame has two extra columns that can be dropped if you don't want in your result.

Create a new column which shows if a customer has booked before

I am trying to create a column (BookedBefore?) which identifies if a new enquiry made is by a customer that has booked before.
Enquirydate Booked CustomerID BookedBefore?
5/19/2018 TRUE 598 NO
8/2/2018 FALSE 598 Yes
9/20/2018 FALSE 598 Yes
1/13/2019 FALSE 598 NO
7/26/2018 FALSE 611 NO
9/30/2017 FALSE 640 NO
5/2/2017 FALSE 732 NO
10/4/2017 FALSE 732 NO
8/25/2017 FALSE 766 NO
2/3/2018 FALSE 773 NO
5/2/2018 TRUE 773 YES
1/27/2019 FALSE 773 YES
5/26/2019 FALSE 972 NO
6/22/2019 FALSE 1022 NO
4/27/2019 FALSE 1024 NO
5/5/2017 FALSE 1148 NO
4/25/2019 FALSE 1323 NO
3/24/2019 FALSE 1354 NO
10/31/2018 TRUE 1596 NO
8/6/2017 FALSE 1623 NO
8/8/2018 FALSE 1623 NO
3/12/2019 TRUE 1623 NO
3/13/2019 FALSE 1623 YES
CustomerID 598 booked(TRUE) on 5/19/2018. For every future enquiry, this customer made should be labelled as YES for "BookedBefore?" as shown. CustomerID 598 made an enquiry on 8/2/2018, this should be labelled as Yes for "BookedBefore?"
Some help will be appreciated. Thank you.
I am using google BigQuery to carry out this task.
The following query checks to see if a user has at least one prior booking (their first), and then joins it to the original table and compares to see if a prior booking exists.
with first_booking as (
select CustomerID, min(Enquirydate) as first_booking_date from <dataset>.<table> where Booked = TRUE group by 1
)
select
a.Enquirydate,
a.Booked,
a.CustomerID,
case when b.first_booking_date is not null then 'Yes' else 'No' end as has_booked_before
from <dataset>.<table> a
left join first_booking b on a.CustomerID = b.CustomerID and b.first_booking_date < a.Enquirydate

Pandas - group by to return first occurence and there on every third occurence of a value

I am trying to filter records from a Dataframe based on their occurence. I am trying to filter out the first occurence and then on every third occurence based on emp_id. Given below is how my Dataframe is.
emp_id,date,value
101,2018-12-01,10001
101,2018-12-03,10002
101,2018-12-05,10003
101,2018-12-13,10004
In the above sample, expected output is :
emp_id,date,value
101,2018-12-01,10001
101,2018-12-13,10004
Given below is the code I have built this far:
df['emp_id'] = df.groupby('emp_id').cumcount()+1
df['emp_id'] = np.where((df['emp_id']%3)==0,1,0)
This however returns back 2nd occurence and every third occurrence after that. How could I modify such that it returns back the first occurence and then on every third occurence based on emp_id
I think you need boolean indexing with check 0 or 1, assign to column is not necessary, is possible create helper Series s:
print (df)
emp_id date value
0 101 2018-12-01 10001
1 101 2018-12-03 10002
2 101 2018-12-05 10003
3 101 2018-12-13 10004
4 101 2018-12-01 10005
5 101 2018-12-03 10006
6 101 2018-12-05 10007
7 101 2018-12-13 10008
s = df.groupby('emp_id').cumcount()
df['check'] = (s % 3) == 0
Alternative:
s = df.groupby('emp_id').cumcount() + 1
df['check'] = (s % 3) == 1
print (df)
emp_id date value check
0 101 2018-12-01 10001 True
1 101 2018-12-03 10002 False
2 101 2018-12-05 10003 False
3 101 2018-12-13 10004 True
4 101 2018-12-01 10005 False
5 101 2018-12-03 10006 False
6 101 2018-12-05 10007 True
7 101 2018-12-13 10008 False