hive rank over grouped value - hive

gurus, i stumbled in hive rank process, i woluld like to rank transaction in each day (with no repeating rank value for the same trx value)
date hour trx rnk
18/03/2018 0 1 24
18/03/2018 1 2 23
18/03/2018 2 3 22
18/03/2018 3 4 21
18/03/2018 4 5 20
18/03/2018 5 6 19
18/03/2018 6 7 18
18/03/2018 7 8 17
18/03/2018 8 9 16
18/03/2018 9 10 15
18/03/2018 10 11 14
18/03/2018 11 12 13
18/03/2018 12 13 12
18/03/2018 13 14 11
18/03/2018 14 15 10
18/03/2018 15 16 9
18/03/2018 16 17 8
18/03/2018 17 18 7
18/03/2018 18 19 6
18/03/2018 19 20 5
18/03/2018 20 21 4
18/03/2018 21 22 3
18/03/2018 22 23 2
18/03/2018 23 24 1
17/03/2018 0 1 24
17/03/2018 1 2 23
17/03/2018 2 3 22
17/03/2018 3 4 21
17/03/2018 4 5 20
17/03/2018 5 6 19
17/03/2018 6 7 18
17/03/2018 7 8 17
17/03/2018 8 9 16
17/03/2018 9 10 15
17/03/2018 10 11 14
17/03/2018 11 12 13
17/03/2018 12 13 12
17/03/2018 13 14 11
17/03/2018 14 15 10
17/03/2018 15 16 9
17/03/2018 16 17 8
17/03/2018 17 18 7
17/03/2018 18 19 6
17/03/2018 19 20 5
17/03/2018 20 21 4
17/03/2018 21 22 3
17/03/2018 22 23 2
17/03/2018 23 24 1
here is my code
select a.date, a.hour, trx, rank() over (order by a.trx) as rnk from(
select date,hour, count(*) as trx from smy_tb
group by date, hour
)a
limit 100;
the problem is:
1. rank value repeated with the same trx value
2. rank value continued to next date (it should be grouped for date and hour, so each date will only return 24 rank value)
need advice,
thank you

You should partition by date column and use a specific ordering.
rank() over (partition by a.date order by a.hour desc)

as explained by #BKS
this is the resolved code
select a.date, a.hour, trx, row_number() over (partition by a.date order by a.trx desc) as rnk from(
select date,hour, count(*) as trx from smy_tb
group by date, hour
)a
limit 100;

Related

Is there an easy way to handle wide timeseries data with pandas

I have a wide dataframe that looks like this.
Hour Wed Aug 10 2022 Thu Aug 11 2022 Fri Aug 12 2022 Sat Aug 13 2022
0 1 52602 49281 52805 53069
1 2 49970 46938 50135 50591
2 3 48188 45494 48156 48837
3 4 47046 44611 47162 47220
4 5 46746 44375 46742 46300
5 6 47493 45325 47259 46115
6 7 48923 47073 48598 46100
7 8 49568 47857 49208 46406
8 9 52147 49854 51482 49274
9 10 55879 53215 55066 53303
10 11 60309 57576 59480 57773
11 12 64943 62024 63799 61670
12 13 68988 66202 67331 64791
13 14 72274 69657 69557 67590
14 15 74249 71855 71525 69363
15 16 75062 73585 72573 70173
16 17 74197 74163 72692 70607
17 18 71764 73506 71726 70353
18 19 68248 71413 69588 69105
19 20 63552 68774 67319 66704
20 21 61337 66328 64784 64501
21 22 59275 63760 62836 62415
22 23 55960 60090 59766 59115
23 24 52384 56233 56341 55681
I would like it to look like this:
Date Hour Values
8/10/2022 1 52602
8/10/2022 2 49970
8/10/2022 3 48188
8/10/2022 4 47046
8/10/2022 5 46746
8/10/2022 6 47493
8/10/2022 7 48923
8/10/2022 8 49568
8/10/2022 9 52147
8/10/2022 10 55879
8/10/2022 11 60309
8/10/2022 12 64943
8/10/2022 13 68988
8/10/2022 14 72274
8/10/2022 15 74249
8/10/2022 16 75062
8/10/2022 17 74197
8/10/2022 18 71764
8/10/2022 19 68248
8/10/2022 20 63552
8/10/2022 21 61337
8/10/2022 22 59275
8/10/2022 23 55960
8/10/2022 24 52384
8/11/2022 1 49281
8/11/2022 2 46938
8/11/2022 3 45494
8/11/2022 4 44611
8/11/2022 5 44375
8/11/2022 6 45325
8/11/2022 7 47073
8/11/2022 8 47857
8/11/2022 9 49854
8/11/2022 10 53215
8/11/2022 11 57576
8/11/2022 12 62024
8/11/2022 13 66202
8/11/2022 14 69657
8/11/2022 15 71855
8/11/2022 16 73585
8/11/2022 17 74163
8/11/2022 18 73506
8/11/2022 19 71413
8/11/2022 20 68774
8/11/2022 21 66328
8/11/2022 22 63760
8/11/2022 23 60090
8/11/2022 24 56233
I have tried melt and wide_to_long, but can't get this exact output. Can someone please point me in the right direction?
I'm sorry I can't even figure out how to demonstrate the output I want correctly.
# Melt your data, keeping the `Hour` column.
df = df.melt('Hour', var_name='Date', value_name='Values')
# Convert to Datetime.
df['Date'] = pd.to_datetime(df['Date'])
# Reorder columns as desired.
df = df[['Date', 'Hour', 'Values']]
print(df)
Output:
Date Hour Values
0 2022-08-10 1 52602
1 2022-08-10 2 49970
2 2022-08-10 3 48188
3 2022-08-10 4 47046
4 2022-08-10 5 46746
.. ... ... ...
91 2022-08-13 20 66704
92 2022-08-13 21 64501
93 2022-08-13 22 62415
94 2022-08-13 23 59115
95 2022-08-13 24 55681
[96 rows x 3 columns]
Additionally, I'd consider merging your Date and Hour columns into a proper timestamp:
# it's unclear if Hour 1 is 1am or midnight~
df['timestamp'] = df.Date.add(pd.to_timedelta(df.Hour.sub(1), unit='h'))
print(df[['timestamp', 'Values']])
# Output:
timestamp Values
0 2022-08-10 00:00:00 52602
1 2022-08-10 01:00:00 49970
2 2022-08-10 02:00:00 48188
3 2022-08-10 03:00:00 47046
4 2022-08-10 04:00:00 46746
.. ... ...
91 2022-08-13 19:00:00 66704
92 2022-08-13 20:00:00 64501
93 2022-08-13 21:00:00 62415
94 2022-08-13 22:00:00 59115
95 2022-08-13 23:00:00 55681
[96 rows x 2 columns]

Remove duplicates appearing within a certain time period using sql or Teradata

I have a dataset containing an ID variable, a date, and several agents (see example below). The agents have been tested several times per patient and I want to filter for every ID the first one to appear and remove all the other tests appearing within 4 weeks after the first. After this, I again want to filter the first one and remove all the others appearing within 4 weeks - throughout the whole dataset. I also generated variables showing the week, month and year.
ID Date Week Month Year Agent
1 10 2010-12-09 49 12 2010 Agent1
2 12 2010-12-09 49 12 2010 Agent2
3 13 2010-12-09 49 12 2010 Agent3
4 14 2010-12-09 49 12 2010 Agent4
5 10 2010-12-09 49 12 2010 Agent1
6 12 2010-12-09 49 12 2010 Agent2
7 13 2010-12-09 49 12 2010 Agent3
8 14 2010-12-09 49 12 2010 Agent4
9 10 2010-12-27 52 12 2010 Agent1
10 12 2010-12-27 52 12 2010 Agent2
11 13 2010-12-27 52 12 2010 Agent3
12 14 2010-12-27 52 12 2010 Agent4
13 10 2011-01-14 2 1 2011 Agent1
14 12 2011-01-14 2 1 2011 Agent2
15 13 2011-01-14 2 1 2011 Agent3
16 14 2011-01-14 2 1 2011 Agent4
17 10 2011-01-14 2 1 2011 Agent1
18 12 2011-01-14 2 1 2011 Agent2
19 13 2011-01-14 2 1 2011 Agent3
20 14 2011-01-14 2 1 2011 Agent4
and what I need is this:
ID Date Week Month Year Agent
1 10 2010-12-09 49 12 2010 Agent1
2 12 2010-12-09 49 12 2010 Agent2
3 13 2010-12-09 49 12 2010 Agent3
4 14 2010-12-09 49 12 2010 Agent4
13 10 2011-01-14 2 1 2011 Agent1
14 12 2011-01-14 2 1 2011 Agent2
15 13 2011-01-14 2 1 2011 Agent3
16 14 2011-01-14 2 1 2011 Agent4
I'm happy about any help!
Assuming your true duplicates are intentional, you can use a couple of window functions (row_number and min over) combined with qualify to filter on those window functions. I named your first column with the row numbers rn just to include it in the output.
select
t.*,
min(theDate) over (partition by id order by theDate) as minDate,
theDate - minDate as daysBetween
from
<your table> t
qualify (daysBetween > 28 --exclude rows within 28 days from first date
or daysBetween = 0 ) -- keep the first row for each id)
--get rid of true duplicates
and row_number() over (partition by id,theDate,agent order by rn) = 1
order by rn

Filter rows of a table based on a condition that implies: 1) value of a field within a range 2) id of the business and 3) date?

I want to filter a TableA, taking into account only those rows whose "TotalInvoice" field is within the minimum and maximum values expressed in a ViewB, based on month and year values and RepairShopId (the sample data only has one RepairShopId, but all the data has multiple IDs).
In the view I have minimum and maximum values for each business and each month and year.
TableA
RepairOrderDataId
RepairShopId
LastUpdated
TotalInvoice
1
10
2017-06-01 07:00:00.000
765
1
10
2017-06-05 12:15:00.000
765
2
10
2017-02-25 13:00:00.000
400
3
10
2017-10-19 12:15:00.000
295679
4
10
2016-11-29 11:00:00.000
133409.41
5
10
2016-10-28 12:30:00.000
127769
6
10
2016-11-25 16:15:00.000
122400
7
10
2016-10-18 11:15:00.000
1950
8
10
2016-11-07 16:45:00.000
79342.7
9
10
2016-11-25 19:15:00.000
1950
10
10
2016-12-09 14:00:00.000
111559
11
10
2016-11-28 10:30:00.000
106333
12
10
2016-12-13 18:00:00.000
23847.4
13
10
2016-11-01 17:00:00.000
22782.9
14
10
2016-10-07 15:30:00.000
NULL
15
10
2017-01-06 15:30:00.000
138958
16
10
2017-01-31 13:00:00.000
244484
17
10
2016-12-05 09:30:00.000
180236
18
10
2017-02-14 18:30:00.000
92752.6
19
10
2016-10-05 08:30:00.000
161952
20
10
2016-10-05 08:30:00.000
8713.08
ViewB
RepairShopId
Orders
Average
MinimumValue
MaximumValue
year
month
yearMonth
10
1
370343
370343
370343
2015
7
2015-7
10
1
109645
109645
109645
2015
10
2015-10
10
1
148487
148487
148487
2015
12
2015-12
10
1
133409.41
133409.41
133409.41
2016
3
2016-3
10
1
19261
19261
19261
2016
8
2016-8
10
4
10477.3575
2656.65644879821
18298.0585512018
2016
9
2016-9
10
69
15047.709565
10
90942.6052417394
2016
10
2016-10
10
98
22312.077244
10
147265.581935242
2016
11
2016-11
10
96
20068.147395
10
99974.1750708773
2016
12
2016-12
10
86
25334.053372
10
184186.985160105
2017
1
2017-1
10
69
21410.63855
10
153417.00126689
2017
2
2017-2
10
100
13009.797
10
59002.3589332934
2017
3
2017-3
10
101
11746.191287
10
71405.3391452842
2017
4
2017-4
10
123
11143.49756
10
55306.8202091131
2017
5
2017-5
10
197
15980.55406
10
204538.144334771
2017
6
2017-6
10
99
10852.496969
10
63283.9899761938
2017
7
2017-7
10
131
52601.981526
10
1314998.61355187
2017
8
2017-8
10
124
10983.221854
10
59444.0535811233
2017
9
2017-9
10
115
12467.148434
10
72996.6054527277
2017
10
2017-10
10
123
14843.379593
10
129673.931373139
2017
11
2017-11
10
111
8535.455945
10
50328.1495501884
2017
12
2017-12
I've tried:
SELECT *
FROM TableA
INNER JOIN ViewB ON TableA.RepairShopId = ViewB.RepairShopId
WHERE TotalInvoice > MinimumValue AND TotalInvoice < MaximumValue
AND TableA.RepairShopId = ViewB.RepairShopId
But I'm not sure how to compare it the yearMonth field with the datetime field "LastUpdated".
Any help is very appreciated!
here is how you can do it:
I assumed LastUpdated column is the column from tableA which indicate date of
SELECT *
FROM TableA A
INNER JOIN ViewB B
ON A.RepairShopId = B.RepairShopId
AND A.TotalInvoice > B.MinimumValue
AND A.TotalInvoice < B.MaximumValue
AND YEAR(LastUpdated) = B.year
AND MONTH(LastUpdated) = B.month

Assigning a day, week, and year column in Pandas in one line

I usually have to extract days, weeks and years into separate columns like this:
data['Day'] = data.SALESDATE.dt.isocalendar().day
data['Week'] = data.SALESDATE.dt.isocalendar().week
data['Year'] = data.SALESDATE.dt.isocalendar().year
But is there a way where I can assign all three in one nice line?
data[['Day', 'Week', 'Year']] = ....
``
For one line solution use DataFrame.join with rename columns if necessary:
rng = pd.date_range('2017-04-03', periods=10)
data = pd.DataFrame({'SALESDATE': rng, 'a': range(10)})
data = data.join(data.SALESDATE.dt.isocalendar().rename(columns=lambda x: x.title()))
print (data)
SALESDATE a Year Week Day
0 2017-04-03 0 2017 14 1
1 2017-04-04 1 2017 14 2
2 2017-04-05 2 2017 14 3
3 2017-04-06 3 2017 14 4
4 2017-04-07 4 2017 14 5
5 2017-04-08 5 2017 14 6
6 2017-04-09 6 2017 14 7
7 2017-04-10 7 2017 15 1
8 2017-04-11 8 2017 15 2
9 2017-04-12 9 2017 15 3
Or change order of list and assign:
data[['Year', 'Week', 'Day']] = data.SALESDATE.dt.isocalendar()
print (data)
SALESDATE a Year Week Day
0 2017-04-03 0 2017 14 1
1 2017-04-04 1 2017 14 2
2 2017-04-05 2 2017 14 3
3 2017-04-06 3 2017 14 4
4 2017-04-07 4 2017 14 5
5 2017-04-08 5 2017 14 6
6 2017-04-09 6 2017 14 7
7 2017-04-10 7 2017 15 1
8 2017-04-11 8 2017 15 2
9 2017-04-12 9 2017 15 3
If need changed order of values in list:
data[['Day', 'Week', 'Year']] = data.SALESDATE.dt.isocalendar()[['day','week','year']]
print (data)
SALESDATE a Day Week Year
0 2017-04-03 0 1 14 2017
1 2017-04-04 1 2 14 2017
2 2017-04-05 2 3 14 2017
3 2017-04-06 3 4 14 2017
4 2017-04-07 4 5 14 2017
5 2017-04-08 5 6 14 2017
6 2017-04-09 6 7 14 2017
7 2017-04-10 7 1 15 2017
8 2017-04-11 8 2 15 2017
9 2017-04-12 9 3 15 2017

Grouping into series based on days since

I need to create a new grouping every time I have a period of more than 60 days since my previous record.
Basically, I need too take the data I have here:
RowNo StartDate StopDate DaysBetween
1 3/21/2017 3/21/2017 14
2 4/4/2017 4/4/2017 14
3 4/18/2017 4/18/2017 14
4 6/23/2017 6/23/2017 66
5 7/5/2017 7/5/2017 12
6 7/19/2017 7/19/2017 14
7 9/27/2017 9/27/2017 70
8 10/24/2017 10/24/2017 27
9 10/31/2017 10/31/2017 7
10 11/14/2017 11/14/2017 14
And turn it into this:
RowNo StartDate StopDate DaysBetween Series
1 3/21/2017 3/21/2017 14 1
2 4/4/2017 4/4/2017 14 1
3 4/18/2017 4/18/2017 14 1
4 6/23/2017 6/23/2017 66 2
5 7/5/2017 7/5/2017 12 2
6 7/19/2017 7/19/2017 14 2
7 9/27/2017 9/27/2017 70 3
8 10/24/2017 10/24/2017 27 3
9 10/31/2017 10/31/2017 7 3
10 11/14/2017 11/14/2017 14 3
Once I have that I'll group by Series and get the min(StartDate) and max(StopDate) for individual durations.
I could do this using a cursor but I'm sure someone much smarter than me has figured out a more elegant solution. Thanks in advance!
You can use the window function sum() over with a conditional FLAG
Example
Select *
,Series= 1+sum(case when [DaysBetween]>60 then 1 else 0 end) over (Order by RowNo)
From YourTable
Returns
RowNo StartDate StopDate DaysBetween Series
1 2017-03-21 2017-03-21 14 1
2 2017-04-04 2017-04-04 14 1
3 2017-04-18 2017-04-18 14 1
4 2017-06-23 2017-06-23 66 2
5 2017-07-05 2017-07-05 12 2
6 2017-07-19 2017-07-19 14 2
7 2017-09-27 2017-09-27 70 3
8 2017-10-24 2017-10-24 27 3
9 2017-10-31 2017-10-31 7 3
10 2017-11-14 2017-11-14 14 3
EDIT - 2008 Version
Select A.*
,B.*
From YourTable A
Cross Apply (
Select Series=1+sum( case when [DaysBetween]>60 then 1 else 0 end)
From YourTable
Where RowNo <= A.RowNo
) B