Is there an easy way to handle wide timeseries data with pandas - pandas

I have a wide dataframe that looks like this.
Hour Wed Aug 10 2022 Thu Aug 11 2022 Fri Aug 12 2022 Sat Aug 13 2022
0 1 52602 49281 52805 53069
1 2 49970 46938 50135 50591
2 3 48188 45494 48156 48837
3 4 47046 44611 47162 47220
4 5 46746 44375 46742 46300
5 6 47493 45325 47259 46115
6 7 48923 47073 48598 46100
7 8 49568 47857 49208 46406
8 9 52147 49854 51482 49274
9 10 55879 53215 55066 53303
10 11 60309 57576 59480 57773
11 12 64943 62024 63799 61670
12 13 68988 66202 67331 64791
13 14 72274 69657 69557 67590
14 15 74249 71855 71525 69363
15 16 75062 73585 72573 70173
16 17 74197 74163 72692 70607
17 18 71764 73506 71726 70353
18 19 68248 71413 69588 69105
19 20 63552 68774 67319 66704
20 21 61337 66328 64784 64501
21 22 59275 63760 62836 62415
22 23 55960 60090 59766 59115
23 24 52384 56233 56341 55681
I would like it to look like this:
Date Hour Values
8/10/2022 1 52602
8/10/2022 2 49970
8/10/2022 3 48188
8/10/2022 4 47046
8/10/2022 5 46746
8/10/2022 6 47493
8/10/2022 7 48923
8/10/2022 8 49568
8/10/2022 9 52147
8/10/2022 10 55879
8/10/2022 11 60309
8/10/2022 12 64943
8/10/2022 13 68988
8/10/2022 14 72274
8/10/2022 15 74249
8/10/2022 16 75062
8/10/2022 17 74197
8/10/2022 18 71764
8/10/2022 19 68248
8/10/2022 20 63552
8/10/2022 21 61337
8/10/2022 22 59275
8/10/2022 23 55960
8/10/2022 24 52384
8/11/2022 1 49281
8/11/2022 2 46938
8/11/2022 3 45494
8/11/2022 4 44611
8/11/2022 5 44375
8/11/2022 6 45325
8/11/2022 7 47073
8/11/2022 8 47857
8/11/2022 9 49854
8/11/2022 10 53215
8/11/2022 11 57576
8/11/2022 12 62024
8/11/2022 13 66202
8/11/2022 14 69657
8/11/2022 15 71855
8/11/2022 16 73585
8/11/2022 17 74163
8/11/2022 18 73506
8/11/2022 19 71413
8/11/2022 20 68774
8/11/2022 21 66328
8/11/2022 22 63760
8/11/2022 23 60090
8/11/2022 24 56233
I have tried melt and wide_to_long, but can't get this exact output. Can someone please point me in the right direction?
I'm sorry I can't even figure out how to demonstrate the output I want correctly.

# Melt your data, keeping the `Hour` column.
df = df.melt('Hour', var_name='Date', value_name='Values')
# Convert to Datetime.
df['Date'] = pd.to_datetime(df['Date'])
# Reorder columns as desired.
df = df[['Date', 'Hour', 'Values']]
print(df)
Output:
Date Hour Values
0 2022-08-10 1 52602
1 2022-08-10 2 49970
2 2022-08-10 3 48188
3 2022-08-10 4 47046
4 2022-08-10 5 46746
.. ... ... ...
91 2022-08-13 20 66704
92 2022-08-13 21 64501
93 2022-08-13 22 62415
94 2022-08-13 23 59115
95 2022-08-13 24 55681
[96 rows x 3 columns]
Additionally, I'd consider merging your Date and Hour columns into a proper timestamp:
# it's unclear if Hour 1 is 1am or midnight~
df['timestamp'] = df.Date.add(pd.to_timedelta(df.Hour.sub(1), unit='h'))
print(df[['timestamp', 'Values']])
# Output:
timestamp Values
0 2022-08-10 00:00:00 52602
1 2022-08-10 01:00:00 49970
2 2022-08-10 02:00:00 48188
3 2022-08-10 03:00:00 47046
4 2022-08-10 04:00:00 46746
.. ... ...
91 2022-08-13 19:00:00 66704
92 2022-08-13 20:00:00 64501
93 2022-08-13 21:00:00 62415
94 2022-08-13 22:00:00 59115
95 2022-08-13 23:00:00 55681
[96 rows x 2 columns]

Related

Unable to find date in pandas

I have a dataset in this form:
company_name date
0 global_infotech 2019-06-15
1 global_infotech 2020-03-22
2 global_infotech 2020-08-30
3 global_infotech 2018-06-19
4 global_infotech 2018-06-15
5 global_infotech 2018-02-15
6 global_infotech 2018-11-22
7 global_infotech 2019-01-15
8 global_infotech 2018-12-15
9 global_infotech 2019-06-15
10 global_infotech 2018-12-19
11 global_infotech 2019-12-31
12 global_infotech 2019-02-18
13 global_infotech 2018-06-16
14 global_infotech 2019-02-10
15 global_infotech 2019-03-15
16 Qualcom 2019-07-11
17 Qualcom 2018-01-11
18 Qualcom 2018-05-29
19 Qualcom 2018-10-06
20 Qualcom 2018-11-11
21 Qualcom 2019-08-17
22 Qualcom 2019-02-22
23 Qualcom 2019-10-16
24 Qualcom 2018-06-22
25 Qualcom 2018-06-14
26 Qualcom 2018-06-16
27 Syscin 2018-02-10
28 Syscin 2019-02-16
29 Syscin 2018-04-12
30 Syscin 2018-08-22
31 Syscin 2018-09-16
32 Syscin 2019-04-20
33 Syscin 2018-02-28
34 Syscin 2018-01-19
CONSIDERING TODAY'S DATE AS 1st JANUARY 2020, I WANT TO WRITE A CODE TO FIND THE NUMBER OF TIMES EACH COMPANY NAME IS OCCURING IN LAST 3 MONTHS. For example, suppose from 1st Oct 2019 to 1st Jan 2020, golbal_infotech's name is appearing 5 times, then 5 should appear infront of every global_infotech value like:
company_name date appearance_count_last_3_months
0 global_infotech 2019-06-15 5
1 global_infotech 2020-03-22 5
2 global_infotech 2020-08-30 5
3 global_infotech 2018-06-19 5
4 global_infotech 2018-06-15 5
5 global_infotech 2018-02-15 5
6 global_infotech 2018-11-22 5
7 global_infotech 2019-01-15 5
8 global_infotech 2018-12-15 5
9 global_infotech 2019-06-15 5
10 global_infotech 2018-12-19 5
11 global_infotech 2019-12-31 5
12 global_infotech 2019-02-18 5
13 global_infotech 2018-06-16 5
14 global_infotech 2019-02-10 5
15 global_infotech 2019-03-15 5
IIUC:
you can create a custom function:
def getcount(company,month=3,df=df):
df=df.copy()
df['date']=pd.to_datetime(df['date'],format='%Y-%m-%d',errors='coerce')
df=df[df['company_name'].eq(company)]
val=df.groupby(pd.Grouper(key='date',freq=str(month)+'m')).count().max().get(0)
df['appearance_count_last_3_months']=val
return df
getcount('global_infotech')
#OR
getcount('global_infotech',3)
Update:
since you have 92 different companies so you can use for loop:
lst=[]
for x in df['company_name'].unique():
lst.append(getcount(x))
out=pd.concat(lst)
If you print out then you will get your desired output
You can first filter the data for the last 3 months, and then groupby company name and merge back into the original dataframe.
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
# sample data
df = pd.DataFrame({
'company_name': ['global_infotech', 'global_infotech', 'Qualcom','another_company'],
'date': ['2019-02-18', '2021-07-02', '2021-07-01','2019-02-18']
})
df['date'] = pd.to_datetime(df['date'])
# filter for last 3 months
summary = df[df['date']>=datetime.now()-relativedelta(months=3)]
# groupby then aggregate with desired column name
summary = summary.rename(columns={'date':'appearance_count_last_3_months'})
summary = summary.groupby('company_name')
summary = summary.agg('count')
# merge summary back into original df, filling missing values with 0
df = df.merge(summary, left_on='company_name', right_index=True, how='left')
df['appearance_count_last_3_months'] = df['appearance_count_last_3_months'].fillna(0).astype('int')
# result:
df
company_name date appearance_count_last_3_months
0 global_infotech 2019-02-18 1
1 global_infotech 2021-07-02 1
2 Qualcom 2021-07-01 1
3 another_company 2019-02-18 0

Filter rows of a table based on a condition that implies: 1) value of a field within a range 2) id of the business and 3) date?

I want to filter a TableA, taking into account only those rows whose "TotalInvoice" field is within the minimum and maximum values expressed in a ViewB, based on month and year values and RepairShopId (the sample data only has one RepairShopId, but all the data has multiple IDs).
In the view I have minimum and maximum values for each business and each month and year.
TableA
RepairOrderDataId
RepairShopId
LastUpdated
TotalInvoice
1
10
2017-06-01 07:00:00.000
765
1
10
2017-06-05 12:15:00.000
765
2
10
2017-02-25 13:00:00.000
400
3
10
2017-10-19 12:15:00.000
295679
4
10
2016-11-29 11:00:00.000
133409.41
5
10
2016-10-28 12:30:00.000
127769
6
10
2016-11-25 16:15:00.000
122400
7
10
2016-10-18 11:15:00.000
1950
8
10
2016-11-07 16:45:00.000
79342.7
9
10
2016-11-25 19:15:00.000
1950
10
10
2016-12-09 14:00:00.000
111559
11
10
2016-11-28 10:30:00.000
106333
12
10
2016-12-13 18:00:00.000
23847.4
13
10
2016-11-01 17:00:00.000
22782.9
14
10
2016-10-07 15:30:00.000
NULL
15
10
2017-01-06 15:30:00.000
138958
16
10
2017-01-31 13:00:00.000
244484
17
10
2016-12-05 09:30:00.000
180236
18
10
2017-02-14 18:30:00.000
92752.6
19
10
2016-10-05 08:30:00.000
161952
20
10
2016-10-05 08:30:00.000
8713.08
ViewB
RepairShopId
Orders
Average
MinimumValue
MaximumValue
year
month
yearMonth
10
1
370343
370343
370343
2015
7
2015-7
10
1
109645
109645
109645
2015
10
2015-10
10
1
148487
148487
148487
2015
12
2015-12
10
1
133409.41
133409.41
133409.41
2016
3
2016-3
10
1
19261
19261
19261
2016
8
2016-8
10
4
10477.3575
2656.65644879821
18298.0585512018
2016
9
2016-9
10
69
15047.709565
10
90942.6052417394
2016
10
2016-10
10
98
22312.077244
10
147265.581935242
2016
11
2016-11
10
96
20068.147395
10
99974.1750708773
2016
12
2016-12
10
86
25334.053372
10
184186.985160105
2017
1
2017-1
10
69
21410.63855
10
153417.00126689
2017
2
2017-2
10
100
13009.797
10
59002.3589332934
2017
3
2017-3
10
101
11746.191287
10
71405.3391452842
2017
4
2017-4
10
123
11143.49756
10
55306.8202091131
2017
5
2017-5
10
197
15980.55406
10
204538.144334771
2017
6
2017-6
10
99
10852.496969
10
63283.9899761938
2017
7
2017-7
10
131
52601.981526
10
1314998.61355187
2017
8
2017-8
10
124
10983.221854
10
59444.0535811233
2017
9
2017-9
10
115
12467.148434
10
72996.6054527277
2017
10
2017-10
10
123
14843.379593
10
129673.931373139
2017
11
2017-11
10
111
8535.455945
10
50328.1495501884
2017
12
2017-12
I've tried:
SELECT *
FROM TableA
INNER JOIN ViewB ON TableA.RepairShopId = ViewB.RepairShopId
WHERE TotalInvoice > MinimumValue AND TotalInvoice < MaximumValue
AND TableA.RepairShopId = ViewB.RepairShopId
But I'm not sure how to compare it the yearMonth field with the datetime field "LastUpdated".
Any help is very appreciated!
here is how you can do it:
I assumed LastUpdated column is the column from tableA which indicate date of
SELECT *
FROM TableA A
INNER JOIN ViewB B
ON A.RepairShopId = B.RepairShopId
AND A.TotalInvoice > B.MinimumValue
AND A.TotalInvoice < B.MaximumValue
AND YEAR(LastUpdated) = B.year
AND MONTH(LastUpdated) = B.month

Assigning a day, week, and year column in Pandas in one line

I usually have to extract days, weeks and years into separate columns like this:
data['Day'] = data.SALESDATE.dt.isocalendar().day
data['Week'] = data.SALESDATE.dt.isocalendar().week
data['Year'] = data.SALESDATE.dt.isocalendar().year
But is there a way where I can assign all three in one nice line?
data[['Day', 'Week', 'Year']] = ....
``
For one line solution use DataFrame.join with rename columns if necessary:
rng = pd.date_range('2017-04-03', periods=10)
data = pd.DataFrame({'SALESDATE': rng, 'a': range(10)})
data = data.join(data.SALESDATE.dt.isocalendar().rename(columns=lambda x: x.title()))
print (data)
SALESDATE a Year Week Day
0 2017-04-03 0 2017 14 1
1 2017-04-04 1 2017 14 2
2 2017-04-05 2 2017 14 3
3 2017-04-06 3 2017 14 4
4 2017-04-07 4 2017 14 5
5 2017-04-08 5 2017 14 6
6 2017-04-09 6 2017 14 7
7 2017-04-10 7 2017 15 1
8 2017-04-11 8 2017 15 2
9 2017-04-12 9 2017 15 3
Or change order of list and assign:
data[['Year', 'Week', 'Day']] = data.SALESDATE.dt.isocalendar()
print (data)
SALESDATE a Year Week Day
0 2017-04-03 0 2017 14 1
1 2017-04-04 1 2017 14 2
2 2017-04-05 2 2017 14 3
3 2017-04-06 3 2017 14 4
4 2017-04-07 4 2017 14 5
5 2017-04-08 5 2017 14 6
6 2017-04-09 6 2017 14 7
7 2017-04-10 7 2017 15 1
8 2017-04-11 8 2017 15 2
9 2017-04-12 9 2017 15 3
If need changed order of values in list:
data[['Day', 'Week', 'Year']] = data.SALESDATE.dt.isocalendar()[['day','week','year']]
print (data)
SALESDATE a Day Week Year
0 2017-04-03 0 1 14 2017
1 2017-04-04 1 2 14 2017
2 2017-04-05 2 3 14 2017
3 2017-04-06 3 4 14 2017
4 2017-04-07 4 5 14 2017
5 2017-04-08 5 6 14 2017
6 2017-04-09 6 7 14 2017
7 2017-04-10 7 1 15 2017
8 2017-04-11 8 2 15 2017
9 2017-04-12 9 3 15 2017

pandas: rolling mean on time interval plus grouping on index

I am trying to find the 7-day rolling average for the hour of day for a category. The data frame is indexed on the category id and there is a time stamp plus other columns:
id name ds time x y z
6 red 2020-02-14 00:00:00 10 20 30
6 red 2020-02-14 01:00:00 20 40 50
6 red 2020-02-14 02:00:00 20 20 60
...
6 red 2020-02-21 00:00:00 20 30 60
6 red 2020-02-21 01:00:00 20 40 60
6 red 2020-02-21 02:00:00 10 40 60
7 green 2020-02-14 00:00:00 10 20 30
7 green 2020-02-14 01:00:00 20 40 50
7 green 2020-02-14 02:00:00 20 20 60
...
7 green 2020-02-21 00:00:00 20 30 60
7 green 2020-02-21 01:00:00 20 40 60
7 green 2020-02-21 02:00:00 10 40 60
what I would like as an output (obviously with the rolling columns filled by the rolling mean where not NaN):
id name ds time x y z rolling_x rolling_y rolling_z
6 red 2020-02-14 00:00:00 10 20 30 NaN NaN NaN
6 red 2020-02-14 01:00:00 20 40 50 NaN NaN NaN
6 red 2020-02-14 02:00:00 20 20 60 NaN NaN NaN
...
6 red 2020-02-21 00:00:00 20 30 60
6 red 2020-02-21 01:00:00 20 40 60
6 red 2020-02-21 02:00:00 10 40 60
7 green 2020-02-14 00:00:00 10 20 30 NaN NaN NaN
7 green 2020-02-14 01:00:00 20 40 50 NaN NaN NaN
7 green 2020-02-14 02:00:00 20 20 60 NaN NaN NaN
...
7 green 2020-02-21 00:00:00 20 30 60
7 green 2020-02-21 01:00:00 20 40 60
7 green 2020-02-21 02:00:00 10 40 60
My approach:
df = df.assign(day=df['ds time'].dt.normalize(),
hour=df['ds time'].dt.hour)
ret_df = df.merge(df.drop('ds time', axis=1)
.set_index('day')
.groupby(['id','hour']).rolling('7D').mean()
.drop(['hour','id'], axis=1),
on=['id','hour','day'],
how='left',
suffixes=['','_roll']
).drop(['day','hour'], axis=1)
Sample data:
dates = pd.date_range('2020-02-21', '2020-02-25', freq='H')
np.random.seed(1)
df = pd.DataFrame({
'id': np.repeat([6,7], len(dates)),
'ds time': np.tile(dates,2),
'X': np.arange(len(dates)*2),
'Y': np.random.randint(0,10, len(dates)*2)
})
df.head()
Output ret_df.head():
id ds time X Y X_roll Y_roll
0 6 2020-02-21 00:00:00 0 5 0.0 5.0
1 6 2020-02-21 01:00:00 1 8 1.0 8.0
2 6 2020-02-21 02:00:00 2 9 2.0 9.0
3 6 2020-02-21 03:00:00 3 5 3.0 5.0
4 6 2020-02-21 04:00:00 4 0 4.0 0.0

hive rank over grouped value

gurus, i stumbled in hive rank process, i woluld like to rank transaction in each day (with no repeating rank value for the same trx value)
date hour trx rnk
18/03/2018 0 1 24
18/03/2018 1 2 23
18/03/2018 2 3 22
18/03/2018 3 4 21
18/03/2018 4 5 20
18/03/2018 5 6 19
18/03/2018 6 7 18
18/03/2018 7 8 17
18/03/2018 8 9 16
18/03/2018 9 10 15
18/03/2018 10 11 14
18/03/2018 11 12 13
18/03/2018 12 13 12
18/03/2018 13 14 11
18/03/2018 14 15 10
18/03/2018 15 16 9
18/03/2018 16 17 8
18/03/2018 17 18 7
18/03/2018 18 19 6
18/03/2018 19 20 5
18/03/2018 20 21 4
18/03/2018 21 22 3
18/03/2018 22 23 2
18/03/2018 23 24 1
17/03/2018 0 1 24
17/03/2018 1 2 23
17/03/2018 2 3 22
17/03/2018 3 4 21
17/03/2018 4 5 20
17/03/2018 5 6 19
17/03/2018 6 7 18
17/03/2018 7 8 17
17/03/2018 8 9 16
17/03/2018 9 10 15
17/03/2018 10 11 14
17/03/2018 11 12 13
17/03/2018 12 13 12
17/03/2018 13 14 11
17/03/2018 14 15 10
17/03/2018 15 16 9
17/03/2018 16 17 8
17/03/2018 17 18 7
17/03/2018 18 19 6
17/03/2018 19 20 5
17/03/2018 20 21 4
17/03/2018 21 22 3
17/03/2018 22 23 2
17/03/2018 23 24 1
here is my code
select a.date, a.hour, trx, rank() over (order by a.trx) as rnk from(
select date,hour, count(*) as trx from smy_tb
group by date, hour
)a
limit 100;
the problem is:
1. rank value repeated with the same trx value
2. rank value continued to next date (it should be grouped for date and hour, so each date will only return 24 rank value)
need advice,
thank you
You should partition by date column and use a specific ordering.
rank() over (partition by a.date order by a.hour desc)
as explained by #BKS
this is the resolved code
select a.date, a.hour, trx, row_number() over (partition by a.date order by a.trx desc) as rnk from(
select date,hour, count(*) as trx from smy_tb
group by date, hour
)a
limit 100;