expand year values to month in pandas - pandas

I have sales by year:
pd.DataFrame({'year':[2015,2016,2017],'value':['12','24','30']})
year value
0 2015 12
1 2016 24
2 2017 36
I want to extrapolate to months:
yyyymm value
201501 1 (ie 12/12, etc)
201502 1
...
201512 1
201601 2
...
201712 3
any suggestions?

One idea is use cross join with helper DataFrame, convert columns to strings and add 0 by Series.str.zfill:
df1 = pd.DataFrame({'m': range(1, 13), 'a' : 1})
df = df.assign(a = 1).merge(df1).drop('a', 1)
df['year'] = df['year'].astype(str) + df.pop('m').astype(str).str.zfill(2)
df = df.rename(columns={'year':'yyyymm'})
Another solution is create MultiIndex and use DataFrame.reindex:
mux = pd.MultiIndex.from_product([df['year'], range(1, 13)], names=['yyyymm','m'])
df = df.set_index('year').reindex(mux, level=0).reset_index()
df['yyyymm'] = df['yyyymm'].astype(str) + df.pop('m').astype(str).str.zfill(2)
print (df.head(15))
yyyymm value
0 201501 12
1 201502 12
2 201503 12
3 201504 12
4 201505 12
5 201506 12
6 201507 12
7 201508 12
8 201509 12
9 201510 12
10 201511 12
11 201512 12
12 201601 24
13 201602 24
14 201603 24

Related

Filter rows of a table based on a condition that implies: 1) value of a field within a range 2) id of the business and 3) date?

I want to filter a TableA, taking into account only those rows whose "TotalInvoice" field is within the minimum and maximum values expressed in a ViewB, based on month and year values and RepairShopId (the sample data only has one RepairShopId, but all the data has multiple IDs).
In the view I have minimum and maximum values for each business and each month and year.
TableA
RepairOrderDataId
RepairShopId
LastUpdated
TotalInvoice
1
10
2017-06-01 07:00:00.000
765
1
10
2017-06-05 12:15:00.000
765
2
10
2017-02-25 13:00:00.000
400
3
10
2017-10-19 12:15:00.000
295679
4
10
2016-11-29 11:00:00.000
133409.41
5
10
2016-10-28 12:30:00.000
127769
6
10
2016-11-25 16:15:00.000
122400
7
10
2016-10-18 11:15:00.000
1950
8
10
2016-11-07 16:45:00.000
79342.7
9
10
2016-11-25 19:15:00.000
1950
10
10
2016-12-09 14:00:00.000
111559
11
10
2016-11-28 10:30:00.000
106333
12
10
2016-12-13 18:00:00.000
23847.4
13
10
2016-11-01 17:00:00.000
22782.9
14
10
2016-10-07 15:30:00.000
NULL
15
10
2017-01-06 15:30:00.000
138958
16
10
2017-01-31 13:00:00.000
244484
17
10
2016-12-05 09:30:00.000
180236
18
10
2017-02-14 18:30:00.000
92752.6
19
10
2016-10-05 08:30:00.000
161952
20
10
2016-10-05 08:30:00.000
8713.08
ViewB
RepairShopId
Orders
Average
MinimumValue
MaximumValue
year
month
yearMonth
10
1
370343
370343
370343
2015
7
2015-7
10
1
109645
109645
109645
2015
10
2015-10
10
1
148487
148487
148487
2015
12
2015-12
10
1
133409.41
133409.41
133409.41
2016
3
2016-3
10
1
19261
19261
19261
2016
8
2016-8
10
4
10477.3575
2656.65644879821
18298.0585512018
2016
9
2016-9
10
69
15047.709565
10
90942.6052417394
2016
10
2016-10
10
98
22312.077244
10
147265.581935242
2016
11
2016-11
10
96
20068.147395
10
99974.1750708773
2016
12
2016-12
10
86
25334.053372
10
184186.985160105
2017
1
2017-1
10
69
21410.63855
10
153417.00126689
2017
2
2017-2
10
100
13009.797
10
59002.3589332934
2017
3
2017-3
10
101
11746.191287
10
71405.3391452842
2017
4
2017-4
10
123
11143.49756
10
55306.8202091131
2017
5
2017-5
10
197
15980.55406
10
204538.144334771
2017
6
2017-6
10
99
10852.496969
10
63283.9899761938
2017
7
2017-7
10
131
52601.981526
10
1314998.61355187
2017
8
2017-8
10
124
10983.221854
10
59444.0535811233
2017
9
2017-9
10
115
12467.148434
10
72996.6054527277
2017
10
2017-10
10
123
14843.379593
10
129673.931373139
2017
11
2017-11
10
111
8535.455945
10
50328.1495501884
2017
12
2017-12
I've tried:
SELECT *
FROM TableA
INNER JOIN ViewB ON TableA.RepairShopId = ViewB.RepairShopId
WHERE TotalInvoice > MinimumValue AND TotalInvoice < MaximumValue
AND TableA.RepairShopId = ViewB.RepairShopId
But I'm not sure how to compare it the yearMonth field with the datetime field "LastUpdated".
Any help is very appreciated!
here is how you can do it:
I assumed LastUpdated column is the column from tableA which indicate date of
SELECT *
FROM TableA A
INNER JOIN ViewB B
ON A.RepairShopId = B.RepairShopId
AND A.TotalInvoice > B.MinimumValue
AND A.TotalInvoice < B.MaximumValue
AND YEAR(LastUpdated) = B.year
AND MONTH(LastUpdated) = B.month

How to group data weekly in column and hourly in row

I have data like following
ID SalesTime Qty Unit Price Item
1 01/01/2021 08:10:00 10 10 A
2 01/01/2021 11:30:00 2 9 B
3 01/01/2021 11:59:50 1 8 C
4 01/02/2021 13:00:00 5 15 D
5 01/03/2021 10:00:00 4 10 A
6 01/03/2021 12:00:00 5 9 B
7 01/03/2021 12:50:00 6 15 D
8 01/04/2021 10:50:00 5 8 C
9 01/04/2021 11:10:00 2 10 A
10 ............
I wanna summarize the total into the form,
for example:
Mon Tue Wed Thu Fri Sat Sun
08:00~09:59 20 21 50 100 60 70 210
10:00~11:59 60 25 60 90 75 80 200
12:00~13:59 100 10 50 60 70 50 150
How to do that in MS SQL, thanks a lot.
You can extract the hour and divide by two for the rows. And then use conditional aggregation for the columns. Assuming you want the total of the price times quantity:
select convert(time, dateadd(hour, 2 * (datepart(hour, salestime) / 2), 0)) as hh,
sum(case when datename(weekday, salestime) = 'Monday' then qty * unit_price end) as mon,
sum(case when datename(weekday, salestime) = 'Tuesday' then qty * unit_price end) as tue,
. . .
from t
group by datepart(hour, salestime) / 2
order by min(salestime);
Note: This just returns the beginning of the time period, rather than the full range.

Pandas groupby time and ID and aggregate

I am trying to calculate, what is the sum of payment made 2nd half of year minus the 1st half of the year.
This is how the data may look:
ID date payment
1 1/1/2020 10
1 1/2/2020 11
1 1/3/2020 10
1 1/4/2020 10
1 1/5/2020 11
1 1/6/2020 10
1 1/7/2020 10
1 1/8/2020 11
1 1/9/2020 10
1 1/10/2020 32
1 1/11/2020 10
1 1/12/2020 12
2 1/1/2020 10
2 1/2/2020 10
2 1/3/2020 41
2 1/4/2020 10
2 1/5/2020 53
2 1/6/2020 10
2 1/7/2020 10
2 1/8/2020 44
2 1/9/2020 10
2 1/10/2020 2
2 1/11/2020 9
2 1/12/2020 5
I convert the df date to a pandas dt
df.date = df.date.astype(str).str.slice(0, 10)
df.date = pd.to_datetime(pay.date)
print(df.date.min(),df.date.max())
output: 2020-01-01 00:00:00 2020-12-01 00:00:00
Then i create time points and different data frames for 1st and 2nd half of the year
observation_date = '2020-12-31'
observation_date = datetime.strptime(observation_date, '%Y-%m-%d')
observation_date = observation_date.date()
observation_date = pd.Timestamp(observation_date)
print(observation_date)
mo6_ago = observation_date - relativedelta(months=6)
mo6_ago = pd.Timestamp(mo6_ago)
print(mo6_ago)
mo6_ago_plus1 = observation_date - relativedelta(months=6) + relativedelta(days=1)
mo6_ago_plus1 = pd.Timestamp(mo6_ago_plus1)
print(mo6_ago_plus1)
mo12_ago = observation_date - relativedelta(months=12) + relativedelta(days=1)
mo12_ago = pd.Timestamp(mo12_ago)
print(mo12_ago)
output:
2020-12-31 00:00:00
2020-06-30 00:00:00
2020-07-01 00:00:00
2020-01-01 00:00:00
mask = (df['date'] >= mo12_ago) & (df['date'] <= mo6_ago)
first_half = df.loc[mask]
first_half = first_half[['ID','date','payment']]
print(first_half.date.min(),first_half.date.max())
output: 2020-01-01 00:00:00 2020-06-01 00:00:00
mask = (df['date'] >= mo6_ago_plus1) & (df['date'] <= observation_date)
sec_half = df.loc[mask]
sec_half = sec_half[['ID','date','payment']]
print(sec_half.date.min(),sec_half.date.max())
output: 2020-07-01 00:00:00 2020-12-01 00:00:00
then i group and sum for the 2 half of the year and merge them into one df like that
sum_first_half = first_half.groupby(['ID'])['payment'].sum().reset_index()
sum_first_half = sum_first_half.rename(columns = {'payment':'payment_first_half'})
sum_sec_half = sec_half.groupby(['ID'])['payment'].sum().reset_index()
sum_sec_half = sum_sec_half.rename(columns = {'payment':'payment_sec_half'})
df_new = pd.merge(sum_first_half, sum_sec_half, how='outer', on='ID')
Finally i take minus the 2 columns this way
df_new['sec_minus_first'] = df_new['payment_sec_half'] -df_new['payment_first_half']
ID payment_first_half payment_sec_half sec_minus_first
1 62 85 23
2 134 80 -54
Is there a faster and more memory efficient way of doing this?
Using datetime:
from datetime import datetime as dt
Convert date column to datetime:
df["date"] = pd.to_datetime(df["date"])
Split on a date of your choice, group by ID, sum each half, then subtract the halves:
df.loc[df['date'] >= dt(2020, 7, 1)].groupby("ID").sum() - df.loc[df['date'] < dt(2020, 7, 1)].groupby("ID").sum()

Assigning a day, week, and year column in Pandas in one line

I usually have to extract days, weeks and years into separate columns like this:
data['Day'] = data.SALESDATE.dt.isocalendar().day
data['Week'] = data.SALESDATE.dt.isocalendar().week
data['Year'] = data.SALESDATE.dt.isocalendar().year
But is there a way where I can assign all three in one nice line?
data[['Day', 'Week', 'Year']] = ....
``
For one line solution use DataFrame.join with rename columns if necessary:
rng = pd.date_range('2017-04-03', periods=10)
data = pd.DataFrame({'SALESDATE': rng, 'a': range(10)})
data = data.join(data.SALESDATE.dt.isocalendar().rename(columns=lambda x: x.title()))
print (data)
SALESDATE a Year Week Day
0 2017-04-03 0 2017 14 1
1 2017-04-04 1 2017 14 2
2 2017-04-05 2 2017 14 3
3 2017-04-06 3 2017 14 4
4 2017-04-07 4 2017 14 5
5 2017-04-08 5 2017 14 6
6 2017-04-09 6 2017 14 7
7 2017-04-10 7 2017 15 1
8 2017-04-11 8 2017 15 2
9 2017-04-12 9 2017 15 3
Or change order of list and assign:
data[['Year', 'Week', 'Day']] = data.SALESDATE.dt.isocalendar()
print (data)
SALESDATE a Year Week Day
0 2017-04-03 0 2017 14 1
1 2017-04-04 1 2017 14 2
2 2017-04-05 2 2017 14 3
3 2017-04-06 3 2017 14 4
4 2017-04-07 4 2017 14 5
5 2017-04-08 5 2017 14 6
6 2017-04-09 6 2017 14 7
7 2017-04-10 7 2017 15 1
8 2017-04-11 8 2017 15 2
9 2017-04-12 9 2017 15 3
If need changed order of values in list:
data[['Day', 'Week', 'Year']] = data.SALESDATE.dt.isocalendar()[['day','week','year']]
print (data)
SALESDATE a Day Week Year
0 2017-04-03 0 1 14 2017
1 2017-04-04 1 2 14 2017
2 2017-04-05 2 3 14 2017
3 2017-04-06 3 4 14 2017
4 2017-04-07 4 5 14 2017
5 2017-04-08 5 6 14 2017
6 2017-04-09 6 7 14 2017
7 2017-04-10 7 1 15 2017
8 2017-04-11 8 2 15 2017
9 2017-04-12 9 3 15 2017

From 10 years of data, I want to select only calendar days with max or min value

Ok, so I have a dataset of temperatures for each day of the year, over a period of ten years. Index is date converted to datetime.
I want to get a dataset with only the min and max value for each calendar day throughout the 10-year period.
I can convert the index to a string, remove the year and get the dataset that way, but I'm guessing there is a smarter way to do it.
Use Series.dt.strftime with aggregate by GroupBy.agg with min and max:
np.random.seed(2020)
d = pd.date_range('2000-01-01', '2010-12-31')
df = pd.DataFrame({"temp": np.random.randint(0, 30, size=len(d))}, index=d)
print(df)
temp
2000-01-01 0
2000-01-02 8
2000-01-03 3
2000-01-04 22
2000-01-05 3
...
2010-12-27 16
2010-12-28 10
2010-12-29 28
2010-12-30 1
2010-12-31 28
[4018 rows x 1 columns]
df = df.groupby(df.index.strftime('%m-%d'))['temp'].agg(['min','max'])
print (df)
min max
01-01 0 28
01-02 0 29
01-03 3 21
01-04 1 28
01-05 0 26
... ...
12-27 3 29
12-28 4 27
12-29 0 29
12-30 1 29
12-31 2 28
[366 rows x 2 columns]
Last for datetimes is possible add year (be careful with leap years):
df.index = pd.to_datetime('2000-' + df.index, format='%Y-%m-%d')
print (df)
min max
2000-01-01 0 28
2000-01-02 0 29
2000-01-03 3 21
2000-01-04 1 28
2000-01-05 0 26
... ...
2000-12-27 3 29
2000-12-28 4 27
2000-12-29 0 29
2000-12-30 1 29
2000-12-31 2 28
[366 rows x 2 columns]