Pandas adding row to categorical index - pandas

I have a scenario where I would like to group my datasets by personally defined week indexes that are then averaged and aggregate the averages into a "Total" row. I am able to achieve the first half of my scenario, but when I try to append/insert a new "Total" row that sums these rows I am receiving error messages.
I attempted to create this row via two different methods:
Method 1:
week_index_avg_unit.loc['Total'] = week_index_avg_unit.sum()
TypeError: cannot append a non-category item to a CategoricalIndex
Method 2:
week_index_avg_unit.index.insert(['Total'], week_index_avg_unit.sum())
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I have used the first approach in this scenario multiple times, but this is the first time where I'm cutting the data into multiple categories and clearly see where the CategoricalIndex type is the problem.
Here is the format of my data:
date organic ppc oa other content_partnership total \
0 2018-01-01 379 251 197 51 0 878
1 2018-01-02 880 527 405 217 0 2029
2 2018-01-03 859 589 403 323 0 2174
3 2018-01-04 835 533 409 335 0 2112
4 2018-01-05 760 449 355 272 0 1836
year_month day weekday weekday_name week_index
0 2018-01 1 0 Monday Week 1
1 2018-01 2 1 Tuesday Week 1
2 2018-01 3 2 Wednesday Week 1
3 2018-01 4 3 Thursday Week 1
4 2018-01 5 4 Friday Week 1
Here is the code:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
historicals = pd.read_csv("2018-2019_plants.csv")
# Capture dates for additional date columns
date_col = pd.to_datetime(historicals['date'])
historicals['year_month'] = date_col.dt.strftime("%Y-%m")
historicals['day'] = date_col.dt.day
historicals['weekday'] = date_col.dt.dayofweek
historicals['weekday_name'] = date_col.dt.day_name()
# create week ranges segment (7 day range)
historicals['week_index'] = pd.cut(historicals['day'],[0,7,14,21,28,32], labels=['Week 1','Week 2','Week 3','Week 4','Week 5'])
# Week Index Average (Units)
week_index_avg_unit = historicals[df_monthly_average].groupby(['week_index']).mean().astype(int)
type(week_index_avg_unit.index)
pandas.core.indexes.category.CategoricalIndex
Here is the week_index_avg_unit table:
organic ppc oa other content_partnership total day weekday
week_index
Week 1 755 361 505 405 22 2027 4 3
Week 2 787 360 473 337 19 1959 11 3
Week 3 781 382 490 352 18 2006 18 3
...

pd.CategoricalIndex is a special animal. It is immutable, so to do the trick you may need to use something like pd.CategoricalIndex.set_categories to add a new category.
See pandas docs: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.CategoricalIndex.html

Related

how to apply a function to each row in a pandas date frame to generate a new column

I have a pandas data frame with a date index and a function (prevRate, below) to work out if a value has increased or decreased from the previous day.
def prevRate(currentDate, df):
prevDate = currentDate.date() - timedelta(days=1)
prevRate =df.loc[prevDate,'Total per Million'].item()
currentRate = df.loc[currentDate.date(),'Total per Million'].item()
if prevRate>currentRate:
rateChange='inc'
else:
rateChange='dec'
return rateChange
The data frame has the date as an index, and I want to apply the prevRate function to each of the index values to create a new field with each value as either 'inc' or 'dec'.
I am tried applying the function as below
df['rateChange']=df.apply(prevRate(df.index.get_level_values(0).item().to_pydatetime(),df))
and just using the function as in
df['rateChange']=prevRate(df.index.item().to_pydatetime(),df)
These approaches seem to extract the whole index rather then one item at a time.
current error message is
ValueError Traceback (most recent call last)
<ipython-input-69-0d45be10dffa> in <module>
----> 1 df['rateChange']=df.apply(prevRate(df.index.item().to_pydatetime(),df))
329 if len(self) == 1:
330 return next(iter(self))
--> 331 raise ValueError("can only convert an array of size 1 to a Python scalar")
332
333 #property
ValueError: can only convert an array of size 1 to a Python scalar
Any suggestions as to how to apply the function to each row and generate a new column?
IIUC you just need to compare each row with the previous one. One way to do is with np.where.
Since you didn't provide some sample data, here is the Input I used for demonstration:
np.random.seed(20)
df = pd.DataFrame({
'Date' : pd.date_range('2022-01-01', '2022-01-10'),
'num' : np.random.randint(0, 100, size=10)
})
print(df)
Date num
0 2022-01-01 99
1 2022-01-02 90
2 2022-01-03 15
3 2022-01-04 95
4 2022-01-05 28
5 2022-01-06 90
6 2022-01-07 9
7 2022-01-08 20
8 2022-01-09 75
9 2022-01-10 22
df['rateChange'] = np.where(df.num > df.num.shift(), 'inc', 'dec')
print(df)
Output:
Date num rateChange
0 2022-01-01 99 dec
1 2022-01-02 90 dec
2 2022-01-03 15 dec
3 2022-01-04 95 inc
4 2022-01-05 28 dec
5 2022-01-06 90 inc
6 2022-01-07 9 dec
7 2022-01-08 20 inc
8 2022-01-09 75 inc
9 2022-01-10 22 dec

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

difference in two date column in Pandas

I am trying to get difference between two date columns below script and data used in script, but I am getting same results for all three rows
df = pd.read_csv(r'Book1.csv',encoding='cp1252')
df
Out[36]:
Start End DifferenceinDays DifferenceinHrs
0 10/26/2013 12:43 12/15/2014 0:04 409 9816
1 2/3/2014 12:43 3/25/2015 0:04 412 9888
2 5/14/2014 12:43 7/3/2015 0:04 409 9816
I am expecting results as in column DifferenceinDays which is calculated in excel but in python getting same values for all three rows, Please refer to below code used, can anyone let me know how is to calculate difference between 2 date column, I am trying to get number of hours between two date columns.
df["Start"] = pd.to_datetime(df['Start'])
df["End"] = pd.to_datetime(df['End'])
df['hrs']=(df.End-df.Start)
df['hrs']
Out[38]:
0 414 days 11:21:00
1 414 days 11:21:00
2 414 days 11:21:00
Name: hrs, dtype: timedelta64[ns]
IIUC, np.timedelta64(1,'h')
Additionally, it looks like excel calculates the hours differently, unsure why.
import numpy as np
df['hrs'] = (df['End'] - df['Start']) / np.timedelta64(1,'h')
print(df)
Start End DifferenceinHrs hrs
0 2013-10-26 12:43:00 2014-12-15 00:04:00 9816 9947.35
1 2014-02-03 12:43:00 2015-03-25 00:04:00 9888 9947.35
2 2014-05-14 12:43:00 2015-07-03 00:04:00 9816 9947.35

Pandas doesn't split EIA API Data into two different columsn for easy access

I am importing EIA data which contains weekly storage data. The first column in the reported week and second is storage.
When I import the data it shows two columns. First column has no title and second one as following title "Weekly Lower 48 States Natural Gas Working Underground Storage, Weekly (Billion Cubic Feet)".
I would like to plot the data using matplotlib but I need to separate the columns first. I used df.iloc[100:,:0] and this gives the first column which is the week but I somehow cannot separate the second column.
import eia
import pandas as pd
import os
api_key = "mykey"
api = eia.API(api_key)
series_search = api.data_by_series(series='NG.NW2_EPG0_SWO_R48_BCF.W')
df = pd.DataFrame(series_search)
df1 = df.iloc[100:,:0]
Code Output
This output is sample of all 486 rows. When I use df.shape command it shows as (486, 1) when it should show (486, 2 )
2010 0101 01 3117
2010 0108 08 2850
2010 0115 15 2607
2010 0122 22 2521
2019 0322 22 1107
2019 0329 29 1130
2019 0405 05 1155
2019 0412 12 1247
2019 0419 19 1339
You can first cut the last 3 characters of the string and then convert it to datetime:
df['Date'] = pd.to_datetime(df['Date'].str[:-3], format='%Y %m%d')
print(df)
Date Value
0 2010-01-01 3117
1 2010-01-08 2850
2 2010-01-15 2607
3 2010-01-22 2521
4 2019-03-22 1107
5 2019-03-29 1130
6 2019-04-05 1155
7 2019-04-12 1247
8 2019-04-19 1339

groupby pandas dataframe, take difference between value of latest and earliest date

I have a Cumulative column and I want to groupby index and take the values corresponding to the latest date minus the values corresponding to the earliest date.
Very similar to this: group by pandas dataframe and select latest in each group
But take the difference between latest and earliest in each group.
I'm a python rookie, and here is my solution:
import pandas as pd
from io import StringIO
csv = StringIO("""index id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01""")
df = pd.read_table(csv, sep='\s+',index_col='index')
df['date']=pd.to_datetime(df['date'],errors='coerce')
df_sort=df.sort_values('date')
df_sort.drop(['product'], axis=1,inplace=True)
df_sort.groupby('id').tail(1).set_index('id')-df_sort.groupby('id').head(1).set_index('id')