I am using pandas groupby to group duplicate dates by their pm25 values to get one average. However when I use the groupby function, the structure of my dataframe changes, and I can no longer call the 'Date' Column.
Using groupby also changes the structure of my data: instead of being sorted by 1/1/19, 1/2/19, it is sorted by 1/1/19, 1/10/19, 1/11/19.
Here is my current code:
Before using df.groupby my df looks like:
df before groupy
I use groupby:
df.groupby('Date').mean('pm25')
print(df)
df after groupby
And after, I cannot call the 'Date' column anymore or sort the column
print(df['Date'])
Returns just
KeyError: 'Date'
Please help, or please let me know what else I can provide.
Using groupby also changes the structure of my data: instead of being sorted by 1/1/19, 1/2/19, it is sorted by 1/1/19, 1/10/19, 1/11/19.
This is because your Date column type is string not datetime. In string comparison, the third character 1 of 1/10/19 is smaller than the third character 2 of 1/2/19. If you want to keep the original sequence, you can to the following
df['Date'] = pd.to_datetime(df['Date']) # Convert Date column to datetime type
df['Date'] = df['Date'].dt.strftime('%m/%d/%y') # Convert datetime to other formats (but the dtype of column will be string)
And after, I cannot call the 'Date' column anymore or sort the column
This is because after groupby Date column, the returned dataframe will use Date column after groupby as index to represent each group.
pm25
Date
01/01/19 8.50
01/02/19 9.20
01/03/19 7.90
01/04/19 8.90
01/05/19 6.00
After doing df.groupby('Date').mean('pm25'), the returned dataframe above means the mean pm25 value of 01/01/19 group is 8.50, etc.
If you want to retrieve the Date column back from the index, you can reset_index() after groupby,
df.groupby('Date').mean('pm25').reset_index()
which gives
Date pm25
0 01/01/19 8.50
1 01/02/19 9.20
2 01/03/19 7.90
3 01/04/19 8.90
4 01/05/19 6.00
5 01/06/19 6.75
6 01/11/19 8.50
7 01/12/19 9.20
8 01/21/19 9.20
Or specify the as_index argument of pandas.DataFrame.groupby() to False
df.groupby('Date', as_index=False).mean('pm25')
Related
I have a DataFrame containing customer purchase information. I would like to groupby the date column to determine the total sales each day. My problem is that there are some days with no purchases whatsoever. I would like the groupby object to include those missing days as groups with the total sales equal to 0. Is there a way to pass a list of the values of the date column to the groupby function? MWE below
>>> df = pd.DataFrame({
... 'purchase_id': [1, 2, 3, 4],
... 'date' : ['1900-01-01', '1900-01-01', '1900-01-03', '1900-01-04'],
... 'cost' : [1.00, 0.25, 0.50, 0.75]
... })
This group by has the following result.
>>> df.groupby('date').agg({'cost':'sum'})
date cost
'1900-01-01' 1.25
'1900-01-03' 0.50
'1900-01-04' 0.75
What command can I execute to obtain the following result instead? (Obviously I will have to pass the dates I am expecting to see as an argument, which is fine.)
date cost
'1900-01-01' 1.25
'1900-01-02' 0.00
'1900-01-03' 0.50
'1900-01-04' 0.75
You can use reindex:
df.groupby('date').agg({'cost':'sum'}).reindex(your_new_date_list)
However, I'd recommend you convert your data to datetime type, then you can use resample:
df['date'] = pd.to_datetime(df['date'])
df.resample('d', on='date')['cost'].sum().reset_index()
Output:
date cost
0 1900-01-01 1.25
1 1900-01-02 0.00
2 1900-01-03 0.50
3 1900-01-04 0.75
I have a dataframe indexed on datetime, with the following output:
2022-04-08 21:59:49 7651.8 7655.8
2022-04-08 21:59:50 7651.7 7655.7
2022-04-08 21:59:54 7651.7 7655.7
2022-04-08 21:59:55 7651.8 7655.8
2022-04-08 09:47:00 7544.9 7545.9
A valid row has the condition where its datetime value is the same or greater than in previous row (and the first row is always valid).
Therefore, in the extract above, the only invalid row is the last one, as the datetime doesn't meet the above condition.
I have managed to remove the offending row by:
df.drop(df.loc[df.index.to_series().diff() < pd.to_timedelta('0 seconds')].index, inplace=True)
But this looks a little convoluted. Is there a simpler way to achieve this?
df.index.to_series().diff() < pd.to_timedelta('0 seconds') returns a boolean Series, you can use boolean indexing to select rows:
df = df.loc[~(df.index.to_series().diff() < pd.to_timedelta('0 seconds'))]
I am trying to calculate the sum of the rows with the maximum date per country and if the country has more than one province then it should add the confirmed cases with the maximum date . For ex input
This is the input that I have and the output should be
Output
So the output for China is 90 which is the sum of Tianjin and Xinjiang for the maximum date which is 02-03-2020.
And since Argentina does not have any province it's output is 20 for the highest date which is again the same as above.
The strategy is to sort the values such that the last date is the first row of the Country/Region-Province/State pairs, then roll up the dataset twice, filtering the max date between roll ups.
First, sorting to put most recent dates at the top of each group:
(df
.sort_values(['Country/Region', 'Province/State', 'Date'], ascending=False))
Date Country/Region Province/State Confirmed
3 02-03-2020 China Xinjiang 70
2 01-03-2020 China Xinjiang 30
1 02-03-2020 China Tianjin 20
0 01-03-2020 China Tianjin 10
Then rolling up to Country/Region-Province/State and taking the most recent date:
(df
.sort_values(['Country/Region', 'Province/State', 'Date'], ascending=False)
.groupby(['Country/Region', 'Province/State'])
.first())
Date Confirmed
Country/Region Province/State
China Tianjin 02-03-2020 20
Xinjiang 02-03-2020 70
Finally, rolling up again to just Country/Region:
(df
.sort_values(['Country/Region', 'Province/State', 'Date'], ascending=False)
.groupby(['Country/Region', 'Province/State'])
.first()
.groupby('Country/Region').sum())
Confirmed
Country/Region
China 90
If you fill in your empty province values, you could use groupby to pull out the Date and then another groupby to get the sum of values.
input['Date'] = pd.to_datetime(input['Date'], format="%d-%m-%Y")
input = input.fillna("dummy")
input.loc[input.groupby(["Country/Region", "Province/State"]).Date.idxmax()].groupby(["Country/Region"])["Confirmed"].sum()
Basic Solution is to use the power of pandas and multiple methods that can jointly solve your problem
So we first convert the Date Column to a Datetime Column
Bascially , String to Datetime Type Conversion
Then we sort by the 'Date' Column
Then we .groupby() - Country and Date aggregating over 'Confirmed' Column (SUM) - Finally we drop_duplicates() keeping the last, which gives you the latest for any particular country
df['Date'] = pd.to_datetime(df['Date'])
df.sort_values(by = 'Date' , inplace = True)
df.groupby(['Country','Date']).agg({'Confirmed':'sum'}).reset_index().drop_duplicates(subset = 'Country' , keep = 'last')
df.groupby(['Date', 'Country/Region'],as_index = False).sum().groupby(['Country/Region']).agg('last').reset_index()
Country/Region Date Confirmed
0 Argentina 2020-03-02 20
1 China 2020-03-02 90
2 France 2020-03-02 70
I want to calculate age and from DOB field. But in my code I am hard coding it. But need to do dynamically like today - DOB. Similarly I also want to calculate duration from start_date. My data frame looks like -
id dob start_date
77 30/09/1990 2019-04-13 15:27:22
65 15/12/1988 2018-12-26 23:28:12
3 08/12/2000 2018-12-26 23:28:17
I have so far - For age calculation
df= df.withColumn('dob',to_date(unix_timestamp(F.col('dob'),'dd/MM/yyyy').cast("timestamp")))
end_date = '3/09/2019'
end_date = pd.to_datetime(end_date, format="%d/%m/%Y")
df= df.withColumn('end_date',F.unix_timestamp(F.lit(end_date),'dd/mm/yyyy').cast("timestamp"))
df = df.withColumn('age', (F.datediff(F.col('end_date'), F.col('dob')))/365)
df= df.withColumn("age", func.round(df["age"], 0))
For duration calculation -
end_date_1 = '2019-09-30'
end_date_1 = pd.to_datetime(end_date_1, format="%Y-%m-%d")
df= df.withColumn('end_date_1',F.unix_timestamp(F.lit(end_date_1),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
df= df.withColumn('duration', (F.datediff(F.col('end_date_1'), F.col('created_at'))))
In the above two codes I have hard code two values. One is end_date = '2019-09-30' and other is end_date_1 = '2019-09-30'. But want to do this based on todays() date. How to do it in pyspark?
You can use date.today() to get today's date as you are using python with spark.
More information about the desired date format can be found here official python documentation
I have a pandas dataframe df with the following format
date value team
12/8/2015 1.2 'A'
12/8/2015 1.3 'A'
12/7/2015 1.2 'A'
12/6/2015 1.3 'B'
12/6/2015 1.1 'B'
12/7/2015 1.3 'B'
...............................
What I want is a figure with two curves representing two groups, with date as x-value, the average value of the corresponding date as the y value. What bother me is the date format seems not correct, as python complains
"Could not convert 12/8/2015... to numeric"
for label, group in df.groupby('team']):
group.plot(x=group['date'], y=group['date'].mean(),label=label)
You first need to convert your date to a timestamp.
df['date'] = pd.to_datetime(df.date)
Then you can group and unstack to get your desired data:
>>> df.groupby(['date', 'team']).sum().unstack('team')
value
team 'A' 'B'
date
2015-12-06 NaN 2.4
2015-12-07 1.2 1.3
2015-12-08 2.5 NaN
Add .plot() and you should get your desired result.