Thank you for taking the time to read my likely silly question.
I have time series data in a pandas dataframe and would like to plot two separate financial years as two separate lines, with the month as theta and the number of queries received each month as r.
df['FY'] = np.where(df['call_DT'] < "01/04/2020", "Financial Year 1", "Financial Year 2")
df['Month'] = df['call_DT'].dt.month
df = df.sort_values('Month')
df = df.groupby('Month')
df = df['Count of queries'].sum().reset_index()
df = df.set_index('Month')
fig = px.line_polar(df,r="Count of queries",theta=df.index)
plot(fig)
I understand I am removing all rows other than 'Count of queries', I am not sure how I am doing this, so I understand why the color would not show.
However, with r = "Count of queries" and theta = "Month", no graph is displayed. I understand I have butchered this by not properly understanding the code. Any help would be appreciated.
Edit:
A snippet of the used columns for this task. I group the data by month and sum the count of queries column. I want to differentiate the two lines in the radial chart by financial year, rather than year, so I included the 'B/A' column to differentiate between them.
call_DT B/A Count of queries
2 2021-05-17 Financial Year 2 1
5 2021-05-17 Financial Year 2 1
16 2021-05-14 Financial Year 2 1
18 2021-05-14 Financial Year 2 1
26 2021-05-14 Financial Year 2 1
Related
So I have a dataframe with all the boroughs in London with their average house price from the years 1995-2021.
What I am trying to do is compile a new dataframe that takes the most expensive borough for each year.
The column names for the original df are: [London_Borough, ID, Average_price, Year]
At first I figured I can loop by each year and create a temporary df where I assign
each borough and its price for that particular year and from there I extract the max value
for the average price.
For example:
for i in range(1995, 2022, 1):
temp = df[df['Year'] == i]
yr_max = temp['Average_price'].max()
the problem with this is that while I get the most expensive borough for that year, all I have is the number without the corresponding borough associated with it.
Is there any way I can extract the entire row? or at least just the borough and the price?
This honestly might just be a simple syntax problem but I have scoured over my notes and online resources but cannot find a way to locate a row given a particular value of one column.
The only solution I could think of is to first reset the index of temporary df, then create a list of average prices of that year, loop through the list until it matches the max price then use the index of that list to locate the index of the temporary df but that is not an acceptable solution as it is over complicated and does not abide by Ohm law as the course I am taking is for data science and so efficiency is principle.
This will be a complete change to my original answer.
You can use groupby to create a DataFrame containing 'Year' and 'Average_price' and use indexes to merge it with the original DataFrame:
df = pd.DataFrame([
[1999, 1252, "Barnet"],
[1999, 1525, "Enfield"],
[2001, 1524, "Camden"]],
columns = ['year', 'price', 'london_borough'])
idx = df.groupby('year').agg({'price':'max'}).reset_index().set_index(['year', 'price'])
df.set_index(['year', 'price'], inplace = True)
And merge the two DataFrames on index from idx:
df = df.merge(idx, left_index = True, right_index = True, how = 'right')
You can also avoid setting indexes and use column names.
If I understand what you want correctly, you can use one of these two approaches:
Approach: keeping your loop (not recommended see this post):
for i in range(1995, 2022, 1):
temp = df[df['Year'] == i]
yr_max = temp[temp['Average_price'] == temp['Average_price'].max()]
Approach (use pandas built in methods):
df.iloc[df.groupby(['Year'])['Average_price'].idxmax()]
for example using the following input:
Year Average_price london_borough
0 1999 1320 Barnet
1 1999 810 Enfield
2 1999 2250 Ealing
3 2000 1524 Bexley
4 2000 810 Camden
5 2000 1524 Brent
6 2001 1524 Barnet
7 2001 2540 Barnet
8 2001 810 Ealing
9 2002 1524 Camden
10 2002 3000 Ealing
11 2002 1524 Brent
you'll get the output:
>>> print(df.iloc[df.groupby(['Year'])['Average_price'].idxmax()])
Year Average_price london_borough
2 1999 2250 Ealing
3 2000 1524 Bexley
7 2001 2540 Barnet
10 2002 3000 Ealing
And if you want to access a specific year you can do:
>>> yr_max = df.iloc[df.groupby(['Year'])['Average_price'].idxmax()]
>>> yr_max[yr_max['Year'] == 1999]
Year Average_price london_borough
2 1999 2250 Ealing
Hi I have a time series data set. I would like to make a new column for each month.
data:
creationDate fre skill
2019-02-15T20:43:29Z 14 A
2019-02-15T21:10:32Z 15 B
2019-03-22T07:14:50Z 41 A
2019-03-22T06:47:41Z 64 B
2019-04-11T09:49:46Z 25 A
2019-04-11T09:49:46Z 29 B
output:
skill 2019-02 2019-03 2019-04
A 14 41 25
B 15 64 29
I know I can do it manually like below and make columns (when I have date1_start and date1_end):
dfdate1=data[(data['creationDate'] >= date1_start) & (data['creationDate']<= date1_end)]
But since I have many many months, it is not feasible to that this ways for each month.
Use DataFrame.pivot with convert datetimes to month periods by Series.dt.to_period:
df['dates'] = pd.to_datetime(df['creationDate']).dt.to_period('m')
df = df.pivot('skill','dates','fre')
Or to custom strings YYYY-MM by Series.dt.strftime:
df['dates'] = pd.to_datetime(df['creationDate']).dt.strftime('%Y-%m')
df = df.pivot('skill','dates','fre')
EDIT:
ValueError: Index contains duplicate entries, cannot reshape
It means there are duplicates, use DataFrame.pivot_table with some aggregation, e.g. sum, mean:
df = df.pivot_table(index='skill',columns='dates',values='fre', aggfunc='sum')
I have monthly data of 6 variables from 2014 until 2018 in one dataset.
I'm trying to draw 6 subplots (one for each variable) with monthly X axis (Jan, Feb....) and 5 series (one for each year) with their legend.
This is part of the data:
I created 5 series (one for each year) per variable (30 in total) and I'm getting the expected output but using MANY lines of code.
What is the best way to achieve this using less lines of code?
This is an example how I created the series:
CL2014 = data_total['Charity Lottery'].where(data_total['Date'].dt.year == 2014)[0:12]
CL2015 = data_total['Charity Lottery'].where(data_total['Date'].dt.year == 2015)[12:24]
This is an example of how I'm plotting the series:
axCL.plot(xvals, CL2014)
axCL.plot(xvals, CL2015)
axCL.plot(xvals, CL2016)
axCL.plot(xvals, CL2017)
axCL.plot(xvals, CL2018)
There's no need to litter your namespace with 30 variables. Seaborn makes the job very easy but you need to normalize your dataframe first. This is what "normalized" or "unpivoted" looks like (Seaborn calls this "long form"):
Date variable value
2014-01-01 Charity Lottery ...
2014-01-01 Racecourse ...
2014-04-01 Bingo Halls ...
2014-04-01 Casino ...
Your screenshot is a "pivoted" or "wide form" dataframe.
df_plot = pd.melt(df, id_vars='Date')
df_plot['Year'] = df_plot['Date'].dt.year
df_plot['Month'] = df_plot['Date'].dt.strftime('%b')
import seaborn as sns
plot = sns.catplot(data=df_plot, x='Month', y='value',
row='Year', col='variable', kind='bar',
sharex=False)
plot.savefig('figure.png', dpi=300)
Result (all numbers are randomly generated):
I would try using .groupby(), it is really powerful for parsing down things like this:
for _, group in data_total.groupby([year, month])[[x_variable, y_variable]]:
plt.plot(group[x_variables], group[y_variables])
So here the groupby will separate your data_total DataFrame into year/month subsets, with the [[]] on the end to parse it down to the x_variable (assuming it is in your data_total DataFrame) and your y_variable, which you can make any of those features you are interested in.
I would decompose your datetime column into separate year and month columns, then use those new columns inside that groupby as the [year, month]. You might be able to pass in the dt.year and dt.month like you had before... not sure, try it both ways!
I have a data frame with perfectly organised timestamps, like below:
It's a web log, and the timestamps go though the whole year. I want to cut them into each day and show the visits within each hour and plot them into the same figure and stack them all together. Just like the pic shown below:
I am doing well on cutting them into days and plot the visits of a day individually, but I am having trouble plotting them and stacking them together. The primary tool I am using is Pandas and Matplotlib.
Any advices and suggestions? Much Appreciated!
Edited:
My Code is as below:
The timestamps are: https://gist.github.com/adamleo/04e4147cc6614820466f7bc05e088ac5
And the dataframe looks like this:
I plotted the timestamp density through the whole period used the code below:
timestamps_series_all = pd.DatetimeIndex(pd.Series(unique_visitors_df.time_stamp))
timestamps_series_all_toBePlotted = pd.Series(1, index=timestamps_series_all)
timestamps_series_all_toBePlotted.resample('D').sum().plot()
and got the result:
I plotted timestamps within one day using the code:
timestamps_series_oneDay = pd.DatetimeIndex(pd.Series(unique_visitors_df.time_stamp.loc[unique_visitors_df["date"] == "2014-08-01"]))
timestamps_series_oneDay_toBePlotted = pd.Series(1, index=timestamps_series_oneDay)
timestamps_series_oneDay_toBePlotted.resample('H').sum().plot()
and the result:
And now I am stuck.
I'd really appreciate all of your help!
I think you need pivot:
#https://gist.github.com/adamleo/04e4147cc6614820466f7bc05e088ac5 to L
df = pd.DataFrame({'date':L})
print (df.head())
date
0 2014-08-01 00:05:46
1 2014-08-01 00:14:47
2 2014-08-01 00:16:05
3 2014-08-01 00:20:46
4 2014-08-01 00:23:22
#convert to datetime if necessary
df['date'] = pd.to_datetime(df['date'] )
#resample by Hours, get count and create df
df = df.resample('H', on='date').size().to_frame('count')
#extract date and hour
df['days'] = df.index.date
df['hours'] = df.index.hour
#pivot and plot
#maybe check parameter kind='density' from http://stackoverflow.com/a/33474410/2901002
#df.pivot(index='days', columns='hours', values='count').plot(rot='90')
#edit: last line change to below:
df.pivot(index='hours', columns='days', values='count').plot(rot='90')
I'm starting to learn about Python Pandas and want to generate a graph with the sum of arbitrary groupings of an ordinal value. It can be better explained with a simple example.
Suppose I have the following table of food consumption data:
And I have two groups of foods defined as two lists:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
Now I want to plot a graph with the evolution of consumption of junk and healthy food. I believe I must then process my data to get a DataFrame like:
Suppose the first table is already in a Dataframe called food, how do I transform it to get the second one?
I also welcome suggestions to reword my question to make it clearer, or for different approaches to generate the plot.
First create dictinary with lists and then swap keys with values.
Then groupby by mapped column food by dict and year, aggregate sum and last reshape by unstack:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
d1 = {'healthy':healthy, 'junk':junk}
##http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in d1.items() for k in oldv}
print (d)
{'brocolli': 'healthy', 'cheetos': 'junk', 'apple': 'healthy', 'coke': 'junk'}
df1 = df.groupby([df.food.map(d), 'year'])['amount'].sum().unstack(0)
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24
Another solution with pivot_table:
df1 = df.pivot_table(index='year', columns=df.food.map(d), values='amount', aggfunc='sum')
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24