my dataframe is structured:
ACTIVITY 2014 2015 2016 2017 2018
WALK 198 501 485 394 461
RUN 187 446 413 371 495
JUMP 45 97 88 103 78
JOG 1125 2150 2482 2140 2734
SLIDE 1156 2357 2530 2044 1956
my visualization goal: facetgrid of bar charts showing the percentage points over time, with each bar either positive/negative depending on percentage change of the year of course. each facet is an INCIDENT type, if that makes sense. for example, one facet would be a barplot of WALK, the other would be RUN, and so on and so forth. x-axis would be time of course (2014, 2015, 2016, etc) and y-axis would be the value (%change) from each year.
in my analysis, i added pct_change columns for every year except the baseline 2014 using simple pct_change() function that takes in two columns from df and spits back out a new calculated column:
df['%change_2015'] = pct_change(df['2014'],df['2015'])
df['%change_2016'] = pct_change(df['2015'],df['2016'])
... etc.
so with these new columns, i think i have the elements i need for my data visualization goal. how can i do it with seaborn facetgrids? specifically bar plots?
augmented dataframe (slice view):
ACTIVITY 2014 2015 2016 2017 2018 %change_2015 %change_2016
WALK 198 501 485 394 461 153.03 -3.19
RUN 187 446 413 371 495 xyz xyz
JUMP 45 97 88 103 78 xyz xyz
i tried reading through the seaborn documentation but i was having trouble understanding the configurations: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
is the problem the way my dataframe is ordered and structured? i hope all of that made sense. i appreciate any help with this.
Use:
import pandas as pd
cols = 'ACTIVITY', '%change_2015', '%change_2016'
data = [['Jump', '10.1', '-3.19'],['Run', '9.35', '-3.19'], ['Run', '4.35', '-1.19']]
df = pd.DataFrame(data, columns = cols)
dfm = pd.melt(df, id_vars=['ACTIVITY'], value_vars=['%change_2015', '%change_2016'])
dfm['value'] = dfm['value'].astype(float)
import seaborn as sns
g = sns.FacetGrid(dfm, col='variable')
g.map(sns.barplot, 'ACTIVITY', "value")
Output:
Based on your comment:
g = sns.FacetGrid(dfm, col='ACTIVITY')
g.map(sns.barplot, 'variable', "value")
Output:
Related
I want to plot three lines for Turkey, the UK and OECD through the years but those countries are not columns so I am suffering finding a way plot them.
I get this df via
df = df.loc[df["Variable"].eq("Relative advantage")
& df["Country"].isin(["United Kingdom", "Türkiye", "OECD - Total"])]
Year
Country
Value
1990
Turkiye
20
1980
UK
34
1992
UK
32
1980
OECD
29
1992
OECD
23
You can use the pivot_table() method to do this. An example:
import pandas as pd
# Set up example dataframe
df = pd.DataFrame([
[1990,'Turkiye',20],
[1992,'Turkiye',22],
[1990,'UK',34],
[1992,'UK',32],
[1990,'OECD',29],
[1992,'OECD',23],
], columns=["year", "country", "value"])
# Pivot so countries are now columns
table = df.pivot_table(values='value', columns='country', index='year')
This creates a dataframe where the countries are columns:
country OECD Turkiye UK
year
1990 29 20 34
1992 23 22 32
(I changed some of the dates to make it work out a bit more nicely.)
Then I plot it:
UPDATE: I have edited the question (and code) to make the problem clearer. I use here synthetic data but imagine a large df of floods and a small one of significant floods. I want add a reference to every row (of the large_df) if it is somewhat close to the significant flood.
I have 2 pandas dataframes (1 large and 1 small).
In every iteration I want to create a subset of the small dataframe based on a few conditions that are dependent on each row (of the large df):
import numpy as np
import pandas as pd
import time
SOME_THRESHOLD = 10.5
MUMBER_OF_ROWS = 2e4
large_df = pd.DataFrame(index=np.arange(MUMBER_OF_ROWS), data={'a': np.arange(MUMBER_OF_ROWS)})
small_df = large_df.loc[np.random.randint(0, MUMBER_OF_ROWS, 5)]
large_df['past_index'] = None
count_time = 0
for ind, row in large_df.iterrows():
start = time.time()
# This line takes forever.
df_tmp = small_df[(small_df.index<ind) & (small_df['a']>(row['a']-SOME_THRESHOLD)) & (small_df['a']<(row['a']+SOME_THRESHOLD))]
count_time += time.time()-start
if not df_tmp.empty:
past_index = df_tmp.loc[df_tmp.index.max()]['a']
large_df.loc[ind, 'similar_past_flood_tag'] = f'Similar to the large flood of {past_index}'
print(f'The total time of creating the subset df for 2e4 rows is: {count_time} seconds.')
The line that creates the subset takes a long time to compute:
The total time of creating the subset df for 2e4 rows is: 18.276793956756592 seconds.
This seems to me to be an too long. I have found similar questions but non of the answers seemed to work (e.g query and numpy conditions).
Is there a way to optimize this?
Note: the code does what is expected - just very slow.
While your code is logically correct, building the many boolean arrays and slicing the DataFrame accumulates to some time..
Here are some stats with %timeit:
(small_df.index<ind): ~30μs
(small_df['a']>(row['a']-SOME_THRESHOLD)): ~100μs
(small_df['a']<(row['a']+SOME_THRESHOLD)): ~100μs
After '&'-ing all three: ~500μs
Including the DataFrame slice: ~700μs
That, multiplied by 20K times is indeed 14 seconds.. :)
What you could do is take advantage of numpy's broadcast to compute the boolean matrix more efficiently, and then reconstruct the "valid" DataFrame. See below:
l_ind = np.array(large_df.index)
s_ind = np.array(small_df.index)
l_a = np.array(large_df.a)
s_a = np.array(small_df.a)
arr1 = (l_ind[:, None] < s_ind[None, :])
arr2 = (((l_a[:, None] - SOME_THRESHOLD) < s_a[None, :]) &
(s_a[None, :] < (l_a[:, None] + SOME_THRESHOLD)))
arr = arr1 & arr2
large_valid_inds, small_valid_inds = np.where(arr)
pd.DataFrame({'large_ind': np.take(l_ind, large_valid_inds),
'small_ind': np.take(s_ind, small_valid_inds)})
That gives you the following DF, which if I understood the question properly, is the expected solution:
large_ind
small_ind
0
1621
1631
1
1622
1631
2
1623
1631
3
1624
1631
4
1625
1631
5
1626
1631
6
1627
1631
7
1628
1631
8
1629
1631
9
1630
1631
10
1992
2002
11
1993
2002
12
1994
2002
13
1995
2002
14
1996
2002
15
1997
2002
16
1998
2002
17
1999
2002
18
2000
2002
19
2001
2002
20
8751
8761
21
8752
8761
22
8753
8761
23
8754
8761
24
8755
8761
25
8756
8761
26
8757
8761
27
8758
8761
28
8759
8761
29
8760
8761
30
10516
10526
31
10517
10526
32
10518
10526
33
10519
10526
34
10520
10526
35
10521
10526
36
10522
10526
37
10523
10526
38
10524
10526
39
10525
10526
40
18448
18458
41
18449
18458
42
18450
18458
43
18451
18458
44
18452
18458
45
18453
18458
46
18454
18458
47
18455
18458
48
18456
18458
49
18457
18458
In pandas for loops are much slower than column operations. So changing the calculation to loop over small_df instead of large_df will already give a big improvement:
for ind, row in small_df.iterrows():
df_tmp = large_df[ <some condition> ]
# ... some other processing
Even better for your case is to use a merge rather than a condition on large_df. The problem is your merge is not on equal columns but on approximately equal. To use this approach, you should truncate your column and use that for the merge. Here's a hacky example.
small_df['a_rounded'] = (small_df['a'] / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded'] = (large_df['a'] / SOME_THRESHOLD / 2).astype(int)
merge_result = small_df.merge(large_df, on='a_rounded')
small_df['a_rounded2'] = ((small_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded2'] = ((large_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
merge_result2 = small_df.merge(large_df, on='a_rounded2')
total_merge_result = pd.concat([merge_result, merge_result2])
# Now remove duplicates and impose additional filters.
You can impose the additional filters on the result later.
I am trying to assess the impact of a promotional campaign on our customers. The goal is to assess revenue from the point the promotion was offered. However promotion was offered for different customers at different points. How do I rearrange the data to Month 0, Month 1, Month 2, Month 3. Month 0 being the month the customer first got the promotion.
With below self explanatory code you can get your desired output:
# Create DataFrame
import pandas as pd
df = pd.DataFrame({"Account":[1,2,3,4,5,6],\
"May-18":[181,166,221,158,210,159],\
"Jun-18":[178,222,230,189,219,200],\
"Jul-18":[184,207,175,167,201,204],\
"Aug-18":[161,174,178,233,223,204],\
"Sep-18":[218,209,165,165,204,225],\
"Oct-18":[199,206,205,196,212,205],\
"Nov-18":[231,196,189,218,234,235],\
"Dec-18":[173,178,189,218,234,205],\
"Promotion Month":["Sep-18","Aug-18","Jul-18","May-18","Aug-18","Jun-18"]})
df = df.set_index("Account")
cols = ["May-18","Jun-18","Jul-18","Aug-18","Sep-18","Oct-18","Nov-18","Dec-18","Promotion Month"]
df = df[cols]
# Define function to select the four months after promotion
def selectMonths(row):
cols = df.columns.to_list()
colMonth0 = cols.index(row["Promotion Month"])
colsOut = cols[colMonth0:colMonth0+4]
out = pd.Series(row[colsOut].to_list())
return out
# Apply the function and set the index and columns of output DataFrame
out = df.apply(selectMonths, axis=1)
out.index = df.index
out.columns=["Month 0","Month 1","Month 2","Month 3"]
Then the output you get is:
>>> out
Month 0 Month 1 Month 2 Month 3
Account
1 218 199 231 173
2 174 209 206 196
3 175 178 165 205
4 158 189 167 233
5 223 204 212 234
6 200 204 204 225
I have a dataframe that looks like below, the date is the index. How would I plot a time series showing a line for each of the years? I have tried df.plot(figsize=(15,4)) but this gives me one line.
Date Value
2008-01-31 22
2008-02-28 17
2008-03-31 34
2008-04-30 29
2009-01-31 33
2009-02-28 42
2009-03-31 45
2009-04-30 39
2019-01-31 17
2019-02-28 12
2019-03-31 11
2019-04-30 12
2020-01-31 24
2020-02-28 34
2020-03-31 43
2020-04-30 45
You can just do a groupby using year.
df = pd.read_clipboard()
df = df.set_index(pd.DatetimeIndex(df['Date']))
df.groupby(df.index.year)['Value'].plot()
In case you want to use year as series of data and compare day to day:
import matplotlib.pyplot as plt
# Create a date column from index (easier to manipulate)
df["date_column"] = pd.to_datetime(df.index)
# Create a year column
df["year"] = df["date_column"].dt.year
# Create a month-day column
df["month_day"] = (df["date_column"].dt.month).astype(str).str.zfill(2) + \
"-" + df["date_column"].dt.day.astype(str).str.zfill(2)
# Plot. Pivot will create for each year a column and these columns will be used as series.
df.pivot('month_day', 'year', 'Value').plot(kind='line', figsize=(12, 8), marker='o' )
plt.title("Values per Month-Day - Year comparison", y=1.1, fontsize=14)
plt.xlabel("Month-Day", labelpad=12, fontsize=12)
plt.ylabel("Value", labelpad=12, fontsize=12);
I have 3 separate dataframes of the same shape with following data.
# for 2015
Grave Crimes Cases Recorded Mistake of Law fact
Abduction 725 3
Kidnapping 246 6
Arson 466 1
Mischief 436 1
House Breaking 12707 21
Grievous Hurt 1299 3
# for 2016
Grave Crimes Cases Recorded Mistake of Law fact
Abduction 738 4
Kidnapping 297 9
Arson 486 4
Mischief 394 1
House Breaking 10287 14
Grievous Hurt 1205 0
# for 2017
Grave Crimes Cases Recorded Mistake of Law fact
Abduction 647 2
Kidnapping 251 10
Arson 418 3
Mischief 424 0
House Breaking 8913 12
Grievous Hurt 1075 1
I want to plot each column (say 'Cases Recorded' for example) against the 'Grave Crimes' type group by the year. My current panel is as follows. I did not set any indexes when creating the panel. Also each dataframe doesn't have any column indicating the year as shown above.
pnl = pd.Panel({2015: df15, 2016: df16, 2017: df17})
My expected output is shown below. Can someone help me on this?
Panels have been deprecated in Pandas as of 0.20.1+
We can use pd.concat with keys to combine those dataframes into a single dataframe, then use reshaping and pandas plot.
l = ['df2015','df2016','df2017']
df_out = pd.concat([eval(i) for i in l], keys=l)\
.set_index('Grave Crimes',append=True)['Cases Recorded'].unstack(0)\
.rename(columns=lambda x: x[2:]).reset_index(0, drop=True)
df_out.plot.bar()
Output: