Remove column name from legend in pandas barplot - pandas

I have a dataframe as below
batsman non_striker partnershipRuns
0 SK Raina A Flintoff 23
1 SK Raina DR Smith 90
2 SK Raina F du Plessis 36
3 SK Raina JA Morkel 14
10 MS Dhoni CK Kapugedera 18
11 MS Dhoni DJ Bravo 51
12 MS Dhoni F du Plessis 27
13 MS Dhoni JA Morkel 12
14 MS Dhoni JDP Oram 6
I have been able to create a stacked bar plot using
df1=df.groupby(['batsman','non_striker']).sum().unstack().fillna(0)
df1.plot(kind='bar',stacked=True,legend=True)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
This results in the column name being included in the legend as a tuple as shown in figure.
How can I not have the column name and just have the value for the legend?
Please help

For prevent MultiIndex in columns specify column for sum after groupby, also fillna here is not necessary, add parameter fill_value=0 to unstack:
df1=df.groupby(['batsman','non_striker'])['partnershipRuns'].sum().unstack(fill_value=0)
Another solution with pivot_table:
df1=df.pivot_table(index='batsman',
columns='non_striker',
values='partnershipRuns',
aggfunc='sum',
fill_value=0)

An easy fix for you, would be to do :
df1=df.groupby(['batsman','non_striker']).sum().unstack().fillna(0)['partnershipRuns']
instead of :
df1=df.groupby(['batsman','non_striker']).sum().unstack().fillna(0)
Why ? Because your aggregations are creating a MultiIndex. And when you are plotting, you could have different multiIndexes values (here, multiple values instead of one being "partnershipRuns").
If you want something else because I didn't understand your question correctly, just ask.

Related

create seaborn facetgrid based on crosstab table

my dataframe is structured:
ACTIVITY 2014 2015 2016 2017 2018
WALK 198 501 485 394 461
RUN 187 446 413 371 495
JUMP 45 97 88 103 78
JOG 1125 2150 2482 2140 2734
SLIDE 1156 2357 2530 2044 1956
my visualization goal: facetgrid of bar charts showing the percentage points over time, with each bar either positive/negative depending on percentage change of the year of course. each facet is an INCIDENT type, if that makes sense. for example, one facet would be a barplot of WALK, the other would be RUN, and so on and so forth. x-axis would be time of course (2014, 2015, 2016, etc) and y-axis would be the value (%change) from each year.
in my analysis, i added pct_change columns for every year except the baseline 2014 using simple pct_change() function that takes in two columns from df and spits back out a new calculated column:
df['%change_2015'] = pct_change(df['2014'],df['2015'])
df['%change_2016'] = pct_change(df['2015'],df['2016'])
... etc.
so with these new columns, i think i have the elements i need for my data visualization goal. how can i do it with seaborn facetgrids? specifically bar plots?
augmented dataframe (slice view):
ACTIVITY 2014 2015 2016 2017 2018 %change_2015 %change_2016
WALK 198 501 485 394 461 153.03 -3.19
RUN 187 446 413 371 495 xyz xyz
JUMP 45 97 88 103 78 xyz xyz
i tried reading through the seaborn documentation but i was having trouble understanding the configurations: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
is the problem the way my dataframe is ordered and structured? i hope all of that made sense. i appreciate any help with this.
Use:
import pandas as pd
cols = 'ACTIVITY', '%change_2015', '%change_2016'
data = [['Jump', '10.1', '-3.19'],['Run', '9.35', '-3.19'], ['Run', '4.35', '-1.19']]
df = pd.DataFrame(data, columns = cols)
dfm = pd.melt(df, id_vars=['ACTIVITY'], value_vars=['%change_2015', '%change_2016'])
dfm['value'] = dfm['value'].astype(float)
import seaborn as sns
g = sns.FacetGrid(dfm, col='variable')
g.map(sns.barplot, 'ACTIVITY', "value")
Output:
Based on your comment:
g = sns.FacetGrid(dfm, col='ACTIVITY')
g.map(sns.barplot, 'variable', "value")
Output:

Create Repeating N Rows at Interval N Pandas DF [duplicate]

This question already has an answer here:
Repeat Rows in Data Frame n Times [duplicate]
(1 answer)
Closed 1 year ago.
i have a df1 with shape 15,1 but I need to create a new df2 of shape 270,1 with repeating rows from each row of the rows in df1 at intervals of 18 rows 15 times (18 * 15 = 270). The df1 looks like this:
Sites
0 TULE
1 DRY LAKE I
2 PENASCAL I
3 EL CABO
4 BARTON CHAPEL
5 RUGBY
6 BARTON I
7 BLUE CREEK
8 NEW HARVEST
9 COLORADO GREEN
10 CAYUGA RIDGE
11 BUFFALO RIDGE I
12 DESERT WIND
13 BIG HORN I
14 GROTON
My df2 should look like this in abbreviated form below and thank you,
I FINALLY found the answer: convert the dataframe to a series and use repeat in the form: my_series.repeat(N) and then convert back the series to a df.

Pandas plot data column on x axis

I am trying to plot data using pandas. The data is as follows
Name 1999 2000 2001
stud1 11 22 33
stud2 33 44 55
stud3 55 66 77
......
I need to plot student mark year-wise (year is on x-axis).
You can do this this way
stud = pd.read_csv(r"C:/users/k_sego/students.csv", sep=";")
df = stud.pivot_table(columns=['Name'])
df.plot(kind='bar', legend=True)
you could try this:
df.pivot_table(columns=['Name']).plot()
It'll pivot your dataframe so that the year is the index and each student is a column

Is there a way to keep only rows in a DataFrame, when a column of that dataframe contains a substring of another column in that dataframe?

I have a dataset:
id key value
24 Apple Inc_Desktops revenue_rgs_category_-_pc_monitors nan
2 Apple Inc_Desktops revenue_rgs_category_-_mobile_phones 142381000000.000
46 Apple Inc_Desktops revenue_rgs_category_-_smart_tech 24482000000.000
13 Apple Inc_Desktops revenue_rgs_category_-_desktop_pcs 12870000000.000
35 Apple Inc_Desktops revenue_rgs_category_-_tablets 21280000000.000
1 Apple Inc_Laptops revenue_rgs_category_-_mobile_phones 142381000000.000
45 Apple Inc_Laptops revenue_rgs_category_-_smart_tech 24482000000.000
23 Apple Inc_Laptops revenue_rgs_category_-_pc_monitors nan
34 Apple Inc_Laptops revenue_rgs_category_-_tablets 21280000000.000
12 Apple Inc_Laptops revenue_rgs_category_-_desktop_pcs 12870000000.000
25 Apple Inc_MobilePhones revenue_rgs_category_-_pc_monitors nan
14 Apple Inc_MobilePhones revenue_rgs_category_-_desktop_pcs 12870000000.000
36 Apple Inc_MobilePhones revenue_rgs_category_-_tablets 21280000000.000
47 Apple Inc_MobilePhones revenue_rgs_category_-_smart_tech 24482000000.000
3 Apple Inc_MobilePhones revenue_rgs_category_-_mobile_phones 142381000000.000
And I only want to keep the rows when the column key contains a substring from column id. For example, as illustrated in the picture below, i want to keep only rows with index 13,3 because for those rows the 'key' column contains part of the id column - eg, for row with index 3, 'Mobile' is included in key column.
So my desired output would be:
id key value
13 Apple Inc_Desktops revenue_rgs_category_-_desktop_pcs 12870000000.000
3 Apple Inc_MobilePhones revenue_rgs_category_-_mobile_phones 142381000000.000
I tried to create a new indicating whether the 'key' column contains substring of the 'id' column, but with not luck:
comp_rev_long['check'] = comp_rev_long['key'].str.contains('|'.join(comp_rev_long['id']),case=False)
Any ideas on an efficient way to do this? Thanking you in advance.
Here is some code that should help you get started:
import numpy as np
import pandas as pd
np.random.seed(1)
# I create a simple DataFrame
df = pd.DataFrame({"id": np.random.choice(["apple", "banana", "cherry"], 15),
"key": np.random.choice(["apple pie", "banana pie", "cherry pie"], 15),
"value": np.random.randint(0,20, 15)})
df looks like this:
id key value
0 banana cherry pie 13
1 apple banana pie 9
2 apple cherry pie 9
3 banana apple pie 7
4 banana apple pie 1
5 apple cherry pie 0
6 apple apple pie 17
7 banana banana pie 8
8 apple cherry pie 13
9 banana cherry pie 19
10 apple apple pie 15
11 cherry banana pie 10
12 banana banana pie 8
13 cherry cherry pie 7
14 apple apple pie 3
Here is a simple option to select only the rows that satisfy a certain condition.
# create a function that checks if a row satisfies your condition
check_condition = lambda row: row["id"] in row["key"]
# create a new column that determines whether you keep the row
# by applying the check_condition function row wise (-> axis=1)
df["keep_row"] = df.apply(check_condition, axis=1)
# finally select and keep only the desired rows
df = df[df["keep_row"]]
Now df looks like this:
id key value keep_row
6 apple apple pie 17 True
7 banana banana pie 8 True
10 apple apple pie 15 True
12 banana banana pie 8 True
13 cherry cherry pie 7 True
14 apple apple pie 3 True
One final issue is how to check if a substring is contained in another string. There are a few ways to go about this.
Replace the values such that this operation becomes trivial, eg. row["id"] in row["key"]
Make new columns with the crucial information of the sting, if you only need to know whether is a mobile or pc make a new 'device' column.
Just code it anyway, this a a bit cumbersome though
This check_condition might work, form seeing your data but I cannot be sure of course.
def check_condition(row):
for i in row["id"].lower().split('_'):
if i in row["key"].lower():
return True
elif i[:-1] in row["key"].lower(): # account for the final 's'
return True
return False
2 notes:
This isn't a lambda function, but in this case it is equivalent to one, so you can replace the lambda check_condition-function by this one.
Also note that in the "id" and "key" columns some words ends with '-s' and some don't so that needs to be accounted for as well.
A solution to your question is to check if a string in your key column is present in the index column. In the example below, I construct a df (since you didn't provide one, with a column containing string in one column, present in another one:
import pandas as pd
a1,b2,c3 = 'ANDGFEEHsdsdSHSHS','FKDsdsdKSDKSDKS','DSLDJSLffsfsKDdSLDJS'
s1, s2, s3 = 1,3,1
e1, e2, e3 = 3,6,6
df = pd.DataFrame({'key':[a1,b2,c3],'start': [s1, s2, s3],'end': [e1, e2, e3]})
df = df[['key', 'start', 'end']]
df['sliced'] = df.apply(fn, axis = 1)
aa1,bb2,cc3 = 'ANDGFEEHsdsdSHSHS','FKDsdsdKSDKSDKS','DSLDJSLffsfsKDdSLDJS'
ss1, ss2, ss3 = 2,2,2
ee1, ee2, ee3 = 1,1,1
df2 = pd.DataFrame({'key':[aa1,bb2,cc3],'start': [ss1, ss2, ss3],'end': [ee1, ee2, ee3]})
df2 = df2[['key', 'start', 'end']]
dff = df.append(df2)
You apply this to determine if string in one column exist in key:
df['Check'] = df.apply(lambda x: x.sliced in x.key, axis=1)
and slice for True.

Plotting a pandas panel colum across third dimension

I have 3 separate dataframes of the same shape with following data.
# for 2015
Grave Crimes Cases Recorded Mistake of Law fact
Abduction 725 3
Kidnapping 246 6
Arson 466 1
Mischief 436 1
House Breaking 12707 21
Grievous Hurt 1299 3
# for 2016
Grave Crimes Cases Recorded Mistake of Law fact
Abduction 738 4
Kidnapping 297 9
Arson 486 4
Mischief 394 1
House Breaking 10287 14
Grievous Hurt 1205 0
# for 2017
Grave Crimes Cases Recorded Mistake of Law fact
Abduction 647 2
Kidnapping 251 10
Arson 418 3
Mischief 424 0
House Breaking 8913 12
Grievous Hurt 1075 1
I want to plot each column (say 'Cases Recorded' for example) against the 'Grave Crimes' type group by the year. My current panel is as follows. I did not set any indexes when creating the panel. Also each dataframe doesn't have any column indicating the year as shown above.
pnl = pd.Panel({2015: df15, 2016: df16, 2017: df17})
My expected output is shown below. Can someone help me on this?
Panels have been deprecated in Pandas as of 0.20.1+
We can use pd.concat with keys to combine those dataframes into a single dataframe, then use reshaping and pandas plot.
l = ['df2015','df2016','df2017']
df_out = pd.concat([eval(i) for i in l], keys=l)\
.set_index('Grave Crimes',append=True)['Cases Recorded'].unstack(0)\
.rename(columns=lambda x: x[2:]).reset_index(0, drop=True)
df_out.plot.bar()
Output: