Plotting a pandas panel colum across third dimension - pandas

I have 3 separate dataframes of the same shape with following data.
# for 2015
Grave Crimes Cases Recorded Mistake of Law fact
Abduction 725 3
Kidnapping 246 6
Arson 466 1
Mischief 436 1
House Breaking 12707 21
Grievous Hurt 1299 3
# for 2016
Grave Crimes Cases Recorded Mistake of Law fact
Abduction 738 4
Kidnapping 297 9
Arson 486 4
Mischief 394 1
House Breaking 10287 14
Grievous Hurt 1205 0
# for 2017
Grave Crimes Cases Recorded Mistake of Law fact
Abduction 647 2
Kidnapping 251 10
Arson 418 3
Mischief 424 0
House Breaking 8913 12
Grievous Hurt 1075 1
I want to plot each column (say 'Cases Recorded' for example) against the 'Grave Crimes' type group by the year. My current panel is as follows. I did not set any indexes when creating the panel. Also each dataframe doesn't have any column indicating the year as shown above.
pnl = pd.Panel({2015: df15, 2016: df16, 2017: df17})
My expected output is shown below. Can someone help me on this?

Panels have been deprecated in Pandas as of 0.20.1+
We can use pd.concat with keys to combine those dataframes into a single dataframe, then use reshaping and pandas plot.
l = ['df2015','df2016','df2017']
df_out = pd.concat([eval(i) for i in l], keys=l)\
.set_index('Grave Crimes',append=True)['Cases Recorded'].unstack(0)\
.rename(columns=lambda x: x[2:]).reset_index(0, drop=True)
df_out.plot.bar()
Output:

Related

create seaborn facetgrid based on crosstab table

my dataframe is structured:
ACTIVITY 2014 2015 2016 2017 2018
WALK 198 501 485 394 461
RUN 187 446 413 371 495
JUMP 45 97 88 103 78
JOG 1125 2150 2482 2140 2734
SLIDE 1156 2357 2530 2044 1956
my visualization goal: facetgrid of bar charts showing the percentage points over time, with each bar either positive/negative depending on percentage change of the year of course. each facet is an INCIDENT type, if that makes sense. for example, one facet would be a barplot of WALK, the other would be RUN, and so on and so forth. x-axis would be time of course (2014, 2015, 2016, etc) and y-axis would be the value (%change) from each year.
in my analysis, i added pct_change columns for every year except the baseline 2014 using simple pct_change() function that takes in two columns from df and spits back out a new calculated column:
df['%change_2015'] = pct_change(df['2014'],df['2015'])
df['%change_2016'] = pct_change(df['2015'],df['2016'])
... etc.
so with these new columns, i think i have the elements i need for my data visualization goal. how can i do it with seaborn facetgrids? specifically bar plots?
augmented dataframe (slice view):
ACTIVITY 2014 2015 2016 2017 2018 %change_2015 %change_2016
WALK 198 501 485 394 461 153.03 -3.19
RUN 187 446 413 371 495 xyz xyz
JUMP 45 97 88 103 78 xyz xyz
i tried reading through the seaborn documentation but i was having trouble understanding the configurations: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
is the problem the way my dataframe is ordered and structured? i hope all of that made sense. i appreciate any help with this.
Use:
import pandas as pd
cols = 'ACTIVITY', '%change_2015', '%change_2016'
data = [['Jump', '10.1', '-3.19'],['Run', '9.35', '-3.19'], ['Run', '4.35', '-1.19']]
df = pd.DataFrame(data, columns = cols)
dfm = pd.melt(df, id_vars=['ACTIVITY'], value_vars=['%change_2015', '%change_2016'])
dfm['value'] = dfm['value'].astype(float)
import seaborn as sns
g = sns.FacetGrid(dfm, col='variable')
g.map(sns.barplot, 'ACTIVITY', "value")
Output:
Based on your comment:
g = sns.FacetGrid(dfm, col='ACTIVITY')
g.map(sns.barplot, 'variable', "value")
Output:

using groupby for datetime values in pandas

I'm using this code in order to groupby my data by year
df = pd.read_csv('../input/companies-info-wikipedia-2021/sparql_2021-11-03_22-25-45Z.csv')
df = pd.read_csv('../input/companies-info-wikipedia-2021/sparql_2021-11-03_22-25-45Z.csv')
df_duplicate_name = df[df.duplicated(['name'])]
df = df.drop_duplicates(subset='name').reset_index()
df = df.drop(['a','type','index'],axis=1).reset_index()
df = df[~df['foundation'].str.contains('[A-Za-z]', na=False)]
df = df.drop([140,214,220])
df['foundation'] = df['foundation'].fillna(0)
df['foundation'] = pd.to_datetime(df['foundation'])
df['foundation'] = df['foundation'].dt.year
df = df.groupby('foundation')
But as a result it does not group it by foundation values:
0 0 Deutsche EuroShop AG 1999 http://dbpedia.org/resource/Germany Investment in shopping centers http://dbpedia.org/resource/Real_property 4 2.964E9 1.25E9 2.241E8 8.04E7
1 1 Industry of Machinery and Tractors 1996 http://dbpedia.org/resource/Belgrade http://dbpedia.org/resource/Tractors http://dbpedia.org/resource/Agribusiness 4 4.648E7 0.0 30000.0 -€0.47 million
2 2 TelexFree Inc. 2012 http://dbpedia.org/resource/Massachusetts 99 http://dbpedia.org/resource/Multi-level_marketing 7 did not disclose did not disclose did not disclose did not disclose
3 3 (prev. Common Cents Communications Inc.) 2012 http://dbpedia.org/resource/United_States 99 http://dbpedia.org/resource/Multi-level_marketing 7 did not disclose did not disclose did not disclose did not disclose
4 4 Bionor Holding AS 1993 http://dbpedia.org/resource/Oslo http://dbpedia.org/resource/Health_care http://dbpedia.org/resource/Biotechnology 18 NOK 253 395 million NOK 203 320 million 1.09499E8 NOK 49 020 million
... ... ... ... ... ... ... ... ... ... ... ...
255 255 Ageas SA/NV 1990 http://dbpedia.org/resource/Belgium http://dbpedia.org/resource/Insurance http://dbpedia.org/resource/Financial_services 45000 1.0872E11 1.348E10 1.112E10 9.792E8
256 256 Sharp Corporation 1912 http://dbpedia.org/resource/Japan Televisions, audiovisual, home appliances, inf... http://dbpedia.org/resource/Consumer_electronics 52876 NaN NaN NaN NaN
257 257 Erste Group Bank AG 2008 Vienna, Austria Retail and commercial banking, investment and ... http://dbpedia.org/resource/Financial_services 47230 2.71983E11 1.96E10 6.772E9 1187000.0
258 258 Manulife Financial Corporation 1887 200 Asset management, Commercial banking, Commerci... http://dbpedia.org/resource/Financial_services 34000 750300000000 47200000000 39000000000 4800000000
259 259 BP plc 1909 London, England, UK http://dbpedia.org/resource/Natural_gas http://dbpedia.org/resource/Petroleum_industry
I also tried with making it again pd.to_datetime and sorting by dt.year - but still unsuccessful.
Column names:
Index(['index', 'name', 'foundation', 'location', 'products', 'sector',
'employee', 'assets', 'equity', 'revenue', 'profit'],
dtype='object')
#Ruslan you simply need to use a "sorting" command, not a "groupby" . You can achieve this generally in two ways:
myDF.sort_value(by='column_name' , ascending= 'true', inplace=true)
or, in case you need to set your column as index, you would need to do this:
myDF.index.name = 'column_name'
myDF.sort_index(ascending=True)
GroupBy is a totally different command, it is used to make actions after you group values by some criteria. Such as find sum, average , min, max of values, grouped-by some criteria.
pandas.DataFrame.sort_values
pandas.DataFrame.groupby
I think you're misunderstanding how groupby() works.
You can't do df = df.groupby('foundation'). groupby() does not return a new DataFrame. Instead, it returns a GroupBy, which is essentially just a mapping from value grouped-by to a dataframe containg the rows that all share that value for the specified column.
You can, for example, print how many rows are in each group with the following code:
groups = df.groupby('foundation')
for val, sub_df in groups:
print(f'{val}: {sub_df.shape[0]} rows')

Unconsistent Pandas axis labels

I have a pandas data-frame (df) including a column as labels (column 'Specimens' here).
Specimens Sample Min_read_lg Avr_read_lg Max_read_lg
0 B.pleb_sili 1 32 249.741 488
1 B.pleb_sili 2 30 276.959 489
2 B.conc_sili 3 25 256.294 489
3 B.conc_sili 4 27 277.923 489
4 F1_1_sili 5 34 303.328 489
...
I have tried to plot it as following, but the labels on the x axis are not matching the actual values of the table. Would anyone know why it could be the case?
plot=df.plot.area()
plot.set_xlabel("Specimens")
plot.set_ylabel("Read length")
plot.set_xticklabels(df['Specimens'], rotation=90)
I think the "plot.set_xticklabels" method is not right, but I would like to understand why the labels on the x axis are mismatched, and most of them missing.

Panda: multiindex vs groupby [duplicate]

So I learned that I can use DataFrame.groupby without having a MultiIndex to do subsampling/cross-sections.
On the other hand, when I have a MultiIndex on a DataFrame, I still need to use DataFrame.groupby to do sub-sampling/cross-sections.
So what is a MultiIndex good for apart from the quite helpful and pretty display of the hierarchies when printing?
Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release.
This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.
Imagine constructing a dataframe using MultiIndex like this:-
import pandas as pd
import numpy as np
np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B'])
df # This is the dataframe we have generated
A B
one 1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
two 1 -0.101713 -1.204458
2 0.958008 -0.455419
3 -0.191702 -0.915983
This df is simply a data structure of two dimensions
df.ndim
2
But we can imagine it, looking at the output, as a 3 dimensional data structure.
one with 1 with data -0.732470 -0.313871.
one with 2 with data -0.031109 -2.068794.
one with 3 with data 1.520652 0.471764.
A.k.a.: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure"
This is not just a "pretty display". It has the benefit of easy retrieval of data since we now have a hierarchal index.
For example.
In [44]: df.ix["one"]
Out[44]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
will give us a new data frame only for the group of data belonging to "one".
And we can narrow down our data selection further by doing this:-
In [45]: df.ix["one"].ix[1]
Out[45]:
A -0.732470
B -0.313871
Name: 1
And of course, if we want a specific value, here's an example:-
In [46]: df.ix["one"].ix[1]["A"]
Out[46]: -0.73247029752040727
So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby.
We can even grab a cross-section (either rows or columns) from our dataframe...
By rows:-
In [47]: df.xs('one')
Out[47]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
By columns:-
In [48]: df.xs('B', axis=1)
Out[48]:
one 1 -0.313871
2 -2.068794
3 0.471764
two 1 -1.204458
2 -0.455419
3 -0.915983
Name: B
Great post by #Calvin Cheng, but thought I'd take a stab at this as well.
When to use a MultiIndex:
When a single column’s value isn’t enough to uniquely identify a row.
When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”
Why (your core question) - at least these are the biggest benefits IMO:
Easy manipulation via stack() and unstack()
Easy math when there are multiple column levels
Syntactic sugar for slicing/filtering
Example:
Dollars Units
Date Store Category Subcategory UPC EAN
2018-07-10 Store 1 Alcohol Liqour 80480280024 154.77 7
Store 2 Alcohol Liqour 80480280024 82.08 4
Store 3 Alcohol Liqour 80480280024 259.38 9
Store 1 Alcohol Liquor 80432400630 477.68 14
674545000001 139.68 4
Store 2 Alcohol Liquor 80432400630 203.88 6
674545000001 377.13 13
Store 3 Alcohol Liquor 80432400630 239.19 7
674545000001 432.32 14
Store 1 Beer Ales 94922755711 65.17 7
702770082018 174.44 14
736920111112 50.70 5
Store 2 Beer Ales 94922755711 129.60 12
702770082018 107.40 10
736920111112 59.65 5
Store 3 Beer Ales 94922755711 154.00 14
702770082018 137.40 10
736920111112 107.88 12
Store 1 Beer Lagers 702770081011 156.24 12
Store 2 Beer Lagers 702770081011 137.06 11
Store 3 Beer Lagers 702770081011 119.52 8
1) If we want to easily compare sales across stores, we can use df.unstack('Store') to line everything up side-by-side:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 154.77 82.08 259.38 7 4 9
Liquor 80432400630 477.68 203.88 239.19 14 6 7
674545000001 139.68 377.13 432.32 4 13 14
Beer Ales 94922755711 65.17 129.60 154.00 7 12 14
702770082018 174.44 107.40 137.40 14 10 10
736920111112 50.70 59.65 107.88 5 5 12
Lagers 702770081011 156.24 137.06 119.52 12 11 8
2) We can also easily do math on multiple columns. For example, df['Dollars'] / df['Units'] will then divide each store's dollars by its units, for every store without multiple operations:
Store Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 22.11 20.52 28.82
Liquor 80432400630 34.12 33.98 34.17
674545000001 34.92 29.01 30.88
Beer Ales 94922755711 9.31 10.80 11.00
702770082018 12.46 10.74 13.74
736920111112 10.14 11.93 8.99
Lagers 702770081011 13.02 12.46 14.94
3) If we then want to filter to just specific rows, instead of using the
df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)]
format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful). The syntax would instead be:
df.xs((val1, val2, val3), level=(col1, col2, col3))
More examples can be found in this tutorial notebook I put together.
The alternative to using a multiindex is to store your data using multiple columns of a dataframe. One would expect multiindex to provide a performance boost over naive column storage, but as of Pandas v 1.1.4, that appears not to be the case.
Timinigs
import numpy as np
import pandas as pd
np.random.seed(2020)
inv = pd.DataFrame({
'store_id': np.random.choice(10000, size=10**7),
'product_id': np.random.choice(1000, size=10**7),
'stock': np.random.choice(100, size=10**7),
})
# Create a DataFrame with a multiindex
inv_multi = inv.groupby(['store_id', 'product_id'])[['stock']].agg('sum')
print(inv_multi)
stock
store_id product_id
0 2 48
4 18
5 58
7 149
8 158
... ...
9999 992 132
995 121
996 105
998 99
999 16
[6321869 rows x 1 columns]
# Create a DataFrame without a multiindex
inv_cols = inv_multi.reset_index()
print(inv_cols)
store_id product_id stock
0 0 2 48
1 0 4 18
2 0 5 58
3 0 7 149
4 0 8 158
... ... ... ...
6321864 9999 992 132
6321865 9999 995 121
6321866 9999 996 105
6321867 9999 998 99
6321868 9999 999 16
[6321869 rows x 3 columns]
%%timeit
inv_multi.xs(key=100, level='store_id')
10 loops, best of 3: 20.2 ms per loop
%%timeit
inv_cols.loc[inv_cols.store_id == 100]
The slowest run took 8.79 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 11.5 ms per loop
%%timeit
inv_multi.xs(key=100, level='product_id')
100 loops, best of 3: 9.08 ms per loop
%%timeit
inv_cols.loc[inv_cols.product_id == 100]
100 loops, best of 3: 12.2 ms per loop
%%timeit
inv_multi.xs(key=(100, 100), level=('store_id', 'product_id'))
10 loops, best of 3: 29.8 ms per loop
%%timeit
inv_cols.loc[(inv_cols.store_id == 100) & (inv_cols.product_id == 100)]
10 loops, best of 3: 28.8 ms per loop
Conclusion
The benefits from using a MultiIndex are about syntactic sugar, self-documenting data, and small conveniences from functions like unstack() as mentioned in #ZaxR's answer; Performance is not a benefit, which seems like a real missed opportunity.
Based on the comment on this
answer it seems the
experiment was flawed. Here is my attempt at a correct experiment.
Timings
import pandas as pd
import numpy as np
from timeit import timeit
random_data = np.random.randn(16, 4)
multiindex_lists = [["A", "B", "C", "D"], [1, 2, 3, 4]]
multiindex = pd.MultiIndex.from_product(multiindex_lists)
dfm = pd.DataFrame(random_data, multiindex)
df = dfm.reset_index()
print("dfm:\n", dfm, "\n")
print("df\n", df, "\n")
dfm_selection = dfm.loc[("B", 4), 3]
print("dfm_selection:", dfm_selection, type(dfm_selection))
df_selection = df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0]
print("df_selection: ", df_selection, type(df_selection), "\n")
print("dfm_selection timeit:",
timeit(lambda: dfm.loc[("B", 4), 3], number=int(1e6)))
print("df_selection timeit: ",
timeit(
lambda: df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0],
number=int(1e6)))
dfm:
0 1 2 3
A 1 -1.055128 -0.845019 -2.853027 0.521738
2 0.397804 0.385045 -0.121294 -0.696215
3 -0.551836 -0.666953 -0.956578 1.929732
4 -0.154780 1.778150 0.183104 -0.013989
B 1 -0.315476 0.564419 0.492496 -1.052432
2 -0.695300 0.085265 0.701724 -0.974168
3 -0.879915 -0.206499 1.597701 1.294885
4 0.653261 0.279641 -0.800613 1.050241
C 1 1.004199 -1.377520 -0.672913 1.491793
2 -0.453452 0.367264 -0.002362 0.411193
3 2.271958 0.240864 -0.923934 -0.572957
4 0.737893 -0.523488 0.485497 -2.371977
D 1 1.133661 -0.584973 -0.713320 -0.656315
2 -1.173231 -0.490667 0.634677 1.711015
3 -0.050371 -0.175644 0.124797 0.703672
4 1.349595 0.122202 -1.498178 0.013391
df
level_0 level_1 0 1 2 3
0 A 1 -1.055128 -0.845019 -2.853027 0.521738
1 A 2 0.397804 0.385045 -0.121294 -0.696215
2 A 3 -0.551836 -0.666953 -0.956578 1.929732
3 A 4 -0.154780 1.778150 0.183104 -0.013989
4 B 1 -0.315476 0.564419 0.492496 -1.052432
5 B 2 -0.695300 0.085265 0.701724 -0.974168
6 B 3 -0.879915 -0.206499 1.597701 1.294885
7 B 4 0.653261 0.279641 -0.800613 1.050241
8 C 1 1.004199 -1.377520 -0.672913 1.491793
9 C 2 -0.453452 0.367264 -0.002362 0.411193
10 C 3 2.271958 0.240864 -0.923934 -0.572957
11 C 4 0.737893 -0.523488 0.485497 -2.371977
12 D 1 1.133661 -0.584973 -0.713320 -0.656315
13 D 2 -1.173231 -0.490667 0.634677 1.711015
14 D 3 -0.050371 -0.175644 0.124797 0.703672
15 D 4 1.349595 0.122202 -1.498178 0.013391
dfm_selection: 1.0502406808918188 <class 'numpy.float64'>
df_selection: 1.0502406808918188 <class 'numpy.float64'>
dfm_selection timeit: 63.92458086000079
df_selection timeit: 450.4555013199997
Conclusion
MultiIndex single-value retrieval is over 7 times faster than conventional
dataframe single-value retrieval.
The syntax for MultiIndex retrieval is much cleaner.

Remove column name from legend in pandas barplot

I have a dataframe as below
batsman non_striker partnershipRuns
0 SK Raina A Flintoff 23
1 SK Raina DR Smith 90
2 SK Raina F du Plessis 36
3 SK Raina JA Morkel 14
10 MS Dhoni CK Kapugedera 18
11 MS Dhoni DJ Bravo 51
12 MS Dhoni F du Plessis 27
13 MS Dhoni JA Morkel 12
14 MS Dhoni JDP Oram 6
I have been able to create a stacked bar plot using
df1=df.groupby(['batsman','non_striker']).sum().unstack().fillna(0)
df1.plot(kind='bar',stacked=True,legend=True)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
This results in the column name being included in the legend as a tuple as shown in figure.
How can I not have the column name and just have the value for the legend?
Please help
For prevent MultiIndex in columns specify column for sum after groupby, also fillna here is not necessary, add parameter fill_value=0 to unstack:
df1=df.groupby(['batsman','non_striker'])['partnershipRuns'].sum().unstack(fill_value=0)
Another solution with pivot_table:
df1=df.pivot_table(index='batsman',
columns='non_striker',
values='partnershipRuns',
aggfunc='sum',
fill_value=0)
An easy fix for you, would be to do :
df1=df.groupby(['batsman','non_striker']).sum().unstack().fillna(0)['partnershipRuns']
instead of :
df1=df.groupby(['batsman','non_striker']).sum().unstack().fillna(0)
Why ? Because your aggregations are creating a MultiIndex. And when you are plotting, you could have different multiIndexes values (here, multiple values instead of one being "partnershipRuns").
If you want something else because I didn't understand your question correctly, just ask.