Unconsistent Pandas axis labels - pandas

I have a pandas data-frame (df) including a column as labels (column 'Specimens' here).
Specimens Sample Min_read_lg Avr_read_lg Max_read_lg
0 B.pleb_sili 1 32 249.741 488
1 B.pleb_sili 2 30 276.959 489
2 B.conc_sili 3 25 256.294 489
3 B.conc_sili 4 27 277.923 489
4 F1_1_sili 5 34 303.328 489
...
I have tried to plot it as following, but the labels on the x axis are not matching the actual values of the table. Would anyone know why it could be the case?
plot=df.plot.area()
plot.set_xlabel("Specimens")
plot.set_ylabel("Read length")
plot.set_xticklabels(df['Specimens'], rotation=90)
I think the "plot.set_xticklabels" method is not right, but I would like to understand why the labels on the x axis are mismatched, and most of them missing.

Related

Kronecker product over the rows of a pandas dataframe

So I have these two dataframes and I would like to get a new dataframe which consists of the kronecker product of the rows of the two dataframes. What is the correct way to this?
As an example:
DataFrame1
c1 c2
0 10 100
1 11 110
2 12 120
and
DataFrame2
a1 a2
0 5 7
1 1 10
2 2 4
Then I would like to have the following matrix:
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480
I hope my question is clear.
PS. I saw this question was posted here kronecker product pandas dataframes. However, the answer given is not the correct answer (I believe to mine and the original question, but definitely not to mine). The answer there gives a Kronecker product of both dataframes, but I only want it over the rows.
Create MultiIndex by MultiIndex.from_product, convert both columns to MultiIndex by DataFrame.reindex and multiple Dataframe, last flatten MultiIndex:
c = pd.MultiIndex.from_product([df1, df2])
df = df1.reindex(c, axis=1, level=0).mul(df2.reindex(c, axis=1, level=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
print (df)
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480
Use numpy for efficiency:
import numpy as np
pd.DataFrame(np.einsum('nk,nl->nkl', df1, df2).reshape(df1.shape[0], -1),
columns=pd.MultiIndex.from_product([df1, df2]).map(''.join)
)
Output:
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480

Find minimum values of df column based on another column

I have following dataframe
index X Coordinate Y Coordinate Z Coordinate indices distances
0 0 650355.148 4766450.315 39.086 537 0.348036
1 1 650355.148 4766450.314 39.086 537 0.347131
2 2 650372.398 4766676.602 -18.388 461 0.398005
3 3 650372.979 4766676.880 -18.087 461 0.591304
4 4 650373.776 4766677.397 -18.172 461 1.432126
and I want to find out minimum distances of each indices so I left with below
I tried df['distances'].min() could not quite get the result I want
Use groupby:
>>> df.loc[df.groupby('indices')['distances'].idxmin()]
index X Coordinate Y Coordinate Z Coordinate indices distances
2 2 650372.398 4766676.602 -18.388 461 0.398005
1 1 650355.148 4766450.314 39.086 537 0.347131

Panda: multiindex vs groupby [duplicate]

So I learned that I can use DataFrame.groupby without having a MultiIndex to do subsampling/cross-sections.
On the other hand, when I have a MultiIndex on a DataFrame, I still need to use DataFrame.groupby to do sub-sampling/cross-sections.
So what is a MultiIndex good for apart from the quite helpful and pretty display of the hierarchies when printing?
Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release.
This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.
Imagine constructing a dataframe using MultiIndex like this:-
import pandas as pd
import numpy as np
np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B'])
df # This is the dataframe we have generated
A B
one 1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
two 1 -0.101713 -1.204458
2 0.958008 -0.455419
3 -0.191702 -0.915983
This df is simply a data structure of two dimensions
df.ndim
2
But we can imagine it, looking at the output, as a 3 dimensional data structure.
one with 1 with data -0.732470 -0.313871.
one with 2 with data -0.031109 -2.068794.
one with 3 with data 1.520652 0.471764.
A.k.a.: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure"
This is not just a "pretty display". It has the benefit of easy retrieval of data since we now have a hierarchal index.
For example.
In [44]: df.ix["one"]
Out[44]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
will give us a new data frame only for the group of data belonging to "one".
And we can narrow down our data selection further by doing this:-
In [45]: df.ix["one"].ix[1]
Out[45]:
A -0.732470
B -0.313871
Name: 1
And of course, if we want a specific value, here's an example:-
In [46]: df.ix["one"].ix[1]["A"]
Out[46]: -0.73247029752040727
So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby.
We can even grab a cross-section (either rows or columns) from our dataframe...
By rows:-
In [47]: df.xs('one')
Out[47]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
By columns:-
In [48]: df.xs('B', axis=1)
Out[48]:
one 1 -0.313871
2 -2.068794
3 0.471764
two 1 -1.204458
2 -0.455419
3 -0.915983
Name: B
Great post by #Calvin Cheng, but thought I'd take a stab at this as well.
When to use a MultiIndex:
When a single column’s value isn’t enough to uniquely identify a row.
When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”
Why (your core question) - at least these are the biggest benefits IMO:
Easy manipulation via stack() and unstack()
Easy math when there are multiple column levels
Syntactic sugar for slicing/filtering
Example:
Dollars Units
Date Store Category Subcategory UPC EAN
2018-07-10 Store 1 Alcohol Liqour 80480280024 154.77 7
Store 2 Alcohol Liqour 80480280024 82.08 4
Store 3 Alcohol Liqour 80480280024 259.38 9
Store 1 Alcohol Liquor 80432400630 477.68 14
674545000001 139.68 4
Store 2 Alcohol Liquor 80432400630 203.88 6
674545000001 377.13 13
Store 3 Alcohol Liquor 80432400630 239.19 7
674545000001 432.32 14
Store 1 Beer Ales 94922755711 65.17 7
702770082018 174.44 14
736920111112 50.70 5
Store 2 Beer Ales 94922755711 129.60 12
702770082018 107.40 10
736920111112 59.65 5
Store 3 Beer Ales 94922755711 154.00 14
702770082018 137.40 10
736920111112 107.88 12
Store 1 Beer Lagers 702770081011 156.24 12
Store 2 Beer Lagers 702770081011 137.06 11
Store 3 Beer Lagers 702770081011 119.52 8
1) If we want to easily compare sales across stores, we can use df.unstack('Store') to line everything up side-by-side:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 154.77 82.08 259.38 7 4 9
Liquor 80432400630 477.68 203.88 239.19 14 6 7
674545000001 139.68 377.13 432.32 4 13 14
Beer Ales 94922755711 65.17 129.60 154.00 7 12 14
702770082018 174.44 107.40 137.40 14 10 10
736920111112 50.70 59.65 107.88 5 5 12
Lagers 702770081011 156.24 137.06 119.52 12 11 8
2) We can also easily do math on multiple columns. For example, df['Dollars'] / df['Units'] will then divide each store's dollars by its units, for every store without multiple operations:
Store Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 22.11 20.52 28.82
Liquor 80432400630 34.12 33.98 34.17
674545000001 34.92 29.01 30.88
Beer Ales 94922755711 9.31 10.80 11.00
702770082018 12.46 10.74 13.74
736920111112 10.14 11.93 8.99
Lagers 702770081011 13.02 12.46 14.94
3) If we then want to filter to just specific rows, instead of using the
df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)]
format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful). The syntax would instead be:
df.xs((val1, val2, val3), level=(col1, col2, col3))
More examples can be found in this tutorial notebook I put together.
The alternative to using a multiindex is to store your data using multiple columns of a dataframe. One would expect multiindex to provide a performance boost over naive column storage, but as of Pandas v 1.1.4, that appears not to be the case.
Timinigs
import numpy as np
import pandas as pd
np.random.seed(2020)
inv = pd.DataFrame({
'store_id': np.random.choice(10000, size=10**7),
'product_id': np.random.choice(1000, size=10**7),
'stock': np.random.choice(100, size=10**7),
})
# Create a DataFrame with a multiindex
inv_multi = inv.groupby(['store_id', 'product_id'])[['stock']].agg('sum')
print(inv_multi)
stock
store_id product_id
0 2 48
4 18
5 58
7 149
8 158
... ...
9999 992 132
995 121
996 105
998 99
999 16
[6321869 rows x 1 columns]
# Create a DataFrame without a multiindex
inv_cols = inv_multi.reset_index()
print(inv_cols)
store_id product_id stock
0 0 2 48
1 0 4 18
2 0 5 58
3 0 7 149
4 0 8 158
... ... ... ...
6321864 9999 992 132
6321865 9999 995 121
6321866 9999 996 105
6321867 9999 998 99
6321868 9999 999 16
[6321869 rows x 3 columns]
%%timeit
inv_multi.xs(key=100, level='store_id')
10 loops, best of 3: 20.2 ms per loop
%%timeit
inv_cols.loc[inv_cols.store_id == 100]
The slowest run took 8.79 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 11.5 ms per loop
%%timeit
inv_multi.xs(key=100, level='product_id')
100 loops, best of 3: 9.08 ms per loop
%%timeit
inv_cols.loc[inv_cols.product_id == 100]
100 loops, best of 3: 12.2 ms per loop
%%timeit
inv_multi.xs(key=(100, 100), level=('store_id', 'product_id'))
10 loops, best of 3: 29.8 ms per loop
%%timeit
inv_cols.loc[(inv_cols.store_id == 100) & (inv_cols.product_id == 100)]
10 loops, best of 3: 28.8 ms per loop
Conclusion
The benefits from using a MultiIndex are about syntactic sugar, self-documenting data, and small conveniences from functions like unstack() as mentioned in #ZaxR's answer; Performance is not a benefit, which seems like a real missed opportunity.
Based on the comment on this
answer it seems the
experiment was flawed. Here is my attempt at a correct experiment.
Timings
import pandas as pd
import numpy as np
from timeit import timeit
random_data = np.random.randn(16, 4)
multiindex_lists = [["A", "B", "C", "D"], [1, 2, 3, 4]]
multiindex = pd.MultiIndex.from_product(multiindex_lists)
dfm = pd.DataFrame(random_data, multiindex)
df = dfm.reset_index()
print("dfm:\n", dfm, "\n")
print("df\n", df, "\n")
dfm_selection = dfm.loc[("B", 4), 3]
print("dfm_selection:", dfm_selection, type(dfm_selection))
df_selection = df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0]
print("df_selection: ", df_selection, type(df_selection), "\n")
print("dfm_selection timeit:",
timeit(lambda: dfm.loc[("B", 4), 3], number=int(1e6)))
print("df_selection timeit: ",
timeit(
lambda: df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0],
number=int(1e6)))
dfm:
0 1 2 3
A 1 -1.055128 -0.845019 -2.853027 0.521738
2 0.397804 0.385045 -0.121294 -0.696215
3 -0.551836 -0.666953 -0.956578 1.929732
4 -0.154780 1.778150 0.183104 -0.013989
B 1 -0.315476 0.564419 0.492496 -1.052432
2 -0.695300 0.085265 0.701724 -0.974168
3 -0.879915 -0.206499 1.597701 1.294885
4 0.653261 0.279641 -0.800613 1.050241
C 1 1.004199 -1.377520 -0.672913 1.491793
2 -0.453452 0.367264 -0.002362 0.411193
3 2.271958 0.240864 -0.923934 -0.572957
4 0.737893 -0.523488 0.485497 -2.371977
D 1 1.133661 -0.584973 -0.713320 -0.656315
2 -1.173231 -0.490667 0.634677 1.711015
3 -0.050371 -0.175644 0.124797 0.703672
4 1.349595 0.122202 -1.498178 0.013391
df
level_0 level_1 0 1 2 3
0 A 1 -1.055128 -0.845019 -2.853027 0.521738
1 A 2 0.397804 0.385045 -0.121294 -0.696215
2 A 3 -0.551836 -0.666953 -0.956578 1.929732
3 A 4 -0.154780 1.778150 0.183104 -0.013989
4 B 1 -0.315476 0.564419 0.492496 -1.052432
5 B 2 -0.695300 0.085265 0.701724 -0.974168
6 B 3 -0.879915 -0.206499 1.597701 1.294885
7 B 4 0.653261 0.279641 -0.800613 1.050241
8 C 1 1.004199 -1.377520 -0.672913 1.491793
9 C 2 -0.453452 0.367264 -0.002362 0.411193
10 C 3 2.271958 0.240864 -0.923934 -0.572957
11 C 4 0.737893 -0.523488 0.485497 -2.371977
12 D 1 1.133661 -0.584973 -0.713320 -0.656315
13 D 2 -1.173231 -0.490667 0.634677 1.711015
14 D 3 -0.050371 -0.175644 0.124797 0.703672
15 D 4 1.349595 0.122202 -1.498178 0.013391
dfm_selection: 1.0502406808918188 <class 'numpy.float64'>
df_selection: 1.0502406808918188 <class 'numpy.float64'>
dfm_selection timeit: 63.92458086000079
df_selection timeit: 450.4555013199997
Conclusion
MultiIndex single-value retrieval is over 7 times faster than conventional
dataframe single-value retrieval.
The syntax for MultiIndex retrieval is much cleaner.

How to modify elements length of pandas dataframe?

I want to change pandas dataframe each element to specified length and decimal digits. Length mean the numbers of charactors. For example, element -23.5556
is 8 charactors length (contain minus and point). I want to modify it to total 6 charactors length containing 2 decimal digits, such as -23.56. If less than 6 charactors ,use space to fill. There is no seperation between each element of new df at last.
name x y elev m1 m2
136 5210580.00000 5846400.000000 43.3 -28.2 -24.2
246 5373860.00000 5809680.000000 36.19 -25 -22.3
349 5361120.00000 5735330.000000 49.46 -24.7 -21.2
353 5521370.00000 5770740.000000 17.74 -26 -20.5
425 5095630.00000 5528200.000000 58.14 -30.3 -26.1
434 5198630.00000 5570740.000000 73.26 -30.2 -26
442 5373170.00000 5593290.000000 37.17 -22.9 -18.3
each columns format requested:
charactors decimal digits
name 3 0
x 14 2
y 14 2
elev 4 1
m1 6 2
m2 6 2
the new df format I wanted:
1365210580.00 5846400.00 43.3-28.2 -24.2
2465373860.00 5809680.00 36.1-25.0 -22.3
3495361120.00 5735330.00 49.4-24.7 -21.2
3535521370.00 5770740.00 17.7-26.0 -20.5
4255095630.00 5528200.00 58.1-30.3 -26.1
4345198630.00 5570740.00 73.2-30.2 -26.0
4425373170.00 5593290.00 37.1-22.9 -18.3
Lastly, save the new df as .dat fixed ascii format.
Which tool could do this in pandas?
You can use string formatting
sf = '{name:3.0f}{x:<14.2f}{y:<14.2f}{elev:<4.1f}{m1:<6.1f}{m2:6.1f}'.format
df.apply(lambda r: sf(**r), 1)
0 1365210580.00 5846400.00 43.3-28.2 -24.2
1 2465373860.00 5809680.00 36.2-25.0 -22.3
2 3495361120.00 5735330.00 49.5-24.7 -21.2
3 3535521370.00 5770740.00 17.7-26.0 -20.5
4 4255095630.00 5528200.00 58.1-30.3 -26.1
5 4345198630.00 5570740.00 73.3-30.2 -26.0
6 4425373170.00 5593290.00 37.2-22.9 -18.3
You need
df.round(2)
The resulting df
name x y elev m1 m2
0 136 5210580 5846400 43.30 -28.2 -24.2
1 246 5373860 5809680 36.19 -25.0 -22.3
2 349 5361120 5735330 49.46 -24.7 -21.2
3 353 5521370 5770740 17.74 -26.0 -20.5
4 425 5095630 5528200 58.14 -30.3 -26.1
5 434 5198630 5570740 73.26 -30.2 -26.0
6 442 5373170 5593290 37.17 -22.9 -18.3

What type of graph can best show the correlation between 'Fare' (price) and "Survival" (Titanic)?

I'm playing around with Seaborn and Matplotlib and I trying to find the best type of graph to show the correlation between fare values and chance of survival from the titanic dataset.
The Titanic fare column has a lot of different values ranging from 1 to 500 and some of the values are repeated often.
Here is a sample of value_counts:
titanic.fare.value_counts()
8.0500 43
13.0000 42
7.8958 38
7.7500 34
26.0000 31
10.5000 24
7.9250 18
7.7750 16
0.0000 15
7.2292 15
26.5500 15
8.6625 13
7.8542 13
7.2500 13
7.2250 12
16.1000 9
9.5000 9
15.5000 8
24.1500 8
14.5000 7
7.0500 7
52.0000 7
31.2750 7
56.4958 7
69.5500 7
14.4542 7
30.0000 6
39.6875 6
46.9000 6
21.0000 6
.....
91.0792 2
106.4250 2
164.8667 2
Survival column on the other hand has only two values :
>>> titanic.survived.head(10)
271 1
597 0
302 0
633 0
277 0
413 0
674 0
263 0
466 0
A histogram would only show the frequency of fares in certain ranges.
For a scatter plot I would need two variables; having "survived" which has only two values would make for a strange variable.
Is there a way to show the rise of survivability as fare increases clearly through a line graph?
I know there is a correlation as If I sort fare values in ascending order (000-500).
Then do:
>>> titanic.head(50).survived.sum()
5
>>>titanic.tail(50).survived.sum()
37
I see a correlation.
Thanks.
This is what I did to show the correlation between the fare values and the chance of survival:
First, I created a new column Fare Groups, converting fare values to groups of fare ranges, using cut().
df['Fare Groups'] = pd.cut(df.Fare, [0,50,100,150,200,550])
Next, I created a pivot_table().
piv_fare = df.pivot_table(index='Fare Groups', columns='Survived', values = 'Fare', aggfunc='count')
Output:
Survived 0 1
Fare Groups
(0, 50] 484 232
(50, 100] 37 70
(100, 150] 5 19
(150, 200] 3 6
(200, 550] 6 14
Plot:
piv_fare.plot(kind='bar')
It seems, those who had the cheapest tickets (0 to 50) had the lowest chance of survival. In fact, (0 to 50) is the only fare range where the chance to die is higher than the chance to survive. Not just higher, but significantly higher.