Find minimum values of df column based on another column - pandas

I have following dataframe
index X Coordinate Y Coordinate Z Coordinate indices distances
0 0 650355.148 4766450.315 39.086 537 0.348036
1 1 650355.148 4766450.314 39.086 537 0.347131
2 2 650372.398 4766676.602 -18.388 461 0.398005
3 3 650372.979 4766676.880 -18.087 461 0.591304
4 4 650373.776 4766677.397 -18.172 461 1.432126
and I want to find out minimum distances of each indices so I left with below
I tried df['distances'].min() could not quite get the result I want

Use groupby:
>>> df.loc[df.groupby('indices')['distances'].idxmin()]
index X Coordinate Y Coordinate Z Coordinate indices distances
2 2 650372.398 4766676.602 -18.388 461 0.398005
1 1 650355.148 4766450.314 39.086 537 0.347131

Related

Unconsistent Pandas axis labels

I have a pandas data-frame (df) including a column as labels (column 'Specimens' here).
Specimens Sample Min_read_lg Avr_read_lg Max_read_lg
0 B.pleb_sili 1 32 249.741 488
1 B.pleb_sili 2 30 276.959 489
2 B.conc_sili 3 25 256.294 489
3 B.conc_sili 4 27 277.923 489
4 F1_1_sili 5 34 303.328 489
...
I have tried to plot it as following, but the labels on the x axis are not matching the actual values of the table. Would anyone know why it could be the case?
plot=df.plot.area()
plot.set_xlabel("Specimens")
plot.set_ylabel("Read length")
plot.set_xticklabels(df['Specimens'], rotation=90)
I think the "plot.set_xticklabels" method is not right, but I would like to understand why the labels on the x axis are mismatched, and most of them missing.

Sort Column of Dataframe by similarity, first row should be fixed Python

I want to order the Frame depending on the first row of B. So the first row of B is allways fixed and the second, third .... row is sorted by similarity of B's first row. It should also be flexible, B could contain 2-20 or even more rows
I expect a result like this
Any idea how to do this?
If you sort the values by the difference from the first value in b, you can just use that index into the original DataFrame:
In [35]: df = pd.DataFrame({'a': range(6), 'b': [483, 479, 503, 479, 485, 495]})
In [36]: df
Out[36]:
a b
0 0 483
1 1 479
2 2 503
3 3 479
4 4 485
5 5 495
In [37]: idx = df['b'].sub(df.loc[0, 'b']).abs().sort_values().index
In [38]: df.loc[idx]
Out[38]:
a b
0 0 483
4 4 485
1 1 479
3 3 479
5 5 495
2 2 503

shift specific rows from one column to next column

I have a column X, and I want to split specific rows in other columns.
x
76.25
'87.12'
1
345.65
'96.45'
2
78.12
'85.23'
3
35.1
'65.21'
1
I want to shift all values with '' to new column Y and all integers to new column sequence. Note all values are text.
desired output is
x y sequence
76.25 '87.12' 1
345.65 '96.45' 2
78.12 '85.23' 3
35.1 '65.21' 1
I have hundreds of rows. I read about shift() to shift values to next column but in this case i don't know row position as there are hundred of rows.is it possible to shift specific values with this criteria? any help will be appreciated.
If data are regular and exist each triple you can convert values to numpy array and reshape, then pass to DataFrame constructor:
df1 = pd.DataFrame(df['x'].to_numpy().reshape(-1,3), columns=['x','y','seq'])
#oldier pandas versions
#df1 = pd.DataFrame(df['x'].values.reshape(-1,3), columns=['x','y','seq'])
print (df1)
x y seq
0 76.25 '87.12' 1
1 345.65 '96.45' 2
2 78.12 '85.23' 3
3 35.1 '65.21' 1

Panda: multiindex vs groupby [duplicate]

So I learned that I can use DataFrame.groupby without having a MultiIndex to do subsampling/cross-sections.
On the other hand, when I have a MultiIndex on a DataFrame, I still need to use DataFrame.groupby to do sub-sampling/cross-sections.
So what is a MultiIndex good for apart from the quite helpful and pretty display of the hierarchies when printing?
Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release.
This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.
Imagine constructing a dataframe using MultiIndex like this:-
import pandas as pd
import numpy as np
np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B'])
df # This is the dataframe we have generated
A B
one 1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
two 1 -0.101713 -1.204458
2 0.958008 -0.455419
3 -0.191702 -0.915983
This df is simply a data structure of two dimensions
df.ndim
2
But we can imagine it, looking at the output, as a 3 dimensional data structure.
one with 1 with data -0.732470 -0.313871.
one with 2 with data -0.031109 -2.068794.
one with 3 with data 1.520652 0.471764.
A.k.a.: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure"
This is not just a "pretty display". It has the benefit of easy retrieval of data since we now have a hierarchal index.
For example.
In [44]: df.ix["one"]
Out[44]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
will give us a new data frame only for the group of data belonging to "one".
And we can narrow down our data selection further by doing this:-
In [45]: df.ix["one"].ix[1]
Out[45]:
A -0.732470
B -0.313871
Name: 1
And of course, if we want a specific value, here's an example:-
In [46]: df.ix["one"].ix[1]["A"]
Out[46]: -0.73247029752040727
So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby.
We can even grab a cross-section (either rows or columns) from our dataframe...
By rows:-
In [47]: df.xs('one')
Out[47]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
By columns:-
In [48]: df.xs('B', axis=1)
Out[48]:
one 1 -0.313871
2 -2.068794
3 0.471764
two 1 -1.204458
2 -0.455419
3 -0.915983
Name: B
Great post by #Calvin Cheng, but thought I'd take a stab at this as well.
When to use a MultiIndex:
When a single column’s value isn’t enough to uniquely identify a row.
When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”
Why (your core question) - at least these are the biggest benefits IMO:
Easy manipulation via stack() and unstack()
Easy math when there are multiple column levels
Syntactic sugar for slicing/filtering
Example:
Dollars Units
Date Store Category Subcategory UPC EAN
2018-07-10 Store 1 Alcohol Liqour 80480280024 154.77 7
Store 2 Alcohol Liqour 80480280024 82.08 4
Store 3 Alcohol Liqour 80480280024 259.38 9
Store 1 Alcohol Liquor 80432400630 477.68 14
674545000001 139.68 4
Store 2 Alcohol Liquor 80432400630 203.88 6
674545000001 377.13 13
Store 3 Alcohol Liquor 80432400630 239.19 7
674545000001 432.32 14
Store 1 Beer Ales 94922755711 65.17 7
702770082018 174.44 14
736920111112 50.70 5
Store 2 Beer Ales 94922755711 129.60 12
702770082018 107.40 10
736920111112 59.65 5
Store 3 Beer Ales 94922755711 154.00 14
702770082018 137.40 10
736920111112 107.88 12
Store 1 Beer Lagers 702770081011 156.24 12
Store 2 Beer Lagers 702770081011 137.06 11
Store 3 Beer Lagers 702770081011 119.52 8
1) If we want to easily compare sales across stores, we can use df.unstack('Store') to line everything up side-by-side:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 154.77 82.08 259.38 7 4 9
Liquor 80432400630 477.68 203.88 239.19 14 6 7
674545000001 139.68 377.13 432.32 4 13 14
Beer Ales 94922755711 65.17 129.60 154.00 7 12 14
702770082018 174.44 107.40 137.40 14 10 10
736920111112 50.70 59.65 107.88 5 5 12
Lagers 702770081011 156.24 137.06 119.52 12 11 8
2) We can also easily do math on multiple columns. For example, df['Dollars'] / df['Units'] will then divide each store's dollars by its units, for every store without multiple operations:
Store Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 22.11 20.52 28.82
Liquor 80432400630 34.12 33.98 34.17
674545000001 34.92 29.01 30.88
Beer Ales 94922755711 9.31 10.80 11.00
702770082018 12.46 10.74 13.74
736920111112 10.14 11.93 8.99
Lagers 702770081011 13.02 12.46 14.94
3) If we then want to filter to just specific rows, instead of using the
df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)]
format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful). The syntax would instead be:
df.xs((val1, val2, val3), level=(col1, col2, col3))
More examples can be found in this tutorial notebook I put together.
The alternative to using a multiindex is to store your data using multiple columns of a dataframe. One would expect multiindex to provide a performance boost over naive column storage, but as of Pandas v 1.1.4, that appears not to be the case.
Timinigs
import numpy as np
import pandas as pd
np.random.seed(2020)
inv = pd.DataFrame({
'store_id': np.random.choice(10000, size=10**7),
'product_id': np.random.choice(1000, size=10**7),
'stock': np.random.choice(100, size=10**7),
})
# Create a DataFrame with a multiindex
inv_multi = inv.groupby(['store_id', 'product_id'])[['stock']].agg('sum')
print(inv_multi)
stock
store_id product_id
0 2 48
4 18
5 58
7 149
8 158
... ...
9999 992 132
995 121
996 105
998 99
999 16
[6321869 rows x 1 columns]
# Create a DataFrame without a multiindex
inv_cols = inv_multi.reset_index()
print(inv_cols)
store_id product_id stock
0 0 2 48
1 0 4 18
2 0 5 58
3 0 7 149
4 0 8 158
... ... ... ...
6321864 9999 992 132
6321865 9999 995 121
6321866 9999 996 105
6321867 9999 998 99
6321868 9999 999 16
[6321869 rows x 3 columns]
%%timeit
inv_multi.xs(key=100, level='store_id')
10 loops, best of 3: 20.2 ms per loop
%%timeit
inv_cols.loc[inv_cols.store_id == 100]
The slowest run took 8.79 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 11.5 ms per loop
%%timeit
inv_multi.xs(key=100, level='product_id')
100 loops, best of 3: 9.08 ms per loop
%%timeit
inv_cols.loc[inv_cols.product_id == 100]
100 loops, best of 3: 12.2 ms per loop
%%timeit
inv_multi.xs(key=(100, 100), level=('store_id', 'product_id'))
10 loops, best of 3: 29.8 ms per loop
%%timeit
inv_cols.loc[(inv_cols.store_id == 100) & (inv_cols.product_id == 100)]
10 loops, best of 3: 28.8 ms per loop
Conclusion
The benefits from using a MultiIndex are about syntactic sugar, self-documenting data, and small conveniences from functions like unstack() as mentioned in #ZaxR's answer; Performance is not a benefit, which seems like a real missed opportunity.
Based on the comment on this
answer it seems the
experiment was flawed. Here is my attempt at a correct experiment.
Timings
import pandas as pd
import numpy as np
from timeit import timeit
random_data = np.random.randn(16, 4)
multiindex_lists = [["A", "B", "C", "D"], [1, 2, 3, 4]]
multiindex = pd.MultiIndex.from_product(multiindex_lists)
dfm = pd.DataFrame(random_data, multiindex)
df = dfm.reset_index()
print("dfm:\n", dfm, "\n")
print("df\n", df, "\n")
dfm_selection = dfm.loc[("B", 4), 3]
print("dfm_selection:", dfm_selection, type(dfm_selection))
df_selection = df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0]
print("df_selection: ", df_selection, type(df_selection), "\n")
print("dfm_selection timeit:",
timeit(lambda: dfm.loc[("B", 4), 3], number=int(1e6)))
print("df_selection timeit: ",
timeit(
lambda: df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0],
number=int(1e6)))
dfm:
0 1 2 3
A 1 -1.055128 -0.845019 -2.853027 0.521738
2 0.397804 0.385045 -0.121294 -0.696215
3 -0.551836 -0.666953 -0.956578 1.929732
4 -0.154780 1.778150 0.183104 -0.013989
B 1 -0.315476 0.564419 0.492496 -1.052432
2 -0.695300 0.085265 0.701724 -0.974168
3 -0.879915 -0.206499 1.597701 1.294885
4 0.653261 0.279641 -0.800613 1.050241
C 1 1.004199 -1.377520 -0.672913 1.491793
2 -0.453452 0.367264 -0.002362 0.411193
3 2.271958 0.240864 -0.923934 -0.572957
4 0.737893 -0.523488 0.485497 -2.371977
D 1 1.133661 -0.584973 -0.713320 -0.656315
2 -1.173231 -0.490667 0.634677 1.711015
3 -0.050371 -0.175644 0.124797 0.703672
4 1.349595 0.122202 -1.498178 0.013391
df
level_0 level_1 0 1 2 3
0 A 1 -1.055128 -0.845019 -2.853027 0.521738
1 A 2 0.397804 0.385045 -0.121294 -0.696215
2 A 3 -0.551836 -0.666953 -0.956578 1.929732
3 A 4 -0.154780 1.778150 0.183104 -0.013989
4 B 1 -0.315476 0.564419 0.492496 -1.052432
5 B 2 -0.695300 0.085265 0.701724 -0.974168
6 B 3 -0.879915 -0.206499 1.597701 1.294885
7 B 4 0.653261 0.279641 -0.800613 1.050241
8 C 1 1.004199 -1.377520 -0.672913 1.491793
9 C 2 -0.453452 0.367264 -0.002362 0.411193
10 C 3 2.271958 0.240864 -0.923934 -0.572957
11 C 4 0.737893 -0.523488 0.485497 -2.371977
12 D 1 1.133661 -0.584973 -0.713320 -0.656315
13 D 2 -1.173231 -0.490667 0.634677 1.711015
14 D 3 -0.050371 -0.175644 0.124797 0.703672
15 D 4 1.349595 0.122202 -1.498178 0.013391
dfm_selection: 1.0502406808918188 <class 'numpy.float64'>
df_selection: 1.0502406808918188 <class 'numpy.float64'>
dfm_selection timeit: 63.92458086000079
df_selection timeit: 450.4555013199997
Conclusion
MultiIndex single-value retrieval is over 7 times faster than conventional
dataframe single-value retrieval.
The syntax for MultiIndex retrieval is much cleaner.

How to modify elements length of pandas dataframe?

I want to change pandas dataframe each element to specified length and decimal digits. Length mean the numbers of charactors. For example, element -23.5556
is 8 charactors length (contain minus and point). I want to modify it to total 6 charactors length containing 2 decimal digits, such as -23.56. If less than 6 charactors ,use space to fill. There is no seperation between each element of new df at last.
name x y elev m1 m2
136 5210580.00000 5846400.000000 43.3 -28.2 -24.2
246 5373860.00000 5809680.000000 36.19 -25 -22.3
349 5361120.00000 5735330.000000 49.46 -24.7 -21.2
353 5521370.00000 5770740.000000 17.74 -26 -20.5
425 5095630.00000 5528200.000000 58.14 -30.3 -26.1
434 5198630.00000 5570740.000000 73.26 -30.2 -26
442 5373170.00000 5593290.000000 37.17 -22.9 -18.3
each columns format requested:
charactors decimal digits
name 3 0
x 14 2
y 14 2
elev 4 1
m1 6 2
m2 6 2
the new df format I wanted:
1365210580.00 5846400.00 43.3-28.2 -24.2
2465373860.00 5809680.00 36.1-25.0 -22.3
3495361120.00 5735330.00 49.4-24.7 -21.2
3535521370.00 5770740.00 17.7-26.0 -20.5
4255095630.00 5528200.00 58.1-30.3 -26.1
4345198630.00 5570740.00 73.2-30.2 -26.0
4425373170.00 5593290.00 37.1-22.9 -18.3
Lastly, save the new df as .dat fixed ascii format.
Which tool could do this in pandas?
You can use string formatting
sf = '{name:3.0f}{x:<14.2f}{y:<14.2f}{elev:<4.1f}{m1:<6.1f}{m2:6.1f}'.format
df.apply(lambda r: sf(**r), 1)
0 1365210580.00 5846400.00 43.3-28.2 -24.2
1 2465373860.00 5809680.00 36.2-25.0 -22.3
2 3495361120.00 5735330.00 49.5-24.7 -21.2
3 3535521370.00 5770740.00 17.7-26.0 -20.5
4 4255095630.00 5528200.00 58.1-30.3 -26.1
5 4345198630.00 5570740.00 73.3-30.2 -26.0
6 4425373170.00 5593290.00 37.2-22.9 -18.3
You need
df.round(2)
The resulting df
name x y elev m1 m2
0 136 5210580 5846400 43.30 -28.2 -24.2
1 246 5373860 5809680 36.19 -25.0 -22.3
2 349 5361120 5735330 49.46 -24.7 -21.2
3 353 5521370 5770740 17.74 -26.0 -20.5
4 425 5095630 5528200 58.14 -30.3 -26.1
5 434 5198630 5570740 73.26 -30.2 -26.0
6 442 5373170 5593290 37.17 -22.9 -18.3