How to read a file in python? - pandas

I have a file of the format .lammpstrj which has the 15 million rows of data points of the following pattern:
id mol type x y z
100000
ITEM: NUMBER OF ATOMS
3000
ITEM: BOX BOUNDS pp pp pp
4.8659189006091452e-02 3.0951340810994285e+01
4.8659189006091452e-02 3.0951340810994285e+01
4.8659189006091452e-02 3.0951340810994285e+01
ITEM: ATOMS id mol type x y z
2325 775 1 4.5602 4.35401 4.16348
718 240 2 4.91829 3.19545 7.30041
2065 689 2 6.51189 1.25778 5.11324
639 213 1 6.84357 5.10011 0.530398
720 240 1 5.46433 3.36715 6.48044
694 232 2 0.107046 3.3119 7.42581
1855 619 2 6.17236 4.57208 5.02607
1856 619 1 6.65988 5.13298 5.69518
I want to store all this data into a pandas DataFrame. How can I do the following task?

Related

Remove related row from pandas dataframe

I have the following dataframe:
id
relatedId
coordinate
123
125
55
125
123
45
128
130
60
132
135
50
130
128
40
135
132
50
So I have 6 rows in this dataframe, but I would like to get rid of the related rows resulting in 3 rows. The coordinate column equals 100 between the two related rows, and I would like to keep the one with the lowest value (so the one less than 50. If both are 50, simply one of them). The resulting dataframe would thus be:
id
relatedId
coordinate
125
123
45
132
135
50
130
128
40
Hopefully someone has a good solution for this problem.
Thanks
You can sort the values and get the first value per group using a frozenset of the 2 ids as grouper:
(df
.sort_values(by='coordinate')
.groupby(df[['id', 'relatedId']].agg(frozenset, axis=1), as_index=False)
.first()
)
output:
id relatedId coordinate
0 130 128 40
1 125 123 45
2 132 135 50
Alternatively, to keep the original order, and original indices, use idxmin per group:
group = df[['id', 'relatedId']].agg(frozenset, axis=1)
idx = df['coordinate'].groupby(group).idxmin()
df.loc[sorted(idx)]
output:
id relatedId coordinate
1 125 123 45
3 132 135 50
4 130 128 40

ValueError: grouper for xxx not 1-dimensional with pandas pivot_table()

I am working on olympics dataset and want to create another dataframe that has total number of athletes and total number of medals won by type for each country.
Using following pivot_table gives me an error "ValueError: Grouper for 'ID' not 1-dimensional"
pd.pivot_table(olymp, index='NOC', columns=['ID','Medal'], values=['ID','Medal'], aggfunc={'ID':pd.Series.nunique,'Medal':'count'}).sort_values(by='Medal')
Result should have one row for each country with columns for totalAthletes, gold, silver, bronze. Not sure how to go about it using pivot_table. I can do this using merge of crosstab but would like to use just one pivottable statement.
Here is what original df looks like.
Update
I would like to get the medal breakdown as well e.g. gold, silver, bronze. Also I need unique count of athlete id's so I use nunique since one athlete may participate in multiple events. Same with medal, ignoring NA values
IIUC:
out = df.pivot_table('ID', 'NOC', 'Medal', aggfunc='count', fill_value=0)
out['ID'] = df[df['Medal'].notna()].groupby('NOC')['ID'].nunique()
Output:
>>> out
Medal Bronze Gold Silver ID
NOC
AFG 2 0 0 1
AHO 0 0 1 1
ALG 8 5 4 14
ANZ 5 20 4 25
ARG 91 91 92 231
.. ... ... ... ...
VIE 0 1 3 3
WIF 5 0 0 4
YUG 93 130 167 317
ZAM 1 0 1 2
ZIM 1 17 4 16
[149 rows x 4 columns]
Old answer
You can't have the same column for columns and values:
out = olymp.pivot_table(index='NOC', values=['ID','Medal'],
aggfunc={'ID':pd.Series.nunique, 'Medal':'count'}) \
.sort_values('Medal', ascending=False)
print(out)
# Output
ID Medal
NOC
USA 9653 5637
URS 2948 2503
GER 4872 2165
GBR 6281 2068
FRA 6170 1777
.. ... ...
GAM 33 0
GBS 15 0
GEQ 26 0
PNG 61 0
LBA 68 0
[230 rows x 2 columns]
Another way to get the result above:
out = olym.groupby('NOC').agg({'ID': pd.Series.nunique, 'Medal': 'count'}) \
.sort_values('Medal', ascending=False)
print(out)
# Output
ID Medal
NOC
USA 9653 5637
URS 2948 2503
GER 4872 2165
GBR 6281 2068
FRA 6170 1777
.. ... ...
GAM 33 0
GBS 15 0
GEQ 26 0
PNG 61 0
LBA 68 0
[230 rows x 2 columns]

Expand/create pandas columns from grouped rows by concatenating subid value and other column names

I would like to create new columns from a panda dataframe by grouping based on a column and concatenating a subindex (in another column) with two other column names. This is best illustrated with an example. Say this is my input dataframe:
filename sub_id x y
0 2019-07-29T16-01-33.jpg 0 731 343
1 2019-07-29T16-01-33.jpg 1 741 283
2 2019-07-29T16-01-34.jpg 0 734 407
3 2019-07-29T16-01-34.jpg 1 757 348
4 2019-07-29T16-01-35.jpg 0 741 293
5 2019-07-29T16-01-35.jpg 1 760 380
And I want to obtain this:
filename x0 y0 x1 y1
0 2019-07-29T16-01-33.jpg 731 343 741 283
1 2019-07-29T16-01-34.jpg 734 407 757 348
2 2019-07-29T16-01-35.jpg 741 293 760 380
The sub_id value (0 or 1) is appended to the x and y column names to create new columns and the respective coordinate values transferred accordingly.
I'm assuming I have to use groupby in some way or joins, but not sure how.
Yet another method:
# create the columns for x0, x1, y0, y1
df_unstacked= df.set_index(['filename', 'sub_id']).unstack(-1)
# rename the column
df_unstacked.columns= [''.join(map(str, c_tup)) for c_tup in df_unstacked.columns]
The result is
x0 x1 y0 y1
filename
2019-07-29T16-01-33.jpg 731 741 343 283
2019-07-29T16-01-34.jpg 734 757 407 348
2019-07-29T16-01-35.jpg 741 760 293 380

Select nearest values around a certain value in column

I have the following df and I need to find all values which are equal or nearest to 660 in column value.
In detail this means I must iterate somehow through the column value to find all these 660 or nearest values. The values in column value are in a range from 1 to (the end varies) and when the end is reached the values start again from 1. Finally, I must select the entire row of all other columns where value == 660 (or neareest). I have more or less a 'helper' column helper which has the same value during a value range of 1 to end. It could be helpfull to get the result (column help is always 0 or 1). Here is the df example:
helper value
0 1
.
.
.
0 647
0 649
0 652
0 654
0 656
0 659
0 661
0 663
0 665
0 667
0 669
0 672
0 674
0 676
0 678
0 681
.
.
.
0 1000
1 1
.
.
.
1 647
1 649
1 652
1 654
1 656
1 659
1 661
1 663
1 665
1 667
1 669
1 672
1 674
1 676
1 678
1 681
.
.
1 1500
0 1
.
.
.
0 645
0 647
0 650
0 652
0 654
0 656
0 658
0 661
0 663
0 665
0 667
0 669
0 672
0 674
0 676
0 679
.
.
.
0 980
Thanks for any help or hints!
There is not enough info for a correct answer. What is the size of the dataframe? Also what do you mean by nearest? The solution to "where value == 660" can be done by applying a mask to the pandas series.
Something like df = df[(df['value'] == 660)] should do.
While if you have a range of numbers you can use boolean operators for the mask:
mask = df.value > 650 & df.value < 670
df = df[mask]
Let me know if this is what you are looking for

How to modify elements length of pandas dataframe?

I want to change pandas dataframe each element to specified length and decimal digits. Length mean the numbers of charactors. For example, element -23.5556
is 8 charactors length (contain minus and point). I want to modify it to total 6 charactors length containing 2 decimal digits, such as -23.56. If less than 6 charactors ,use space to fill. There is no seperation between each element of new df at last.
name x y elev m1 m2
136 5210580.00000 5846400.000000 43.3 -28.2 -24.2
246 5373860.00000 5809680.000000 36.19 -25 -22.3
349 5361120.00000 5735330.000000 49.46 -24.7 -21.2
353 5521370.00000 5770740.000000 17.74 -26 -20.5
425 5095630.00000 5528200.000000 58.14 -30.3 -26.1
434 5198630.00000 5570740.000000 73.26 -30.2 -26
442 5373170.00000 5593290.000000 37.17 -22.9 -18.3
each columns format requested:
charactors decimal digits
name 3 0
x 14 2
y 14 2
elev 4 1
m1 6 2
m2 6 2
the new df format I wanted:
1365210580.00 5846400.00 43.3-28.2 -24.2
2465373860.00 5809680.00 36.1-25.0 -22.3
3495361120.00 5735330.00 49.4-24.7 -21.2
3535521370.00 5770740.00 17.7-26.0 -20.5
4255095630.00 5528200.00 58.1-30.3 -26.1
4345198630.00 5570740.00 73.2-30.2 -26.0
4425373170.00 5593290.00 37.1-22.9 -18.3
Lastly, save the new df as .dat fixed ascii format.
Which tool could do this in pandas?
You can use string formatting
sf = '{name:3.0f}{x:<14.2f}{y:<14.2f}{elev:<4.1f}{m1:<6.1f}{m2:6.1f}'.format
df.apply(lambda r: sf(**r), 1)
0 1365210580.00 5846400.00 43.3-28.2 -24.2
1 2465373860.00 5809680.00 36.2-25.0 -22.3
2 3495361120.00 5735330.00 49.5-24.7 -21.2
3 3535521370.00 5770740.00 17.7-26.0 -20.5
4 4255095630.00 5528200.00 58.1-30.3 -26.1
5 4345198630.00 5570740.00 73.3-30.2 -26.0
6 4425373170.00 5593290.00 37.2-22.9 -18.3
You need
df.round(2)
The resulting df
name x y elev m1 m2
0 136 5210580 5846400 43.30 -28.2 -24.2
1 246 5373860 5809680 36.19 -25.0 -22.3
2 349 5361120 5735330 49.46 -24.7 -21.2
3 353 5521370 5770740 17.74 -26.0 -20.5
4 425 5095630 5528200 58.14 -30.3 -26.1
5 434 5198630 5570740 73.26 -30.2 -26.0
6 442 5373170 5593290 37.17 -22.9 -18.3