Having a data frame exex as
EXEX I J
1 702 2 3
2 3112 2 4
3 1360 2 5
4 702 3 2
5 221 3 5
6 591 3 11
7 3112 4 2
8 394 4 5
9 3416 4 11
10 1360 5 2
11 221 5 3
12 394 5 4
13 108 5 11
14 591 11 3
15 3416 11 4
16 108 11 5
is there a more efficient pandas approach to update the value of an existing dataframe df of 0 to the value exex.EXEX where the exex.I field is the index and the exex.J field is the column? Is there a way in where to update the data by specifing the name instead of the row index? This is because if the name fields change, the row index would be different and could lead to an erroneous result.
i get it by:
df = pd.DataFrame(0, index = range(1,908), columns=range(1,908))
for index, row in exex12.iterrows():
df.set_value(row[1],row[2],row[0])
Assign to df.values
df.values[exex.I.values - 1, exex.J.values - 1] = exex.EXEX.values
print(df.iloc[:5, :5])
1 2 3 4 5
1 0 0 0 0 0
2 0 0 702 3112 1360
3 0 702 0 0 221
4 0 3112 0 0 394
5 0 1360 221 394 0
Related
The following code:
import pandas as pd
df_original=pd.DataFrame({\
'race_num':[1,1,1,2,2,2,2,3,3],\
'race_position':[2,3,0,1,0,0,2,3,0],\
'percentage_place':[77,55,88,50,34,56,99,12,75]
})
Gives an output of:
race_num
race_position
percentage_place
1
2
77
1
3
55
1
0
88
2
1
50
2
0
34
2
0
56
2
2
99
3
3
12
3
0
75
I need to mainpulate this dataframe to keep the race_num grouped but sort the percentage place in ascending order - and the race_position is to stay aligned with the original percentage_place.
Desired out is:
race_num
race_position
percentage_place
1
0
88
1
2
77
1
3
55
2
2
99
2
0
56
2
1
50
2
0
34
3
0
75
3
3
12
My attempt is:
df_new = df_1.groupby(['race_num','race_position'])\['percentage_place'].nlargest().reset_index()
Thank you in advance.
Look into sort_values
In [137]: df_original.sort_values(['race_num', 'percentage_place'], ascending=[True, False])
Out[137]:
race_num race_position percentage_place
2 1 0 88
0 1 2 77
1 1 3 55
6 2 2 99
5 2 0 56
3 2 1 50
4 2 0 34
8 3 0 75
7 3 3 12
I have a pandas (v.0.23.4) dataframe with multindex('date', 'class').
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
4 7
In '2019-06-30' class 3 is missing because there are no data.
What I want is to add class 3 in the multindex and zero values in the Col_values column automatically.
Use DataFrame.unstack with fill_value=0 with DataFrame.stack:
df = df.unstack(fill_value=0).stack()
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7
Another solution is use DataFrame.reindex with MultiIndex.from_product:
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0)
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7
I have a large list with a shape in excess of (1000000, 200). I would like to count the occurrences of the items in the last column (:, -1). I can do this in pandas with a smaller list;
distribution = mylist.groupby('var1').count()
However I do not have labels on any of my dimensions. So unsure of how to reference them.
Edit:
print of pandas sample data;
0 1 2 3 4 ... 204 205 206 207 208
0 1 1 Random 1 4 12 ... 8 -14860 0 -5.0000 43.065233
1 1 1 Random 2 3 2 ... 8 -92993 -1 -1.0000 43.057945
2 1 1 Random 3 13 3 ... 8 -62907 1 -2.0000 43.070335
3 1 1 Random 3 13 3 ... 8 -62907 -1 -2.0000 43.070335
4 1 1 Random 4 4 2 ... 8 -38673 -1 0.0000 43.057945
5 1 1 Book 1 3 9 ... 8 -82339 -1 0.0000 43.059402
... ... ... ... .. .. ... .. ... .. ... ...
11795132 292 1 Random 5 12 2 ... 8 -69229 -1 0.0000 12.839051
11795133 292 1 Book 2 7 10 ... 8 -60664 -1 0.0000 46.823615
11795134 292 1 Random 2 9 4 ... 8 -78754 1 -2.0000 11.762521
11795135 292 1 Random 2 9 4 ... 8 -78754 -1 -2.0000 11.762521
11795136 292 1 Random 1 7 5 ... 8 -76275 -1 0.0000 41.839286
I want a few different counts and summaries so plan to do one at a time with;
mylist = input_list.values
mylist = mylist[:, -1]
mylist.astype(int)
Expected output;
11 2
12 1
41 1
43 6
46 1
iloc enables you to reference a column without using labels
distribution = input_list.groupby(input_list.iloc[:, -1]).count()
I have a text file that imported as pandas dataframe looks like:
a b c d e
index
0 18 1 1 -30.47 0.746
1 19 1 1 -30.47 0.751
2 20 1 1 -30.47 0.801
3 33 2 1 -30.47 1.451
4 34 2 1 -30.47 1.534
5 35 2 1 -30.47 1.551
6 49 3 1 -30.47 2.297
7 50 3 1 -30.47 2.301
8 51 3 1 -30.47 2.351
9 64 4 1 -30.47 3.001
10 65 4 1 -30.47 3.085
11 66 4 1 -30.47 3.101
12 346 1 2 -28.47 0.601
13 347 1 2 -20 0.682
14 348 1 2 -28.47 0.701
15 362 2 2 -28.47 1.445
16 363 2 2 -28.47 1.451
17 364 2 2 -28.47 1.501
18 377 3 2 -28.47 2.151
19 378 3 2 -28.47 2.233
20 379 3 2 -28.47 2.251
21 392 4 2 -28.47 2.901
22 393 4 2 -28.47 2.996
23 394 4 2 -28.47 3.001
24 675 1 3 -25 0.596
25 676 1 3 -26 0.601
26 677 1 3 -22 0.651
27 690 2 3 -26.47 1.301
28 691 2 3 -26.47 1.384
29 692 2 3 -26.47 1.401
30 705 3 3 -26.47 2.051
31 706 3 3 -26.47 2.147
32 707 3 3 -26.47 2.151
33 721 4 3 -26.47 2.851
34 722 4 3 -26.47 2.935
35 723 4 3 -26.47 2.951
I have been trying to reorganize the dataframe as following: for each value in col two, for example value 1, there are multiple corresponding values in column three and four.For example
value 1 (col one) corresponds to: value 1 (col three), -3.47 (col four);
value 1 (col three), -3.47 (col four); value 1 (col three), -3.47 (col four).....value 3 (col three), -25 (col four); value 3 (col three), -26 (col four); value 3 (col three), -22 (col four)
and so on. I would like to create a new dataframe where for value 1 there are other three corresponding columns, 1,2 and 3, containing the mean of the three values of the original column four. The output should look like:
col 1, col 2, col 3, col 4
1 mean(-3.47,-3.47,-3.47) mean(-28.47,-20,-20.47) mean(-25, -26,-22)
The output should contain all the values of column one, in this case 1,2,3 and 4 (table 4x3). I am not an expert in python, I have no idea how I should approach this task besides matching values in couple. Any help is more than welcome!
IIUC:
df.groupby([2,3])[4].mean().reset_index(name='Mean').pivot(columns=3,index=2,values='Mean')
Output:
3 1 2 3
2
1 -30.47 -25.646667 -24.333333
2 -30.47 -28.470000 -26.470000
3 -30.47 -28.470000 -26.470000
4 -30.47 -28.470000 -26.470000
I have the following, using a DF that has two columns that I would like to aggregate by:
df2.groupby(['airline_clean','sentiment']).size()
airline_clean sentiment
americanair -1 14
0 36
1 1804
2 722
3 171
4 1
jetblue -1 2
0 7
1 1074
2 868
3 250
4 11
southwestair -1 4
0 20
1 1320
2 829
3 237
4 4
united -1 7
0 74
1 2467
2 1026
3 221
4 5
usairways -1 5
0 62
1 1962
2 716
3 155
4 2
virginamerica -1 2
0 2
1 250
2 180
3 69
dtype: int64
Plotting the aggragated view:
dfc=df2.groupby(['airline_clean','sentiment']).size()
dfc.plot(kind='bar', stacked=True,figsize=(18,6))
Results in:
I would like to change two things:
plot the data in a stacked chart by airline
using % instead of raw numbers (by airline as well)
I am not sure how to achieve that. Any direction is appreciated.
The best way is to plot this dataset is to convert to % values first and use unstack() for plotting:
airline_sentiment = df3.groupby(['airline_clean', 'sentiment']).agg({'tweet_count': 'sum'})
airline = df3.groupby(['airline_clean']).agg({'tweet_count': 'sum'})
p = airline_sentiment.div(airline, level='airline_clean') * 100
p.unstack().plot(kind='bar',stacked=True,figsize=(9, 6),title='Sentiment % distribution by airline')
This results in a nice chart: