How to add new column by getting the data from each? - pandas

I have Dataframe:
teamId pts xpts
Liverpool 82 59
Man City 57 63
Leicester 53 47
Chelsea 48 55
And I'm trying to add new columns that identify the team position by each column
I wanna get this:
teamId pts xpts №pts №xpts
Liverpool 82 59 1 2
Man City 57 63 2 1
Leicester 53 47 3 4
Chelsea 48 53 4 3
I tried to do something similar with the following code, but to no avail. The result is a list
df = [df.sort_values(by=i, ascending=False).assign(new_col=lambda x: range(1, len(df) + 1)) for i in df.columns]

You can use np.argsort:
df[["no_pts", "no_xpts"]] = df.apply(lambda x: np.argsort(-x)) + 1
We pass each column but negated so that it gives the ascending order sort's indices. Since it starts from 0, we add 1 at the end.
to get
teamId pts xpts no_pts no_xpts
Liverpool 82 59 1 2
Man City 57 63 2 1
Leicester 53 47 3 4
Chelsea 48 55 4 3

Solutions if teamId is index:
Use DataFrame.rank with all columns, add prefix and append to original DataFrame:
df = df.join(df.rank(method='dense', ascending=False).astype(int).add_prefix('no_'))
print (df)
pts xpts no_pts no_xpts
teamId
Liverpool 82 59 1 2
Man City 57 63 2 1
Leicester 53 47 3 4
Chelsea 48 55 4 3
If need specify columns for processing use list:
cols = ['pts','xpts']
df = df.join(df[cols].rank(method='dense', ascending=False).astype(int).add_prefix('no_'))

Related

ValueError: grouper for xxx not 1-dimensional with pandas pivot_table()

I am working on olympics dataset and want to create another dataframe that has total number of athletes and total number of medals won by type for each country.
Using following pivot_table gives me an error "ValueError: Grouper for 'ID' not 1-dimensional"
pd.pivot_table(olymp, index='NOC', columns=['ID','Medal'], values=['ID','Medal'], aggfunc={'ID':pd.Series.nunique,'Medal':'count'}).sort_values(by='Medal')
Result should have one row for each country with columns for totalAthletes, gold, silver, bronze. Not sure how to go about it using pivot_table. I can do this using merge of crosstab but would like to use just one pivottable statement.
Here is what original df looks like.
Update
I would like to get the medal breakdown as well e.g. gold, silver, bronze. Also I need unique count of athlete id's so I use nunique since one athlete may participate in multiple events. Same with medal, ignoring NA values
IIUC:
out = df.pivot_table('ID', 'NOC', 'Medal', aggfunc='count', fill_value=0)
out['ID'] = df[df['Medal'].notna()].groupby('NOC')['ID'].nunique()
Output:
>>> out
Medal Bronze Gold Silver ID
NOC
AFG 2 0 0 1
AHO 0 0 1 1
ALG 8 5 4 14
ANZ 5 20 4 25
ARG 91 91 92 231
.. ... ... ... ...
VIE 0 1 3 3
WIF 5 0 0 4
YUG 93 130 167 317
ZAM 1 0 1 2
ZIM 1 17 4 16
[149 rows x 4 columns]
Old answer
You can't have the same column for columns and values:
out = olymp.pivot_table(index='NOC', values=['ID','Medal'],
aggfunc={'ID':pd.Series.nunique, 'Medal':'count'}) \
.sort_values('Medal', ascending=False)
print(out)
# Output
ID Medal
NOC
USA 9653 5637
URS 2948 2503
GER 4872 2165
GBR 6281 2068
FRA 6170 1777
.. ... ...
GAM 33 0
GBS 15 0
GEQ 26 0
PNG 61 0
LBA 68 0
[230 rows x 2 columns]
Another way to get the result above:
out = olym.groupby('NOC').agg({'ID': pd.Series.nunique, 'Medal': 'count'}) \
.sort_values('Medal', ascending=False)
print(out)
# Output
ID Medal
NOC
USA 9653 5637
URS 2948 2503
GER 4872 2165
GBR 6281 2068
FRA 6170 1777
.. ... ...
GAM 33 0
GBS 15 0
GEQ 26 0
PNG 61 0
LBA 68 0
[230 rows x 2 columns]

How to reshape, group by and rename Julia dataframe?

I have the following DataFrame :
Police Product PV1 PV2 PV3 PM1 PM2 PM3
0 1 AA 10 8 14 150 145 140
1 2 AB 25 4 7 700 650 620
2 3 AA 13 22 5 120 80 60
3 4 AA 12 6 12 250 170 120
4 5 AB 10 13 5 500 430 350
5 6 BC 7 21 12 1200 1000 900
PV1 is the item PV for year 1, PV2 for year 2, ....
I would like to combine reshaping and group by operations + some renaming stuffs to obtain the DF below :
Product Item Year1 Year2 Year3
0 AA PV 35 36 31
1 AA PM 520 395 320
2 AB PV 35 17 12
3 AB PM 1200 1080 970
4 BC PV 7 21 12
5 BC PM 1200 1000 900
It makes a group by operation on product name and reshape the DF to pass the item as a column and put the sum of each in new columns years.
I found a way to do it in Python but I am now looking for a solution passing my code in Julia.
No problem for the groupby operation, but I have more issues with the reshaping / renaming part.
If you have any idea, I would be very grateful.
Thanks for any help
Edit :
As you recommended, I have installed Julia 1.5 and updated the DataFrames pkg to 0.22 version. As a result, the code runs well. The only remaining issue is related to the non constant lenght of column names in my real DF, which makes the transform part of the code not completly suitable. I will search for a way to split char/num with regular expression.
Thanks a lot for your time and sorry for the mistakes on editing.
There are probably several ways how you can do it. Here is an example using in-built functions (also taking advantage of several advanced features at once, so if you have any questions regarding the code please comment and I can explain):
julia> using CSV, DataFrames, Chain
julia> str = """
Police Product PV1 PV2 PV3 PM1 PM2 PM3
1 AA 10 8 14 150 145 140
2 AB 25 4 7 700 650 620
3 AA 13 22 5 120 80 60
4 AA 12 6 12 250 170 120
5 AB 10 13 5 500 430 350
6 BC 7 21 12 1200 1000 900""";
julia> #chain str begin
IOBuffer
CSV.read(DataFrame, ignorerepeated=true, delim=" ")
groupby(:Product)
combine(names(df, r"\d") .=> sum, renamecols=false)
stack(Not(:Product))
transform!(:variable => ByRow(x -> (first(x, 2), last(x, 1))) => [:Item, :Year])
unstack([:Product, :Item], :Year, :value, renamecols = x -> Symbol("Year", x))
sort!(:Product)
end
6×5 DataFrame
Row │ Product Item Year1 Year2 Year3
│ String String Int64? Int64? Int64?
─────┼─────────────────────────────────────────
1 │ AA PV 35 36 31
2 │ AA PM 520 395 320
3 │ AB PV 35 17 12
4 │ AB PM 1200 1080 970
5 │ BC PV 7 21 12
6 │ BC PM 1200 1000 900
I used Chain.jl just to show how it can be employed in practice (but of course it is not needed).
You can add #aside show(_) annotation after any stage of the processing to see the results of the processing steps.
Edit:
Is this the regex you need (split non-digit characters followed by digit characters)?
julia> match(r"([^\d]+)(\d+)", "fsdfds123").captures
2-element Array{Union{Nothing, SubString{String}},1}:
"fsdfds"
"123"
Then just write:
ByRow(x -> match(r"([^\d]+)(\d+)", x).captures)
as your transformation

How to combine two groupby into one

I have two GroubBy:
The First one
ser2 = ser.groupby(pd.cut(ser, 10)).sum()
(-2620.137, 476638.7] 12393813
(476638.7, 951152.4] 9479666
(951152.4, 1425666.1] 14381033
(1425666.1, 1900179.8] 5113056
(1900179.8, 2374693.5] 4114429
(2374693.5, 2849207.2] 4929537
(2849207.2, 3323720.9] 0
(3323720.9, 3798234.6] 0
(3798234.6, 4272748.3] 3978230
(4272748.3, 4747262.0] 4747262
And the second:
ser1= pd.cut(ser, 10)
print(ser1.value_counts())
(-2620.137, 476638.7] 110
(476638.7, 951152.4] 15
(951152.4, 1425666.1] 12
(1425666.1, 1900179.8] 3
(2374693.5, 2849207.2] 2
(1900179.8, 2374693.5] 2
(4272748.3, 4747262.0] 1
(3798234.6, 4272748.3] 1
(3323720.9, 3798234.6] 0
(2849207.2, 3323720.9] 0
Question: Are there ways to combine these operations into one code to get both calculations in the same pivot table
Use GroupBy.agg, instead value_counts use GroupBy.size:
np.random.seed(2020)
ser = pd.Series(np.random.randint(40, size=100))
df = ser.groupby(pd.cut(ser, 10)).agg(['sum','size'])
print (df)
sum size
(-0.039, 3.9] 27 14
(3.9, 7.8] 49 9
(7.8, 11.7] 142 15
(11.7, 15.6] 151 11
(15.6, 19.5] 159 9
(19.5, 23.4] 187 9
(23.4, 27.3] 253 10
(27.3, 31.2] 176 6
(31.2, 35.1] 231 7
(35.1, 39.0] 375 10
If need custom columns names:
np.random.seed(2020)
ser = pd.Series(np.random.randint(40, size=100))
df = ser.groupby(pd.cut(ser, 10)).agg([('col1','sum'),('col2','size')])
print (df)
col1 col2
(-0.039, 3.9] 27 14
(3.9, 7.8] 49 9
(7.8, 11.7] 142 15
(11.7, 15.6] 151 11
(15.6, 19.5] 159 9
(19.5, 23.4] 187 9
(23.4, 27.3] 253 10
(27.3, 31.2] 176 6
(31.2, 35.1] 231 7
(35.1, 39.0] 375 10

List of Pandas Dataframes: Merging Function Outputs

I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?
I think better is use GroupBy.apply:
cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
Score_1 Score_2 Score_3
Income_Quantile
correlation 1 NaN NaN NaN
2 NaN NaN NaN
3 6.837722e-01 0.000000e+00 1.000000e+00
4 4.337662e-01 6.238377e-01 4.818230e-03
5 2.000000e-01 2.000000e-01 2.000000e-01
p-value 1 8.190692e-03 8.241377e-03 8.194933e-03
2 5.887943e-03 5.880440e-03 5.888611e-03
3 3.606128e-13 3.603267e-13 3.604996e-13
4 5.584822e-14 5.587619e-14 5.586583e-14
5 3.861801e-06 3.862192e-06 3.864736e-06
t-statistic 1 1.098143e+01 1.094719e+01 1.097856e+01
2 1.297459e+01 1.298294e+01 1.297385e+01
3 2.391611e+02 2.391927e+02 2.391736e+02
4 1.090548e+02 1.090479e+02 1.090505e+02
5 1.594605e+01 1.594577e+01 1.594399e+01

Pandas pivot_table -- index values extended through rows

I'm trying to tidy some data, specifically by taking two columns "measure" and "value" and making more columns for each unique value of measure.
So far I have some python (3) code that reads in data and pivots it to the form that I want--roughly. This code looks like so:
import pandas as pd
#Load the data
df = pd.read_csv(r"C:\Users\User\Documents\example data.csv")
#Pivot the dataframe
df_pivot = df.pivot_table(index=['Geography Type', 'Geography Name', 'Week Ending',
'Item Name'], columns='Measure', values='Value')
print(df_pivot.head())
This outputs:
Measure X Y Z
Geography Type Geography Name Week Ending Item Name
Type 1 Total US 1/1/2018 Item A 57 51 16
Item B 95 37 17
1/8/2018 Item A 92 8 32
Item B 36 49 54
Type 2 Region 1 1/1/2018 Item A 78 46 88
This is almost perfect, but for my work I need to put this file in software and for the software to read the data correctly it needs values for each of the rows, so I need the columns values for each of those indexes to extend through the rows, like so:
Measure X Y Z
Geography Type Geography Name Week Ending Item Name
Type 1 Total US 1/1/2018 Item A 57 51 16
Type 1 Total US 1/1/2018 Item B 95 37 17
Type 1 Total US 1/8/2018 Item A 92 8 32
Type 1 Total US 1/8/2018 Item B 36 49 54
Type 2 Region 1 1/1/2018 Item A 78 46 88
and so on.