How to combine two groupby into one - pandas

I have two GroubBy:
The First one
ser2 = ser.groupby(pd.cut(ser, 10)).sum()
(-2620.137, 476638.7] 12393813
(476638.7, 951152.4] 9479666
(951152.4, 1425666.1] 14381033
(1425666.1, 1900179.8] 5113056
(1900179.8, 2374693.5] 4114429
(2374693.5, 2849207.2] 4929537
(2849207.2, 3323720.9] 0
(3323720.9, 3798234.6] 0
(3798234.6, 4272748.3] 3978230
(4272748.3, 4747262.0] 4747262
And the second:
ser1= pd.cut(ser, 10)
print(ser1.value_counts())
(-2620.137, 476638.7] 110
(476638.7, 951152.4] 15
(951152.4, 1425666.1] 12
(1425666.1, 1900179.8] 3
(2374693.5, 2849207.2] 2
(1900179.8, 2374693.5] 2
(4272748.3, 4747262.0] 1
(3798234.6, 4272748.3] 1
(3323720.9, 3798234.6] 0
(2849207.2, 3323720.9] 0
Question: Are there ways to combine these operations into one code to get both calculations in the same pivot table

Use GroupBy.agg, instead value_counts use GroupBy.size:
np.random.seed(2020)
ser = pd.Series(np.random.randint(40, size=100))
df = ser.groupby(pd.cut(ser, 10)).agg(['sum','size'])
print (df)
sum size
(-0.039, 3.9] 27 14
(3.9, 7.8] 49 9
(7.8, 11.7] 142 15
(11.7, 15.6] 151 11
(15.6, 19.5] 159 9
(19.5, 23.4] 187 9
(23.4, 27.3] 253 10
(27.3, 31.2] 176 6
(31.2, 35.1] 231 7
(35.1, 39.0] 375 10
If need custom columns names:
np.random.seed(2020)
ser = pd.Series(np.random.randint(40, size=100))
df = ser.groupby(pd.cut(ser, 10)).agg([('col1','sum'),('col2','size')])
print (df)
col1 col2
(-0.039, 3.9] 27 14
(3.9, 7.8] 49 9
(7.8, 11.7] 142 15
(11.7, 15.6] 151 11
(15.6, 19.5] 159 9
(19.5, 23.4] 187 9
(23.4, 27.3] 253 10
(27.3, 31.2] 176 6
(31.2, 35.1] 231 7
(35.1, 39.0] 375 10

Related

Pandas, Replace values of a column with a variable (negative) if it is less than that variable, else keep the values as is

say:
m = 170000 , v = -(m/100)
{'01-09-2021': 631, '02-09-2021': -442, '08-09-2021': 6, '09-09-2021': 1528, '13-09-2021': 2042, '14-09-2021': 1098, '15-09-2021': -2092, '16-09-2021': -6718, '20-09-2021': -595, '22-09-2021': 268, '23-09-2021': -2464, '28-09-2021': 611, '29-09-2021': -1700, '30-09-2021': 4392}
I want to replace values in column 'Final' with v if the value is less than v, else keep the original value. Tried numpy.where , df.loc etc but didn't work.
Your can use clip:
df['Final'] = df['Final'].clip(-1700)
print(df)
# Output:
Date Final
0 01-09-2021 631
1 02-09-2021 -442
2 08-09-2021 6
3 09-09-2021 1528
4 13-09-2021 2042
5 14-09-2021 1098
6 15-09-2021 -1700
7 16-09-2021 -1700
8 20-09-2021 -595
9 22-09-2021 268
10 23-09-2021 -1700
11 28-09-2021 611
12 29-09-2021 -1700
13 30-09-2021 4392
Or the classical np.where:
df['Final'] = np.where(df['Final'] < -1700, -1700, df['Final'])
Setup:
df = pd.DataFrame({'Date': d.keys(), 'Final': d.values()})
You can try:
df.loc[df['Final']<v, 'Final'] = v
Output:
Date Final
0 01-09-2021 631
1 02-09-2021 -442
2 08-09-2021 6
3 09-09-2021 1528
4 13-09-2021 2042
5 14-09-2021 1098
6 15-09-2021 -1700
7 16-09-2021 -1700
8 20-09-2021 -595
9 22-09-2021 268
10 23-09-2021 -1700
11 28-09-2021 611
12 29-09-2021 -1700
13 30-09-2021 4392

How to add new column by getting the data from each?

I have Dataframe:
teamId pts xpts
Liverpool 82 59
Man City 57 63
Leicester 53 47
Chelsea 48 55
And I'm trying to add new columns that identify the team position by each column
I wanna get this:
teamId pts xpts №pts №xpts
Liverpool 82 59 1 2
Man City 57 63 2 1
Leicester 53 47 3 4
Chelsea 48 53 4 3
I tried to do something similar with the following code, but to no avail. The result is a list
df = [df.sort_values(by=i, ascending=False).assign(new_col=lambda x: range(1, len(df) + 1)) for i in df.columns]
You can use np.argsort:
df[["no_pts", "no_xpts"]] = df.apply(lambda x: np.argsort(-x)) + 1
We pass each column but negated so that it gives the ascending order sort's indices. Since it starts from 0, we add 1 at the end.
to get
teamId pts xpts no_pts no_xpts
Liverpool 82 59 1 2
Man City 57 63 2 1
Leicester 53 47 3 4
Chelsea 48 55 4 3
Solutions if teamId is index:
Use DataFrame.rank with all columns, add prefix and append to original DataFrame:
df = df.join(df.rank(method='dense', ascending=False).astype(int).add_prefix('no_'))
print (df)
pts xpts no_pts no_xpts
teamId
Liverpool 82 59 1 2
Man City 57 63 2 1
Leicester 53 47 3 4
Chelsea 48 55 4 3
If need specify columns for processing use list:
cols = ['pts','xpts']
df = df.join(df[cols].rank(method='dense', ascending=False).astype(int).add_prefix('no_'))

How to reshape, group by and rename Julia dataframe?

I have the following DataFrame :
Police Product PV1 PV2 PV3 PM1 PM2 PM3
0 1 AA 10 8 14 150 145 140
1 2 AB 25 4 7 700 650 620
2 3 AA 13 22 5 120 80 60
3 4 AA 12 6 12 250 170 120
4 5 AB 10 13 5 500 430 350
5 6 BC 7 21 12 1200 1000 900
PV1 is the item PV for year 1, PV2 for year 2, ....
I would like to combine reshaping and group by operations + some renaming stuffs to obtain the DF below :
Product Item Year1 Year2 Year3
0 AA PV 35 36 31
1 AA PM 520 395 320
2 AB PV 35 17 12
3 AB PM 1200 1080 970
4 BC PV 7 21 12
5 BC PM 1200 1000 900
It makes a group by operation on product name and reshape the DF to pass the item as a column and put the sum of each in new columns years.
I found a way to do it in Python but I am now looking for a solution passing my code in Julia.
No problem for the groupby operation, but I have more issues with the reshaping / renaming part.
If you have any idea, I would be very grateful.
Thanks for any help
Edit :
As you recommended, I have installed Julia 1.5 and updated the DataFrames pkg to 0.22 version. As a result, the code runs well. The only remaining issue is related to the non constant lenght of column names in my real DF, which makes the transform part of the code not completly suitable. I will search for a way to split char/num with regular expression.
Thanks a lot for your time and sorry for the mistakes on editing.
There are probably several ways how you can do it. Here is an example using in-built functions (also taking advantage of several advanced features at once, so if you have any questions regarding the code please comment and I can explain):
julia> using CSV, DataFrames, Chain
julia> str = """
Police Product PV1 PV2 PV3 PM1 PM2 PM3
1 AA 10 8 14 150 145 140
2 AB 25 4 7 700 650 620
3 AA 13 22 5 120 80 60
4 AA 12 6 12 250 170 120
5 AB 10 13 5 500 430 350
6 BC 7 21 12 1200 1000 900""";
julia> #chain str begin
IOBuffer
CSV.read(DataFrame, ignorerepeated=true, delim=" ")
groupby(:Product)
combine(names(df, r"\d") .=> sum, renamecols=false)
stack(Not(:Product))
transform!(:variable => ByRow(x -> (first(x, 2), last(x, 1))) => [:Item, :Year])
unstack([:Product, :Item], :Year, :value, renamecols = x -> Symbol("Year", x))
sort!(:Product)
end
6×5 DataFrame
Row │ Product Item Year1 Year2 Year3
│ String String Int64? Int64? Int64?
─────┼─────────────────────────────────────────
1 │ AA PV 35 36 31
2 │ AA PM 520 395 320
3 │ AB PV 35 17 12
4 │ AB PM 1200 1080 970
5 │ BC PV 7 21 12
6 │ BC PM 1200 1000 900
I used Chain.jl just to show how it can be employed in practice (but of course it is not needed).
You can add #aside show(_) annotation after any stage of the processing to see the results of the processing steps.
Edit:
Is this the regex you need (split non-digit characters followed by digit characters)?
julia> match(r"([^\d]+)(\d+)", "fsdfds123").captures
2-element Array{Union{Nothing, SubString{String}},1}:
"fsdfds"
"123"
Then just write:
ByRow(x -> match(r"([^\d]+)(\d+)", x).captures)
as your transformation

List of Pandas Dataframes: Merging Function Outputs

I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?
I think better is use GroupBy.apply:
cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
Score_1 Score_2 Score_3
Income_Quantile
correlation 1 NaN NaN NaN
2 NaN NaN NaN
3 6.837722e-01 0.000000e+00 1.000000e+00
4 4.337662e-01 6.238377e-01 4.818230e-03
5 2.000000e-01 2.000000e-01 2.000000e-01
p-value 1 8.190692e-03 8.241377e-03 8.194933e-03
2 5.887943e-03 5.880440e-03 5.888611e-03
3 3.606128e-13 3.603267e-13 3.604996e-13
4 5.584822e-14 5.587619e-14 5.586583e-14
5 3.861801e-06 3.862192e-06 3.864736e-06
t-statistic 1 1.098143e+01 1.094719e+01 1.097856e+01
2 1.297459e+01 1.298294e+01 1.297385e+01
3 2.391611e+02 2.391927e+02 2.391736e+02
4 1.090548e+02 1.090479e+02 1.090505e+02
5 1.594605e+01 1.594577e+01 1.594399e+01

How to substitute a column in a pandas dataframe whit a series?

Let's have a dataframe df and a series s1 in pandas
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10000,1000))
s1 = pd.Series(range(0,10000))
How can I modify df so that the column 42 become equal to s1?
How can I modify df so that the columns between 42 and 442 become equal to s1?
I would like to know the simplest way to do that but also a way to do that in place.
I think you need first same length Series with DataFrame, here 20:
np.random.seed(456)
df = pd.DataFrame(np.random.randn(20,10))
#print (df)
s1 = pd.Series(range(0,20))
#print (s1)
#set column by Series
df[8] = s1
#set Series to range of columns
cols = df.loc[:, 3:5].columns
df[cols] = pd.concat([s1] * len(cols), axis=1)
print (df)
0 1 2 3 4 5 6 7 8 9
0 -0.668129 -0.498210 0.618576 0 0 0 0.301966 0.449483 0 -0.315231
1 -2.015971 -1.130231 -1.111846 1 1 1 1.915676 0.920348 1 1.157552
2 -0.106208 -0.088752 -0.971485 2 2 2 -0.366948 -0.301085 2 1.141635
3 -1.309529 -0.274381 0.864837 3 3 3 0.670294 0.086347 3 -1.212503
4 0.120359 -0.358880 1.199936 4 4 4 0.389167 1.201631 4 0.445432
5 -1.031109 0.067133 -1.213451 5 5 5 -0.636896 0.013802 5 1.726135
6 -0.491877 0.254206 -0.268168 6 6 6 0.671070 -0.633645 6 1.813671
7 0.080433 -0.882443 1.152671 7 7 7 0.249225 1.385407 7 1.010374
8 0.307274 0.806150 0.071719 8 8 8 1.133853 -0.789922 8 -0.286098
9 -0.767206 1.094445 1.603907 9 9 9 0.083149 2.322640 9 0.396845
10 -0.740018 -0.853377 -2.039522 10 10 10 0.764962 -0.472048 10 -0.071255
11 -0.238565 1.077573 2.143252 11 11 11 1.542892 2.572560 11 -0.803516
12 -0.139521 -0.992107 -0.892619 12 12 12 0.259612 -0.661760 12 -1.508976
13 -1.077001 0.381962 0.205388 13 13 13 -0.023986 -1.293080 13 1.846402
14 -0.714792 -0.728496 -0.127079 14 14 14 0.606065 -2.320500 14 -0.992798
15 -0.127113 -0.563313 -0.101387 15 15 15 0.647325 -0.816023 15 -0.309938
16 -1.151304 -1.673719 0.074930 16 16 16 -0.392157 0.736714 16 1.142983
17 -1.247396 -0.471524 1.173713 17 17 17 -0.005391 0.426134 17 0.781832
18 -0.325111 0.579248 0.040363 18 18 18 0.361926 0.036871 18 0.581314
19 -1.057501 -1.814500 0.109628 19 19 19 -1.738658 -0.061883 19 0.989456
Timings
Another solutions, but it seems concat solution is fastest:
np.random.seed(456)
df = pd.DataFrame(np.random.randn(1000,1000))
#print (df)
s1 = pd.Series(range(0,1000))
#print (s1)
#set column by Series
df[8] = s1
#set Series to range of columns
cols = df.loc[:, 42:442].columns
print (df)
In [310]: %timeit df[cols] = np.broadcast_to(s1.values[:, np.newaxis], (len(df),len(cols)))
1 loop, best of 3: 202 ms per loop
In [311]: %timeit df[cols] = np.repeat(s1.values[:, np.newaxis], len(cols), axis=1)
1 loop, best of 3: 208 ms per loop
In [312]: %timeit df[cols] = np.array([s1.values]*len(cols)).transpose()
10 loops, best of 3: 175 ms per loop
In [313]: %timeit df[cols] = pd.concat([s1] * len(cols), axis=1)
10 loops, best of 3: 53.8 ms per loop