How to modify elements length of pandas dataframe? - pandas

I want to change pandas dataframe each element to specified length and decimal digits. Length mean the numbers of charactors. For example, element -23.5556
is 8 charactors length (contain minus and point). I want to modify it to total 6 charactors length containing 2 decimal digits, such as -23.56. If less than 6 charactors ,use space to fill. There is no seperation between each element of new df at last.
name x y elev m1 m2
136 5210580.00000 5846400.000000 43.3 -28.2 -24.2
246 5373860.00000 5809680.000000 36.19 -25 -22.3
349 5361120.00000 5735330.000000 49.46 -24.7 -21.2
353 5521370.00000 5770740.000000 17.74 -26 -20.5
425 5095630.00000 5528200.000000 58.14 -30.3 -26.1
434 5198630.00000 5570740.000000 73.26 -30.2 -26
442 5373170.00000 5593290.000000 37.17 -22.9 -18.3
each columns format requested:
charactors decimal digits
name 3 0
x 14 2
y 14 2
elev 4 1
m1 6 2
m2 6 2
the new df format I wanted:
1365210580.00 5846400.00 43.3-28.2 -24.2
2465373860.00 5809680.00 36.1-25.0 -22.3
3495361120.00 5735330.00 49.4-24.7 -21.2
3535521370.00 5770740.00 17.7-26.0 -20.5
4255095630.00 5528200.00 58.1-30.3 -26.1
4345198630.00 5570740.00 73.2-30.2 -26.0
4425373170.00 5593290.00 37.1-22.9 -18.3
Lastly, save the new df as .dat fixed ascii format.
Which tool could do this in pandas?

You can use string formatting
sf = '{name:3.0f}{x:<14.2f}{y:<14.2f}{elev:<4.1f}{m1:<6.1f}{m2:6.1f}'.format
df.apply(lambda r: sf(**r), 1)
0 1365210580.00 5846400.00 43.3-28.2 -24.2
1 2465373860.00 5809680.00 36.2-25.0 -22.3
2 3495361120.00 5735330.00 49.5-24.7 -21.2
3 3535521370.00 5770740.00 17.7-26.0 -20.5
4 4255095630.00 5528200.00 58.1-30.3 -26.1
5 4345198630.00 5570740.00 73.3-30.2 -26.0
6 4425373170.00 5593290.00 37.2-22.9 -18.3

You need
df.round(2)
The resulting df
name x y elev m1 m2
0 136 5210580 5846400 43.30 -28.2 -24.2
1 246 5373860 5809680 36.19 -25.0 -22.3
2 349 5361120 5735330 49.46 -24.7 -21.2
3 353 5521370 5770740 17.74 -26.0 -20.5
4 425 5095630 5528200 58.14 -30.3 -26.1
5 434 5198630 5570740 73.26 -30.2 -26.0
6 442 5373170 5593290 37.17 -22.9 -18.3

Related

How to reshape, group by and rename Julia dataframe?

I have the following DataFrame :
Police Product PV1 PV2 PV3 PM1 PM2 PM3
0 1 AA 10 8 14 150 145 140
1 2 AB 25 4 7 700 650 620
2 3 AA 13 22 5 120 80 60
3 4 AA 12 6 12 250 170 120
4 5 AB 10 13 5 500 430 350
5 6 BC 7 21 12 1200 1000 900
PV1 is the item PV for year 1, PV2 for year 2, ....
I would like to combine reshaping and group by operations + some renaming stuffs to obtain the DF below :
Product Item Year1 Year2 Year3
0 AA PV 35 36 31
1 AA PM 520 395 320
2 AB PV 35 17 12
3 AB PM 1200 1080 970
4 BC PV 7 21 12
5 BC PM 1200 1000 900
It makes a group by operation on product name and reshape the DF to pass the item as a column and put the sum of each in new columns years.
I found a way to do it in Python but I am now looking for a solution passing my code in Julia.
No problem for the groupby operation, but I have more issues with the reshaping / renaming part.
If you have any idea, I would be very grateful.
Thanks for any help
Edit :
As you recommended, I have installed Julia 1.5 and updated the DataFrames pkg to 0.22 version. As a result, the code runs well. The only remaining issue is related to the non constant lenght of column names in my real DF, which makes the transform part of the code not completly suitable. I will search for a way to split char/num with regular expression.
Thanks a lot for your time and sorry for the mistakes on editing.
There are probably several ways how you can do it. Here is an example using in-built functions (also taking advantage of several advanced features at once, so if you have any questions regarding the code please comment and I can explain):
julia> using CSV, DataFrames, Chain
julia> str = """
Police Product PV1 PV2 PV3 PM1 PM2 PM3
1 AA 10 8 14 150 145 140
2 AB 25 4 7 700 650 620
3 AA 13 22 5 120 80 60
4 AA 12 6 12 250 170 120
5 AB 10 13 5 500 430 350
6 BC 7 21 12 1200 1000 900""";
julia> #chain str begin
IOBuffer
CSV.read(DataFrame, ignorerepeated=true, delim=" ")
groupby(:Product)
combine(names(df, r"\d") .=> sum, renamecols=false)
stack(Not(:Product))
transform!(:variable => ByRow(x -> (first(x, 2), last(x, 1))) => [:Item, :Year])
unstack([:Product, :Item], :Year, :value, renamecols = x -> Symbol("Year", x))
sort!(:Product)
end
6×5 DataFrame
Row │ Product Item Year1 Year2 Year3
│ String String Int64? Int64? Int64?
─────┼─────────────────────────────────────────
1 │ AA PV 35 36 31
2 │ AA PM 520 395 320
3 │ AB PV 35 17 12
4 │ AB PM 1200 1080 970
5 │ BC PV 7 21 12
6 │ BC PM 1200 1000 900
I used Chain.jl just to show how it can be employed in practice (but of course it is not needed).
You can add #aside show(_) annotation after any stage of the processing to see the results of the processing steps.
Edit:
Is this the regex you need (split non-digit characters followed by digit characters)?
julia> match(r"([^\d]+)(\d+)", "fsdfds123").captures
2-element Array{Union{Nothing, SubString{String}},1}:
"fsdfds"
"123"
Then just write:
ByRow(x -> match(r"([^\d]+)(\d+)", x).captures)
as your transformation

List of Pandas Dataframes: Merging Function Outputs

I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?
I think better is use GroupBy.apply:
cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
Score_1 Score_2 Score_3
Income_Quantile
correlation 1 NaN NaN NaN
2 NaN NaN NaN
3 6.837722e-01 0.000000e+00 1.000000e+00
4 4.337662e-01 6.238377e-01 4.818230e-03
5 2.000000e-01 2.000000e-01 2.000000e-01
p-value 1 8.190692e-03 8.241377e-03 8.194933e-03
2 5.887943e-03 5.880440e-03 5.888611e-03
3 3.606128e-13 3.603267e-13 3.604996e-13
4 5.584822e-14 5.587619e-14 5.586583e-14
5 3.861801e-06 3.862192e-06 3.864736e-06
t-statistic 1 1.098143e+01 1.094719e+01 1.097856e+01
2 1.297459e+01 1.298294e+01 1.297385e+01
3 2.391611e+02 2.391927e+02 2.391736e+02
4 1.090548e+02 1.090479e+02 1.090505e+02
5 1.594605e+01 1.594577e+01 1.594399e+01

Iterate over every row and compare a column value of a dataframe

I have following dataframe. I want to iterate over every row and compare the score column, if the value is >= value present in cut_off list.
seq score status
7 TTGTTCTCTGTGTATTTCAGGCT 10.42 positive
56 CAGGTGAGA 9.22 positive
64 AATTCCTGTGGACTTTCAAGTAT 1.23 positive
116 AAGGTATAT 7.84 positive
145 AAGGTAATA 8.49 positive
172 TGGGTAGGT 6.86 positive
204 CAGGTAGAG 7.10 positive
214 GCGTTTCTTGAATCCAGCAGGGA 3.58 positive
269 GAGGTAATG 8.73 positive
274 CACCCATTCCTGTACCTTAGGTA 8.96 positive
325 GCCGTAAGG 5.46 positive
356 GAGGTGAGG 8.41 positive
cut_off = range(0, 11)
The code I tried so far is:
cutoff_list_pos = []
number_list_pos = []
cut_off = range(0, int(new_df['score'].max())+1)
for co in cut_off:
for df in df_elements:
val = (df['score']>=co).value_counts()
cutoff_list_pos.append(co)
number_list_pos.append(val)
The desired output is:
cutoff true false
0 0 12.0 0
1 1 12.0 0
and so on..
If the score is >= to the value in cut_off, it should assign the row as true else false.
You can use parameter keys in concat by values of cutoff_list_pos, then transpose and convert index to column by DataFrame.reset_index:
df = (pd.concat(number_list_pos, axis=1, keys=cutoff_list_pos, sort=False)
.T
.rename_axis('cutoff')
.reset_index())
Another pandas implementation:
res_df = pd.DataFrame(columns=['cutoff', 'true'])
for i in range(1,int(df['score'].max()+1)):
temp_df = pd.DataFrame(data={'cutoff': i, 'true': (df['score']>=i).sum()}, index=[i])
res_df = pd.concat([res_df, temp_df])
res_df
cutoff true
1 1 12
2 2 11
3 3 11
4 4 10
5 5 10
6 6 9
7 7 8
8 8 6
9 9 2
10 10 1

Apply function with arguments across Multiindex levels

I would like to apply a custom function to each level within a multiindex.
For example, I have the dataframe
df = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY']]))
of which I want to add a column for each level 0 column, called "Value" which is the result of the following function;
def my_func(df, scale):
return df['QTY']*df['PRICE']*scale
where the user supplies the "scale" value.
Even in setting up this example, I am not sure how to show the result I want. But I know I want the final dataframe's multiindex column to be
pd.DataFrame(columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY','Value']]))
Even if that wasn't had enough, I want to apply one "scale" value for the "OP" level 0 column and a different "scale" value to the "PK" column.
Use:
def my_func(df, scale):
#select second level of columns
df1 = df.xs('QTY', axis=1, level=1).values *df.xs('PRICE', axis=1, level=1) * scale
#create MultiIndex in columns
df1.columns = pd.MultiIndex.from_product([df1.columns, ['val']])
#join to original
return pd.concat([df, df1], axis=1).sort_index(axis=1)
print (my_func(df, 10))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100
EDIT:
For multiple by scaled values different for each level is possible use list of values:
print (my_func(df, [10, 20]))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 120
1 4 5 200 6 7 840
2 8 9 720 10 11 2200
3 12 13 1560 14 15 4200
Use groupby + agg, and then concatenate the pieces together with pd.concat.
scale = 10
v = df.groupby(level=0, axis=1).agg(lambda x: x.values.prod(1) * scale)
v.columns = pd.MultiIndex.from_product([v.columns, ['value']])
pd.concat([df, v], axis=1).sort_index(axis=1, level=0)
OP PK
PRICE QTY value PRICE QTY value
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100

Divide dataframe in different bins based on condition

i have a pandas dataframe
id no_of_rows
1 2689
2 1515
3 3826
4 814
5 1650
6 2292
7 1867
8 2096
9 1618
10 923
11 766
12 191
i want to divide id's into 5 different bins based on their no. of rows,
such that every bin has approx(equal no of rows)
and assign it as a new column bin
One approach i thought was
df.no_of_rows.sum() = 20247
div_factor = 20247//5 == 4049
if we add 1st and 2nd row its sum = 2689+1515 = 4204 > div_factor.
Therefore assign bin = 1 where id = 1.
Now look for the next ones
id no_of_rows bin
1 2689 1
2 1515 2
3 3826 3
4 814 4
5 1650 4
6 2292 5
7 1867
8 2096
9 1618
10 923
11 766
12 191
But this method proved wrong.
Is there a way to have 5 bins such that every bin has good amount of stores(approximately equal)
You can use an approach based on percentiles.
n_bins = 5
dfa = df.sort_values(by='no_of_rows').cumsum()
df['bin'] = dfa.no_of_rows.apply(lambda x: int(n_bins*x/dfa.no_of_rows.max()))
And then you can check with
df.groupby('bin').sum()
The more records you have the more fair it will be in terms of dispersion.