Change column values into rating and sum - pandas

Change the column values and sum the row according to conditions.
d = {'col1': [20, 40], 'col2': [30, 40],'col3':[200,300}
df = pd.DataFrame(data=d)
col1 col2 col3
0 20 30 200
1 40 40 300
Col4 shoud give back the sum of the row after the values have been tranfered to a rating.
Col1 Value between 0-20 ->2 Points, 20-40 -> 3 Points
Col2 Value between 40-50 ->2 Points, 70-80 -> 3 Points
Col3 Value between 0-100 ->2 Points, 100-300 -> 2 Points
col 4 (Points)
0 2
1 6

Use pd. cut as follows. Values didnt add up though. Happy to asist further if clarified.
pd.cut to bin and save in new columnms suffixed withname Points. Select only columns with string Points and add.
df['col1Points'],df['col2Points'],df['col3Points']=\
pd.cut(df.col1, [0,20,40],labels=[2,3])\
,pd.cut(df.col2, [40,70,80],labels=[2,3])\
,pd.cut(df.col3, [-0,100,300],labels=[2,3])
df['col4']=df.filter(like='Points').sum(axis=1)
col1 col2 col3 col1Points col2Points col3Points col4
0 20 30 200 2 NaN 3 5.0
1 40 40 300 3 NaN 3 6.0

Related

Properly map values between two dataframes

I have dataframe df
d = {'Col1': [10,67], 'Col2': [30,10],'Col3': [70,40]}
df = pd.DataFrame(data=d)
which results in
Col1 Col2 Col3
0 10 30 70
1 67 10 40
and df2
df2=pd.DataFrame(data=([25,36,47,(0,20)],[70,85,95,(20,40)],
[12,35,49,(40,60)],[50,49,21,(60,80)],[60,75,38,(80,100)]),
columns=["Col1","Col2","Col3","Range"])
which results in:
Col1 Col2 Col3 Range
0 25 36 47 (0, 20)
1 70 85 95 (20, 40)
2 12 35 49 (40, 60)
3 50 49 21 (60, 80)
4 60 75 38 (80, 100)
Both frames are just for example purposes and might be much bigger in reality. Both frames have the same columns but one.
I want to apply some function (x/y) between each value from df and a value in df2 from the same column. The value from df2 however maybe in varying rows depending on the Range column.
For example 10 from df (Col1) falls in range (0,20) in df2 therefore I want to use 25 from Col1 (df2) and do 10/25.
30 from df (Col2) falls in range (20,40) in df2 therefore I want to take 85 from Col2 (df2) and do 30/85.
70 from df (Col3) falls in range (60,80) in df2 therefore I want to take 21 from Col3 (df2) and do 70/21.
I want to do this for each row in df.
Don't really know how to do the proper mapping; I always tend to start with some for loops which are not very pretty especially if both dataframes are of bigger shape. Expected output can be any array, dataframe or the like composed of the resulting numbers.
Here is one way to do it by defining a helper function:
def find_denominator_for(v):
"""Helper function.
>>> find_denominator_for(10)
{'Col1': 25, 'Col2': 36, 'Col3': 47}
"""
for tup, sub_dict in df2.set_index("Range").to_dict(orient="index").items():
if min(tup) <= v <= max(tup):
return sub_dict
for col in df.columns:
df[col] = df[col] / df[col].apply(lambda x: find_denominator_for(x)[col])
Then:
print(df)
# Output
Col1 Col2 Col3
0 0.40 0.352941 3.333333
1 1.34 0.277778 0.421053

Fix the length of some columns using Pandas

I am trying to add some columns to a pandas dataFrame, but I cannot set the character length of the columns.
I want to add the new fields as a string with a value of null and a length of two characters as the length of the field.
Any idea is welcome.
import pandas as pd
df[["Assess", "Operator","x", "y","z", "g"]]=None
If need fix length of columns in new DataFrame use:
from itertools import product
import string
#length of one character
letters = string.ascii_letters
#print(len(letters)) #52
#if need length of two characters
#print(len(letters)) #2704
#letters = [''.join(x) for x in product(letters,letters)]
df = pd.DataFrame({'col1':[4,5], 'col':[8,2]})
#threshold
N = 5
#get new columns names by difference with original columns length
#min is used if possible negative number after subraction, then is set 0
cols = list(letters[:max(0, N- len(df.columns))])
#added new columns filled by None
#filter by threshold (if possible more columns in original like `N`)
df = df.assign(**dict.fromkeys(cols, None)).iloc[:, :N]
print (df)
col1 col a b c
0 4 8 None None None
1 5 2 None None None
Test if more columns like N threshold:
df = pd.DataFrame({'col1':[4,5], 'col2':[8,2],'col3':[4,5],
'col4':[8,2], 'col5':[7,3],'col6':[9,0], 'col7':[5,1]})
print (df)
col1 col2 col3 col4 col5 col6 col7
0 4 8 4 8 7 9 5
1 5 2 5 2 3 0 1
N = 5
cols = list(letters[:max(0, N - len(df.columns))])
df = df.assign(**dict.fromkeys(cols, None)).iloc[:, :N]
print (df)
col1 col2 col3 col4 col5
0 4 8 4 8 7
1 5 2 5 2 3

How two dataframes in python and replace the null values from one dataframe column to another column in pyspark?

Suppose I have a df with 5 columns and a second df with 6 columns. I want to join df1 with df2 such that the null rows of a column in df1 get replaced by a not null value in df2. How do I do this in python?
I don't want to specify the name of the columns, hard code them. I want to make a robust logic that works even if in the future we need to replace rows for 7 cols instead of 6.
Sample Data:
df1=
col1 col2 col3 col5
1 null null 5
2 null 5 9
4 4 8 6
null 0 9 1
df2=
col1 col2 col3 col4
1 2 -3 5
null null 7 5
4 4 8 1
1 null 9 3
Final df=
col1 col2 col3 col5 col4
1 2 -3 5 5
2 null 5 9 5
4 4 8 6 1
1 0 9 1 3
Conditions:
The null rows of a column in df1 get replaced by a not null value in df2
if both data frames have different not null values on the same index, take the first one or second one. Doesn't matter.
if both of them are null, the final df will have null values on that very same index.
I don't want to specify the column names, just want to have a robust script that works for other data as well with different column names.
I want to join df1 with df2 such that the null rows of a column in df1 get replaced by a not null value in df2. How do I do this in python?
Just join and you can use coalesce to get the first non-null value
I don't want to specify the name of the columns, hard cord them.
You can access columns' name via df.columns, and access columns' datatypes via df.dtypes

Pandas new column that is sum of last N columns

Using Python 3.7 & Pandas, how can I create a new column that is the sum of the last N columns?
There are several questions with this title (example here), but they all seem to be referring to rolling thru last N rows which is not what I am after
col1 = [0,1,1,0,0,0,1,1,1]
col2 = [1,5,9,2,4,2,5,6,1]
col3 = [25,14,2,15,18,98,65,4,77]
col4 = [1,1,1,1,1,1,1,1,1]
df = pd.DataFrame(list(zip(col1, col2, col3, col4)), columns =['col1', 'col2', 'col3', 'col4'])
Desired Result
Let us try
c = df.columns
df['last_2'] = df.loc[:,c[-2:]].sum(1)
#df['last_3'] = df.loc[:,c[-3:]].sum(1)
0 26
1 15
2 3
3 16
4 19
5 99
6 66
7 5
8 78
dtype: int64

split string for a range of columns Pandas

How can I split the string to list for each column for the following Pandas dataframe with many columns?
col1 col2
0/1:9,12:21:99 0/1:9,12:22:99
0/1:9,12:23:99 0/1:9,15:24:99
Desired output:
col1 col2
[0/1,[9,12],21,99] [0/1,[9,12],22,99]
[0/1,[9,12],23,99] [0/1,[9,15],24,99]
I could do:
df['col1'].str.split(":", n = -1, expand = True)
df['col2'].str.split(":", n = -1, expand = True)
but I have many columns, I was wondering if I could do it in a more automated way?
I would then like to calculate the mean of the 2nd element of each list for every row, that is for the first row, get the mean of 21 and 22 and for the second row, get the mean of 23 and 24.
If the data is like your sample, you can make use of stack:
new_df = (df.iloc[:,0:2]
.stack()
.str.split(':',expand=True)
)
Then new_df is double indexed:
0 1 2 3
0 col1 0/1 9,12 21 99
col2 0/1 9,12 22 99
1 col1 0/1 9,12 23 99
col2 0/1 9,15 24 99
And say if you want the mean of 2nd numbers:
new_df[2].unstack(level=-1).astype(float).mean(axis=1)
gives:
0 21.5
1 23.5
dtype: float64