Pandas new column that is sum of last N columns - pandas

Using Python 3.7 & Pandas, how can I create a new column that is the sum of the last N columns?
There are several questions with this title (example here), but they all seem to be referring to rolling thru last N rows which is not what I am after
col1 = [0,1,1,0,0,0,1,1,1]
col2 = [1,5,9,2,4,2,5,6,1]
col3 = [25,14,2,15,18,98,65,4,77]
col4 = [1,1,1,1,1,1,1,1,1]
df = pd.DataFrame(list(zip(col1, col2, col3, col4)), columns =['col1', 'col2', 'col3', 'col4'])
Desired Result

Let us try
c = df.columns
df['last_2'] = df.loc[:,c[-2:]].sum(1)
#df['last_3'] = df.loc[:,c[-3:]].sum(1)
0 26
1 15
2 3
3 16
4 19
5 99
6 66
7 5
8 78
dtype: int64

Related

Fix the length of some columns using Pandas

I am trying to add some columns to a pandas dataFrame, but I cannot set the character length of the columns.
I want to add the new fields as a string with a value of null and a length of two characters as the length of the field.
Any idea is welcome.
import pandas as pd
df[["Assess", "Operator","x", "y","z", "g"]]=None
If need fix length of columns in new DataFrame use:
from itertools import product
import string
#length of one character
letters = string.ascii_letters
#print(len(letters)) #52
#if need length of two characters
#print(len(letters)) #2704
#letters = [''.join(x) for x in product(letters,letters)]
df = pd.DataFrame({'col1':[4,5], 'col':[8,2]})
#threshold
N = 5
#get new columns names by difference with original columns length
#min is used if possible negative number after subraction, then is set 0
cols = list(letters[:max(0, N- len(df.columns))])
#added new columns filled by None
#filter by threshold (if possible more columns in original like `N`)
df = df.assign(**dict.fromkeys(cols, None)).iloc[:, :N]
print (df)
col1 col a b c
0 4 8 None None None
1 5 2 None None None
Test if more columns like N threshold:
df = pd.DataFrame({'col1':[4,5], 'col2':[8,2],'col3':[4,5],
'col4':[8,2], 'col5':[7,3],'col6':[9,0], 'col7':[5,1]})
print (df)
col1 col2 col3 col4 col5 col6 col7
0 4 8 4 8 7 9 5
1 5 2 5 2 3 0 1
N = 5
cols = list(letters[:max(0, N - len(df.columns))])
df = df.assign(**dict.fromkeys(cols, None)).iloc[:, :N]
print (df)
col1 col2 col3 col4 col5
0 4 8 4 8 7
1 5 2 5 2 3

Pandas, multiply part of one DF against another based on condition

Pretty new to this and am having trouble finding the right way to do this.
Say I have dataframe1 looking like this with column names and a bunch of numbers as data:
D L W S
1 2 3 4
4 3 2 1
1 2 3 4
and I have dataframe2 looking like this:
Name1 Name2 Name3 Name4
2 data data D
3 data data S
4 data data L
5 data data S
6 data data W
I would like a new dataframe produced with the result of multiplying each row of the second dataframe against each row of the first dataframe, where it multiplies the value of Name1 against the value in the column of dataframe1 which matches the Name4 value of dataframe2.
Is there any nice way to do this? I was trying to look at using methods like where, condition, and apply but haven't been understanding things well enough to get something working.
EDIT: Use the following code to create fake data for the DataFrames:
d1 = {'D':[1,2,3,4,5,6],'W':[2,2,2,2,2,2],'L':[6,5,4,3,2,1],'S':[1,2,3,4,5,6]}
d2 = {'col1': [3,2,7,4,5,6], 'col2':[2,2,2,2,3,4], 'col3':['data', 'data', 'data','data', 'data', 'data' ], 'col4':['D','L','D','W','S','S']}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
EDIT AGAIN FOR MORE INFO
First I changed the data in df1 at this point so this new example will turn out better.
Okay so from those two dataframes the data frame I'd like to create would come out like this if the multiplication when through for the first four rows of df2. You can see that Col2 and Col3 are unchanged, but depending on the letter of Col4, Col1 was multiplied with the corresponding factor from df1:
d3 = { 'col1':[3,6,9,12,15,18,12,10,8,6,4,2,7,14,21,28,35,42,8,8,8,8,8,8], 'col2':[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], 'col3':['data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data'], 'col4':['D','D','D','D','D','D','L','L','L','L','L','L','D','D','D','D','D','D','W','W','W','W','W','W']}
df3 = pd.DataFrame(data = d3)
I think I understand what you are trying to achieve. You want to multiply each row r in df2 with the corresponding column c in df1 but the elements from c are only multiplied with the first element in r the rest of the row doesn't change.
I was thinking there might be a way to join df1.transpose() and df2 but I didn't find one.
While not pretty, I think the code below solves your problem:
def stretch(row):
repeated_rows = pd.concat([row]*len(df1), axis=1, ignore_index=True).transpose()
factor = row['col1']
label = row['col4']
first_column = df1[label] * factor
repeated_rows['col1'] = first_column
return repeated_rows
pd.concat((stretch(r) for _, r in df2.iterrows()), ignore_index=True)
#resulting in
col1 col2 col3 col4
0 3 2 data D
1 6 2 data D
2 9 2 data D
3 12 2 data D
4 15 2 data D
5 18 2 data D
0 12 2 data L
1 10 2 data L
2 8 2 data L
3 6 2 data L
4 4 2 data L
5 2 2 data L
0 7 2 data D
1 14 2 data D
2 21 2 data D
3 28 2 data D
4 35 2 data D
5 42 2 data D
0 8 2 data W
1 8 2 data W
2 8 2 data W
3 8 2 data W
4 8 2 data W
5 8 2 data W
...

split string for a range of columns Pandas

How can I split the string to list for each column for the following Pandas dataframe with many columns?
col1 col2
0/1:9,12:21:99 0/1:9,12:22:99
0/1:9,12:23:99 0/1:9,15:24:99
Desired output:
col1 col2
[0/1,[9,12],21,99] [0/1,[9,12],22,99]
[0/1,[9,12],23,99] [0/1,[9,15],24,99]
I could do:
df['col1'].str.split(":", n = -1, expand = True)
df['col2'].str.split(":", n = -1, expand = True)
but I have many columns, I was wondering if I could do it in a more automated way?
I would then like to calculate the mean of the 2nd element of each list for every row, that is for the first row, get the mean of 21 and 22 and for the second row, get the mean of 23 and 24.
If the data is like your sample, you can make use of stack:
new_df = (df.iloc[:,0:2]
.stack()
.str.split(':',expand=True)
)
Then new_df is double indexed:
0 1 2 3
0 col1 0/1 9,12 21 99
col2 0/1 9,12 22 99
1 col1 0/1 9,12 23 99
col2 0/1 9,15 24 99
And say if you want the mean of 2nd numbers:
new_df[2].unstack(level=-1).astype(float).mean(axis=1)
gives:
0 21.5
1 23.5
dtype: float64

Python Pandas: LabelEncoding fitting unknown variables

Hi I have a dataframe full of strings and I want to encode these strings and store their corresponding codes.
I want to produce these codes on one column and fit onto another column.
When I fit these codes on some other column that has a string that I haven't seen on my training column I want to create another unique value for that.
I have tried LabelEncoding function but it gives error on the previously unseen strings.
For example a have dataframe:
col1 col2
a a
b b
c e
d f
After training LabelEncoding on first column I get something like this:
col1 col2
1 a
2 b
3 e
4 f
After fitting on the created codes onthe second column I want to have something like this:
col1 col2
1 1
2 2
3 5
4 6
What is the easiest way to do this. Thank you.
Created df dataframe by copying sample from OP's post as follows.
df=pd.read_clipboard()
Its value will be as follows when we print it:
col1 col2
0 a a
1 b b
2 c e
3 d f
Could you please try following. I have given here only 1st 6 alphabets you could mention all in case you have them in your actual Input_file.
dict1 = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':6}
df.applymap(lambda s: dict1.get(s) if s in dict1 else s)
Output will be as follows.
col1 col2
0 1 1
1 2 2
2 3 5
3 4 6
You could encoding yourself using pd.factorize:
v, k = pd.factorize(sorted(df.stack().unique()))
m = dict(zip(k.tolist(), (v+1).tolist()))
df.replace(m)
Output:
col1 col2
0 1 1
1 2 2
2 3 5
3 4 6
I think the real trick is to stack col1 and col2 then encoding the values of both list as one.
le = LabelEncoder()
le.fit(df.stack())

Map column names if data is same in two dataframes

I have two pandas dataframes
df1 = A B C
1 2 3
2 3 4
3 4 5
df2 = X Y Z
1 2 3
2 3 4
3 4 5
I need to map based on data If data is same then map column namesenter code here
Output = col1 col2
A X
B Y
C Z
I cannot find any built-in function to support this, hence simply loop over all columns:
pairs = []
for col1 in df1.columns:
for col2 in df2.columns:
if df1[col1].equals(df2[col2]):
pairs.append((col1, col2))
output = pandas.DataFrame(pairs, columns=['col1', 'col2'])