python - how do I perform the specific df operation? - pandas

this is my df
a = [1,3,4,5,6]
b = [5,3,7,8,9]
c = [0,7,34,6,87]
dd = pd.DataFrame({"a":a,"b":b,"c":c})
I need the output such that first row of the df remains the same, and for all subsequent rows the value in column b = value in column a + value in column b in the row just above + the value in column c
i.e. dd.iloc[1,1] will be 15 (i.e. 3+5+7)
dd.iloc[2,1] will be 53 (i.e. 4 + 15 + 34) plz note that it took new value of [1,1] i.e. 15 (instead of the old value which was 3)
dd.iloc[3,1] will be 64 (5 + 53 + 6). Again it took the updated value of [2,1] (i.e. 53 instead of 7)
expected output

Use:
from numba import jit
#jit(nopython=True)
def f(a,b,c):
for i in range(1, a.shape[0]):
b[i] = b[i-1] + a[i] + c[i]
return b
dd['b'] = f(dd.a.to_numpy(), dd.b.to_numpy(), dd.c.to_numpy())
print (dd)
a b c
0 1 5 0
1 3 15 7
2 4 53 34
3 5 64 6
4 6 157 87

Related

Pandas with a condition select a value from a column and multiply by scalar in new column, row by row

A value in 'Target_Labels' is either 0.0,1.0,2.0 in float64.
Based on this value, I would like to look up a value in one of three columns 'B365A','B365D','B365H' and multiply this value by 10 in a new column. This operation needs to be row wise throughout the entire DataFrame.
I have tried many combinations but nothing seem to work...
final['amount'] = final['Target_Labels'].apply((lambda x: 'B365A' * 10 if x==0.0 else ('B365D' * 10 if x ==1 else 'B365H' * 10))
def prod(x, var1, var2, var3, var4):
if (x[var4])==0:
x[var3]*10
elif (x[var4])==1:
x[var1]*10
else:
x[var2]*10
return x
final['montant'] = final.apply(lambda x: prod(x, 'B365D', 'B365H','B365A', 'Target_Labels'), axis=1)
I'm new to Pandas and any help is welcome...
Use numpy to indexing and get individual cells:
array = final.values
row = range(len(df))
col = final['Target_Labels'] - 1
>>> final
B365A B365D B365H Target_Labels
0 11 12 13 1
1 11 12 13 2
2 11 12 13 3
>>> final['amount'] = final.values[(range(len(final)),
final['Target_Labels'] - 1)] * 10
>>> final
B365A B365D B365H Target_Labels amount
0 11 12 13 1 110
1 11 12 13 2 120
2 11 12 13 3 130

'Series' objects are mutable, thus they cannot be hashed trying to sum columns and datatype is float

I am tryning to sum all values in a range of columns from the third to last of several thousand columns using:
day3prep['D3counts'] = day3prep.sum(day3prep.iloc[:, 2:].sum(axis=1))
dataframe is formated as:
ID G1 Z1 Z2 ...ZN
0 50 13 12 ...62
1 51 62 23 ...19
dataframe with summed column:
ID G1 Z1 Z2 ...ZN D3counts
0 50 13 12 ...62 sum(Z1:ZN in row 0)
1 51 62 23 ...19 sum(Z1:ZN in row 1)
I've changed the NaNs to 0's. The datatype is float but I am getting the error:
'Series' objects are mutable, thus they cannot be hashed
You only need this part:
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
With some random numbers:
import pandas as pd
import random
random.seed(42)
day3prep = pd.DataFrame({'ID': random.sample(range(10), 5), 'G1': random.sample(range(10), 5),
'Z1': random.sample(range(10), 5), 'Z2': random.sample(range(10), 5), 'Z3': random.sample(range(10), 5)})
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
Output:
> day3prep
ID G1 Z1 Z2 Z3 D3counts
0 1 2 0 8 8 16
1 0 1 9 0 6 15
2 4 8 1 3 3 7
3 9 4 7 5 7 19
4 6 3 6 6 4 16

how to apply a function that uses multiple columns as input in pandas?

I have a function relative_humidity(temperature, humidity_index) which takes two variables.
I also have a DataFrame with one column being temperature and the other humidity_index, and I am trying to use this function to create a new humidity column which is calculated using these rows.
I have tried using the df.apply() function but it hasn't worked for me since I am trying to use more than one column. I have also tried looping through every row and applying the function to each row, but this appears too slow. Any help appreciated.
EDIT: my function looks like this:
def relative_humidity_calculator(T, HI):
a = c_6 + c_8*T + c_9*T**2
b = c_3 + c_4*T + c_7*T**2
c = c_1 + c_2*T + c_5*T**2 -HI
solutions = []
#adding both solutions of quadratic to list
if b**2-4*a*c>=0:
solutions.append((-b+np.sqrt(b**2-4*a*c))/(2*a))
solutions.append((-b-np.sqrt(b**2-4*a*c))/(2*a))
#solution is the correct one if it is between 0 and 100
if solutions[0]>0 and solutions[0]<100:
return solutions[0]
else:
return solutions[1]
else:
return print('imaginary roots', T, HI, a, b, c)
Based on your updated question, you can do this without comprising the function:
# sample data:
c1,c2,c3,c4,c5,c6,c7,c8,c9 = range(9)
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,100,(10,2)), columns=['T','HI'])
# shorthand for Temp and Humidity-Index
T = df['T']
HI = df['HI']
# series arithmetic operations are allowed
a = c6 + c8*T + c9*T**2
b = c3 + c4*T + c7*T**2
c = c1 + c2*T + c5*T**2 - HI
# discriminant too
deltas = b**2-4*a*c
delta_roots = np.sqrt(b**2 - 4*a*c, where=deltas>0)
# two solutions of quadratic
s0 = (- b + delta_roots)/(2*a)
s1 = (- b - delta_roots)/(2*a)
df['rel_hum'] = np.select(((s0>0) & (s0<100), # condition on first solution
deltas>=0), # quadratic has solutions
(s0, s1), np.nan)
Output:
T HI rel_hum
0 37 12 NaN
1 72 9 0.129917
2 75 5 0.028714
3 79 64 -0.629721
4 16 1 NaN
5 76 71 -0.742304
6 6 25 NaN
7 50 20 NaN
8 18 84 NaN
9 11 28 NaN

Combine two columns of numbers in dataframe into single column using pandas/python

I'm very new to Pandas and Python.
I have a 3226 x 61 dataframe and I would like to combine two columns into a single one.
The two columns I would like to combine are both integers - one has either one or two digits (1 through 52) while the other has three digits (e.g., 1 or 001, 23 or 023). I need the output to be a five digit integer (e.g., 01001 or 52023). There will be no mathematical operations with the resulting integers - I will need them only for look-up purposes.
Based on some of the other posts on this fantastic site, I tried the following:
df['YZ'] = df['Y'].map(str) + df['Z'].map(str)
But that returns "1.00001 for a first column of "1" and second column of "001", I believe because making "1" a str turns it into "1.0", which "001" is added to the end.
I've also tried:
df['YZ'] = df['Y'].join(df['Z'])
Getting the following error:
AttributeError: 'Series' object has no attribute 'join'
I've also tried:
df['Y'] = df['Y'].astype(int)
df['Z'] = df['Z'].astype(int)
df['YZ'] = df[['Y','Z']].apply(lambda x: ''.join(x), axis=1)
Getting the following error:
TypeError: ('sequence item 0: expected str instance, numpy.int32
found', 'occurred at index 0')
A copy of the columns is below:
1 1
1 3
1 5
1 7
1 9
1 11
1 13
I understand there are two issues here:
Combining the two columns
Getting the correct format (five digits)
Frankly, I need help with both but would be most appreciative of the column combining problem.
I think you need convert columns to string, add 0 by zfill and simply sum by +:
df['YZ'] = df['Y'].astype(str).str.zfill(2) + df['Z'].astype(str).str.zfill(3)
Sample:
df=pd.DataFrame({'Y':[1,3,5,7], 'Z':[10,30,51,74]})
print (df)
Y Z
0 1 10
1 3 30
2 5 51
3 7 74
df['YZ'] = df['Y'].astype(str).str.zfill(2) + df['Z'].astype(str).str.zfill(3)
print (df)
Y Z YZ
0 1 10 01010
1 3 30 03030
2 5 51 05051
3 7 74 07074
If need also change original columns:
df['Y'] = df['Y'].astype(str).str.zfill(2)
df['Z'] = df['Z'].astype(str).str.zfill(3)
df['YZ'] = df['Y'] + df['Z']
print (df)
Y Z YZ
0 01 010 01010
1 03 030 03030
2 05 051 05051
3 07 074 07074
Solution with join:
df['Y'] = df['Y'].astype(str).str.zfill(2)
df['Z'] = df['Z'].astype(str).str.zfill(3)
df['YZ'] = df[['Y','Z']].apply('-'.join, axis=1)
print (df)
Y Z YZ
0 01 010 01-010
1 03 030 03-030
2 05 051 05-051
3 07 074 07-074
And without change original columns:
df['YZ'] = df['Y'].astype(str).str.zfill(2) + '-' + df['Z'].astype(str).str.zfill(3)
print (df)
Y Z YZ
0 1 10 01-010
1 3 30 03-030
2 5 51 05-051
3 7 74 07-074

How to subtract one dataframe from another?

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)