pandas function, use previously computed value - pandas

Had a simple question that I have not found a simple answer to. As an example this data frame can be used:
A = pd.Series([0.1,-0.2,0.14,0.12,-0.11])
B = pd.Series([1.0,3.0,2.0,6.0,9.0])
df = pd.DataFrame({'A':A,'B':B})
I now would like to create a column C as follows:
C_i = A_i*(B_i+C_{i-1})
i.e. to compute the value of C I need the previously computed value of C. This can be done by a simple for loop, but I would like to use map, apply or some other pandas functionality. Can this be done i a simple manner?
I tested it in a spreadsheet and this what I am looking for:
A B C
0,1 1 0,10000
−0,2 3 −0,62000
0,14 2 0,19320
0,12 6 0,74318
−0,11 9 −1,07175

One way to get this:
df['C'] = df.A * df.B
df['C'] = df.C + (df.A * df.C.shift().fillna(0))
df
Which yields:
A B C
0 0.10 1 0.100
1 -0.20 3 -0.620
2 0.14 2 0.196
3 0.12 6 0.754
4 -0.11 9 -1.069
Which looks like what you wanted.

Related

Generate dataframe with same key but different value

How can I generate a dataframe with common key but different value?
import pandas as pd
A = {"ID":["A", "B","C"], "Weight":[500,300,200]}
B = {"ID":["A", "B","D"], "Weight":[500,100,100]}
dfA = pd.DataFrame(data=A)
dfB = pd.DataFrame(data=B)
dfC = dfA.merge(dfB, how='outer', left_on=['ID'], right_on=['ID'])
dfC
Current output is:
ID Weight_x Weight_y
0 A 500.0 500.0
1 B 300.0 100.0
2 C 200.0 NaN
3 D NaN 100.0
But my expected output (ID is common key, so A is identical value but C and D not common element):
ID Weight_x Weight_y
0 B 300.0 100.0
Use a simple merge with the default parameter how='inner' as suggested by #ALollz and query to keep different weight values:
>>> pd.merge(dfA, dfB, on='ID').query("Weight_x != Weight_y")
ID Weight_x Weight_y
1 B 300 100

How to extract different groups of 4 rows from dataframe and unstack the columns

I am new to Python and lost in the way to approach this problem: I have a dataframe where the information I need is mostly grouped in layers of 2,3 and 4 rows. Each group has a different ID in one of the columns. I need to create another dataframe where the groups of rows are now a single row, where the information is unstacked in more columns. Later I can drop unwanted/redundant columns.
I think I need to iterate through the dataframe rows and filter for each ID unstacking the rows into a new dataframe. I cannot obtain much from unstack or groupby functions. Is there a easy function or combination that can make this task?
Here is a sample of the dataframe:
2_SH1_G8_D_total;Positions tolerance d [z] ;"";0.000; ;0.060;"";0.032;0.032;53%
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-58.000;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-1324.500;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";391.000;"";"";"";390.990;"";""
13_SH1_G8_D_total;Flatness;"";0.000; ;0.020;"";0.004;0.004;20%
14_SH1_G8_D_total;Parallelism tolerance ;"";0.000; ;0.030;"";0.025;0.025;84%
15_SH1_B1_B;Positions tolerance d [x y] ;"";0.000; ;0.200;"";0.022;0.022;11%
15_SH1_B1_B;Positions tolerance d [x y] ;"";265.000;"";"";"";264.993;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";1502.800;"";"";"";1502.792;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";-391.000;"";"";"";---;"";""
The original dataframe has information in 4 rows, but not always. Ending dataframe should have only one row per Id occurrence, with all the info in the columns.
So far, with help, I managed to run this code:
with open(path, newline='') as datafile:
data = csv.reader(datafile, delimiter=';')
for row in data:
tmp.append(row)
# Create data table joining data with the same GAT value, GAT is the ID I need
Data = []
Data.append(tmp[0])
GAT = tmp[0][0]
j = 0
counter = 0
for i in range(0,len(tmp)):
if tmp[i][0] == GAT:
counter = counter + 1
if counter == 2:
temp=(tmp[i][5],tmp[i][7],tmp[i][8],tmp[i][9])
else:
temp = (tmp[i][3], tmp[i][7])
Data[j].extend(temp)
else:
Data.append(tmp[i])
GAT = tmp[i][0]
j = j + 1
# for i in range(0,len(Data)):
# print(Data[i])
with open('output.csv', 'w', newline='') as outputfile:
writedata = csv.writer(outputfile, delimiter=';')
for i in range(0, len(Data)):
writedata.writerow(Data[i]);
But is not really using pandas, which probably will give me more power handling the data. In addition, this open() commands have troubles with the non-ascii characters I am unable to solve.
Is there a more elegant way using pandas?
So basically you're doing a "partial transpose". Is this what you want (referenced from this answer)?
Sample Data
With unequal number of rows per line
ID col1 col2
0 A 1.0 2.0
1 A 3.0 4.0
2 B 5.0 NaN
3 B 7.0 8.0
4 B 9.0 10.0
5 B NaN 12.0
Code
import pandas as pd
import io
# read df
df = pd.read_csv(io.StringIO("""
ID col1 col2
A 1 2
A 3 4
B 5 nan
B 7 8
B 9 10
B nan 12
"""), sep=r"\s{2,}", engine="python")
# solution
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
Result
print(df)
col1_1 col2_1 col1_2 col2_2 col1_3 col2_3 col1_4 col2_4
ID
A 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
B 5.0 NaN 7.0 8.0 9.0 10.0 NaN 12.0
Explanation
After the .set_index(["ID", g]) step, the dataset becomes
col1 col2
ID
A 0 1.0 2.0
1 3.0 4.0
B 0 5.0 NaN
1 7.0 8.0
2 9.0 10.0
3 NaN 12.0
where the multi-index is perfect for df.unstack().

How to convert my column into row using Pandas

I have result my data frame as below. How to convert into rows. My column header is numbers which has to avoided
https://i.stack.imgur.com/RvwVD.png
IIUC, try this:
row = [*'ABCDE']
row2 = [1,2,3,4,5]
df = pd.DataFrame([row,row2])
print(df)
Input dataframe:
0 1 2 3 4
0 A B C D E
1 1 2 3 4 5
Use this code:
df_out = df.T.set_index(0)
print(df_out)
Output:
1
0
A 1
B 2
C 3
D 4
E 5
you can use a simple reshape to do this.
This can be many other ways to do this, Here I show I way,
I think this is how your data frame looks like:
**df = pd.DataFrame({0:['neg',0.015],1:['neu',0.006],2:['pos',0.014]})**
You can do a simple reshape using below line:
**
import pandas as pd
import numpy as np
pd.DataFrame(np.array(df.iloc[1:]).reshape(-1,1))
**
If I understand you correctly, you want to turn your first row into your header?
This is the way I would do it:
df.T.set_index(0).T

Pandas Creating Normal Dist series

I'm trying to convert an excel "normal distribution" formula into python.
(1-NORM.DIST(a+col,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE)))
For example: Here's my given df
Id a b c
ijk 4 3.5 12.53
xyz 12 3 10.74
My goal:
Id a b c 0 1 2 3
ijk 4 3.5 12.53 1 .93 .87 .81
xyz 12 3 10.74 1 .87 .76 .66
Here's the math behind it:
column 0: always 1
column 1: (1-NORM.DIST(a+1,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
column 2: (1-NORM.DIST(a+2,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
column 3: (1-NORM.DIST(a+3,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
This is what I have so far:
df1 = pd.DataFrame(df, columns=np.arange(0,4))
result = pd.concat([df, df1], axis=1, join_axes=[df.index])
result[0] = 1
I'm not sure what to do after this.
This is how I use the normal distribution function:
https://support.office.com/en-us/article/normdist-function-126db625-c53e-4591-9a22-c9ff422d6d58
Many many thanks!
NORM.DIST(..., TRUE) means the cumulative distribution function and 1 - NORM.DIST(..., TRUE) means the survival function. These are available under scipy's stats module (see ss.norm). For example,
import scipy.stats as ss
ss.norm.cdf(4, 3.5, 12.53)
Out:
0.51591526057026538
For your case, you can first define a function:
def normalize(a, b, c, col):
return ss.norm.sf(a+col, b, c) / ss.norm.sf(a, b, c)
and call that function with apply:
for col in range(4):
df[col] = df.apply(lambda x: normalize(x.a, x.b, x.c, col), axis=1)
df
Out:
Id a b c 0 1 2 3
0 ijk 4 3.5 12.53 1.0 0.934455 0.869533 0.805636
1 xyz 12 3.0 10.74 1.0 0.875050 0.760469 0.656303
This is not the most efficient approach as it calculates the survival function for same values again and involves two loops. One level of loops can be omitted by passing an array of values to ss.sf:
out = df.apply(
lambda x: pd.Series(
ss.norm.sf(x.a + np.arange(4), x.b, x.c) / ss.norm.sf(x.a, x.b, x.c)
), axis=1
)
Out:
0 1 2 3
0 1.0 0.934455 0.869533 0.805636
1 1.0 0.875050 0.760469 0.656303
And you can use join to add this to your original DataFrame:
df.join(out)
Out:
Id a b c 0 1 2 3
0 ijk 4 3.5 12.53 1.0 0.934455 0.869533 0.805636
1 xyz 12 3.0 10.74 1.0 0.875050 0.760469 0.656303

Pandas: Imputing Missing Values to Data Frame

Suppose I have a data frame with some missing values, as below:
import pandas as pd
df = pd.DataFrame([[1,3,'NA',2], [0,1,1,3], [1,2,'NA',1]], columns=['W', 'X', 'Y', 'Z'])
print(df)
The variable Y is missing two values. Say I run some imputation model and come up with an estimate of what the two values should be:
to_impute = [2,1]
What is the best way of replacing the two NA's with those two values? I know of ways that are fairly roundabout, e.g. looping over to_impute and using df.iloc to add each value. But I'm hoping there is a concise and non-iterative way.
(This is something that is easy in R, and I'm hoping it can be easy in Pandas.)
In pandas NA should be NaN, 1st you need to replace it , then we can using fillna
df.Y=df.Y.replace('NA',np.nan)
df.Y=df.Y.fillna(pd.Series([1,2],index=df.index[df.Y.isnull()]))
df
Out[1375]:
W X Y Z
0 1 3 1.0 2
1 0 1 1.0 3
2 1 2 2.0 1
Let us treat your NA as str
df.loc[df.Y=='NA','Y']=[1,2]
df
Out[1380]:
W X Y Z
0 1 3 1 2
1 0 1 1 3
2 1 2 2 1