Pandas Creating Normal Dist series - pandas

I'm trying to convert an excel "normal distribution" formula into python.
(1-NORM.DIST(a+col,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE)))
For example: Here's my given df
Id a b c
ijk 4 3.5 12.53
xyz 12 3 10.74
My goal:
Id a b c 0 1 2 3
ijk 4 3.5 12.53 1 .93 .87 .81
xyz 12 3 10.74 1 .87 .76 .66
Here's the math behind it:
column 0: always 1
column 1: (1-NORM.DIST(a+1,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
column 2: (1-NORM.DIST(a+2,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
column 3: (1-NORM.DIST(a+3,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
This is what I have so far:
df1 = pd.DataFrame(df, columns=np.arange(0,4))
result = pd.concat([df, df1], axis=1, join_axes=[df.index])
result[0] = 1
I'm not sure what to do after this.
This is how I use the normal distribution function:
https://support.office.com/en-us/article/normdist-function-126db625-c53e-4591-9a22-c9ff422d6d58
Many many thanks!

NORM.DIST(..., TRUE) means the cumulative distribution function and 1 - NORM.DIST(..., TRUE) means the survival function. These are available under scipy's stats module (see ss.norm). For example,
import scipy.stats as ss
ss.norm.cdf(4, 3.5, 12.53)
Out:
0.51591526057026538
For your case, you can first define a function:
def normalize(a, b, c, col):
return ss.norm.sf(a+col, b, c) / ss.norm.sf(a, b, c)
and call that function with apply:
for col in range(4):
df[col] = df.apply(lambda x: normalize(x.a, x.b, x.c, col), axis=1)
df
Out:
Id a b c 0 1 2 3
0 ijk 4 3.5 12.53 1.0 0.934455 0.869533 0.805636
1 xyz 12 3.0 10.74 1.0 0.875050 0.760469 0.656303
This is not the most efficient approach as it calculates the survival function for same values again and involves two loops. One level of loops can be omitted by passing an array of values to ss.sf:
out = df.apply(
lambda x: pd.Series(
ss.norm.sf(x.a + np.arange(4), x.b, x.c) / ss.norm.sf(x.a, x.b, x.c)
), axis=1
)
Out:
0 1 2 3
0 1.0 0.934455 0.869533 0.805636
1 1.0 0.875050 0.760469 0.656303
And you can use join to add this to your original DataFrame:
df.join(out)
Out:
Id a b c 0 1 2 3
0 ijk 4 3.5 12.53 1.0 0.934455 0.869533 0.805636
1 xyz 12 3.0 10.74 1.0 0.875050 0.760469 0.656303

Related

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

How to merge headers of multiple columns of a pandas dataframe in a new column if values of the multiple columns meet certain criteria

I have a data frame of size 122400*92, out of which 8 columns represent flow, which maybe in different combinations. I want to merge all the columns headers in a new column if the flow in each column is > 20.
For an example:
A Flow: 52
B Flow: 46
C Flow: 0
D Flow: 54
E Flow: 34
F Flow: 0
G Flow: 12
H Flow: 0
New column will give :'A,B,D,E,G'
I have used the below code, which seems to work for small dataset but fails to work in large dataset.
reqcol=['A FLOW','B FLOW','C FLOW','D FLOW','E FLOW','F FLOW','G FLOW','H FLOW']
arr=[]
for i in range(len(df)):
arr1=[]
for j in reqcol:
if(df[j][i]>20):
arr1.append(j[0])
arr.append(arr1)
df['Combination'] = arr
Request your help
Let's define some test data:
import pandas as pd
import numpy as np
reqcol = ['A FLOW', 'B FLOW']
df = pd.DataFrame({'NAME': ['N1', 'N2', 'N3', 'N4'], 'A FLOW': [5, 80,50, 40], 'B FLOW' : [10, 0, 40, 10]})
This gives you the following data frame:
>>> df
NAME A FLOW B FLOW
0 N1 5 10
1 N2 80 0
2 N3 50 40
3 N4 40 10
First let's only look at the relevant columns:
>>> df[reqcol]
A FLOW B FLOW
0 5 10
1 80 0
2 50 40
3 40 10
Now check where the flow is bigger than 20 (only in the relevant columns):
>>> df[reqcol].where(lambda x: x > 20)
A FLOW B FLOW
0 NaN NaN
1 80.0 NaN
2 50.0 40.0
3 40.0 NaN
We can now apply a function onto every row. The function will get a Pandas.Series Object. First we eliminate all NaN's from the Series by calling .dropna() and then we get all remaining keys by using .keys() (original column name like 'FLOW A').
>>> dfBigger20 = df[reqcol].where(lambda x: x > 20)
>>> dfBigger20.apply(lambda row: row.dropna().keys(), axis = 1)
0 Index([], dtype='object')
1 Index(['A FLOW'], dtype='object')
2 Index(['A FLOW', 'B FLOW'], dtype='object')
3 Index(['A FLOW'], dtype='object')
dtype: object
To make this more beatiful we can convert the Index Object returned by .keys() to a regular list.
>>> dfBigger20.apply(lambda row: list(row.dropna().keys()), axis = 1)
0 []
1 [A FLOW]
2 [A FLOW, B FLOW]
3 [A FLOW]
dtype: object
Now all in one line and saved into the original df:
>>> df['Combinations'] = df[reqcol].where(lambda x: x>20).apply(lambda row: list(row.dropna().keys()), axis=1)
>>> df
NAME A FLOW B FLOW Combinations
0 N1 5 10 []
1 N2 80 0 [A FLOW]
2 N3 50 40 [A FLOW, B FLOW]
3 N4 40 10 [A FLOW]

How to extract different groups of 4 rows from dataframe and unstack the columns

I am new to Python and lost in the way to approach this problem: I have a dataframe where the information I need is mostly grouped in layers of 2,3 and 4 rows. Each group has a different ID in one of the columns. I need to create another dataframe where the groups of rows are now a single row, where the information is unstacked in more columns. Later I can drop unwanted/redundant columns.
I think I need to iterate through the dataframe rows and filter for each ID unstacking the rows into a new dataframe. I cannot obtain much from unstack or groupby functions. Is there a easy function or combination that can make this task?
Here is a sample of the dataframe:
2_SH1_G8_D_total;Positions tolerance d [z] ;"";0.000; ;0.060;"";0.032;0.032;53%
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-58.000;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-1324.500;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";391.000;"";"";"";390.990;"";""
13_SH1_G8_D_total;Flatness;"";0.000; ;0.020;"";0.004;0.004;20%
14_SH1_G8_D_total;Parallelism tolerance ;"";0.000; ;0.030;"";0.025;0.025;84%
15_SH1_B1_B;Positions tolerance d [x y] ;"";0.000; ;0.200;"";0.022;0.022;11%
15_SH1_B1_B;Positions tolerance d [x y] ;"";265.000;"";"";"";264.993;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";1502.800;"";"";"";1502.792;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";-391.000;"";"";"";---;"";""
The original dataframe has information in 4 rows, but not always. Ending dataframe should have only one row per Id occurrence, with all the info in the columns.
So far, with help, I managed to run this code:
with open(path, newline='') as datafile:
data = csv.reader(datafile, delimiter=';')
for row in data:
tmp.append(row)
# Create data table joining data with the same GAT value, GAT is the ID I need
Data = []
Data.append(tmp[0])
GAT = tmp[0][0]
j = 0
counter = 0
for i in range(0,len(tmp)):
if tmp[i][0] == GAT:
counter = counter + 1
if counter == 2:
temp=(tmp[i][5],tmp[i][7],tmp[i][8],tmp[i][9])
else:
temp = (tmp[i][3], tmp[i][7])
Data[j].extend(temp)
else:
Data.append(tmp[i])
GAT = tmp[i][0]
j = j + 1
# for i in range(0,len(Data)):
# print(Data[i])
with open('output.csv', 'w', newline='') as outputfile:
writedata = csv.writer(outputfile, delimiter=';')
for i in range(0, len(Data)):
writedata.writerow(Data[i]);
But is not really using pandas, which probably will give me more power handling the data. In addition, this open() commands have troubles with the non-ascii characters I am unable to solve.
Is there a more elegant way using pandas?
So basically you're doing a "partial transpose". Is this what you want (referenced from this answer)?
Sample Data
With unequal number of rows per line
ID col1 col2
0 A 1.0 2.0
1 A 3.0 4.0
2 B 5.0 NaN
3 B 7.0 8.0
4 B 9.0 10.0
5 B NaN 12.0
Code
import pandas as pd
import io
# read df
df = pd.read_csv(io.StringIO("""
ID col1 col2
A 1 2
A 3 4
B 5 nan
B 7 8
B 9 10
B nan 12
"""), sep=r"\s{2,}", engine="python")
# solution
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
Result
print(df)
col1_1 col2_1 col1_2 col2_2 col1_3 col2_3 col1_4 col2_4
ID
A 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
B 5.0 NaN 7.0 8.0 9.0 10.0 NaN 12.0
Explanation
After the .set_index(["ID", g]) step, the dataset becomes
col1 col2
ID
A 0 1.0 2.0
1 3.0 4.0
B 0 5.0 NaN
1 7.0 8.0
2 9.0 10.0
3 NaN 12.0
where the multi-index is perfect for df.unstack().

How to *multiply* (for lack of a better term) two dataframes [duplicate]

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.
Given two simple DataFrames;
left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})
left
col1 col2
0 A 1
1 B 2
2 C 3
right
col1 col2
0 X 20
1 Y 30
2 Z 50
The cross product of these frames can be computed, and will look something like:
A 1 X 20
A 1 Y 30
A 1 Z 50
B 2 X 20
B 2 Y 30
B 2 Z 50
C 3 X 20
C 3 Y 30
C 3 Z 50
What is the most performant method of computing this result?
Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:
pandas <= 1.1.X
def cartesian_product_basic(left, right):
return (
left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))
cartesian_product_basic(left, right)
pandas >= 1.2
left.merge(right, how="cross") # implements the technique above
col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".
While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.
A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is #senderle's first implementation.
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames
Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!
This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and
def cartesian_product_generalized(left, right):
la, lb = len(left), len(right)
idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
return pd.DataFrame(
np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))
cartesian_product_generalized(left, right)
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left, right))
True
And, along similar lines,
left2 = left.copy()
left2.index = ['s1', 's2', 's1']
right2 = right.copy()
right2.index = ['x', 'y', 'y']
left2
col1 col2
s1 A 1
s2 B 2
s1 C 3
right2
col1 col2
x X 20
y Y 30
y Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left2, right2))
True
This solution can generalise to multiple DataFrames. For example,
def cartesian_product_multi(*dfs):
idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
return pd.DataFrame(
np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))
cartesian_product_multi(*[left, right, left]).head()
0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1
Further Simplification
A simpler solution not involving #senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
np.array_equal(cartesian_product_simplified(left, right),
cartesian_product_basic(left2, right2))
True
Performance Comparison
Benchmarking these solutions on some contrived DataFrames with unique indices, we have
Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.
Performance Benchmarking Code
This is the timing script. All functions called here are defined above.
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=['cartesian_product_basic', 'cartesian_product_generalized',
'cartesian_product_multi', 'cartesian_product_simplified'],
columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
dtype=float
)
for f in res.index:
for c in res.columns:
# print(f,c)
left2 = pd.concat([left] * c, ignore_index=True)
right2 = pd.concat([right] * c, ignore_index=True)
stmt = '{}(left2, right2)'.format(f)
setp = 'from __main__ import left2, right2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=5)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames
Cross join *
* you are here
After pandas 1.2.0 merge now have option cross
left.merge(right, how='cross')
Using itertools product and recreate the value in dataframe
import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,[]),l)))
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
Here's an approach with triple concat
m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
pd.concat([right]*len(left)).reset_index(drop=True) ], 1)
col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
One option is with expand_grid from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor as jn
others = {'left':left, 'right':right}
jn.expand_grid(others = others)
left right
col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
I think the simplest way would be to add a dummy column to each data frame, do an inner merge on it and then drop that dummy column from the resulting cartesian dataframe:
left['dummy'] = 'a'
right['dummy'] = 'a'
cartesian = left.merge(right, how='inner', on='dummy')
del cartesian['dummy']

Partition pandas .diff() in multi-index level

My question relates to calling .diff() within the partition of a multi index level
In the following sample the output of the first
df.diff() is
values
Greek English
alpha a NaN
b 2
c 2
d 2
beta e 11
f 1
g 1
h 1
But I want it to be:
values
Greek English
alpha a NaN
b 2
c 2
d 2
beta e NaN
f 1
g 1
h 1
Here is a solution, using a loop but I am thinking I can avoid that loop!
import pandas as pd
import numpy as np
df = pd.DataFrame({'values' : [1.,3.,5.,7.,18.,19.,20.,21.],
'Greek' : ['alpha', 'alpha', 'alpha', 'alpha','beta','beta','beta','beta'],
'English' : ['a', 'b', 'c', 'd','e','f','g','h']})
df.set_index(['Greek','English'],inplace =True)
print df
# (1.) This is not the type of .diff() i want.
# I need it to respect the level='Greek' and restart
print df.diff()
# this is one way to achieve my desired result but i have to think
# there is a way that does not involve the need to loop.
idx = pd.IndexSlice
for greek_letter in df.index.get_level_values('Greek').unique():
df.loc[idx[greek_letter,:]]['values'] = df.loc[idx[greek_letter,:]].diff()
print df
Just groupby by level=0 or 'Greek' if you prefer and then you can call diff on values:
In [179]:
df.groupby(level=0)['values'].diff()
Out[179]:
Greek English
alpha a NaN
b 2
c 2
d 2
beta e NaN
f 1
g 1
h 1
dtype: float64