I have a pandas dataframe with two columns, say x and y. For each row, x is the mean of a random variable following a poisson distribution. I want to add a third column, z, such that z = the probability that a random draw will be less than y.
For a given row, say x = 15 and I want to know the probability that a random draw will be less than y = 10. I know I can use:
from scipy.stats import poisson
x = 15
y = 10
z = poisson.cdf(y, x)
z
which returns 0.118
How do I do this for each row in a pandas dataframe, creating a third column?
You can use the apply method:
df = pd.DataFrame({"x": [15, 15, 15], "y": [10, 15, 20]})
df["z"] = df.apply(lambda r: poisson.cdf(r.y, r.x), axis=1)
print(df)
Result:
x y z
0 15 10 0.118464
1 15 15 0.568090
2 15 20 0.917029
Related
I'm trying to separate a DataFrame into groups and drop groups below a minimum size (small outliers).
Here's what I've tried:
df.groupby(['A']).filter(lambda x: x.count() > min_size)
df.groupby(['A']).filter(lambda x: x.size() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].count() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].size() > min_size)
But these either throw an exception or return a different table than I'm expecting. I'd just like to filter, not compute a new table.
You can use len:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df.groupby('A').filter(lambda x: len(x) > 1)
Out[12]:
A B
0 1 2
1 1 4
The number of rows is in the attribute .shape[0]:
df.groupby('A').filter(lambda x: x.shape[0] >= min_size)
NB: If you want to remove the groups below the minimum size, keep those that are above or at the minimum size (>=, not >).
groupby.filter can be very slow for larger dataset / a large number of groups. A faster approach is to use groupby.transform:
Here's an example, first create the dataset:
import pandas as pd
import numpy as np
df = pd.concat([
pd.DataFrame({'y': np.random.randn(np.random.randint(1,5))}).assign(A=str(i))
for i in range(1,1000)
]).reset_index(drop=True)
print(df)
y A
0 1.375980 1
1 -0.023861 1
2 -0.474707 1
3 -0.151859 2
4 -1.696823 2
... ... ...
2424 0.276737 998
2425 -0.142171 999
2426 -0.718891 999
2427 -0.621315 999
2428 1.335450 999
[2429 rows x 2 columns]
Time it:
I have two DF (df1 and df2):
Sample Date Value_df1
1992-11-04 1
1992-11-12 2
1992-11-18 3
... ...
1992-12-02 4
1992-12-09 5
1992-12-21 6
1992-12-28 7
1993-01-07 8
Sample Date Value_df2
1992-11-04 9
1992-11-12 10
1992-11-18 11
... ...
1992-12-02 12
1992-12-09 13
1992-12-21 14
1992-12-28 15
1993-01-07 16
by establishing:
y = df1['Value_df1']
x = df2['Value_df2']
I want to fit the values of df1 and df2 to this eq πΏππ10(y)=πΏππ10(π)+ππΏππ10(x) and then get the constants a and b.
This is what I've done:
import numpy as np
#getting log10 values from both DF
log_y = np.log10( y )
log_x = np.log10( x )
curve = np.polyfit(log_x, log_y, 1)
b = curve[0]
a = curve[1]
#applying the inverse of log(a) to obtain the real a value
a = 10 ** a
This way one should get a and b values?.
Is this a good approach? is there a better way or am I doing something wrong?
Thanks for your feedback
I do not really understand why from the following code pandas return is Series but not a DataFrame.
import pandas as pd
df = pd.DataFrame([[4,9]]*3, columns = ["A", "B"])
def plus_2(x):
y =[]
for i in range(0, len(x)):
y.append(x[i]+2)
return y
df_row = df.apply(plus_2, axis = 1) # Applied to each row
df_row
While if I change axis=0 it produces DataFrame as expected:
import pandas as pd
df = pd.DataFrame([[4,9]]*3, columns = ["A", "B"])
def plus_2(x):
y =[]
for i in range(0, len(x)):
y.append(x[i]+2)
return y
df_row = df.apply(plus_2, axis = 0) # Applied to each row
df_row
Here is the output:
In first example where you put axis=1 you implement on row level.
It means that for each row plus_2 function returns y which is list of two element (but list as a whole is single element so this is pd.Series).
Based on your example it will be returned 3x list (2 element each). Here single list if single row.
You could expand this result and create two columns (each element from list will be new column) by adding result_type="expand" in apply:
df_row = df.apply(lambda x: plus_2(x), axis=1, result_type="expand")
# output
0 1
0 6 11
1 6 11
2 6 11
In second approach you have axis=0 co this is applied on column level.
It means that for each column plus_2 function returns y, so plus_2 is applied twice, separately for A column and for B column. This is why it returns dataframe: your input is DataFrame with columns A and B, each column applies plus_2 function and returns A and B columns as result of plus_2 functions applied.
Based on your example it will be returned 2x list (3 element each). Here single list is single column.
So the main difference between axis=1 and axis=0 is that:
if you applied on row level apply will return:
[6, 11]
[6, 11]
[6, 11]
if you applied on column level apply will return:
[6, 6, 6]
[11, 11, 11]
I have a pandas dataframe with thousands of columns and I would like to perform the following operations for each column of the dataframe:
check if the value i-th and i-1-th values are in the range (between x and y);
if #1 is satisfied, then find log(i/i-1) ** 2 of the column;
if #1 is not satisfied, assume 0;
find the total of #2 for each column.
Here is a dataframe with a single column:
d = {'col1': [10, 15, 23, 16, 5, 14, 11, 4]}
df = pd.DataFrame(data = d)
df
x = 10 and y = 20
Here is what I can do for this single column:
df["IsIn"] = "NA"
for i in range(1, len(df.col1)):
if (x < df.col1[i] < y) & (x < df.col1[i - 1] < y):
df.IsIn[i] = 1
else:
df.IsIn[i] = 0
df["rets"] = np.log(df["col1"] / df["col1"].shift(1))
df["var"] = df["IsIn"] * df["rets"]**2
Total = df["var"].sum()
Total
Ideally, I would have a (1 by n-cols) dataframe of Totals for each column. How can I best achieve this? I would also appreciate if you can supplement your answer with detailed explanation.
Yes, this is an instance where apply works. You only need to wrap your logic in a function. Also, consider between and shift on the condition to eliminate the first loop:
def func(s, x=10, y=20):
'''
compute the value given a series
'''
# mask where values are between x and y
valid = s.between(x,y)
# shift `valid` and double check
valid = valid & valid.shift(fill_value=False)
# squared log, mask with `valid`, and sum
return (np.log(s/s.shift())**2 * valid).sum()
# apply `func` on the columns
df.apply(func, x=10, y=20)
Output:
col1 0.222561
dtype: float64
I have two columns in a df. each column has multiple values in 1 row.I want to split each value in a new row in another table and generate sequence number. given data is
x y
76.25, 345.65 87.12,96.45
78.12,35.1,98.27 85.23,65.2,56.63
new df should be like this
x 76.25
y 87.12
sequence number 1
x 345.65
y 96.45
sequence number 2
x 78.12
y 85.23
sequence number 1
x 35.1
y 65.21
sequence number 2
x 98.27
y 56.63
sequence number 3
all values are strings. I have no idea how should I do it.Should I write a function or there is any command in dataframe? any help is appreciated
You can do it using iterrows() + concat():
df = pd.DataFrame({
'x': ('76.25,345.65', '78.12,35.1,98.27'),
'y': ('87.12,96.45', '85.23,65.2,56.63')
})
def get_parts():
for index, row in df.iterrows():
x = row['x'].split(',')
y = row['y'].split(',')
for index, _ in enumerate(x):
# len(x) must be equal len(y)...
yield 'x', x[index]
yield 'y', y[index]
# generate number after each splitted item
yield 'sequence number', index + 1
# generate Series from parts and union into new dataframe
new_df = pd.concat([
pd.Series([p[1]], [p[0]])
for p in get_parts()
])
Hope this helps.