Applying poisson.cdf to create a column in a pandas dataframe

Applying poisson.cdf to create a column in a pandas dataframe - pandas

I have a pandas dataframe with two columns, say x and y. For each row, x is the mean of a random variable following a poisson distribution. I want to add a third column, z, such that z = the probability that a random draw will be less than y.
For a given row, say x = 15 and I want to know the probability that a random draw will be less than y = 10. I know I can use:
from scipy.stats import poisson
x = 15
y = 10
z = poisson.cdf(y, x)
z
which returns 0.118
How do I do this for each row in a pandas dataframe, creating a third column?

You can use the apply method:
df = pd.DataFrame({"x": [15, 15, 15], "y": [10, 15, 20]})
df["z"] = df.apply(lambda r: poisson.cdf(r.y, r.x), axis=1)
print(df)
Result:
x y z
0 15 10 0.118464
1 15 15 0.568090
2 15 20 0.917029

Related

Filtering the Pandas data frame according to a condition of the timestamp [duplicate]

I'm trying to separate a DataFrame into groups and drop groups below a minimum size (small outliers).
Here's what I've tried:
df.groupby(['A']).filter(lambda x: x.count() > min_size)
df.groupby(['A']).filter(lambda x: x.size() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].count() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].size() > min_size)
But these either throw an exception or return a different table than I'm expecting. I'd just like to filter, not compute a new table.

You can use len:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df.groupby('A').filter(lambda x: len(x) > 1)
Out[12]:
A B
0 1 2
1 1 4

The number of rows is in the attribute .shape[0]:
df.groupby('A').filter(lambda x: x.shape[0] >= min_size)
NB: If you want to remove the groups below the minimum size, keep those that are above or at the minimum size (>=, not >).

groupby.filter can be very slow for larger dataset / a large number of groups. A faster approach is to use groupby.transform:
Here's an example, first create the dataset:
import pandas as pd
import numpy as np
df = pd.concat([
pd.DataFrame({'y': np.random.randn(np.random.randint(1,5))}).assign(A=str(i))
for i in range(1,1000)
]).reset_index(drop=True)
print(df)
y A
0 1.375980 1
1 -0.023861 1
2 -0.474707 1
3 -0.151859 2
4 -1.696823 2
... ... ...
2424 0.276737 998
2425 -0.142171 999
2426 -0.718891 999
2427 -0.621315 999
2428 1.335450 999
[2429 rows x 2 columns]
Time it:

How to fit data to a log10 eq to find out the constant values a and b?

I have two DF (df1 and df2):
Sample Date Value_df1
1992-11-04 1
1992-11-12 2
1992-11-18 3
... ...
1992-12-02 4
1992-12-09 5
1992-12-21 6
1992-12-28 7
1993-01-07 8
Sample Date Value_df2
1992-11-04 9
1992-11-12 10
1992-11-18 11
... ...
1992-12-02 12
1992-12-09 13
1992-12-21 14
1992-12-28 15
1993-01-07 16
by establishing:
y = df1['Value_df1']
x = df2['Value_df2']
I want to fit the values of df1 and df2 to this eq 𝐿𝑜𝑔10(y)=𝐿𝑜𝑔10(𝑎)+𝑏𝐿𝑜𝑔10(x) and then get the constants a and b.
This is what I've done:
import numpy as np
#getting log10 values from both DF
log_y = np.log10( y )
log_x = np.log10( x )
curve = np.polyfit(log_x, log_y, 1)
b = curve[0]
a = curve[1]
#applying the inverse of log(a) to obtain the real a value
a = 10 ** a
This way one should get a and b values?.
Is this a good approach? is there a better way or am I doing something wrong?
Thanks for your feedback

Why does pandas.DataFrame.apply produces Series instead of DataFrame

I do not really understand why from the following code pandas return is Series but not a DataFrame.
import pandas as pd
df = pd.DataFrame([[4,9]]*3, columns = ["A", "B"])
def plus_2(x):
y =[]
for i in range(0, len(x)):
y.append(x[i]+2)
return y
df_row = df.apply(plus_2, axis = 1) # Applied to each row
df_row
While if I change axis=0 it produces DataFrame as expected:
import pandas as pd
df = pd.DataFrame([[4,9]]*3, columns = ["A", "B"])
def plus_2(x):
y =[]
for i in range(0, len(x)):
y.append(x[i]+2)
return y
df_row = df.apply(plus_2, axis = 0) # Applied to each row
df_row
Here is the output:

In first example where you put axis=1 you implement on row level.
It means that for each row plus_2 function returns y which is list of two element (but list as a whole is single element so this is pd.Series).
Based on your example it will be returned 3x list (2 element each). Here single list if single row.
You could expand this result and create two columns (each element from list will be new column) by adding result_type="expand" in apply:
df_row = df.apply(lambda x: plus_2(x), axis=1, result_type="expand")
# output
0 1
0 6 11
1 6 11
2 6 11
In second approach you have axis=0 co this is applied on column level.
It means that for each column plus_2 function returns y, so plus_2 is applied twice, separately for A column and for B column. This is why it returns dataframe: your input is DataFrame with columns A and B, each column applies plus_2 function and returns A and B columns as result of plus_2 functions applied.
Based on your example it will be returned 2x list (3 element each). Here single list is single column.
So the main difference between axis=1 and axis=0 is that:
if you applied on row level apply will return:
[6, 11]
[6, 11]
[6, 11]
if you applied on column level apply will return:
[6, 6, 6]
[11, 11, 11]

Multiple operations iterating over dataframe columns (apply function?)

I have a pandas dataframe with thousands of columns and I would like to perform the following operations for each column of the dataframe:
check if the value i-th and i-1-th values are in the range (between x and y);
if #1 is satisfied, then find log(i/i-1) ** 2 of the column;
if #1 is not satisfied, assume 0;
find the total of #2 for each column.
Here is a dataframe with a single column:
d = {'col1': [10, 15, 23, 16, 5, 14, 11, 4]}
df = pd.DataFrame(data = d)
df
x = 10 and y = 20
Here is what I can do for this single column:
df["IsIn"] = "NA"
for i in range(1, len(df.col1)):
if (x < df.col1[i] < y) & (x < df.col1[i - 1] < y):
df.IsIn[i] = 1
else:
df.IsIn[i] = 0
df["rets"] = np.log(df["col1"] / df["col1"].shift(1))
df["var"] = df["IsIn"] * df["rets"]**2
Total = df["var"].sum()
Total
Ideally, I would have a (1 by n-cols) dataframe of Totals for each column. How can I best achieve this? I would also appreciate if you can supplement your answer with detailed explanation.

Yes, this is an instance where apply works. You only need to wrap your logic in a function. Also, consider between and shift on the condition to eliminate the first loop:
def func(s, x=10, y=20):
'''
compute the value given a series
'''
# mask where values are between x and y
valid = s.between(x,y)
# shift `valid` and double check
valid = valid & valid.shift(fill_value=False)
# squared log, mask with `valid`, and sum
return (np.log(s/s.shift())**2 * valid).sum()
# apply `func` on the columns
df.apply(func, x=10, y=20)
Output:
col1 0.222561
dtype: float64

split values from columns and generate sequence number

I have two columns in a df. each column has multiple values in 1 row.I want to split each value in a new row in another table and generate sequence number. given data is
x y
76.25, 345.65 87.12,96.45
78.12,35.1,98.27 85.23,65.2,56.63
new df should be like this
x 76.25
y 87.12
sequence number 1
x 345.65
y 96.45
sequence number 2
x 78.12
y 85.23
sequence number 1
x 35.1
y 65.21
sequence number 2
x 98.27
y 56.63
sequence number 3
all values are strings. I have no idea how should I do it.Should I write a function or there is any command in dataframe? any help is appreciated

You can do it using iterrows() + concat():
df = pd.DataFrame({
'x': ('76.25,345.65', '78.12,35.1,98.27'),
'y': ('87.12,96.45', '85.23,65.2,56.63')
})
def get_parts():
for index, row in df.iterrows():
x = row['x'].split(',')
y = row['y'].split(',')
for index, _ in enumerate(x):
# len(x) must be equal len(y)...
yield 'x', x[index]
yield 'y', y[index]
# generate number after each splitted item
yield 'sequence number', index + 1
# generate Series from parts and union into new dataframe
new_df = pd.concat([
pd.Series([p[1]], [p[0]])
for p in get_parts()
])
Hope this helps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Applying poisson.cdf to create a column in a pandas dataframe - pandas

You can use the apply method: df = pd.DataFrame({"x": [15, 15, 15], "y": [10, 15, 20]}) df["z"] = df.apply(lambda r: poisson.cdf(r.y, r.x), axis=1) print(df) Result: x y z 0 15 10 0.118464 1 15 15 0.568090 2 15 20 0.917029

Related

Filtering the Pandas data frame according to a condition of the timestamp [duplicate]

How to fit data to a log10 eq to find out the constant values a and b?

Why does pandas.DataFrame.apply produces Series instead of DataFrame

Multiple operations iterating over dataframe columns (apply function?)

split values from columns and generate sequence number

Categories

Resources