Multiple operations iterating over dataframe columns (apply function?) - pandas

I have a pandas dataframe with thousands of columns and I would like to perform the following operations for each column of the dataframe:
check if the value i-th and i-1-th values are in the range (between x and y);
if #1 is satisfied, then find log(i/i-1) ** 2 of the column;
if #1 is not satisfied, assume 0;
find the total of #2 for each column.
Here is a dataframe with a single column:
d = {'col1': [10, 15, 23, 16, 5, 14, 11, 4]}
df = pd.DataFrame(data = d)
df
x = 10 and y = 20
Here is what I can do for this single column:
df["IsIn"] = "NA"
for i in range(1, len(df.col1)):
if (x < df.col1[i] < y) & (x < df.col1[i - 1] < y):
df.IsIn[i] = 1
else:
df.IsIn[i] = 0
df["rets"] = np.log(df["col1"] / df["col1"].shift(1))
df["var"] = df["IsIn"] * df["rets"]**2
Total = df["var"].sum()
Total
Ideally, I would have a (1 by n-cols) dataframe of Totals for each column. How can I best achieve this? I would also appreciate if you can supplement your answer with detailed explanation.

Yes, this is an instance where apply works. You only need to wrap your logic in a function. Also, consider between and shift on the condition to eliminate the first loop:
def func(s, x=10, y=20):
'''
compute the value given a series
'''
# mask where values are between x and y
valid = s.between(x,y)
# shift `valid` and double check
valid = valid & valid.shift(fill_value=False)
# squared log, mask with `valid`, and sum
return (np.log(s/s.shift())**2 * valid).sum()
# apply `func` on the columns
df.apply(func, x=10, y=20)
Output:
col1 0.222561
dtype: float64

Related

How to do an advanced multiplication with panda dataframe

I have a dataframe1 of 1802 rows and 29 columns (in code as df) - each row is a person and each column is a number representing their answer to 29 different questions.
I have another dataframe2 of 29 different coefficients (in code as seg_1).
Each column needs to be multiplied by the corresponding coefficient and this needs to be repeated for each participant.
For example - 1802 iterations of q1 * coeff1, 1802 iterations of q2 * coeff2 etc
So I should end up with 1802 * 29 = 52,258
but the answer doesn't seem to be this length and also the answers aren't what I expect - I think the loop is multiplying q1-29 by coeff1, then repeating this for coeff2 but that's not what I need.
questions = range(0, 28)
co = range(0, 28)
segment_1 = []
for a in questions:
for b in co:
answer = df.iloc[:,a] * seg_1[b]
segment_1.append([answer])
Proper encoding of the coefficients as a Pandas frame makes this a one-liner
df_person['Answer'] = (df_person * df_coeffs.values).sum(1)
and circumvents slow for-loops. In addition, you don't need to remember the number of rows in the given table 1802 and can use the code without changes even if you data grows larger.
For a minimum viable example, see:
# answer frame
df_person = pd.DataFrame({'Question_1': [10, 20, 15], 'Question_2' : [4, 4, 2], 'Question_3' : [2, -2, 1]})
# coefficient frame
seg_1 = [2, 4, -1]
N = len(df_person)
df_coeffs = pd.DataFrame({'C_1': [seg_1[0]] * N, 'C_2' : [seg_1[1]] * N, 'C_3' : [seg_1[2]] * N})
# elementwise multiplication & row-wise summation
df_person['Answer'] = (df_person * df_coeffs.values).sum(1)
giving
for the coefficient table df_coeffs
and answer table df_person

In Pandas how to replace dataframe column values with new values that are a function of old values

I need your guidance on this issue i am facing, i want to some this like this:
df[x], where x is variable name that represents one of the column names.
Now i want the entire column to go through this manipulation
each value of column x should be process as below equation:
( 1 / (x * 22)) note that x here is a individual value of the column and it can be huge number
since the huge number getting reciprocated (or 1 over x), it may result in exponential number (e+/-01)
The resultant number should be replacing the original number in the dataframe
If the number is 100, the new number to be put into the dataframe is (1/((100)*22)) = 4.54e-4
Please let me know how to do it rounded to 2 decimal points.
Thanks.
df[x] = df[x].apply(lambda x: (1/((x * 22)))
This is resulting in 0 in the dataframe
df[x] = df[x].apply(lambda x: (1/((x * 22)))
looks okay, but will probably round to whole numbers. This may depend on what datatype your column x has and/or what x is.
This may work:
df[x] = df[x].apply(lambda x: (1.0/((x * 22.0)))
If your lambda function gets more complicated, for example if you want to use if-else clauses, you should write a helper function and call that inside of your apply:
def helper(x):
if x == 100:
return (1.0/((100)*22.0))
else:
return (1.0/((x)*22.0))
df[x] = df[x].apply(lambda x: helper(x))
Use round() to round the result to two decimals:
df[x] = df[x].apply(lambda x: round((1.0/((x * 22.0)),2))
Your formula is good, but the numbers are too small to be showed in the answer (rounding with 2 decimals 0.000455 will result in 0.00).
xd = pd.DataFrame({'x':[100,101,102],'y':[1,2,3]})
xd['x'] = xd['x'].apply(lambda x: (1/(x * 22)))
>>> xd
x y
0 0.000455 1
1 0.000450 2
2 0.000446 3
Try this to format the numbers to exponential format with 2 decimals.
>>> pd.set_option('display.float_format', lambda x: '%.2E' % x)
>>> xd
x y
0 4.55E-04 1
1 4.50E-04 2
2 4.46E-04 3
Source for formatting: https://stackoverflow.com/a/6913576/14593183

how to find missing number between minimum and maximum

I want to make a NumPy array which has below;
Random number: 0~9 (0<=value<=9) Random 1D size: 5~9 (5<= size <=9)
And I hope to find missing numbers between min and max so I made a code like this
import numpy as np
min_val = 0
max_val = 10
min_val_len = 5
max_val_len = 10
arr1 = [4,3,2,7,8,2,3]
a = list(arr1)
print(a)
diff = np.setdiff1d(range(min_val, max_val), arr1)
arr = np.arange(min_val_len, max_val_len)
if diff in arr:
print(diff)
else:
print("no missing")
In my purpose, the output will be [5,6].
And if an input is [1, 2, 3, 4, 5], the result will be 'no_missing'.
But the code isn't work on my expectation.
I think you expect in to work in a way it does not: You want to check every single element, try:
b = [d in arr for d in diff]
Now b contains a boolean value for each value d of diff. If you want to find the actual number that are missing you can do it using a condition
b = np.intersect1d(np.setdiff1d(range(min_val, max_val), arr1), arr)
Also note that python has built in set types, so you do not actually need to use numpy.
Now b contains all numbers of d that are in arr. But you can do it in even a simpler way as you're already using the notion of sets:
print(np.setdiff1d(rang

In pandas, how to apply a function to every column that returns two columns

I would like to apply a function to every column of my grouped multiindex pandas dataframe.
If I had a function my_function() that returns a scalar, I would use
data_grouped = data.groupby(['type'])
data_transf = data_grouped.apply(lambda x: my_function(x))
However, consider another function my_function_array() takes an array (all n rows within one group) as an input and returns an n x 2 array as the output.
How can I apply this to every column of my grouped dataframe data_grouped? That is, I want to take every column of my grouped data of m rows and replace it by the n x 2 output of my_function_array().
Here's some sample data. There are other groups (types) but I only show one
type frame x y
F1675 1 77.369027 108.013249
2 107.784096 22.177883
3 22.385162 65.024619
4 65.152003 77.74970
def my_function_array(data_vec, D=2, T=2):
N = len(data_vec) - (D-1)*T # length of embedded signal
embed_data = np.zeros([N,D])
for di in range(-D//2,D//2):
embed_data[:,di] = data_vec[ np.arange((D//2+di)*T, N+(D//2+di)*T) ]
return embed_data
Appyling the function to the second column y
my_function_array(np.array([108.013249, 22.177883, 65.024619, 77.74970]))
I have
array([[ 65.024619, 108.013249],
[ 77.7497 , 22.177883]])
So, the expected output is
type frame x_1 x_2 y_1 y_2
F1675 1 22.385162 77.369027 65.024619 108.013249
2 65.152003 107.784096 77.7497 22.177883
where x_1 and x_2 are the two columns resulting from x (the naming is not important, can be anything). Note that the groups have become shorter and wider.
I think you need return pd.DataFrame:
def my_function_array(data_vec, D=2, T=2):
# print (data_vec.name)
N = len(data_vec) - (D-1)*T # length of embedded signal
embed_data = np.zeros([N,D])
for di in range(-D//2,D//2):
embed_data[:,di] = data_vec[ np.arange((D//2+di)*T, N+(D//2+di)*T) ]
return pd.DataFrame(embed_data).add_prefix(data_vec.name)
f = lambda x: pd.concat([my_function_array(x[y]) for y in x], axis=1)
data_transf = data.groupby(['type']).apply(f)
print (data_transf)
x0 x1 y0 y1
type
F1675 0 22.385162 77.369027 65.024619 108.013249
1 65.152003 107.784096 77.749700 22.177883

pandas dataframe multiplication with missing values

I have a dataframe with 2columns (floating types), but one of them has missing data represented by a string ".."
When performing a multiplication operation, an exception is raised and the whole operation is aborted.
What I try to achieve is to perform the multiplication for the float values and leave ".." for the missing ones.
2 * 6
.. * 4
should give [12, ..]
I found a naive solution consisting in replacing .. by 0 then perform the multiplication, then replace back the 0 by ..
It doesn't seem very optimized. Any other solution?
df['x'] = pd.to_numeric(df['x'], errors='coerce').fillna(0)
mg['x'] = df['x'] * df["Value"]
for col in mg.columns:
mg[col] = mg[col].apply(update)
def update(v):
if (v == 0):
return ".."
return v
You can use np.where and Series.isna:
import numpy as np
mg['x'] = np.where(df['X'].isna(), df['X'], df['X']*df['Value'])
If you want to replace the null with '..' and multiply others:
mg['x'] = np.where(df['X'].isna(), '..', df['X']*df['Value'])
So anywhere the Value of column x is null, the it remains the same, otherwise it's multiplies with the value of the corresponding row of Value column
In you solution you can also do a fillna(1):
df['x'] = pd.to_numeric(df['x'], errors='coerce').fillna(1)
mg['x'] = df['x'] * df["Value"]
This is how I tried:
df = pd.DataFrame({'X': [ 2, np.nan],
'Value': [6, 4]})
df
X Value
0 2.0 6
1 NaN 4
np.where(df['X'].isna(), df['X'], df['X']*df['Value'])
array([12., nan])