In pandas, how to apply a function to every column that returns two columns - pandas

I would like to apply a function to every column of my grouped multiindex pandas dataframe.
If I had a function my_function() that returns a scalar, I would use
data_grouped = data.groupby(['type'])
data_transf = data_grouped.apply(lambda x: my_function(x))
However, consider another function my_function_array() takes an array (all n rows within one group) as an input and returns an n x 2 array as the output.
How can I apply this to every column of my grouped dataframe data_grouped? That is, I want to take every column of my grouped data of m rows and replace it by the n x 2 output of my_function_array().
Here's some sample data. There are other groups (types) but I only show one
type frame x y
F1675 1 77.369027 108.013249
2 107.784096 22.177883
3 22.385162 65.024619
4 65.152003 77.74970
def my_function_array(data_vec, D=2, T=2):
N = len(data_vec) - (D-1)*T # length of embedded signal
embed_data = np.zeros([N,D])
for di in range(-D//2,D//2):
embed_data[:,di] = data_vec[ np.arange((D//2+di)*T, N+(D//2+di)*T) ]
return embed_data
Appyling the function to the second column y
my_function_array(np.array([108.013249, 22.177883, 65.024619, 77.74970]))
I have
array([[ 65.024619, 108.013249],
[ 77.7497 , 22.177883]])
So, the expected output is
type frame x_1 x_2 y_1 y_2
F1675 1 22.385162 77.369027 65.024619 108.013249
2 65.152003 107.784096 77.7497 22.177883
where x_1 and x_2 are the two columns resulting from x (the naming is not important, can be anything). Note that the groups have become shorter and wider.

I think you need return pd.DataFrame:
def my_function_array(data_vec, D=2, T=2):
# print (data_vec.name)
N = len(data_vec) - (D-1)*T # length of embedded signal
embed_data = np.zeros([N,D])
for di in range(-D//2,D//2):
embed_data[:,di] = data_vec[ np.arange((D//2+di)*T, N+(D//2+di)*T) ]
return pd.DataFrame(embed_data).add_prefix(data_vec.name)
f = lambda x: pd.concat([my_function_array(x[y]) for y in x], axis=1)
data_transf = data.groupby(['type']).apply(f)
print (data_transf)
x0 x1 y0 y1
type
F1675 0 22.385162 77.369027 65.024619 108.013249
1 65.152003 107.784096 77.749700 22.177883

Related

In Pandas how to replace dataframe column values with new values that are a function of old values

I need your guidance on this issue i am facing, i want to some this like this:
df[x], where x is variable name that represents one of the column names.
Now i want the entire column to go through this manipulation
each value of column x should be process as below equation:
( 1 / (x * 22)) note that x here is a individual value of the column and it can be huge number
since the huge number getting reciprocated (or 1 over x), it may result in exponential number (e+/-01)
The resultant number should be replacing the original number in the dataframe
If the number is 100, the new number to be put into the dataframe is (1/((100)*22)) = 4.54e-4
Please let me know how to do it rounded to 2 decimal points.
Thanks.
df[x] = df[x].apply(lambda x: (1/((x * 22)))
This is resulting in 0 in the dataframe
df[x] = df[x].apply(lambda x: (1/((x * 22)))
looks okay, but will probably round to whole numbers. This may depend on what datatype your column x has and/or what x is.
This may work:
df[x] = df[x].apply(lambda x: (1.0/((x * 22.0)))
If your lambda function gets more complicated, for example if you want to use if-else clauses, you should write a helper function and call that inside of your apply:
def helper(x):
if x == 100:
return (1.0/((100)*22.0))
else:
return (1.0/((x)*22.0))
df[x] = df[x].apply(lambda x: helper(x))
Use round() to round the result to two decimals:
df[x] = df[x].apply(lambda x: round((1.0/((x * 22.0)),2))
Your formula is good, but the numbers are too small to be showed in the answer (rounding with 2 decimals 0.000455 will result in 0.00).
xd = pd.DataFrame({'x':[100,101,102],'y':[1,2,3]})
xd['x'] = xd['x'].apply(lambda x: (1/(x * 22)))
>>> xd
x y
0 0.000455 1
1 0.000450 2
2 0.000446 3
Try this to format the numbers to exponential format with 2 decimals.
>>> pd.set_option('display.float_format', lambda x: '%.2E' % x)
>>> xd
x y
0 4.55E-04 1
1 4.50E-04 2
2 4.46E-04 3
Source for formatting: https://stackoverflow.com/a/6913576/14593183

Multiple operations iterating over dataframe columns (apply function?)

I have a pandas dataframe with thousands of columns and I would like to perform the following operations for each column of the dataframe:
check if the value i-th and i-1-th values are in the range (between x and y);
if #1 is satisfied, then find log(i/i-1) ** 2 of the column;
if #1 is not satisfied, assume 0;
find the total of #2 for each column.
Here is a dataframe with a single column:
d = {'col1': [10, 15, 23, 16, 5, 14, 11, 4]}
df = pd.DataFrame(data = d)
df
x = 10 and y = 20
Here is what I can do for this single column:
df["IsIn"] = "NA"
for i in range(1, len(df.col1)):
if (x < df.col1[i] < y) & (x < df.col1[i - 1] < y):
df.IsIn[i] = 1
else:
df.IsIn[i] = 0
df["rets"] = np.log(df["col1"] / df["col1"].shift(1))
df["var"] = df["IsIn"] * df["rets"]**2
Total = df["var"].sum()
Total
Ideally, I would have a (1 by n-cols) dataframe of Totals for each column. How can I best achieve this? I would also appreciate if you can supplement your answer with detailed explanation.
Yes, this is an instance where apply works. You only need to wrap your logic in a function. Also, consider between and shift on the condition to eliminate the first loop:
def func(s, x=10, y=20):
'''
compute the value given a series
'''
# mask where values are between x and y
valid = s.between(x,y)
# shift `valid` and double check
valid = valid & valid.shift(fill_value=False)
# squared log, mask with `valid`, and sum
return (np.log(s/s.shift())**2 * valid).sum()
# apply `func` on the columns
df.apply(func, x=10, y=20)
Output:
col1 0.222561
dtype: float64

Python exponential curve fitting in pandas: Define function parameters per row

my dataframe [11 x 300], where the column header equals 'x' ([0.75,1,1.25,1.5,1.75,2,2.25,2.5,2.75,3,3.25]), and each row-value represents 'y' for. Each row can be described by an exponential function in the following format : a * x ^k + b.
The goal is to add three additional columns, describing a, k and b for that specific row. Just like: Python curve fitting on pandas dataframe then add coef to new columns
Instead of a polynomial function, my data needs be described in the following format: a * x **k + b.
As I cannot find any solution to derive the coefficients by using np.polyfit, I split my dataframe into different lists.
x = np.array([0.75,1,1.25,1.5,1.75,2,2.25,2.5,2.75,3,3.25])
y1 = np.array([288.79,238.32,199.42,181.22,165.50,154.74,152.25,152.26,144.81,144.81,144.81])
y2 = np.array([309.92,255.75,214.02,194.48,177.61,166.06,163.40,163.40,155.41,155.41,155.41])
...
y300 = np.array([352.18,290.63,243.20,221.00,201.83,188.71,185.68,185.68,176.60,176.60,176.60])
def func(x,a,k,b):
return a * (x**k) + b
popt1, pcov = curve_fit(func,x,y1, p0 = (300,-0.5,0))
...
popt300, pcov = curve_fit(func,x,y300, p0 = (300,-0.5,0))
output:
popt1
[107.73727907 -1.545475 123.48621504]
...
popt300
[131.38411712 -1.5454452 150.59522147
This works, when I split all dataframe rows into lists and define popt for every list/row.
Avoiding to split all 300 columns - I prefer to apply the same methodology as Python curve fitting on pandas dataframe then add coef to new columns
my_coep_array = pd.DataFrame(np.polyfit(x, df.values,1)).T
But how to define my np.polyfit - a * x **k + b?

split values from columns and generate sequence number

I have two columns in a df. each column has multiple values in 1 row.I want to split each value in a new row in another table and generate sequence number. given data is
x y
76.25, 345.65 87.12,96.45
78.12,35.1,98.27 85.23,65.2,56.63
new df should be like this
x 76.25
y 87.12
sequence number 1
x 345.65
y 96.45
sequence number 2
x 78.12
y 85.23
sequence number 1
x 35.1
y 65.21
sequence number 2
x 98.27
y 56.63
sequence number 3
all values are strings. I have no idea how should I do it.Should I write a function or there is any command in dataframe? any help is appreciated
You can do it using iterrows() + concat():
df = pd.DataFrame({
'x': ('76.25,345.65', '78.12,35.1,98.27'),
'y': ('87.12,96.45', '85.23,65.2,56.63')
})
def get_parts():
for index, row in df.iterrows():
x = row['x'].split(',')
y = row['y'].split(',')
for index, _ in enumerate(x):
# len(x) must be equal len(y)...
yield 'x', x[index]
yield 'y', y[index]
# generate number after each splitted item
yield 'sequence number', index + 1
# generate Series from parts and union into new dataframe
new_df = pd.concat([
pd.Series([p[1]], [p[0]])
for p in get_parts()
])
Hope this helps.

speeding up numpy code involving array slicing and broadcasting

I have the following code:
x = sp.linspace(-2,2,1000)
z = sp.linspace(-1,3,2000)
X,Z = sp.meshgrid(x,z)
X = X[:,:,sp.meshgrid]
Z = Z[:,:,sp.meshgrid]
E = sp.zeros((len(z),len(x),3), dtype=complex)
# e_uvect.shape = (2,N,2,3)
# En.shape = (2,N,2)
# d_cum.shape = (N,)
# pol is either 0 or 1
for n in range(N):
idx = sp.logical_and(Z<d_cum[n], Z>=d_cum[n-1])
E += e_uvect[pol,n,0,:]*En[pol,n,0]*sp.exp(+1j*self.kz[n]*(Z-d_cum[n-1])+1j*self.kx*X)*idx
Basically the above is part of a code to calculate the electric field of an N-layer structures. For each iteration inside for loop, I find the index of the array elements which are within the Nth layer, then after I calculate the electric field I multiply the whole thing by idx to 'filter' out the correct part which satisfies sp.logical_and(Z<d_cum[n], Z>=d_cum[n-1]).
It works fine, but I wonder if there is a more efficient way of doing this using numpy array slicing or other methods, because each multiplication involves a large proportion of array elements which are not accepted in each iteration. I tried something like the following to only work on the relevant part of the coordinates array Z and X
idx = sp.logical_and(Z<d_cum[n], Z>=d_cum[n-1])
Z2 = Z[idx]
X2 = X[idx]
E[???] += e_uvect[pol,n,0,:]*En[pol,n,0]*sp.exp(+1j*self.kz[n]*(Z2-d_cum[n-1])+1j*self.kx*X2)
But then Z2 and X2 becomes a 1d-array, and I'm not sure about the indexing part within E or how to reshape the arrays appropriately.
So are there any ways to speed up the original code?