I have a dataframe1 of 1802 rows and 29 columns (in code as df) - each row is a person and each column is a number representing their answer to 29 different questions.
I have another dataframe2 of 29 different coefficients (in code as seg_1).
Each column needs to be multiplied by the corresponding coefficient and this needs to be repeated for each participant.
For example - 1802 iterations of q1 * coeff1, 1802 iterations of q2 * coeff2 etc
So I should end up with 1802 * 29 = 52,258
but the answer doesn't seem to be this length and also the answers aren't what I expect - I think the loop is multiplying q1-29 by coeff1, then repeating this for coeff2 but that's not what I need.
questions = range(0, 28)
co = range(0, 28)
segment_1 = []
for a in questions:
for b in co:
answer = df.iloc[:,a] * seg_1[b]
segment_1.append([answer])
Proper encoding of the coefficients as a Pandas frame makes this a one-liner
df_person['Answer'] = (df_person * df_coeffs.values).sum(1)
and circumvents slow for-loops. In addition, you don't need to remember the number of rows in the given table 1802 and can use the code without changes even if you data grows larger.
For a minimum viable example, see:
# answer frame
df_person = pd.DataFrame({'Question_1': [10, 20, 15], 'Question_2' : [4, 4, 2], 'Question_3' : [2, -2, 1]})
# coefficient frame
seg_1 = [2, 4, -1]
N = len(df_person)
df_coeffs = pd.DataFrame({'C_1': [seg_1[0]] * N, 'C_2' : [seg_1[1]] * N, 'C_3' : [seg_1[2]] * N})
# elementwise multiplication & row-wise summation
df_person['Answer'] = (df_person * df_coeffs.values).sum(1)
giving
for the coefficient table df_coeffs
and answer table df_person
Related
Question:
Assume for this part that the total count for every sample is 5000 (i.e., sum of column = 5000).
Imagine there was a row (gene G) in this dataset for which the count is expected to be 1 in 10% of samples and 0 in the remaining 90% of samples. We are doing an experiment where we would like to know if the expression of gene G changes in experimental vs control conditions, and we will measure n samples (single cells) from each condition.
Plot the statistical power to detect a 10% increase in the expression of G in experimental vs control at Bonferroni-corrected p < 0.05 as a function of n, assuming that we will be performing a similar test for significance on 1000 genes total. How many samples from each condition do we need to measure to achieve a power of 95%?
First, for gene G, I created a pandas dataframe for control and experimental conditions, where the ratio of 0:1 is 10% and 20%, respectively. I extrapolated the conditions for 1000 genes and then perform cross-tabulation analysis.
import pandas as pd
from statsmodels.stats.contingency_tables import mcnemar
from statsmodels.stats.gof import chisquare_effectsize
from statsmodels.stats.power import GofChisquarePower
n = 5000 # no of records
nog = 1000 # no of genes
gene_list = ["gene_" + str(i) for i in range(0,nog)]
def generate_gene_df(gene, n):
df = pd.DataFrame.from_dict(
{"Gene" : gene,
"Cells": (f'Cell{x}' for x in range(1, n+1)),
"Control": np.random.choice([1,0], p=[0.1, 0.9], size=n),
"Experimental": np.random.choice([1,0], p=[0.1+0.1, 0.9-0.1], size=n)},
orient='columns'
)
df = df.set_index(["Gene","Cells"])
return df
# List of simulated genes
gene_df_list = [generate_gene_df(gene, n) for gene in gene_list]
df = pd.concat(gene_df_list)
df = df.reset_index()
table = pd.crosstab([df["Gene"], df["Cells"]],
[df["Control"], df["Experimental"]]).to_numpy()
Table:
array([[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0],
...,
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0]])
Now, I want to plot the statistical power at Bonferroni-corrected p < 0.05 as a function of n. I also want to find out how many samples from each condition do we need to measure to achieve a power of 95%.
My attempt:
McNemar's test
result = mcnemar(table, exact=True)
print('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue))
alpha=0.05
if result.pvalue > alpha:
print('Same proportions of errors (fail to reject H0)')
else:
print('Different proportions of errors (reject H0)')
Output:
statistic=0.000, p-value=1.000
Same proportions of errors (fail to reject H0)
I calculated the power analysis using:
nobs = 5000
alpha = 0.05
effect_size = chisquare_effectsize(0.5, 0.5*1.1, correction=None, cohen=True, axis=0)
analysis = GofChisquarePower()
power_chisquare = analysis.solve_power(effect_size=effect_size, nobs=nobs, alpha=alpha)
print('Based on Chi Square test, the minimum number of samples required to see an effect of the desired size: %.3f' % power_chisquare)
Based on Chi Square test, the minimum number of samples required to see an effect of the desired size: 0.050
Why does the power curve look atypical? Did I perform the analyses correctly? Is McNemar an appropriate statistical test method in this case?
I am looking to calculate summary statistics on subsets of a dataframe but related to a specific values within the row.
For example, I have a dataframe that has latitude and longitude and number of people.
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
I want to know the total people within .05 miles from each row. This can be easily created with a loop, but as the space starts to increase this becomes unusable.
Current/Sample:
from geopy.distance import distance
def distance_calc (row, focus_lat, focus_long):
start = (row['latitude'], row['longitude'])
stop = (focus_lat, focus_long)
return distance(start, stop).miles
df['total_people_within_05'] = 0
df['total_rows_within_05'] = 0
for index, row in df.iterrows():
focus_lat = df['latitude'][index]
focus_long = df['longitude'][index]
new_df = df.copy()
new_df['distance'] = new_df.apply (lambda row: (distance_calc(row, focus_lat, focus_long)),axis=1)
df.at[index, 'total_people_within_05'] = new_df.loc[new_df.distance<=.05]['people'].sum()
df.at[index, 'total_rows_within_05'] = new_df.loc[new_df.distance<=.05].shape[0]
Is there any pythonic way to do this?
Cartesian product to itself to get all combinations. This will be expensive on larger datasets. This generates N^2 rows, so in this case 25 rows
calculate distance on each of these combinations
filter query() to distances required
groupby() to get total number of people. Also generate a list of indexes included in total for helping with transparency
finally join() this back together and you have what you want
import geopy.distance as gd
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
df = df.join((df.reset_index().assign(foo=1).merge(df.reset_index().assign(foo=1), on="foo")
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_x,r.longitude_x),
(r.latitude_y,r.longitude_y)).miles, axis=1))
.query("distance<=0.05")
.rename(columns={"people_y":"nearby"})
.groupby("index_x").agg({"nearby":"sum","index_y":lambda x: list(x)})
))
print(df.to_markdown())
latitude
longitude
people
nearby
index_y
0
40.9919
-106.049
1
6
[0, 1, 2]
1
40.992
-106.049
2
6
[0, 1, 2]
2
40.9916
-106.049
3
6
[0, 1, 2]
3
40.9899
-106.05
4
4
[3]
4
40.9878
-106.049
5
5
[4]
Update - use combinations instead of Cartesian product
It's been bugging me that a Cartesian product is a huge overhead, when all that is required is to calculate distances between valid combinations
make use of itertools.combinations() to make a list of valid combinations of indexes
calculate distances between this minimum set
filter down to only distances we're interested in
now build permutations of this smaller set to provide a simple join to actual data
join and aggregate
# get distances between all valid combinations
dfd = (pd.DataFrame(list(itertools.combinations(df.index, 2)))
.merge(df, left_on=0, right_index=True)
.merge(df, left_on=1, right_index=True, suffixes=("_0","_1"))
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_0,r.longitude_0),
(r.latitude_1,r.longitude_1)).miles, axis=1))
.loc[:,[0,1,"distance"]]
# filter down to close proximities
.query("distance <= 0.05")
)
# build all valid permuations of close by combinations
dfnppl = (pd.DataFrame(itertools.permutations(pd.concat([dfd[0],dfd[1]]).unique(), 2))
.merge(df.loc[:,"people"], left_on=1, right_index=True)
)
# bring it all together
df = (df.reset_index().rename(columns={"index":0}).merge(dfnppl, on=0, suffixes=("","_near"), how="left")
.groupby(0).agg({**{c:"first" for c in df.columns}, **{"people_near":"sum"}})
)
0
latitude
longitude
people
people_near
0
40.9919
-106.049
1
5
1
40.992
-106.049
2
4
2
40.9916
-106.049
3
3
3
40.9899
-106.05
4
0
4
40.9878
-106.049
5
0
I have a pandas dataframe with thousands of columns and I would like to perform the following operations for each column of the dataframe:
check if the value i-th and i-1-th values are in the range (between x and y);
if #1 is satisfied, then find log(i/i-1) ** 2 of the column;
if #1 is not satisfied, assume 0;
find the total of #2 for each column.
Here is a dataframe with a single column:
d = {'col1': [10, 15, 23, 16, 5, 14, 11, 4]}
df = pd.DataFrame(data = d)
df
x = 10 and y = 20
Here is what I can do for this single column:
df["IsIn"] = "NA"
for i in range(1, len(df.col1)):
if (x < df.col1[i] < y) & (x < df.col1[i - 1] < y):
df.IsIn[i] = 1
else:
df.IsIn[i] = 0
df["rets"] = np.log(df["col1"] / df["col1"].shift(1))
df["var"] = df["IsIn"] * df["rets"]**2
Total = df["var"].sum()
Total
Ideally, I would have a (1 by n-cols) dataframe of Totals for each column. How can I best achieve this? I would also appreciate if you can supplement your answer with detailed explanation.
Yes, this is an instance where apply works. You only need to wrap your logic in a function. Also, consider between and shift on the condition to eliminate the first loop:
def func(s, x=10, y=20):
'''
compute the value given a series
'''
# mask where values are between x and y
valid = s.between(x,y)
# shift `valid` and double check
valid = valid & valid.shift(fill_value=False)
# squared log, mask with `valid`, and sum
return (np.log(s/s.shift())**2 * valid).sum()
# apply `func` on the columns
df.apply(func, x=10, y=20)
Output:
col1 0.222561
dtype: float64
It seems that cumsum, cumprod and other cumulative operations can not be transformed. At present, it seems that the cumulative operation can only be done in a row-by-row cycle.
Data about 10 million lines, need to do cross-line calculation cycle, the computer can not run at all, consult the solution, thank you.
The calculations needed are as follows:
for i in range(1,10000000):
df.iloc[i,3] = df.iloc[i-1,3]*df[i,1]+df[i,2]
There probably is no pythonic way to do it without looping it in C/Java style.
Added: Thus, just do a loop. Or hack using global variables etc as follow:
prev_result = 0
def my_func(x):
global prev_result
prev_result = x.a * prev_result + x.b
return prev_result
df = pd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})
df["c"] = df.apply(my_func, axis=1)
# df["c"] is now [1, 4, 15]
# 0 x 1 + 1 = 1; 1 x 2 + 2 = 4; 4 x 3 + 3 = 15;
Edit: The followings are not cumulative and hence, does not answer the question.
That being said, #pythonic833 's solution:
df.shift(-1).iloc[:,3]*df.iloc[:,1]+df.iloc[:,2]
is quite a decent one.
If I were you, I'd just assign df["temp_column"] as df["third_column"].shift(-1)
df["temp_column"] = df["third_column"].shift(-1)
df["third_column"] = df["temp_column"] * df["first_column"] + df["second_column"]
My proposed solution is a bit easier to read at the cost of memory for a column.
I have a flat array b:
a = numpy.array([0, 1, 1, 2, 3, 1, 2])
And an array c of indices marking the start of each "chunk":
b = numpy.array([0, 4])
I know I can find the maximum in each "chunk" using a reduction:
m = numpy.maximum.reduceat(a,b)
>>> array([2, 3], dtype=int32)
But... Is there a way to find the index of the maximum <edit>within a chunk</edit> (like numpy.argmax), with vectorized operations (no lists, loops)?
Borrowing the idea from this post.
Steps involved :
Offset all elements in a group by a limit-offset. Sort them globally, thus limiting each group to stay at their positions, but sorting the elements within each group.
In the sorted array, we would look for the last element, which would be the group max. Their indices would be the argmax after offsetting down for the group lengths.
Thus, a vectorized implementation would be -
def numpy_argmax_reduceat(a, b):
n = a.max()+1 # limit-offset
grp_count = np.append(b[1:] - b[:-1], a.size - b[-1])
shift = n*np.repeat(np.arange(grp_count.size), grp_count)
sortidx = (a+shift).argsort()
grp_shifted_argmax = np.append(b[1:],a.size)-1
return sortidx[grp_shifted_argmax] - b
As a minor tweak and possibly faster one, we could alternatively create shift with cumsum and thus have a variation of the earlier approach, like so -
def numpy_argmax_reduceat_v2(a, b):
n = a.max()+1 # limit-offset
id_arr = np.zeros(a.size,dtype=int)
id_arr[b[1:]] = 1
shift = n*id_arr.cumsum()
sortidx = (a+shift).argsort()
grp_shifted_argmax = np.append(b[1:],a.size)-1
return sortidx[grp_shifted_argmax] - b