cancatenate multiple dfs with same dimensions and apply functions to cell values of all dfs and store result in the cell - pandas

df1 = pd.DataFrame(np.random.randint(0,9,size=(2, 2)))
df2 = pd.DataFrame(np.random.randint(0,9,size=(2, 2)))
Lets say after concatenate df1 and df2(real case I have many dfs with 700*200 size) in a way that I get something like below table(I dont need to see this table, just for explanation)
col a
col b
row a
[1.4]
[7,8]
row b
[9,2]
[2,0]
Then i want to pass each cell values to below compute function and add the result it from to the cell
def compute(row, column, cell_values):
baseline_df = [2, 4, 6, 7, 8]
result = baseline_df
for values in cell_values:
if (column-row) != dict[values]: # dict contain specific values
result = baseline_df
else:
result = result.apply(func, value=values)
return result.loc[column-row]
def func(df, value):
# operation
result_df = df*value
return result_df
What I want is get df1 and df2 , concatenate and apply above function and get the results. In a really fast way.
In the actual use case , df quite big and if it run for all cells it would take significant amount of time, i need a faster way to perform this.
Note:
This is my idea of doing this. I hope you understand what my requirements are. Please let me know if that is not clear.
currently, i am using something like below, just get the max value of the cell and do the calculation(func)later
This will just give the max value of all cells combined,
dfs = pd.concat(grid).max(level=0)
Final result should be something like this after calculation(same 2d array with new cell data)
col a
col b
row a
0.1
0.7
row b
0.9
0,6
Different approaches are also welcome

Related

iterating through each value in the column and comparing them with the other values in other columns

I am working on, going into each element of each column and comparing them with every other element of the other column in the dataframe.
To do this I made 4 loops (nested) and tried to test it but apparently, it is very very very slow. Is there any other way through which I can do it?
Here is my code:
num = 0
for i in df:
for k in df:
for val in df[i]:
for val2 in df[k]:
if val == val2:
num += 1
else:
break
It is just to count the common elements but that is not my main purpose, it's just to know is there an efficient way to do it?
For Example:
So I want to find edit distance between each value in a column with the same index value of every other column but all that I could find is finding distances for all the values in the column from all the values in the other columns which is quite slow.
Better understanding is shown in the pic but I want the one with the 'tick sign' on it. And I want an average distance of the newly made columns.
Output:
Average Distance between column1 and column2 is (some num) ,
Average Distance between column1 and column3 is (some num)
Thanks a ton!
This might be what you are looking for, I had to create some of my own data, but I believe this is what yo u are trying to accomplish
df = pd.DataFrame({
'Column1' : ['Yes'],
'Column2' : ['No'],
'Column3' : ['Yes'],
'Column4' : ['Yes']
})
df['Count'] = df.apply(lambda x : ' '.join(x), axis = 1).apply(lambda x : x.split()).apply(lambda x : x.count('Yes'))

Select cells in a pandas DataFrame by a Series of its column labels

Say we have a DataFrame and a Series of its column labels, both (almost) sharing a common index:
df = pd.DataFrame(...)
s = df.idxmax(axis=1).shift(1)
How can I obtain cells given a series of columns, getting value from every row using a corresponding column label from the joined series? I'd imagine it would be:
values = df[s] # either
values = df.loc[s] # or
In my example I'd like to have values that are under biggest-in-their-row values (I'm doing a poor man's ML :) )
However I cannot find any interface selecting cells by series of columns. Any ideas folks?
Meanwhile I use this monstrous snippet:
def get_by_idxs(df: pd.DataFrame, idxs: pd.Series) -> pd.Series:
ts_v_pairs = [
(ts, row[row['idx']])
for ts, row in df.join(idxs.rename('idx'), how='inner').iterrows()
if isinstance(row['idx'], str)
]
return pd.Series([v for ts, v in ts_v_pairs], index=[ts for ts, v in ts_v_pairs])
I think you need dataframe lookup
v = s.dropna()
v[:] = df.to_numpy()[range(len(v)), df.columns.get_indexer_for(v)]

Rolling apply lambda function based on condtion

I have a dataframe with normalised (to 100) returns for 18 products (columns). I want to apply a lambda function which multplies the next row by the previous row.
I can do :
df= df.rolling(2).apply(lambda x: (x[0]*x[1]),raw=True)
But some of my columns dont have values on row 1 (they go live on row 4). So I need to either:
Have a lambda function that starts only on row 4 yet applies to the entire df. I can create the first 4 rows manually.
As my values are 100 until "live" I could have the lambda function only applying when the value does not equal 100.
I have tried both :
1.
df.iloc[3:,:] = df.iloc[3:,:].rolling(2).apply(lambda x: (x[0]*x[1]),raw=True)
df= df.rolling(2).apply(lambda x: (x[0]*x[1]) if x[0] != 100 else x,raw=True)
But both meet with total failure.
Any advice welcomed - I've spent hours looking through the site and have yet to find any outcome that works for this situation.
So given the lack of responses I came up with a solution where I split my df in 2 parts and appended it back together.
My lambda function was also garbage I needed something like :
df2 = df.copy()
for i in range(df2.index.size):
if not i:
continue
df2.iloc[i] = (df2.iloc[i - 1] * (df.iloc[i]))
df2
to actually achieve what I was after.

Error in using Pandas groupby.apply to drop duplication

I have a Pandas data frame which has some duplicate values, not rows. I want to use groupby.apply to remove the duplication. An example is as follows.
df = pd.DataFrame([['a', 1, 1], ['a', 1, 2], ['b', 1, 1]], columns=['A', 'B', 'C'])
A B C
0 a 1 1
1 a 1 2
2 b 1 1
# My function
def get_uniq_t(df):
if df.shape[0] > 1:
df['D'] = df.C * 10 + df.B
df = df[df.D == df.D.max()].drop(columns='D')
return df
df = df.groupby('A').apply(get_uniq_t)
Then I get the following value error message. The issue seems to do with creating the new column D. If I create the column D outside the function, the code seems running fine. Can someone help explain what caused the value error message?
ValueError: Shape of passed values is (3, 3), indices imply (2, 3)
The problem with your code is that it attempts to modify
the original group.
Other problem is that this function should return a single row
not a DataFrame.
Change your function to:
def get_uniq_t(df):
iMax = (df.C * 10 + df.B).idxmax()
return df.loc[iMax]
Then its application returns:
A B C
A
a a 1 2
b b 1 1
Edit following the comment
In my opinion, it is not allowed to modify the original group,
as it would indirectly modify the original DataFrame.
At least it displays a warning about this and is considered a bad practice.
Search the Web for SettingWithCopyWarning for more extensive description.
My code (get_uniq_t function) does not modify the original group.
It only returns one row from the current group.
The returned row is selected based on which row returns the greatest value
of df.C * 10 + df.B. So when you apply this function, the result is a new
DataFrame, with consecutive rows equal to results of this function
for consecutive groups.
You can perform an operation equivalent to modification, when you
create some new content, e.g. as the result of groupby instruction
and then save it under the same variable which so far held the source
DataFrame.

Sample Pandas dataframe based on values in column

I have a large dataframe that I want to sample based on values on the target column value, which is binary : 0/1
I want to extract equal number of rows that have 0's and 1's in the "target" column. I was thinking of using the pandas sampling function but not sure how to declare the equal number of samples I want from both classes for the dataframe based on the target column.
I was thinking of using something like this:
df.sample(n=10000, weights='target', random_state=1)
Not sure how to edit it to get 10k records with 5k 1's and 5k 0's in the target column. Any help is appreciated!
You can group the data by target and then sample,
df = pd.DataFrame({'col':np.random.randn(12000), 'target':np.random.randint(low = 0, high = 2, size=12000)})
new_df = df.groupby('target').apply(lambda x: x.sample(n=5000)).reset_index(drop = True)
new_df.target.value_counts()
1 5000
0 5000
Edit: Use DataFrame.sample
You get similar results using DataFrame.sample
new_df = df.groupby('target').sample(n=5000)
You can use DataFrameGroupBy.sample method as follwing:
sample_df = df.groupby("target").sample(n=5000, random_state=1)
Also found this to be a good method:
df['weights'] = np.where(df['target'] == 1, .5, .5)
sample_df = df.sample(frac=.1, random_state=111, weights='weights')
Change the value of frac depending on the percent of data you want back from the original dataframe.
You will have to run a df0.sample(n=5000) and df1.sample(n=5000) and then combine df0 and df1 into a dfsample dataframe. You can create df0 and df1 by df.filter() with some logic. If you provide sample data I can help you construct that logic.