How to apply a list comprehension in Panda Dataframe? - pandas

From a list of values, I try to identify any sequential pair of values whose sum exceeds 10
a = [1,9,3,4,5]
...so I wrote a for loop...
values = []
for i in range(len(a)-2):
if sum(a[i:i+2]) >10:
values += [a[i:i+2]]
...which I rewritten as a list comprehension...
values = [a[i:i+2] for i in range(len(a)-2) if sum(a[i:i+2]) >10]
Both produce same output:
values = [[1,9], [9,3]]
My question is how best may I apply the above list comprehension in a DataFrame.
Here is the sample 5 rows DataFrame
import pandas as pd
df = pd.DataFrame({'A': [1,1,1,1,0],
'B': [9,8,3,2,2],
'C': [3,3,3,10,3],
'E': [4,4,4,4,4],
'F': [5,5,5,5,5]})
df['X'] = df.values.tolist()
where:
- a is within a df['X'] which is a list of values Columns A - F
df['X'] = [[1,9,3,4,5],[1,8,3,4,5],[1,3,3,4,5],[1,2,10,4,5],[0,2,3,4,5]]
and, result of the list comprehension is to be store in new column df['X1]
Desired output is:
df['X1'] = [[[1,9], [9,3]],[[8,3]],[[NaN]],[[2,10],[10,4]],[[NaN]]]
Thank you.

You could use pandas apply function, and put your list comprehension in it.
df = pd.DataFrame({'A': [1,1,1,1,0],
'B': [9,8,3,2,2],
'C': [3,3,3,10,3],
'E': [4,4,4,4,4],
'F': [5,5,5,5,5]})
df['x'] = df.apply(lambda a: [a[i:i+2] for i in range(len(a)-2) if sum(a[i:i+2]) >= 10], axis=1)
#Note the axis parameters tells if you want to apply this function by rows or by columns, axis = 1 applies the function to each row.
This will give the output as stated in df['X1']

Related

(Python)Dataframe copy with selected columns and index

Whole dataframe can be copied to df2 as below.
How to copy only 'B' column and index in df to df2?
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 30],'B': [100, 200, 300]}, index=['2021-11-24', '2021-11-25', '2021-11-26'])
df2 = df.copy()
You can simply select and then copy as follows:
df2 = df[['B']].copy()
I am using a list as the selection in order to have a DataFrame instead of a pd.Series.

Check similarity of 2 pandas dataframes

I am trying to compare 2 pandas dataframes in terms of column names and datatypes. With assert_frame_equal, I get an error since shapes are different. Is there a way to ignore it, as I could not find it in the documentation.
With df1_dict == df2_dict, it just says whether its similar or not, I am trying to print if there are any differences in terms of feature names or datatypes.
df1_dict = dict(df1.dtypes)
df2_dict = dict(df2.dtypes)
# df1_dict = {'A': np.dtype('O'), 'B': np.dtype('O'), 'C': np.dtype('O')}
# df2_dict = {'A': np.dtype('int64'), 'B': np.dtype('O'), 'C': np.dtype('O')}
print(set(df1_dict) - set(df2_dict))
print(f'''Are two datsets similar: {df1_dict == df2_dict}''')
pd.testing.assert_frame_equal(df1, df2)
Any suggestions would be appreciated.
It seems to me that if the two dataframe descriptions are outer joined, you would have all the information you want.
example:
df1 = pd.DataFrame({'a': [1,2,3], 'b': list('abc')})
df2 = pd.DataFrame({'a': [1.0,2.0,3.0], 'b': list('abc'), 'c': [10,20,30]})
diff = df1.dtypes.rename('df1').reset_index().merge(
df2.dtypes.rename('df2').reset_index(), how='outer'
)
def check(x):
if pd.isnull(x.df1):
return 'df1-missing'
if pd.isnull(x.df2):
return 'df2-missing'
if x.df1 != x.df2:
return 'type-mismatch'
return 'ok'
diff['diff_status'] = diff.apply(check, axis=1)
# diff prints:
index df1 df2 diff_status
0 a int64 float64 type-mismatch
1 b object object ok
2 c NaN int64 df1-missing

Get max value of 3 columns from pandas DataFrame?

I've a Pandas DataFrame with 3 columns:
c={'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
Now I need the max value of these 3 columns.
I've tried:
df['max_val'] = df[['a','b','c']].max(axis=1)
The result is Nan instead of the expected output: US.
How can I get the max value for these 3 columns? (and what if one of them contains Nan)
Use:
c={'a': [['US', 'BE'],['US']],'b': [['US'],['US']], 'c': [['US','BE'],['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
from collections import Counter
df = df[['a','b','c']].apply(lambda x: list(Counter(map(tuple, x)).most_common()[0][0]), 1)
print (df)
0 [US, BE]
1 [US]
dtype: object
if it as # Erfan stated, most common value in a row then .agg(), mode
df.agg('mode', axis=1)
0
0 [US, BE]
1 [US]
while your data are lists, you can't use pandas.mode(). because lists objects are unhashable and mode() function won't work.
a solution is converting the elements of your dataframe's row to strings and then use pandas.mode().
check this:
>>> import pandas as pd
>>> c = {'a': [['US','BE']],'b': [['US']], 'c': [['US','BE']]}
>>> df = pd.DataFrame(c, columns = ['a','b','c'])
>>> x = df.iloc[0].apply(lambda x: str(x))
>>> x.mode()
# Answer:
0 ['US', 'BE']
dtype: object
>>> d = {'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
>>> df2 = pd.DataFrame(d, columns = ['a','b','c'])
>>> z = df.iloc[0].apply(lambda z: str(z))
>>> z.mode()
# Answer:
0 ['US']
dtype: object
As I can see you have some elements as a list type, So I think the below-mentioned code will work fine.
First, append all value into an array
Then, find the most occurring element from that array.
from scipy.stats import mode
arr = []
for i in df:
for j in range(len(df[i])):
for k in range(len(df[i][j])):
arr.append(df[i][j][k])
from collections import Counter
b = Counter(arr)
print(b.most_common())
this will give you an answer as you want.

How to find pearson correlation between rows in two dataframes

I have a dataframe that I split into two dataframes of the same amount of columns and rows (df1 and df2). I want to write a function that will go through each row and feed their values into the scipy.stats.pearsonr() function. How would I do this?
Something like:
for index, row in d1.iterrows():
print(scipy.stats.pearsonr(df1.loc[index], df2.loc[index]))
If you just want the function, try this:
import pandas as pd
from scipy.stats import pearsonr
df1 = pd.DataFrame(
{
'A': [0,2,3,4,5],
'B': [2,3,4,5,6],
'C': [5,6,7,8,9],
}
)
df2 = pd.DataFrame(
{
'A': [2,1,3,4,5],
'B': [3,2,4,5,6],
'C': [7,7,7,3,3],
}
)
def pandas_pearsonr(df1, df2):
assert len(df1)==len(df2)
coefs = []
for i in range(0, len(df1)):
coefs.append(pearsonr(df1.iloc[i].values, df2.iloc[i].values))
print(coefs)
return pd.DataFrame(index=df1.index, data=coefs, columns=['coef', 'p-value'])
pandas_pearsonr(df1, df2)
Output looks like this:
coef p-value
0 0.976221 0.139109
1 0.996271 0.054996
2 1.000000 0.000000
3 -0.720577 0.487754
4 -0.838628 0.366717
But I think, it can be more pythonic. And maybe you can use pandas.DataFrame.corr

Getting standard deviation on a specific number of dates

In this dataframe...
import pandas as pd
import numpy as np
import datetime
tf = 365
dt = datetime.datetime.now()-datetime.timedelta(days=365)
df = pd.DataFrame({
'Cat': np.repeat(['a', 'b', 'c'], tf),
'Date': np.tile(pd.date_range(dt, periods=tf), 3),
'Val': np.random.rand(3*tf)
})
How can I get a dictionary of standard deviation of each 'Cat' values for a specific number of days - back from the last day for a large dataset?
This code gives the standard deviation for 10 days...
{s: np.std(df[(df.Cat == s) &
(df.Date > today-datetime.timedelta(days=10))].Val)
for s in df.Cat.unique()}
...looks clunky.
Is there a better way?
First filter by boolean indexing and then aggregate std, but because default value ddof=1 is necessary set it to 0:
d1 = df[(df.Date>dt-datetime.timedelta(days=10))].groupby('Cat')['Val'].std(ddof=0).to_dict()
print (d1)
{'a': 0.28435695432581953, 'b': 0.2908486860242955, 'c': 0.2995981283031974}
Another solution is use custom function:
f = lambda x: np.std(x.loc[(x.Date > dt-datetime.timedelta(days=10)), 'Val'])
d2 = df.groupby('Cat').apply(f).to_dict()
Difference between solutions is if some values in group not matched conditions then is removed and for second solution is assignd NaN:
d1 = {'b': 0.2908486860242955, 'c': 0.2995981283031974}
d2 = {'a': nan, 'b': 0.2908486860242955, 'c': 0.2995981283031974}