I have a pandas dataframe containing IDs and Codes which are of type list:
df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3, 3, 4], 'Code': [['A', 'B'], ['A', 'B'], ['A', 'B', 'C'],
['A'], ['A'], ['A', 'C'], ['D', 'C'], ['A', 'D']]})
I would like to groupby ID and get a list of all codes associated with each ID:
df_groupby = pd.DataFrame(df.groupby('ID')['Code'].apply(list))
After executing the above code I have a dataframe at the ID level with the 'Code' column transformed to a list of lists. How would I flatten each list of lists within the 'Code' column such that I have a list of all codes associated with each ID?
Try this.You can use np.hstack to Stack arrays in sequence horizontally.
import numpy as np
df_groupby["Code"] = df_groupby["Code"].apply(lambda x: np.hstack(x))
or
df_groupby["Code"] = df_groupby["Code"].apply(np.hstack)
Use list comprehension:
df = df.groupby('ID')['Code'].agg(lambda x: [z for y in x for z in y]).to_frame()
print(df)
Code
ID
1 [A, B, A, B, A, B, C]
2 [A, A]
3 [A, C, D, C]
4 [A, D]
Answering my own question. Applying numpy's hstack does the trick:
df_groupby['Code'] = df_groupby['Code'].apply(np.hstack)
Related
I have a Pandas dataframe with two columns:
col1: a list column
col2: an integer that specifies the index of the list element that I would like to extract and store in col3. It can take NaN value, in which case the outcome should be NaN as well.
Sample input:
df = pd.DataFrame({
'col1' : [['A', 'B'], ['C', 'D', 'E'], ['F', 'G']],
'col2' : [0, 2, np.nan]})
Expected output:
df_out = pd.DataFrame({
'col1' : [['A', 'B'], ['C', 'D', 'E'], ['F', 'G']],
'col2' : [0, 2, np.nan],
'col3' : ['A', 'E', np.nan]})
you can use a basic apply:
def func(row):
if np.isnan(row.col2):
return np.nan
else:
return row.col1[int(row.col2)]
df['col3'] = df.apply(func, axis=1)
output:
col1 col2 col3
0 [A, B] 0.0 A
1 [C, D, E] 2.0 E
2 [F, G] NaN NaN
You can do a list comprehension to compare the two columns too.
df['col3'] = [i[int(j)] if not np.isnan(j) else j for i,j in zip(df.col1,df.col2)]
col1 col2 col3
0 [A, B] 0.0 A
1 [C, D, E] 2.0 E
2 [F, G] NaN NaN
I have a pandas dataframe df with the contents below:
df = pd.DataFrame({'x': ['a', 'b', 'c'], 'y': [15, 10, 5]})
enter image description here
I would like to get a third column that shows the result of dividing by the value in y when x=c
enter image description here
I tried some but not worked:
df['z'] = df['y']/df.loc[df['x']== 'c', 'y']
I am trying to compare 2 pandas dataframes in terms of column names and datatypes. With assert_frame_equal, I get an error since shapes are different. Is there a way to ignore it, as I could not find it in the documentation.
With df1_dict == df2_dict, it just says whether its similar or not, I am trying to print if there are any differences in terms of feature names or datatypes.
df1_dict = dict(df1.dtypes)
df2_dict = dict(df2.dtypes)
# df1_dict = {'A': np.dtype('O'), 'B': np.dtype('O'), 'C': np.dtype('O')}
# df2_dict = {'A': np.dtype('int64'), 'B': np.dtype('O'), 'C': np.dtype('O')}
print(set(df1_dict) - set(df2_dict))
print(f'''Are two datsets similar: {df1_dict == df2_dict}''')
pd.testing.assert_frame_equal(df1, df2)
Any suggestions would be appreciated.
It seems to me that if the two dataframe descriptions are outer joined, you would have all the information you want.
example:
df1 = pd.DataFrame({'a': [1,2,3], 'b': list('abc')})
df2 = pd.DataFrame({'a': [1.0,2.0,3.0], 'b': list('abc'), 'c': [10,20,30]})
diff = df1.dtypes.rename('df1').reset_index().merge(
df2.dtypes.rename('df2').reset_index(), how='outer'
)
def check(x):
if pd.isnull(x.df1):
return 'df1-missing'
if pd.isnull(x.df2):
return 'df2-missing'
if x.df1 != x.df2:
return 'type-mismatch'
return 'ok'
diff['diff_status'] = diff.apply(check, axis=1)
# diff prints:
index df1 df2 diff_status
0 a int64 float64 type-mismatch
1 b object object ok
2 c NaN int64 df1-missing
In this dataframe...
import pandas as pd
import numpy as np
import datetime
tf = 365
dt = datetime.datetime.now()-datetime.timedelta(days=365)
df = pd.DataFrame({
'Cat': np.repeat(['a', 'b', 'c'], tf),
'Date': np.tile(pd.date_range(dt, periods=tf), 3),
'Val': np.random.rand(3*tf)
})
How can I get a dictionary of standard deviation of each 'Cat' values for a specific number of days - back from the last day for a large dataset?
This code gives the standard deviation for 10 days...
{s: np.std(df[(df.Cat == s) &
(df.Date > today-datetime.timedelta(days=10))].Val)
for s in df.Cat.unique()}
...looks clunky.
Is there a better way?
First filter by boolean indexing and then aggregate std, but because default value ddof=1 is necessary set it to 0:
d1 = df[(df.Date>dt-datetime.timedelta(days=10))].groupby('Cat')['Val'].std(ddof=0).to_dict()
print (d1)
{'a': 0.28435695432581953, 'b': 0.2908486860242955, 'c': 0.2995981283031974}
Another solution is use custom function:
f = lambda x: np.std(x.loc[(x.Date > dt-datetime.timedelta(days=10)), 'Val'])
d2 = df.groupby('Cat').apply(f).to_dict()
Difference between solutions is if some values in group not matched conditions then is removed and for second solution is assignd NaN:
d1 = {'b': 0.2908486860242955, 'c': 0.2995981283031974}
d2 = {'a': nan, 'b': 0.2908486860242955, 'c': 0.2995981283031974}
Assuming the following dataframe:
In [26]: x = { 'a': 9 }
In [27]: y = { 'b': 10 }
In [28]: 'a' in x
Out[28]: True
In [29]: 'a' in y
Out[29]: False
In [32]: df = DataFrame('data': Series([x, y]))
In [34]: df
Out[34]:
data
0 {'a': 9}
1 {'b': 10}
How can I obtain a new dataframe that only contains rows where the dictionary in the data column has the key a?
df['a' in df['data']] results in an error.
This is quite simple still:
df[df['data'].apply(lambda x: 'a' in x)]