Pandas - Splitting data from one column into multiple columns - pandas

I have a Dataframe in the below format:
id, data
101, [{"tree":[
{"Group":"1001","sub-group":3,"Child":"100267","Child_1":"8 cm"},
{"Group":"1002","sub-group":1,"Child":"102280","Child_1":"4 cm"},
{"Group":"1003","sub-group":0,"Child":"102579","Child_1":"0.1 cm"}]}]
102, [{"tree":[
{"Group":"2001","sub-group":3,"Child":"200267","Child_1":"6 cm"},
{"Group":"2002","sub-group":1,"Child":"202280","Child_1":"4 cm"}]}]
103,
I am trying to have data from this one column split into multiple columns
Expected output:
id, Group, sub-group, Child, Child_1, Group, sub-group, Child, Child_1, Group, sub-group, Child, Child_1
101, 1001, 3, 100267, 8 cm, 1002, 1, 102280, 4 cm, 1003, 0, 102579, 0.1 cm
102, 2001, 3, 200267, 6 cm, 2002, 1, 2022280, 4 cm
103
Output of df.loc[:15, ['id','data']].to_dict()
{'id': {1: '101',
4: '102',
11: '103',
15: '104',
16: '105'},
'data': {1: '[{"tree":[{"Group":"","sub-group":"3","Child":"100267","Child_1":"8 cm"}]}]',
4: '[{"tree":[{"sub-group":"0.01","Child_1":"4 cm"}]}]',
11: '[{"tree":[{"sub-group":null,"Child_1":null}]}]',
15: '[{"tree":[{"Group":"1003","sub-group":15,"Child":"child_","Child_1":"41 cm"}]}]',
16: '[{"tree":[{"sub-group":"0.00","Child_1":"0"}]}]'}}

you can use explode on the column data, create a dataframe from it, add a cumcount column, then some shape change with set_index, stack, unstack and drop to fit your expected output, join back to the column id
s = df['data'].dropna().str['tree'].explode()
df_f = df[['id']].join(pd.DataFrame(s.tolist(), s.index)\
.assign(cc=lambda x: x.groupby(level=0).cumcount()+1)\
.set_index('cc', append=True)\
.stack()\
.unstack(level=[-2,-1])\
.droplevel(0, axis=1),
how='left')
print (df_f)
id Group sub-group Child Child_1 Group sub-group Child Child_1 Group \
0 101 1001 3 100267 8 cm 1002 1 102280 4 cm 1003
1 102 2001 3 200267 6 cm 2002 1 202280 4 cm NaN
2 103 NaN NaN NaN NaN NaN NaN NaN NaN NaN
sub-group Child Child_1
0 0 102579 0.1 cm
1 NaN NaN NaN
2 NaN NaN NaN
Note: while it does fit your expected output, having several times the same column name is not really a good practice. I would rather remove the method drop and flatten the multiindex column.
Edit: After some comments, I guess one way to actually go through the whole column with some weird format:
import ast
def f(x):
try:
return ast.literal_eval(x.replace('null', "'nan'"))[0]['tree']
except:
return [{}]
# then create s with
s = df['data'].apply(f).explode()
# then create df_f like above

Related

Find difference between groupby values for specific categories in Pandas

I'd like to find the difference between values in a Pandas groupby dataframe, but for specific column values. I've read multiple posts about using the diff command, but that applies to subsequent rows regardless of groupings.
In the dataframe below (it's a dictionary), the dataframe has columns for user id trial_id, a condition placebovstreatment, a moderator variable expbin, and a value.
I want to calculate the difference between values within users, but only if they have values for certain condition categories.
For instance, user 1 has values of
correct_placebo_baseline 10.000
correct_treatment 21.000
The difference is 11.
User 2 has values of
0 22.000
correct_placebo_baseline 8.688
The difference is roughly 14.
User 1 has a difference between column categories correct_placebo_baseline and correct_treatment. User 2 has a difference between, correct_placebo_baseline and category '0'.
How do I calculate only if a user has both a correct_placebo_baseline and a 'correct_treatment' groupings? Or, alternatively, how do you create columns where the differences are specific per group per user?
The formula could create columns difference from baseline for correct placebo and 'difference from baseline for 0' for each trial_id.
The challenge is that some users don't have a baseline score. Some users have a baseline score but nothing else. I need difference values only if they have both.
I tried to find a way to run a function when groupby categories meet certain criteria, but couldn't.
Thanks for any help and let me know if I can make this question easier to answer.
{'trial_id': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 5},
'placebovstreatment': {0: '0',
1: 'correct_placebo_baseline',
2: 'correct_treatment',
3: '0',
4: 'correct_placebo_baseline',
5: 'correct_placebo_baseline',
6: 'incorrect_placebo',
7: 'correct_placebo_baseline',
8: 'incorrect_placebo',
9: '0'},
'expbin': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 2, 7: 1, 8: 1, 9: 1},
'value': {0: 31.5,
1: 10.0,
2: 21.0,
3: 22.0,
4: 8.688,
5: 20.0,
6: 37.5,
7: 12.0,
8: 32.5,
9: 10.0}}
You can pivot to get the conditions as columns:
df2 = df.pivot(index=['trial_id', 'expbin'], columns='placebovstreatment', values='value')
Output:
placebovstreatment 0 correct_placebo_baseline correct_treatment incorrect_placebo
trial_id expbin
1 1 31.5 10.000 21.0 NaN
2 2 22.0 8.688 NaN NaN
3 2 NaN 20.000 NaN 37.5
4 1 NaN 12.000 NaN 32.5
5 1 10.0 NaN NaN NaN
You can then easily perform computations:
df2['correct_treatment'] - df2['correct_placebo_baseline']
Output:
trial_id expbin
1 1 11.0
2 2 NaN
3 2 NaN
4 1 NaN
5 1 NaN
dtype: float64

pandas joining strings in a group, skipping na values

I'm using a combination of str.join (let's call the column joined col_str) and groupby (Let's call the grouped col col_a) in order to summarize data row-wise.
col_str, may contain nan values. Unsurprisingly, and as seen in str.join documentation, joining nan will result in an empty string:
df = df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', '))
To mitigate this, I tried to convert col_str to string (e.g. df['col_str'] = df['col_str'].astype(str) ). But then, empty values now literally have a string nan value, hence considered non empty.
Not only that str.join now includes nan strings, but also other calculations over the script, that rely on those nans, are ruined.
To address that, I thought about converting just the non-empty values as follows:
df['col_str'] = np.where(pd.isnull(df['col_str']), df['col_str'],
df['col_str'].astype(str))
But now str.join return empty values again :-(
So, I tried fillna('') and even dropna(). None provided me with the desired results.
You get the vicious cycle here, right?
astype(str) => nan strings in join and calculations ruined
Leaving as-is => join.str returns empty results.
Thanks for your assistance!
Edit:
Data is read from a csv. Sample:
Code to test -
df = pd.read_csv('/Users/goidelg/Downloads/sample_data.csv', low_memory=False)
print("---Original DF ---")
print(df)
print("---Joining NaNs as NaN---")
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
print("---Convertin col to str---")
df['col_str'] = df['col_str'].astype(str)
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
And results for the script:
First remove missing values by DataFrame.dropna or Series.notna in boolean indexing:
df = pd.DataFrame({'col_a':[1,2,3,4,1,2,3,4,1,2],
'col_str':['a','b','c','d',np.nan, np.nan, np.nan, np.nan,'a', 's']})
df1 = (df.join(df['col_a'].map(df[df['col_str'].notna()]
.groupby('col_a')['col_str'].unique()
.str.join(', ')). rename('labels')))
print (df1)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s
df2 = (df.join(df['col_a'].map(df.dropna(subset=['col_str'])
.groupby('col_a')['col_str']
.unique().str.join(', ')).rename('labels')))
print (df2)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s

How to get the groupby nth row directly in the row as an item?

I have Date, Time, Open, High, low, Close, data on a minute basis of a stock. It is arranged in ascending order ( date wise ). I want to make a new column and for every day (for each row) insert the yesterday price at second row of last date). So for instance I have mentioned price of 18812.3 in front of 11th Jan since last date was 10th Jan and its second row has a price of 18812.3. Similarly I have done it for day before yesterday too. I tried using nth of groupby object but for I have to create a group by object. The below code is getting the a new Dataframe but I would like to create a column directly having the desired values.
test = bn_futures.groupby('Date')['Open','High','Low','Close'].nth(1).reset_index()
Try: (check comments)
# Convert Date to datetime64 and set it as index
df = df.assign(Date=pd.to_datetime(df['Date'], dayfirst=True)).set_index('Date')
# Find second value for each day
prices = df.groupby(level=0)['Open'].nth(1).squeeze()
# Find last row for each day
mask = ~df.index.duplicated(keep='last')
# Create new columns
df.loc[mask, 'price at yesterday'] = prices.shift(1)
df.loc[mask, 'price 2d ago'] = prices.shift(2)
Output:
>>> df
Open price at yesterday price 2d ago
Date
2015-01-09 1 NaN NaN
2015-01-09 2 NaN NaN
2015-01-09 3 NaN NaN
2015-01-10 4 NaN NaN
2015-01-10 5 NaN NaN
2015-01-10 6 2.0 NaN
2015-01-11 7 NaN NaN
2015-01-11 8 NaN NaN
2015-01-11 9 5.0 2.0
Setup a MRE:
df = pd.DataFrame({'Date': ['09-01-2015', '09-01-2015', '09-01-2015',
'10-01-2015', '10-01-2015', '10-01-2015',
'11-01-2015', '11-01-2015', '11-01-2015'],
'Open': [1, 2, 3, 4, 5, 6, 7, 8, 9]})

DataFrame: Moving average with rolling, mean and shift while ignoring NaN

I have a data set, let's say, 420x1. Now I would to calculate the moving average of the past 30 days, excluding the current date.
If I do the following:
df.rolling(window = 30).mean().shift(1)
my df results in a window with lots of NaNs, which is probably caused by NaNs in the original dataframe here and there (1 NaN within the 30 data points results the MA to be NaN).
Is there a method that ignores NaN (avoiding apply-method, I run it on large data so performance is key)? I do not want to replace the value with 0 because that could skew the results.
the same applies than to moving standard deviation.
For example you can adding min_periods, and NaN is gone
df=pd.DataFrame({'A':[1,2,3,np.nan,2,3,4,np.nan]})
df.A.rolling(window=2,min_periods=1).mean()
Out[7]:
0 1.0
1 1.5
2 2.5
3 3.0
4 2.0
5 2.5
6 3.5
7 4.0
Name: A, dtype: float64
Option 1
df.dropna().rolling('30D').mean()
Option 2
df.interpolate('index').rolling('30D').mean()
Option 2.5
df.interpolate('index').rolling(30).mean()
Option 3
s.rolling('30D').apply(np.nanmean)
Option 3.5
df.rolling(30).apply(np.nanmean)
You can try dropna() to remove the nan values or fillna() to replace the nan with specific value.
Or you can filter out all nan value by notnull() or isnull() within your operation.
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three'])
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df2
one two three
a 0.434024 -0.749472 -1.393307
b NaN NaN NaN
c 0.897861 0.032307 -0.602912
d NaN NaN NaN
e -1.056938 -0.129128 1.328862
f -0.581842 -0.682375 -0.409072
g NaN NaN NaN
h -1.772906 -1.342019 -0.948151
df3 = df2[df2['one'].notnull()]
# use ~isnull() would return the same result
# df3 = df2[~df2['one'].isnull()]
print df3
one two three
a 0.434024 -0.749472 -1.393307
c 0.897861 0.032307 -0.602912
e -1.056938 -0.129128 1.328862
f -0.581842 -0.682375 -0.409072
h -1.772906 -1.342019 -0.948151
For further reference, Pandas has a clean documentary about handling missing data(read this).

Selecting columns of a pandas dataframe based on criteria

I have a DF which contains results from the UK election results with one column per party. So the DF is something like:
In[107]: Results.columns
Out[107]:
Index(['Press Association ID Number', 'Constituency Name', 'Region', 'Country',
'Constituency ID', 'Constituency Type', 'Election Year', 'Electorate',
' Total number of valid votes counted ', 'Unnamed: 9',
...
'Wessex Reg', 'Whig', 'Wigan', 'Worth', 'WP', 'WRP', 'WVPTFP', 'Yorks',
'Young', 'Zeb'],
dtype='object', length=147)
e.g.
Results.head(2)
Out[108]:
Press Association ID Number Constituency Name Region Country \
0 1 Aberavon Wales Wales
1 2 Aberconwy Wales Wales
Constituency ID Constituency Type Election Year Electorate \
0 W07000049 County 2015 49,821
1 W07000058 County 2015 45,525
Total number of valid votes counted Unnamed: 9 ... Wessex Reg Whig \
0 31,523 NaN ... NaN NaN
1 30,148 NaN ... NaN NaN
Wigan Worth WP WRP WVPTFP Yorks Young Zeb
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
[2 rows x 147 columns]
The columns containing the votes for the different parties are Results.ix[:, 'Unnamed: 9':]
Most of these parties poll very few votes in any constituency, and so I would like to exclude them. Is there a way (short of iterating through each row and column myself) of returning only those columns which meet a particular condition, for example having at least one value > 1000? I would ideally like to be able to specify something like
Results.ix[:, 'Unnamed: 9': > 1000]
you can do it this way:
In [94]: df
Out[94]:
a b c d e f g h
0 -1.450976 -1.361099 -0.411566 0.955718 99.882051 -1.166773 -0.468792 100.333169
1 0.049437 -0.169827 0.692466 -1.441196 0.446337 -2.134966 -0.407058 -0.251068
2 -0.084493 -2.145212 -0.634506 0.697951 101.279115 -0.442328 -0.470583 99.392245
3 -1.604788 -1.136284 -0.680803 -0.196149 2.224444 -0.117834 -0.299730 -0.098353
4 -0.751079 -0.732554 1.235118 -0.427149 99.899120 1.742388 -1.636730 99.822745
5 0.955484 -0.261814 -0.272451 1.039296 0.778508 -2.591915 -0.116368 -0.122376
6 0.395136 -1.155138 -0.065242 -0.519787 100.446026 1.584397 0.448349 99.831206
7 -0.691550 0.052180 0.827145 1.531527 -0.240848 1.832925 -0.801922 -0.298888
8 -0.673087 -0.791235 -1.475404 2.232781 101.521333 -0.424294 0.088186 99.553973
9 1.648968 -1.129342 -1.373288 -2.683352 0.598885 0.306705 -1.742007 -0.161067
In [95]: df[df.loc[:, 'e':].columns[(df.loc[:, 'e':] > 50).any()]]
Out[95]:
e h
0 99.882051 100.333169
1 0.446337 -0.251068
2 101.279115 99.392245
3 2.224444 -0.098353
4 99.899120 99.822745
5 0.778508 -0.122376
6 100.446026 99.831206
7 -0.240848 -0.298888
8 101.521333 99.553973
9 0.598885 -0.161067
Explanation:
In [96]: (df.loc[:, 'e':] > 50).any()
Out[96]:
e True
f False
g False
h True
dtype: bool
In [97]: df.loc[:, 'e':].columns
Out[97]: Index(['e', 'f', 'g', 'h'], dtype='object')
In [98]: df.loc[:, 'e':].columns[(df.loc[:, 'e':] > 50).any()]
Out[98]: Index(['e', 'h'], dtype='object')
Setup:
In [99]: df = pd.DataFrame(np.random.randn(10, 8), columns=list('abcdefgh'))
In [100]: df.loc[::2, list('eh')] += 100
UPDATE:
starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.