Selecting columns of a pandas dataframe based on criteria - pandas

I have a DF which contains results from the UK election results with one column per party. So the DF is something like:
In[107]: Results.columns
Out[107]:
Index(['Press Association ID Number', 'Constituency Name', 'Region', 'Country',
'Constituency ID', 'Constituency Type', 'Election Year', 'Electorate',
' Total number of valid votes counted ', 'Unnamed: 9',
...
'Wessex Reg', 'Whig', 'Wigan', 'Worth', 'WP', 'WRP', 'WVPTFP', 'Yorks',
'Young', 'Zeb'],
dtype='object', length=147)
e.g.
Results.head(2)
Out[108]:
Press Association ID Number Constituency Name Region Country \
0 1 Aberavon Wales Wales
1 2 Aberconwy Wales Wales
Constituency ID Constituency Type Election Year Electorate \
0 W07000049 County 2015 49,821
1 W07000058 County 2015 45,525
Total number of valid votes counted Unnamed: 9 ... Wessex Reg Whig \
0 31,523 NaN ... NaN NaN
1 30,148 NaN ... NaN NaN
Wigan Worth WP WRP WVPTFP Yorks Young Zeb
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
[2 rows x 147 columns]
The columns containing the votes for the different parties are Results.ix[:, 'Unnamed: 9':]
Most of these parties poll very few votes in any constituency, and so I would like to exclude them. Is there a way (short of iterating through each row and column myself) of returning only those columns which meet a particular condition, for example having at least one value > 1000? I would ideally like to be able to specify something like
Results.ix[:, 'Unnamed: 9': > 1000]

you can do it this way:
In [94]: df
Out[94]:
a b c d e f g h
0 -1.450976 -1.361099 -0.411566 0.955718 99.882051 -1.166773 -0.468792 100.333169
1 0.049437 -0.169827 0.692466 -1.441196 0.446337 -2.134966 -0.407058 -0.251068
2 -0.084493 -2.145212 -0.634506 0.697951 101.279115 -0.442328 -0.470583 99.392245
3 -1.604788 -1.136284 -0.680803 -0.196149 2.224444 -0.117834 -0.299730 -0.098353
4 -0.751079 -0.732554 1.235118 -0.427149 99.899120 1.742388 -1.636730 99.822745
5 0.955484 -0.261814 -0.272451 1.039296 0.778508 -2.591915 -0.116368 -0.122376
6 0.395136 -1.155138 -0.065242 -0.519787 100.446026 1.584397 0.448349 99.831206
7 -0.691550 0.052180 0.827145 1.531527 -0.240848 1.832925 -0.801922 -0.298888
8 -0.673087 -0.791235 -1.475404 2.232781 101.521333 -0.424294 0.088186 99.553973
9 1.648968 -1.129342 -1.373288 -2.683352 0.598885 0.306705 -1.742007 -0.161067
In [95]: df[df.loc[:, 'e':].columns[(df.loc[:, 'e':] > 50).any()]]
Out[95]:
e h
0 99.882051 100.333169
1 0.446337 -0.251068
2 101.279115 99.392245
3 2.224444 -0.098353
4 99.899120 99.822745
5 0.778508 -0.122376
6 100.446026 99.831206
7 -0.240848 -0.298888
8 101.521333 99.553973
9 0.598885 -0.161067
Explanation:
In [96]: (df.loc[:, 'e':] > 50).any()
Out[96]:
e True
f False
g False
h True
dtype: bool
In [97]: df.loc[:, 'e':].columns
Out[97]: Index(['e', 'f', 'g', 'h'], dtype='object')
In [98]: df.loc[:, 'e':].columns[(df.loc[:, 'e':] > 50).any()]
Out[98]: Index(['e', 'h'], dtype='object')
Setup:
In [99]: df = pd.DataFrame(np.random.randn(10, 8), columns=list('abcdefgh'))
In [100]: df.loc[::2, list('eh')] += 100
UPDATE:
starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

Related

Y axis in panel

I have a DataFrame dft like this:
Date Apple Amazon Facebook US Bond
0 2018-01-02 NaN NaN NaN NaN
1 2018-01-03 NaN NaN NaN NaN
2 2018-01-04 NaN NaN NaN NaN
3 2018-01-05 NaN NaN NaN NaN
4 2018-01-08 NaN NaN NaN NaN
... ... ... ... ... ...
665 2020-08-24 0.708554 0.528557 0.152367 0.185932
666 2020-08-25 0.639243 0.534403 0.106550 0.133563
667 2020-08-26 0.520858 0.562482 0.018176 0.133283
668 2020-08-27 0.549531 0.593006 -0.011161 0.261187
669 2020-08-28 0.552725 0.595580 -0.038886 0.278847
Change the Date type
dft["Date"] = pd.to_datetime(dft["Date"]).dt.date
idf = dft.interactive()
date_from = datetime.date(yearStart, 1, 1)
date_to = datetime.date(yearEnd, 8, 31)
date_slider = pn.widgets.DateSlider(name="date", start = date_from, end = date_to, steps=1, value=date_from)
date_slider
and I see a date slider. All good. More controls:
tickerNames = ['Apple', 'Amazon', 'Facebook', 'US Bond']
# Radio buttons for metric measures
yaxis = pn.widgets.RadioButtonGroup(
name='Y axis',
options=tickerNames,
button_type='success'
)
pipeline = (
idf[
(idf.Date <= date_slider)
]
.groupby(['Date'])[yaxis].mean()
.to_frame()
.reset_index()
.sort_values(by='Date')
.reset_index(drop=True)
)
if I now type
pipeline
I see a table with a date slider above it, where each symbol is it's own "tab". If I click on the symbol, and I change the slider, I see more/less data. Again all good. Here is where I get confused. I want to plot the values of the columns:
plot = pipeline.hvplot(x = 'Date', by='WHAT GOES IN HERE', y=yaxis,line_width=2, title="Prices")
NOTE: WHAT GOES IN HERE. I need the values in the `dtf` dataframe above, but I can't hardwire the symbol since it depends on what the user chooses in the `table`? I want an interactive chart, so that as I slide the date_slider, all more and more of the data for each symbol gets plotted.
If I do it the old fashioned way:
fig = plt.figure(figsize=(15, 7))
ax1 = fig.add_subplot(1, 1, 1)
dft.plot(ax=ax1)
ax1.set_xlabel('Date')
ax1.set_ylabel('21days rolling daily change')
ax1.set_title('21days rolling daily change of financial assets')
plt.show()
It works as expected?

Pandas - Splitting data from one column into multiple columns

I have a Dataframe in the below format:
id, data
101, [{"tree":[
{"Group":"1001","sub-group":3,"Child":"100267","Child_1":"8 cm"},
{"Group":"1002","sub-group":1,"Child":"102280","Child_1":"4 cm"},
{"Group":"1003","sub-group":0,"Child":"102579","Child_1":"0.1 cm"}]}]
102, [{"tree":[
{"Group":"2001","sub-group":3,"Child":"200267","Child_1":"6 cm"},
{"Group":"2002","sub-group":1,"Child":"202280","Child_1":"4 cm"}]}]
103,
I am trying to have data from this one column split into multiple columns
Expected output:
id, Group, sub-group, Child, Child_1, Group, sub-group, Child, Child_1, Group, sub-group, Child, Child_1
101, 1001, 3, 100267, 8 cm, 1002, 1, 102280, 4 cm, 1003, 0, 102579, 0.1 cm
102, 2001, 3, 200267, 6 cm, 2002, 1, 2022280, 4 cm
103
Output of df.loc[:15, ['id','data']].to_dict()
{'id': {1: '101',
4: '102',
11: '103',
15: '104',
16: '105'},
'data': {1: '[{"tree":[{"Group":"","sub-group":"3","Child":"100267","Child_1":"8 cm"}]}]',
4: '[{"tree":[{"sub-group":"0.01","Child_1":"4 cm"}]}]',
11: '[{"tree":[{"sub-group":null,"Child_1":null}]}]',
15: '[{"tree":[{"Group":"1003","sub-group":15,"Child":"child_","Child_1":"41 cm"}]}]',
16: '[{"tree":[{"sub-group":"0.00","Child_1":"0"}]}]'}}
you can use explode on the column data, create a dataframe from it, add a cumcount column, then some shape change with set_index, stack, unstack and drop to fit your expected output, join back to the column id
s = df['data'].dropna().str['tree'].explode()
df_f = df[['id']].join(pd.DataFrame(s.tolist(), s.index)\
.assign(cc=lambda x: x.groupby(level=0).cumcount()+1)\
.set_index('cc', append=True)\
.stack()\
.unstack(level=[-2,-1])\
.droplevel(0, axis=1),
how='left')
print (df_f)
id Group sub-group Child Child_1 Group sub-group Child Child_1 Group \
0 101 1001 3 100267 8 cm 1002 1 102280 4 cm 1003
1 102 2001 3 200267 6 cm 2002 1 202280 4 cm NaN
2 103 NaN NaN NaN NaN NaN NaN NaN NaN NaN
sub-group Child Child_1
0 0 102579 0.1 cm
1 NaN NaN NaN
2 NaN NaN NaN
Note: while it does fit your expected output, having several times the same column name is not really a good practice. I would rather remove the method drop and flatten the multiindex column.
Edit: After some comments, I guess one way to actually go through the whole column with some weird format:
import ast
def f(x):
try:
return ast.literal_eval(x.replace('null', "'nan'"))[0]['tree']
except:
return [{}]
# then create s with
s = df['data'].apply(f).explode()
# then create df_f like above

DataFrame: Moving average with rolling, mean and shift while ignoring NaN

I have a data set, let's say, 420x1. Now I would to calculate the moving average of the past 30 days, excluding the current date.
If I do the following:
df.rolling(window = 30).mean().shift(1)
my df results in a window with lots of NaNs, which is probably caused by NaNs in the original dataframe here and there (1 NaN within the 30 data points results the MA to be NaN).
Is there a method that ignores NaN (avoiding apply-method, I run it on large data so performance is key)? I do not want to replace the value with 0 because that could skew the results.
the same applies than to moving standard deviation.
For example you can adding min_periods, and NaN is gone
df=pd.DataFrame({'A':[1,2,3,np.nan,2,3,4,np.nan]})
df.A.rolling(window=2,min_periods=1).mean()
Out[7]:
0 1.0
1 1.5
2 2.5
3 3.0
4 2.0
5 2.5
6 3.5
7 4.0
Name: A, dtype: float64
Option 1
df.dropna().rolling('30D').mean()
Option 2
df.interpolate('index').rolling('30D').mean()
Option 2.5
df.interpolate('index').rolling(30).mean()
Option 3
s.rolling('30D').apply(np.nanmean)
Option 3.5
df.rolling(30).apply(np.nanmean)
You can try dropna() to remove the nan values or fillna() to replace the nan with specific value.
Or you can filter out all nan value by notnull() or isnull() within your operation.
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three'])
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df2
one two three
a 0.434024 -0.749472 -1.393307
b NaN NaN NaN
c 0.897861 0.032307 -0.602912
d NaN NaN NaN
e -1.056938 -0.129128 1.328862
f -0.581842 -0.682375 -0.409072
g NaN NaN NaN
h -1.772906 -1.342019 -0.948151
df3 = df2[df2['one'].notnull()]
# use ~isnull() would return the same result
# df3 = df2[~df2['one'].isnull()]
print df3
one two three
a 0.434024 -0.749472 -1.393307
c 0.897861 0.032307 -0.602912
e -1.056938 -0.129128 1.328862
f -0.581842 -0.682375 -0.409072
h -1.772906 -1.342019 -0.948151
For further reference, Pandas has a clean documentary about handling missing data(read this).

Pandas organise delimited rows of data frame into dictionary

After reading a cvs file with pandas by:
df = pd.read_csv(file_name, names= ['x', 'y', 'z'], header=None, delim_whitespace=True)
print df
Outputs something like:
x y z
0 ROW 1.0000 NaN
1 60.1662 30.5987 -29.2246
2 60.1680 30.5951 -29.2212
3 60.1735 30.5843 -29.2101
4 ROW 2.0000 NaN
5 60.1955 30.5410 -29.1664
6 ROW 3.0000 NaN
7 60.1955 30.5410 -29.1664
8 60.1958 30.5412 -29.1665
9 60.1965 30.5419 -29.1667
now ideally I would like to organise all the data with the assumption that everything below a "ROW" entry row in the data frame belongs to each other. Maybe I would like a dictionary of python arrays so that
dict = {ROW1: [[60.1662 30.5987 -29.2246], [60.1680 30.5951 -29.2212], [60.1735 30.5843 -29.2101]], ROW2: [[60.1955 30.5410 -29.1664]], ... }
basically each dictionary entry is a numpy array of the coordinates in the data frame. What would be the best way to do this?
Sounds like we need some dictionary comprehension here:
In [162]:
print df
x y z
0 ROW 1.0000 NaN
1 60.1662 30.5987 -29.2246
2 60.1680 30.5951 -29.2212
3 60.1735 30.5843 -29.2101
4 ROW 2.0000 NaN
5 60.1955 30.5410 -29.1664
6 ROW 3.0000 NaN
7 60.1955 30.5410 -29.1664
8 60.1958 30.5412 -29.1665
9 60.1965 30.5419 -29.1667
In [163]:
df['label'] = df.ix[df.x=='ROW', ['x','y']].apply(lambda x: x[0]+'%i'%x[1], axis=1)
In [164]:
df.label.fillna(method='pad', inplace=True)
df = df.dropna().set_index('label')
In [165]:
{k: df.ix[k].values.tolist() for k in df.index.unique()}
Out[165]:
{'ROW1': [['60.1662', 30.5987, -29.2246],
['60.1680', 30.5951, -29.2212],
['60.1735', 30.5843, -29.2101]],
'ROW2': [['60.1955', 30.541, -29.1664]],
'ROW3': [['60.1955', 30.541, -29.1664],
['60.1958', 30.5412, -29.1665],
['60.1965', 30.5419, -29.1667]]}
Here is another way.
df['label'] = (df.x == 'ROW').astype(int).cumsum()
Out[24]:
x y z label
0 ROW 1.0000 NaN 1
1 60.1662 30.5987 -29.2246 1
2 60.1680 30.5951 -29.2212 1
3 60.1735 30.5843 -29.2101 1
4 ROW 2.0000 NaN 2
5 60.1955 30.5410 -29.1664 2
6 ROW 3.0000 NaN 3
7 60.1955 30.5410 -29.1664 3
8 60.1958 30.5412 -29.1665 3
9 60.1965 30.5419 -29.1667 3
And then, by groupby on label column, you can start to process the df whatever you like. You have all the column name within each group. Very convenient to work on.

Can you prevent automatic alphabetical order of df.append()?

I am trying to append data to a log where the order of columns isn't in alphabetical order but makes logical sense, ex.
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 Diff_Goals_2
I am running through several calculations based on different variables and logging the results through appending a dictionary of the values after each run. Is there a way to prevent the df.append() function to order the columns alphabetically?
Seems you have to reorder the columns after the append operation:
In [25]:
# assign the appended dfs to merged
merged = df1.append(df2)
# create a list of the columns in the order you desire
cols = list(df1) + list(df2)
# assign directly
merged.columns = cols
# column order is now as desired
merged.columns
Out[25]:
Index(['Org_Goals_1', 'Calc_Goals_1', 'Diff_Goals_1', 'Org_Goals_2', 'Calc_Goals_2', 'Diff_Goals_2'], dtype='object')
example:
In [26]:
df1 = pd.DataFrame(columns=['Org_Goals_1','Calc_Goals_1','Diff_Goals_1'], data = randn(5,3))
df2 = pd.DataFrame(columns=['Org_Goals_2','Calc_Goals_2','Diff_Goals_2'], data=randn(5,3))
merged = df1.append(df2)
cols = list(df1) + list(df2)
merged.columns = cols
merged
Out[26]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.028935 NaN -0.687143 NaN 1.528579
1 0.943432 NaN -2.055357 NaN -0.720132
2 0.035234 NaN 0.020756 NaN 1.556319
3 1.447863 NaN 0.847496 NaN -1.458852
4 0.132337 NaN -0.255578 NaN -0.222660
0 NaN 0.131085 NaN 0.850022 NaN
1 NaN -1.942110 NaN 0.672965 NaN
2 NaN 0.944052 NaN 1.274509 NaN
3 NaN -1.796448 NaN 0.130338 NaN
4 NaN 0.961545 NaN -0.741825 NaN
Diff_Goals_2
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
0 0.727619
1 0.022209
2 -0.350757
3 1.116637
4 1.947526
The same alpha sorting of the columns happens with concat also so it looks like you have to reorder after appending.
EDIT
An alternative is to use join:
In [32]:
df1.join(df2)
Out[32]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.163745 1.608398 0.876040 0.651063 0.371263
1 -1.762973 -0.471050 -0.206376 1.323191 0.623045
2 0.166269 1.021835 -0.119982 1.005159 -0.831738
3 -0.400197 0.567782 -1.581803 0.417112 0.188023
4 -1.443269 -0.001080 0.804195 0.480510 -0.660761
Diff_Goals_2
0 -2.723280
1 2.463258
2 0.147251
3 2.328377
4 -0.248114
Actually, I found "advanced indexing" to work quite well
df2=df.ix[:,'order of columns']
As I see it, the order is lost, but when appending, the original data should have the correct order. To maintain that, assuming Dataframe 'alldata' and dataframe to be appended data 'newdata', appending and keeping column order as in 'alldata' would be:
alldata.append(newdata)[list(alldata)]
(I encountered this problem with named date fields, where 'Month' would be sorted between 'Minute' and 'Second')