Pandas organise delimited rows of data frame into dictionary - numpy

After reading a cvs file with pandas by:
df = pd.read_csv(file_name, names= ['x', 'y', 'z'], header=None, delim_whitespace=True)
print df
Outputs something like:
x y z
0 ROW 1.0000 NaN
1 60.1662 30.5987 -29.2246
2 60.1680 30.5951 -29.2212
3 60.1735 30.5843 -29.2101
4 ROW 2.0000 NaN
5 60.1955 30.5410 -29.1664
6 ROW 3.0000 NaN
7 60.1955 30.5410 -29.1664
8 60.1958 30.5412 -29.1665
9 60.1965 30.5419 -29.1667
now ideally I would like to organise all the data with the assumption that everything below a "ROW" entry row in the data frame belongs to each other. Maybe I would like a dictionary of python arrays so that
dict = {ROW1: [[60.1662 30.5987 -29.2246], [60.1680 30.5951 -29.2212], [60.1735 30.5843 -29.2101]], ROW2: [[60.1955 30.5410 -29.1664]], ... }
basically each dictionary entry is a numpy array of the coordinates in the data frame. What would be the best way to do this?

Sounds like we need some dictionary comprehension here:
In [162]:
print df
x y z
0 ROW 1.0000 NaN
1 60.1662 30.5987 -29.2246
2 60.1680 30.5951 -29.2212
3 60.1735 30.5843 -29.2101
4 ROW 2.0000 NaN
5 60.1955 30.5410 -29.1664
6 ROW 3.0000 NaN
7 60.1955 30.5410 -29.1664
8 60.1958 30.5412 -29.1665
9 60.1965 30.5419 -29.1667
In [163]:
df['label'] = df.ix[df.x=='ROW', ['x','y']].apply(lambda x: x[0]+'%i'%x[1], axis=1)
In [164]:
df.label.fillna(method='pad', inplace=True)
df = df.dropna().set_index('label')
In [165]:
{k: df.ix[k].values.tolist() for k in df.index.unique()}
Out[165]:
{'ROW1': [['60.1662', 30.5987, -29.2246],
['60.1680', 30.5951, -29.2212],
['60.1735', 30.5843, -29.2101]],
'ROW2': [['60.1955', 30.541, -29.1664]],
'ROW3': [['60.1955', 30.541, -29.1664],
['60.1958', 30.5412, -29.1665],
['60.1965', 30.5419, -29.1667]]}

Here is another way.
df['label'] = (df.x == 'ROW').astype(int).cumsum()
Out[24]:
x y z label
0 ROW 1.0000 NaN 1
1 60.1662 30.5987 -29.2246 1
2 60.1680 30.5951 -29.2212 1
3 60.1735 30.5843 -29.2101 1
4 ROW 2.0000 NaN 2
5 60.1955 30.5410 -29.1664 2
6 ROW 3.0000 NaN 3
7 60.1955 30.5410 -29.1664 3
8 60.1958 30.5412 -29.1665 3
9 60.1965 30.5419 -29.1667 3
And then, by groupby on label column, you can start to process the df whatever you like. You have all the column name within each group. Very convenient to work on.

Related

Y axis in panel

I have a DataFrame dft like this:
Date Apple Amazon Facebook US Bond
0 2018-01-02 NaN NaN NaN NaN
1 2018-01-03 NaN NaN NaN NaN
2 2018-01-04 NaN NaN NaN NaN
3 2018-01-05 NaN NaN NaN NaN
4 2018-01-08 NaN NaN NaN NaN
... ... ... ... ... ...
665 2020-08-24 0.708554 0.528557 0.152367 0.185932
666 2020-08-25 0.639243 0.534403 0.106550 0.133563
667 2020-08-26 0.520858 0.562482 0.018176 0.133283
668 2020-08-27 0.549531 0.593006 -0.011161 0.261187
669 2020-08-28 0.552725 0.595580 -0.038886 0.278847
Change the Date type
dft["Date"] = pd.to_datetime(dft["Date"]).dt.date
idf = dft.interactive()
date_from = datetime.date(yearStart, 1, 1)
date_to = datetime.date(yearEnd, 8, 31)
date_slider = pn.widgets.DateSlider(name="date", start = date_from, end = date_to, steps=1, value=date_from)
date_slider
and I see a date slider. All good. More controls:
tickerNames = ['Apple', 'Amazon', 'Facebook', 'US Bond']
# Radio buttons for metric measures
yaxis = pn.widgets.RadioButtonGroup(
name='Y axis',
options=tickerNames,
button_type='success'
)
pipeline = (
idf[
(idf.Date <= date_slider)
]
.groupby(['Date'])[yaxis].mean()
.to_frame()
.reset_index()
.sort_values(by='Date')
.reset_index(drop=True)
)
if I now type
pipeline
I see a table with a date slider above it, where each symbol is it's own "tab". If I click on the symbol, and I change the slider, I see more/less data. Again all good. Here is where I get confused. I want to plot the values of the columns:
plot = pipeline.hvplot(x = 'Date', by='WHAT GOES IN HERE', y=yaxis,line_width=2, title="Prices")
NOTE: WHAT GOES IN HERE. I need the values in the `dtf` dataframe above, but I can't hardwire the symbol since it depends on what the user chooses in the `table`? I want an interactive chart, so that as I slide the date_slider, all more and more of the data for each symbol gets plotted.
If I do it the old fashioned way:
fig = plt.figure(figsize=(15, 7))
ax1 = fig.add_subplot(1, 1, 1)
dft.plot(ax=ax1)
ax1.set_xlabel('Date')
ax1.set_ylabel('21days rolling daily change')
ax1.set_title('21days rolling daily change of financial assets')
plt.show()
It works as expected?

Using python pandas, how can I merge datasets together and create a column that has the unique modifier? [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 1 year ago.
Here is the current dataset that I am working with.
df contains Knn, Kss, and Ktt in three separate columns.
What I have been unable to figure out is how to merge the three into a single column and have a column that has a label.
Here is what I currently have but I
df_CohBeh = pd.concat([pd.DataFrame(Knn),
pd.DataFrame(Kss),
pd.DataFrame(Ktt)],
keys=['Knn', 'Kss', 'Ktt'],
ignore_index=True)
Which looks like this:
display(df_CohBeh)
Knn Kss Ktt
0 24.579131 NaN NaN
1 21.673524 NaN NaN
2 25.785409 NaN NaN
3 20.686215 NaN NaN
4 21.504863 NaN NaN
.. ... ... ...
106 NaN NaN 27.615440
107 NaN NaN 27.636029
108 NaN NaN 26.215347
109 NaN NaN 27.626850
110 NaN NaN 25.473380
Which is in essence filtering them, but I would rather have a single column with a string that I can use for plotting on the same seaborn graph "Knn", "Kss", "Ktt". To look at various distributions.
I'm not sure how to create a column that can label the Knn value in the label column.
If df looks like that:
>>> df
Knn Kss Ktt
0 96.054660 72.301166 15.355594
1 36.221933 72.646999 41.670382
2 96.503307 78.597493 71.959442
3 53.867432 17.315678 35.006592
4 43.014227 75.122762 83.666844
5 63.301808 72.514763 64.597765
6 0.201688 1.069586 98.816202
7 48.558265 87.660352 9.140665
8 64.353999 43.534200 15.202242
9 41.260903 24.128533 25.963022
10 63.571747 17.474933 47.093538
11 91.006290 90.834753 37.672980
12 61.960163 87.308155 64.698762
13 87.403750 86.402637 78.946980
14 22.238364 88.394919 81.935868
15 56.356764 80.575804 72.925204
16 30.431063 4.466978 32.257898
17 21.403800 46.752591 59.831690
18 57.330671 14.172341 64.764542
19 54.163311 66.037043 0.822948
Try df.melt
to merge the three into a single column and have a column that has a label.
variable value
0 Knn 96.054660
1 Knn 36.221933
2 Knn 96.503307
3 Knn 53.867432
4 Knn 43.014227
5 Knn 63.301808
...
20 Kss 72.301166
21 Kss 72.646999
22 Kss 78.597493
23 Kss 17.315678
24 Kss 75.122762
25 Kss 72.514763
...
40 Ktt 15.355594
41 Ktt 41.670382
42 Ktt 71.959442
43 Ktt 35.006592
44 Ktt 83.666844
45 Ktt 64.597765
...
You should use an pandas Series.
knn = pd.DataFram({...})
kss = pd.DataFram({...})
ktt = pd.DataFram({...})
l = knn.values.flatten() + kss.values.flatten() + ktt.values.flatten()
s = pd.Series(l, name="Knn")

Pandas columns headers split

I have a dataframe with colums header made up of 3 tags which are split by '__'
E.g
A__2__66 B__4__45
0
1
2
3
4
5
I know I cant split the header and just use the first tag with this code; df.columns=df.columns.str.split('__').str[0]
giving:
A B
0
1
2
3
4
5
Is there a way I can use a combination of the tags, for example 1 and 3.
giving
A__66 B__45
0
1
2
3
4
5
I've trided the below but its not working
df.columns=df.columns.str.split('__').str[0]+'__'+df.columns.str.split('__').str[2]
With specific regex substitution:
In [124]: df.columns.str.replace(r'__[^_]+__', '__')
Out[124]: Index(['A__66', 'B__45'], dtype='object')
Use Index.map with f-strings for select first and third values of lists:
df.columns = df.columns.str.split('__').map(lambda x: f'{x[0]}__{x[2]}')
print (df)
A__66 B__45
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
Also you can try split and join:
df.columns=['__'.join((i[0],i[-1])) for i in df.columns.str.split('__')]
#Columns: [A__66, B__45]
I found your own solution perfectly fine, and probably most readable. Just needs a little adjustment
df.columns = df.columns.str.split('__').str[0] + '__' + df.columns.str.split('__').str[-1]
Index(['A__66', 'B__45'], dtype='object')
Or for the sake of efficiency, we do not want to call str.split twice:
lst_split = df.columns.str.split('__')
df.columns = lst_split.str[0] + '__' + lst_split.str[-1]
Index(['A__66', 'B__45'], dtype='object')

Selecting columns of a pandas dataframe based on criteria

I have a DF which contains results from the UK election results with one column per party. So the DF is something like:
In[107]: Results.columns
Out[107]:
Index(['Press Association ID Number', 'Constituency Name', 'Region', 'Country',
'Constituency ID', 'Constituency Type', 'Election Year', 'Electorate',
' Total number of valid votes counted ', 'Unnamed: 9',
...
'Wessex Reg', 'Whig', 'Wigan', 'Worth', 'WP', 'WRP', 'WVPTFP', 'Yorks',
'Young', 'Zeb'],
dtype='object', length=147)
e.g.
Results.head(2)
Out[108]:
Press Association ID Number Constituency Name Region Country \
0 1 Aberavon Wales Wales
1 2 Aberconwy Wales Wales
Constituency ID Constituency Type Election Year Electorate \
0 W07000049 County 2015 49,821
1 W07000058 County 2015 45,525
Total number of valid votes counted Unnamed: 9 ... Wessex Reg Whig \
0 31,523 NaN ... NaN NaN
1 30,148 NaN ... NaN NaN
Wigan Worth WP WRP WVPTFP Yorks Young Zeb
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
[2 rows x 147 columns]
The columns containing the votes for the different parties are Results.ix[:, 'Unnamed: 9':]
Most of these parties poll very few votes in any constituency, and so I would like to exclude them. Is there a way (short of iterating through each row and column myself) of returning only those columns which meet a particular condition, for example having at least one value > 1000? I would ideally like to be able to specify something like
Results.ix[:, 'Unnamed: 9': > 1000]
you can do it this way:
In [94]: df
Out[94]:
a b c d e f g h
0 -1.450976 -1.361099 -0.411566 0.955718 99.882051 -1.166773 -0.468792 100.333169
1 0.049437 -0.169827 0.692466 -1.441196 0.446337 -2.134966 -0.407058 -0.251068
2 -0.084493 -2.145212 -0.634506 0.697951 101.279115 -0.442328 -0.470583 99.392245
3 -1.604788 -1.136284 -0.680803 -0.196149 2.224444 -0.117834 -0.299730 -0.098353
4 -0.751079 -0.732554 1.235118 -0.427149 99.899120 1.742388 -1.636730 99.822745
5 0.955484 -0.261814 -0.272451 1.039296 0.778508 -2.591915 -0.116368 -0.122376
6 0.395136 -1.155138 -0.065242 -0.519787 100.446026 1.584397 0.448349 99.831206
7 -0.691550 0.052180 0.827145 1.531527 -0.240848 1.832925 -0.801922 -0.298888
8 -0.673087 -0.791235 -1.475404 2.232781 101.521333 -0.424294 0.088186 99.553973
9 1.648968 -1.129342 -1.373288 -2.683352 0.598885 0.306705 -1.742007 -0.161067
In [95]: df[df.loc[:, 'e':].columns[(df.loc[:, 'e':] > 50).any()]]
Out[95]:
e h
0 99.882051 100.333169
1 0.446337 -0.251068
2 101.279115 99.392245
3 2.224444 -0.098353
4 99.899120 99.822745
5 0.778508 -0.122376
6 100.446026 99.831206
7 -0.240848 -0.298888
8 101.521333 99.553973
9 0.598885 -0.161067
Explanation:
In [96]: (df.loc[:, 'e':] > 50).any()
Out[96]:
e True
f False
g False
h True
dtype: bool
In [97]: df.loc[:, 'e':].columns
Out[97]: Index(['e', 'f', 'g', 'h'], dtype='object')
In [98]: df.loc[:, 'e':].columns[(df.loc[:, 'e':] > 50).any()]
Out[98]: Index(['e', 'h'], dtype='object')
Setup:
In [99]: df = pd.DataFrame(np.random.randn(10, 8), columns=list('abcdefgh'))
In [100]: df.loc[::2, list('eh')] += 100
UPDATE:
starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

Can you prevent automatic alphabetical order of df.append()?

I am trying to append data to a log where the order of columns isn't in alphabetical order but makes logical sense, ex.
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 Diff_Goals_2
I am running through several calculations based on different variables and logging the results through appending a dictionary of the values after each run. Is there a way to prevent the df.append() function to order the columns alphabetically?
Seems you have to reorder the columns after the append operation:
In [25]:
# assign the appended dfs to merged
merged = df1.append(df2)
# create a list of the columns in the order you desire
cols = list(df1) + list(df2)
# assign directly
merged.columns = cols
# column order is now as desired
merged.columns
Out[25]:
Index(['Org_Goals_1', 'Calc_Goals_1', 'Diff_Goals_1', 'Org_Goals_2', 'Calc_Goals_2', 'Diff_Goals_2'], dtype='object')
example:
In [26]:
df1 = pd.DataFrame(columns=['Org_Goals_1','Calc_Goals_1','Diff_Goals_1'], data = randn(5,3))
df2 = pd.DataFrame(columns=['Org_Goals_2','Calc_Goals_2','Diff_Goals_2'], data=randn(5,3))
merged = df1.append(df2)
cols = list(df1) + list(df2)
merged.columns = cols
merged
Out[26]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.028935 NaN -0.687143 NaN 1.528579
1 0.943432 NaN -2.055357 NaN -0.720132
2 0.035234 NaN 0.020756 NaN 1.556319
3 1.447863 NaN 0.847496 NaN -1.458852
4 0.132337 NaN -0.255578 NaN -0.222660
0 NaN 0.131085 NaN 0.850022 NaN
1 NaN -1.942110 NaN 0.672965 NaN
2 NaN 0.944052 NaN 1.274509 NaN
3 NaN -1.796448 NaN 0.130338 NaN
4 NaN 0.961545 NaN -0.741825 NaN
Diff_Goals_2
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
0 0.727619
1 0.022209
2 -0.350757
3 1.116637
4 1.947526
The same alpha sorting of the columns happens with concat also so it looks like you have to reorder after appending.
EDIT
An alternative is to use join:
In [32]:
df1.join(df2)
Out[32]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.163745 1.608398 0.876040 0.651063 0.371263
1 -1.762973 -0.471050 -0.206376 1.323191 0.623045
2 0.166269 1.021835 -0.119982 1.005159 -0.831738
3 -0.400197 0.567782 -1.581803 0.417112 0.188023
4 -1.443269 -0.001080 0.804195 0.480510 -0.660761
Diff_Goals_2
0 -2.723280
1 2.463258
2 0.147251
3 2.328377
4 -0.248114
Actually, I found "advanced indexing" to work quite well
df2=df.ix[:,'order of columns']
As I see it, the order is lost, but when appending, the original data should have the correct order. To maintain that, assuming Dataframe 'alldata' and dataframe to be appended data 'newdata', appending and keeping column order as in 'alldata' would be:
alldata.append(newdata)[list(alldata)]
(I encountered this problem with named date fields, where 'Month' would be sorted between 'Minute' and 'Second')