Pandas DataFrame + object type + HDF + PyTables 'table' - pandas

(Editing to clarify my application, sorry for any confusion)
I run an experiment broken up into trials. Each trial can produce invalid data or valid data. When there is valid data the data take the form of a list of numbers which can be of zero length.
So an invalid trial produces None and a valid trial can produce [] or [1,2] etc etc.
Ideally, I'd like to be able to save this data as a frame_table (call it data). I have another table (call it trials) that is easily converted into a frame_table and which I use as a selector to extract rows (trials). I would then like to pull up by data using select_as_multiple.
Right now, I'm saving the data structure as a regular table as I'm using an object array. I realize folks are saying this is inefficient, but I can't think of an efficient way to handle the variable length nature of data.
I understand that I can use NaNs and make a (potentially very wide) table whose max width is the maximum length of my data array, but then I need a different mechanism to flag invalid trials. A row with all NaNs is confusing - does it mean that I had a zero length data trial or did I have an invalid trial?
I think there is no good solution to this using Pandas. The NaN solution leads me to potentially extremely wide tables and an additional column marking valid/invalid trials
If I used a database I would make the data a binary blob column. With Pandas my current working solution is to save data as an object array in a regular frame and load it all in and then pull out the relevant indexes based on my trials table.
This is slightly inefficient, since I'm reading my whole data table in one go, but it's the most workable/extendable scheme I have come up with.
But I welcome most enthusiastically a more canonical solution.
Thanks so much for all your time!
EDIT: Adding code (Jeff's suggestion)
import pandas as pd, numpy
mydata = [numpy.empty(n) for n in range(1,11)]
df = pd.DataFrame(mydata)
In [4]: df
Out[4]:
0
0 [1.28822975392e-231]
1 [1.28822975392e-231, -2.31584192385e+77]
2 [1.28822975392e-231, -1.49166823584e-154, 2.12...
3 [1.28822975392e-231, 1.2882298313e-231, 2.1259...
4 [1.28822975392e-231, 1.72723381477e-77, 2.1259...
5 [1.28822975392e-231, 1.49166823584e-154, 1.531...
6 [1.28822975392e-231, -2.68156174706e+154, 2.20...
7 [1.28822975392e-231, -2.68156174706e+154, 2.13...
8 [1.28822975392e-231, -1.3365130604e-315, 2.222...
9 [1.28822975392e-231, -1.33651054067e-315, 2.22...
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
0 10 non-null values
dtypes: object(1)
df.to_hdf('test.h5','data')
--> OK
df.to_hdf('test.h5','data1',table=True)
--> ...
TypeError: Cannot serialize the column [0] because
its data contents are [mixed] object dtype

Here's a simple example along the lines of what you have described
In [17]: df = DataFrame(randn(10,10))
In [18]: df.iloc[5:10,7:9] = np.nan
In [19]: df.iloc[7:10,4:9] = np.nan
In [22]: df.iloc[7:10,-1] = np.nan
In [23]: df
Out[23]:
0 1 2 3 4 5 6 7 8 9
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN
In [24]: df['stop'] = df.apply(lambda x: x.last_valid_index(), 1)
In [25]: df
Out[25]:
0 1 2 3 4 5 6 7 8 9 stop
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996 9
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824 9
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806 9
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520 9
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789 9
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333 9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012 9
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN 3
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN 3
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN 3
Note that in 0.12 you should use table=True, rather than fmt (this is in the process of changing)
In [26]: df.to_hdf('test.h5','df',mode='w',fmt='t')
In [27]: pd.read_hdf('test.h5','df')
Out[27]:
0 1 2 3 4 5 6 7 8 9 stop
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996 9
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824 9
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806 9
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520 9
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789 9
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333 9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012 9
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN 3
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN 3
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN 3

Related

Pandas Long to Wide conversion

I am new to Pandas. I have a data set with in this format.
UserID ISBN BookRatings
0 276725.0 034545104U 0.0
1 276726.0 155061224 5.0
2 276727.0 446520802 0.0
3 276729.0 052165615W 3.0
4 276729.0 521795028 6.0
I would like to create this
ISBN 276725 276726 276727 276729
UserID
0 034545104U
1 0 155061224 0 0 0
2 0 0 446520802 0 0
3 0 0 0 052165615W 0
4 0 0 0 521795028 0
I tried pivot but was not successful. Any kind advice please?
I think that pivot() is the right approach here. The most difficult part is to get the arguments correctly. I think we need to keep the original index and the new columns should be the values in column UserID. Also, we want to fill the new dataframe with the values from column ISBN.
For this, I firstly extract the original index as column and then apply the pivot() function:
df = df.reset_index()
result = df.pivot(index='index', columns='UserID', values='ISBN')
# Make your float columns to integers (only works if all user ids are numbers, drop nan values first)
result.columns = map(int,result.columns)
Output:
276725 276726 276727 276729
index
0 034545104U NaN NaN NaN
1 NaN 155061224 NaN NaN
2 NaN NaN 446520802 NaN
3 NaN NaN NaN 052165615W
4 NaN NaN NaN 521795028
Edit: If you want the same appearance as in the original dataframe you have to apply the following line as well:
result = result.rename_axis(None, axis=0)
Output:
276725 276726 276727 276729
0 034545104U NaN NaN NaN
1 NaN 155061224 NaN NaN
2 NaN NaN 446520802 NaN
3 NaN NaN NaN 052165615W
4 NaN NaN NaN 521795028

How to keep all values from a dataframe except where NaN is present in another dataframe?

I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.
You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN

For every row in pandas, do until sample ID change

How can I iterarate over rows in a dataframe until the sample ID change?
my_df:
ID loc_start
sample1 10
sample1 15
sample2 10
sample2 20
sample3 5
Something like:
samples = ["sample1", "sample2" ,"sample3"]
out = pd.DataFrame()
for sample in samples:
if my_df["ID"] == sample:
my_list = []
for index, row in my_df.iterrows():
other_list = [row.loc_start]
my_list.append(other_list)
my_list = pd.DataFrame(my_list)
out = pd.merge(out, my_list)
Expected output:
sample1 sample2 sample3
10 10 5
15 20
I realize of course that this could be done easier if my_df really would look like this. However, what I'm after is the principle to iterate over rows until a certain column value change.
Based on the input & output provided, this should work.
You need to provide more info if you are looking for something else.
df.pivot(columns='ID', values = 'loc_start').rename_axis(None, axis=1).apply(lambda x: pd.Series(x.dropna().values))
output
sample1 sample2 sample3
0 10.0 10.0 5.0
1 15.0 20.0 NaN
Ben.T is correct that a pivot works here. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0, 5, (10, 2)), columns=list("AB"))
# what does the df look like? Here, I consider column A to be analogous to your "ID" column
In [5]: df
Out[5]:
A B
0 3 1
1 2 1
2 4 2
3 4 1
4 0 4
5 4 2
6 4 1
7 3 1
8 1 1
9 4 0
# now do a pivot and see what it looks like
df2 = df.pivot(columns="A", values="B")
In [8]: df2
Out[8]:
A 0 1 2 3 4
0 NaN NaN NaN 1.0 NaN
1 NaN NaN 1.0 NaN NaN
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 4.0 NaN NaN NaN NaN
5 NaN NaN NaN NaN 2.0
6 NaN NaN NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN
8 NaN 1.0 NaN NaN NaN
9 NaN NaN NaN NaN 0.0
Not quite what you wanted. With a little help from Jezreal's answer
df2 = df2.apply(lambda x: pd.Series(x.dropna().values))
In [20]: df3
Out[20]:
A 0 1 2 3 4
0 4.0 1.0 1.0 1.0 2.0
1 NaN NaN NaN 1.0 1.0
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 NaN NaN NaN NaN 0.0
The empty spots in the dataframe have to be filled with something, and NaN is used by default. Is this what you wanted?
If, on the other hand, you wanted to perform an operation on your data you would use the groupby instead.
df2 = df.groupby(by="A", as_index=False).mean()

Create new dataframe columns from old dataframe rows using for loop --> N/A values

I created a dataframe df1:
df1 = pd.read_csv('FBK_var_conc_1.csv', names = ['Cycle', 'SQ'])
df1 = df1['SQ'].copy()
df1 = df1.to_frame()
df1.head(n=10)
SQ
0 2430.0
1 2870.0
2 2890.0
3 3270.0
4 3350.0
5 3520.0
6 26900.0
7 26300.0
8 28400.0
9 3230.0
And then created a second dataframe df2, that I want to fill with the row values of df 1:
df2 = pd.DataFrame()
for x in range(12):
y='Experiment %d' % (x+1)
df2[y]= df1.iloc[3*x:3*x+3]
df2
I get the column names from Experiment 1 - Experiment 12 in df2 and the first column i filled with the right values, but all following columns are filled with N/A.
> Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment
> 10 Experiment 11 Experiment 12
> 0 2430.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 1 2870.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 2 2890.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've been looking at this for the last 2 hours but can't figure out why the columns after column 1 aren't filled with values.
Desired output:
Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 Experiment 11 Experiment 12
2430 3270 26900 3230 2940 243000 256000 249000 2880 26100 3890 33400
2870 3350 26300 3290 3180 242000 254000 250000 3390 27900 3730 30700
2890 3520 28400 3090 3140 253000 260000 237000 3510 27400 3760 29600
I found the issue.
I had to use .values
So the final line of the loop has to be:
df2[y] = df1.iloc[3*x:3*x+3].values
and I get the right output

Boxplot with pandas and groupby

I have the following dataset sample:
0 1
0 0 0.040158
1 2 0.500642
2 0 0.005694
3 1 0.065052
4 0 0.034789
5 2 0.128495
6 1 0.088816
7 1 0.056725
8 0 -0.000193
9 2 -0.070252
10 2 0.138282
11 2 0.054638
12 2 0.039994
13 2 0.060659
14 0 0.038562
And need a box and whisker plot, grouped by column 0. I have the following:
plt.figure()
grouped = df.groupby(0)
grouped.boxplot(column=1)
plt.savefig('plot.png')
But I end up with three subplots. How can place all three on one plot?
Thanks.
In 0.16.0 version of pandas, you could simply do this:
df.boxplot(by='0')
Result:
I don't believe you need to use groupby.
df2 = df.pivot(columns=df.columns[0], index=df.index)
df2.columns = df2.columns.droplevel()
>>> df2
0 0 1 2
0 0.040158 NaN NaN
1 NaN NaN 0.500642
2 0.005694 NaN NaN
3 NaN 0.065052 NaN
4 0.034789 NaN NaN
5 NaN NaN 0.128495
6 NaN 0.088816 NaN
7 NaN 0.056725 NaN
8 -0.000193 NaN NaN
9 NaN NaN -0.070252
10 NaN NaN 0.138282
11 NaN NaN 0.054638
12 NaN NaN 0.039994
13 NaN NaN 0.060659
14 0.038562 NaN NaN
df2.boxplot()