Pandas, multiindex, dates, HDFstore and frame_tables

Pandas, multiindex, dates, HDFstore and frame_tables - pandas

I want to use a MultiIndex with dates as one of the hierarchical index types. I also want to save the DataFrame as a frame_table, so that I can select subsets from disk without loading the whole thing. I currently get an error: TypeError: [date] is not implemented as a table column and I was wondering if I am using the multiindex incorrectly, or this is indeed a limitation of Pandas. Thanks!
import pandas as pd, numpy, datetime
print pd.__version__ #-> 0.13.0rc1
idx1 = pd.MultiIndex.from_tuples([(datetime.date(2013,12,d), s, t) for d in range(1,3) for s in range(2) for t in range(3)])
df1 = pd.DataFrame(data=numpy.zeros((len(idx1),2)), columns=['a','b'], index=idx1)
with pd.get_store('test1.h5') as f:
f.put('trials',df1) #-> OK
with pd.get_store('test2.h5') as f:
f.put('trials',df1,data_colums=True,format='t') #-> TypeError: [date] is not implemented as a table column

Use datetime.datetime as these types can be stored efficiently. Docs are here for an example of storing a multi-index frame in a HDFStore.
When storing a multi-index, you MUST specify names for the levels (HDFStore currently won't warn you if you try to store it ATM; this will be addressed in the next release).
In [20]: idx1 = pd.MultiIndex.from_tuples([(datetime.datetime(2013,12,d), s, t) for d in range(1,3) for s in range(2) for t in range(3)],names=['date','s','t'])
In [21]: df1 = pd.DataFrame(data=numpy.zeros((len(idx1),2)), columns=['a','b'], index=idx1)
You need to store as a table (put stores in Fixed format, unless append is specified).
In [22]: df1.to_hdf('test.h5','df',mode='w',format='table')
In [23]: pd.read_hdf('test.h5','df')
Out[23]:
a b
date s t
2013-12-01 0 0 0 0
1 0 0
2 0 0
1 0 0 0
1 0 0
2 0 0
2013-12-02 0 0 0 0
1 0 0
2 0 0
1 0 0 0
1 0 0
2 0 0
[12 rows x 2 columns]
Sample selection
In [8]: pd.read_hdf('test.h5','df',where='date=20131202')
Out[8]:
a b
date s t
2013-12-02 0 0 0 0
1 0 0
2 0 0
1 0 0 0
1 0 0
2 0 0
[6 rows x 2 columns]

Related

Perform similar computations in every dataframe in a list of dataframes

I have a list of 18 different dataframes. The only thing these dataframes have in common is that each contains a variable that ends with "_spec". The computations I would like to perform on each dataframe in the list are as follows:
return the number of columns in each dataframe that are numeric;
filter the dataframe to include only the "_spec" column if the sum of the numeric columns is equal to #1 (above); and
store the results of #2 in a separate list of 18 dataframes
I can get the output that I would like for each individual dataframe with the following:
lvmo_numlength = -len(df.select_dtypes('number').columns.tolist()) # count (negative) no. of numeric vars in df
lvmo_spec = df[df.sum(numeric_only=True,axis=1)==lvmo_numlength].filter(regex='_spec') # does ^ = sum of numeric vars?
lvmo_spec.to_list()
but I don't want to copy and paste this 18(+) times...
I am new to writing functions and loops, but I know these can be utilized to perform the procedure I desire; yet I don't know how to execute it. The below code shows the abomination I have created, which can't even make it off the ground. Any suggestions?
# make list of dataframes
name_list = [lvmo, trx_nonrx, pd, odose_drg, fx, cpn_use, dem_hcc, dem_ori, drg_man, drg_cou, nlx_gvn, nlx_ob, opd_rsn, opd_od, psy_yn, sti_prep_tkn, tx_why, tx_curtx]
# create variable that satisfies condition 1
def numlen(name):
return name + "_numlen"
# create variable that satisfies condition 2
def spec(name):
return name + "_spec"
# loop it all together
for name in name_list:
numlen(name) = -len(name.select_dtypes('number').columns.tolist())
spec(name) = name[name.sum(numeric_only=True,axis=1)]==numlen(name).filter(regex='spec')

You can achieve what I believe your question is asking as follows, given input df_list which is a list of dataframes:
res_list = [df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') for df in df_list]
Explanation:
for each input dataframe, create a new dataframe as follows: for rows where the sum of the values in numeric columns is <=0 and is equal in magnitude to the number of numeric columns, select only those columns with a label ending in '_spec'
use a list comprehension to compile the above new dataframes into a list
Note that this can also be expressed using a standard for loop instead of a list comprehension as follows:
res_list = []
for df in df_list:
res_list.append( df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') )
Sample code (using 7 input dataframe objects instead of 18:
import pandas as pd
df_list = [pd.DataFrame({'b':['a','b','c','d']} | {f'col{i+1}{"_spec" if not i%3 else ""}':[-1,0,0]+([0 if i!=n-1 else -n]) for i in range(n)}) for n in range(7)]
for df in df_list: print(df)
res_list = [df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') for df in df_list]
for df in res_list: print(df)
Input:
b
0 a
1 b
2 c
3 d
b col1_spec
0 a -1
1 b 0
2 c 0
3 d -1
b col1_spec col2
0 a -1 -1
1 b 0 0
2 c 0 0
3 d 0 -2
b col1_spec col2 col3
0 a -1 -1 -1
1 b 0 0 0
2 c 0 0 0
3 d 0 0 -3
b col1_spec col2 col3 col4_spec
0 a -1 -1 -1 -1
1 b 0 0 0 0
2 c 0 0 0 0
3 d 0 0 0 -4
b col1_spec col2 col3 col4_spec col5
0 a -1 -1 -1 -1 -1
1 b 0 0 0 0 0
2 c 0 0 0 0 0
3 d 0 0 0 0 -5
b col1_spec col2 col3 col4_spec col5 col6
0 a -1 -1 -1 -1 -1 -1
1 b 0 0 0 0 0 0
2 c 0 0 0 0 0 0
3 d 0 0 0 0 0 -6
Output:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
col1_spec
0 -1
3 -1
col1_spec
0 -1
3 0
col1_spec
0 -1
3 0
col1_spec col4_spec
0 -1 -1
3 0 -4
col1_spec col4_spec
0 -1 -1
3 0 0
col1_spec col4_spec
0 -1 -1
3 0 0
Also, a couple of comments about the original question:
lvmo_spec.to_list() doesn't work because to_list() is not defined. There is a method named tolist(), but it will only work for a Series (not a DataFrame).
lvmo_numlength = -len(df.select_dtypes('number').columns.tolist()) gives a negative result. I have assumed this is your intention, and that you want the sum of each row's numeric values to have a negative value, but this is slightly at odds with your description which states:
return the number of columns in each dataframe that are numeric;
filter the dataframe to include only the "_spec" column if the sum of the numeric columns is equal to #1 (above);

Pandas Group Columns by Value of 1 and Sort By Frequency

I have to take this dataframe:
d = {'Apple': [0,0,1,0,1,0], 'Aurora': [0,0,0,0,0,1], 'Barn': [0,1,1,0,0,0]}
df = pd.DataFrame(data=d)
Apple Aurora Barn
0 0 0 0
1 0 0 1
2 1 0 1
3 0 0 0
4 1 0 0
5 0 1 0
And count the frequency of the number one in each column, and create a new dataframe that looks like this:
df = pd.DataFrame([['Apple',0.3333], ['Aurora',0.166666], ['Barn', 0.3333]], columns = ['index', 'value'])
index value
0 Apple 0.333300
1 Aurora 0.166666
2 Barn 0.333300
I have tried this:
df['freq'] = df.groupby(1)[1].transform('count')
But I get an error: KeyError: 1
So I'm not sure how to count the value 1 across rows and columns, and group by column names and the frequency of 1 in each column.

If I understand correctly, you could do simply this:
freq = df.mean()
Output:
>>> freq
Apple 0.333333
Aurora 0.166667
Barn 0.333333
dtype: float64

Pandas iterate max value of a variable length slice in a series

Let's assume i have a Pandas DataFrame as follows:
import pandas as pd
idx = ['2003-01-02', '2003-01-03', '2003-01-06', '2003-01-07',
'2003-01-08', '2003-01-09', '2003-01-10', '2003-01-13',
'2003-01-14', '2003-01-15', '2003-01-16', '2003-01-17',
'2003-01-21', '2003-01-22', '2003-01-23', '2003-01-24',
'2003-01-27']
a = pd.DataFrame([1,2,0,0,1,2,3,0,0,0,1,2,3,4,5,0,1],
columns = ['original'], index = pd.to_datetime(idx))
I am trying to get the max for each slices of that DataFrame between two zeros.
In that example i would get:
a['result'] = [0,2,0,0,0,0,3,0,0,0,0,0,0,0,5,0,1]
that is:
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1

find zeros
cumsum to make groups
mask the zeros into their own group -1
find the max location in each group idxmax
get rid of the one for group -1, that was for zeros anyway
get a.original for found max locations, reindex and fill with zeros
m = a.original.eq(0)
g = a.original.groupby(m.cumsum().mask(m, -1))
i = g.idxmax().drop(-1)
a.assign(result=a.loc[i, 'original'].reindex(a.index, fill_value=0))
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1

how to split one column into many columns and count the frequency

Here is the question I have in mind, given a table
Id type
0 1 [a,b]
1 2 [c]
2 3 [a,d]
I want to convert it into the form of:
Id a b c d
0 1 1 1 0 0
1 2 0 0 1 0
2 3 1 0 0 1
I need a very efficient way to convert a large table. any comment is welcome.
====================================
I have received several good answers, and really appreciate your help.
Now a new question comes along, which is my laptop memory is insufficient to generating the whole dataframe by using pd.dummies.
is there anyway to generate a sparse vector row by row and stack then together?

Try this
>>> df
Id type
0 1 [a, b]
1 2 [c]
2 3 [a, d]
>>> df2 = pd.DataFrame([x for x in df['type'].apply(
... lambda item: dict(map(
... lambda x: (x,1),
... item))
... ).values]).fillna(0)
>>> df2.join(df)
a b c d Id type
0 1 1 0 0 1 [a, b]
1 0 0 1 0 2 [c]
2 1 0 0 1 3 [a, d]
It basically convert the list of list to list of dict and construct a DataFrame out of this
[ ['a', 'b'], ['c'], ['a', 'd'] ] # list of list
[ {'a':1, 'b':1}, {'c':1}, {'a':1, 'd':1} ] # list of dict
Make DataFrame out of this

try this:
pd.get_dummies(df.type.apply(lambda x: pd.Series([i for i in x])))
to explain:
df.type.apply(lambda x: pd.Series([i for i in x]
gets you a column for index position in your lists. You can then use get dummies to get the count of each value
pd.get_dummies(df.type.apply(lambda x: pd.Series([i for i in x])))
outputs:
a c b d
0 1 0 1 0
1 0 1 0 0
2 1 0 0 1

Pandas: how to use convert_objects to replace strings with NaN values

This is related to a previous question I've asked, here: Replace any string in columns with 1
However, since that question has been answered long ago, I've started a new question here. I am essentially trying to use convert_objects to replace string values with 1's in the following dataframe (abbreviated here):
uniq_epoch T_Opp T_Eval
1 0 0
1 0 vv.bo
2 bx 0
3 0 0
3 vo.bp 0
...
I am using the following code to do this. I've actually tried using this code on the entire dataframe, and have also applied it to a particular column. The result each time is that there is no error message, but also no change to the data (no values are converted to NaN, and the dtype is still 'O').
df = df.convert_objects(convert_numeric = True)
or
df.T_Eval = df.T_Eval.convert_objects(convert_numeric=True)
Desired final output is as follows:
uniq_epoch T_Opp T_Eval
1 0 0
1 0 1
2 1 0
3 0 0
3 1 0
...
Where there may also be a step prior to this, with the 1s as NaN, and fillna(1) is used to insert 1s where strings have been.
I've already searched posts on stackoverflow, and looked at the documentation for convert_objects, but it is unfortunately pretty sparse. I wouldn't have known to even attempt to apply it this way if not for the previous post (linked above).
I'll also mention that there are quite a few strings (codes) in these columns, and that the codes can recombine, so that to do this with a dict and replace(), would take about the same amount of time as if I did this by hand.
Based on the previous post and the various resources I've been able to find, I can't figure out why this isn't working - any help much appreciated, including pointing towards further documentation.

This is on 0.13.1
docs here
and here
Maybe you have an older version; IIRC convert_objects introduced in 0.11.
In [5]: df = read_csv(StringIO(data),sep='\s+',index_col=0)
In [6]: df
Out[6]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 vv.bo
2 bx 0
3 0 0
3 vo.bp 0
[5 rows x 2 columns]
In [7]: df.convert_objects(convert_numeric=True)
Out[7]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 NaN
2 NaN 0
3 0 0
3 NaN 0
[5 rows x 2 columns]
In [8]: df.convert_objects(convert_numeric=True).dtypes
Out[8]:
T_Opp float64
T_Eval float64
dtype: object
In [9]: df.convert_objects(convert_numeric=True).fillna(1)
Out[9]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 1
2 1 0
3 0 0
3 1 0
[5 rows x 2 columns]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas