Perform similar computations in every dataframe in a list of dataframes - pandas

I have a list of 18 different dataframes. The only thing these dataframes have in common is that each contains a variable that ends with "_spec". The computations I would like to perform on each dataframe in the list are as follows:
return the number of columns in each dataframe that are numeric;
filter the dataframe to include only the "_spec" column if the sum of the numeric columns is equal to #1 (above); and
store the results of #2 in a separate list of 18 dataframes
I can get the output that I would like for each individual dataframe with the following:
lvmo_numlength = -len(df.select_dtypes('number').columns.tolist()) # count (negative) no. of numeric vars in df
lvmo_spec = df[df.sum(numeric_only=True,axis=1)==lvmo_numlength].filter(regex='_spec') # does ^ = sum of numeric vars?
lvmo_spec.to_list()
but I don't want to copy and paste this 18(+) times...
I am new to writing functions and loops, but I know these can be utilized to perform the procedure I desire; yet I don't know how to execute it. The below code shows the abomination I have created, which can't even make it off the ground. Any suggestions?
# make list of dataframes
name_list = [lvmo, trx_nonrx, pd, odose_drg, fx, cpn_use, dem_hcc, dem_ori, drg_man, drg_cou, nlx_gvn, nlx_ob, opd_rsn, opd_od, psy_yn, sti_prep_tkn, tx_why, tx_curtx]
# create variable that satisfies condition 1
def numlen(name):
return name + "_numlen"
# create variable that satisfies condition 2
def spec(name):
return name + "_spec"
# loop it all together
for name in name_list:
numlen(name) = -len(name.select_dtypes('number').columns.tolist())
spec(name) = name[name.sum(numeric_only=True,axis=1)]==numlen(name).filter(regex='spec')

You can achieve what I believe your question is asking as follows, given input df_list which is a list of dataframes:
res_list = [df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') for df in df_list]
Explanation:
for each input dataframe, create a new dataframe as follows: for rows where the sum of the values in numeric columns is <=0 and is equal in magnitude to the number of numeric columns, select only those columns with a label ending in '_spec'
use a list comprehension to compile the above new dataframes into a list
Note that this can also be expressed using a standard for loop instead of a list comprehension as follows:
res_list = []
for df in df_list:
res_list.append( df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') )
Sample code (using 7 input dataframe objects instead of 18:
import pandas as pd
df_list = [pd.DataFrame({'b':['a','b','c','d']} | {f'col{i+1}{"_spec" if not i%3 else ""}':[-1,0,0]+([0 if i!=n-1 else -n]) for i in range(n)}) for n in range(7)]
for df in df_list: print(df)
res_list = [df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') for df in df_list]
for df in res_list: print(df)
Input:
b
0 a
1 b
2 c
3 d
b col1_spec
0 a -1
1 b 0
2 c 0
3 d -1
b col1_spec col2
0 a -1 -1
1 b 0 0
2 c 0 0
3 d 0 -2
b col1_spec col2 col3
0 a -1 -1 -1
1 b 0 0 0
2 c 0 0 0
3 d 0 0 -3
b col1_spec col2 col3 col4_spec
0 a -1 -1 -1 -1
1 b 0 0 0 0
2 c 0 0 0 0
3 d 0 0 0 -4
b col1_spec col2 col3 col4_spec col5
0 a -1 -1 -1 -1 -1
1 b 0 0 0 0 0
2 c 0 0 0 0 0
3 d 0 0 0 0 -5
b col1_spec col2 col3 col4_spec col5 col6
0 a -1 -1 -1 -1 -1 -1
1 b 0 0 0 0 0 0
2 c 0 0 0 0 0 0
3 d 0 0 0 0 0 -6
Output:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
col1_spec
0 -1
3 -1
col1_spec
0 -1
3 0
col1_spec
0 -1
3 0
col1_spec col4_spec
0 -1 -1
3 0 -4
col1_spec col4_spec
0 -1 -1
3 0 0
col1_spec col4_spec
0 -1 -1
3 0 0
Also, a couple of comments about the original question:
lvmo_spec.to_list() doesn't work because to_list() is not defined. There is a method named tolist(), but it will only work for a Series (not a DataFrame).
lvmo_numlength = -len(df.select_dtypes('number').columns.tolist()) gives a negative result. I have assumed this is your intention, and that you want the sum of each row's numeric values to have a negative value, but this is slightly at odds with your description which states:
return the number of columns in each dataframe that are numeric;
filter the dataframe to include only the "_spec" column if the sum of the numeric columns is equal to #1 (above);

Related

pandas creating new columns for each value in categorical columns

I have a pandas dataframe with some numeric and some categoric columns. I want to create a new column for each value of every categorical column and give that column a value of 1 in every row where that value is true and 0 in every row where that value is false. So the df is something like this -
col1 col2 col3
A P 1
B P 3
A Q 7
expected result is something like this:
col1 col2 col3 A B P Q
A P 1 1 0 1 0
B P 3 0 1 1 0
A Q 7 1 0 0 1
Is this possible? can someone please help me?
Use df.select_dtypes, pd.get_dummies with pd.concat:
# First select all columns which have object dtypes
In [826]: categorical_cols = df.select_dtypes('object').columns
# Create one-hot encoding for the above cols and concat with df
In [817]: out = pd.concat([df, pd.get_dummies(df[categorical_cols])], 1)
In [818]: out
Out[818]:
col1 col2 col3 col1_A col1_B col2_P col2_Q
0 A P 1 1 0 1 0
1 B P 3 0 1 1 0
2 A Q 7 1 0 0 1

Pandas, create new column applying groupby values

I have a DF:
Col1 Col2 Label
0 0 5345
1 0 7574
2 0 3445
0 1 2126
1 1 4653
2 1 9566
So I'm trying to groupby on Col1 and Col2 to get index value based on Label column like this:
df_gb = df.groupby(['Col1','Col2'])['Label'].agg(['sum', 'count'])
df_gb['sum_count'] = df_gb['sum'] / df_gb['count']
sum_count_total = df_gb['sum_count'].sum()
index = df_gb['sum_count'] / 10
Col2 Col1
0 0 2.996036
1 3.030063
2 3.038579
1 0 2.925314
1 2.951295
2 2.956083
2 0 2.875549
1 2.899254
2 2.905063
Everything so far is as I expected. But now I would like to assign this 'index' groupby df to my original 'df' based on those two groupby columns. If it was only one column it's working with map() function but not if I would like to assign index values based on two columns order.
df_index = df.copy()
df_index['index'] = df.groupby([]).apply(index)
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Tried with agg() and transform() but without success. Any ideas how to proceed?
Thanks in advance.
Hristo.
I believe you need join:
a = df.join(index.rename('new'), on=['Col1','Col2'])
print (a)
Col1 Col2 Label new
0 0 0 5345 534.5
1 1 0 7574 757.4
2 2 0 3445 344.5
3 0 1 2126 212.6
4 1 1 4653 465.3
5 2 1 9566 956.6
Or GroupBy.transform:
df['new']=df.groupby(['Col1','Col2'])['Label'].transform(lambda x: x.sum() / x.count()) / 10
print (df)
Col1 Col2 Label new
0 0 0 5345 534.5
1 1 0 7574 757.4
2 2 0 3445 344.5
3 0 1 2126 212.6
4 1 1 4653 465.3
5 2 1 9566 956.6
And if no NaNs in Label column use solution from Zero suggestion, thank you:
df.groupby(['Col1','Col2'])['Label'].transform('mean') / 10
If need count only non NaNs values by count use solution with transform.

Pandas iterate max value of a variable length slice in a series

Let's assume i have a Pandas DataFrame as follows:
import pandas as pd
idx = ['2003-01-02', '2003-01-03', '2003-01-06', '2003-01-07',
'2003-01-08', '2003-01-09', '2003-01-10', '2003-01-13',
'2003-01-14', '2003-01-15', '2003-01-16', '2003-01-17',
'2003-01-21', '2003-01-22', '2003-01-23', '2003-01-24',
'2003-01-27']
a = pd.DataFrame([1,2,0,0,1,2,3,0,0,0,1,2,3,4,5,0,1],
columns = ['original'], index = pd.to_datetime(idx))
I am trying to get the max for each slices of that DataFrame between two zeros.
In that example i would get:
a['result'] = [0,2,0,0,0,0,3,0,0,0,0,0,0,0,5,0,1]
that is:
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1
find zeros
cumsum to make groups
mask the zeros into their own group -1
find the max location in each group idxmax
get rid of the one for group -1, that was for zeros anyway
get a.original for found max locations, reindex and fill with zeros
m = a.original.eq(0)
g = a.original.groupby(m.cumsum().mask(m, -1))
i = g.idxmax().drop(-1)
a.assign(result=a.loc[i, 'original'].reindex(a.index, fill_value=0))
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1

Pandas, multiindex, dates, HDFstore and frame_tables

I want to use a MultiIndex with dates as one of the hierarchical index types. I also want to save the DataFrame as a frame_table, so that I can select subsets from disk without loading the whole thing. I currently get an error: TypeError: [date] is not implemented as a table column and I was wondering if I am using the multiindex incorrectly, or this is indeed a limitation of Pandas. Thanks!
import pandas as pd, numpy, datetime
print pd.__version__ #-> 0.13.0rc1
idx1 = pd.MultiIndex.from_tuples([(datetime.date(2013,12,d), s, t) for d in range(1,3) for s in range(2) for t in range(3)])
df1 = pd.DataFrame(data=numpy.zeros((len(idx1),2)), columns=['a','b'], index=idx1)
with pd.get_store('test1.h5') as f:
f.put('trials',df1) #-> OK
with pd.get_store('test2.h5') as f:
f.put('trials',df1,data_colums=True,format='t') #-> TypeError: [date] is not implemented as a table column
Use datetime.datetime as these types can be stored efficiently. Docs are here for an example of storing a multi-index frame in a HDFStore.
When storing a multi-index, you MUST specify names for the levels (HDFStore currently won't warn you if you try to store it ATM; this will be addressed in the next release).
In [20]: idx1 = pd.MultiIndex.from_tuples([(datetime.datetime(2013,12,d), s, t) for d in range(1,3) for s in range(2) for t in range(3)],names=['date','s','t'])
In [21]: df1 = pd.DataFrame(data=numpy.zeros((len(idx1),2)), columns=['a','b'], index=idx1)
You need to store as a table (put stores in Fixed format, unless append is specified).
In [22]: df1.to_hdf('test.h5','df',mode='w',format='table')
In [23]: pd.read_hdf('test.h5','df')
Out[23]:
a b
date s t
2013-12-01 0 0 0 0
1 0 0
2 0 0
1 0 0 0
1 0 0
2 0 0
2013-12-02 0 0 0 0
1 0 0
2 0 0
1 0 0 0
1 0 0
2 0 0
[12 rows x 2 columns]
Sample selection
In [8]: pd.read_hdf('test.h5','df',where='date=20131202')
Out[8]:
a b
date s t
2013-12-02 0 0 0 0
1 0 0
2 0 0
1 0 0 0
1 0 0
2 0 0
[6 rows x 2 columns]

Setting values in a matrix in bulk

The question is about bulk-changing values in a matrix based on data contained in a vector.
Suppose I have a matrix 5x4 matrix of zeroes.
octave> Z = zeros(5,4)
Z =
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
And a column vector of length equal to the number of rows in Z, that is, 5. The rows in the vector y correspond to rows in the matrix Z.
octave> y = [1; 3; 2; 1; 3]
y =
1
3
2
1
3
What I want is to set 1's in the matrix Z in the columns whose indices are contained as values in the corresponding row of the vector y. Namely, I'd like to have Z matrix like this:
Z = # y =
1 0 0 0 # <-- 1 st column
0 0 1 0 # <-- 3 rd column
0 1 0 0 # <-- 2 nd column
1 0 0 0 # <-- 1 st column
0 0 1 0 # <-- 3 rd column
Is there a concise way of doing it? I know I can implement it using a loop over y, but I have a feeling Octave could have a more laconic way. I am new to Octave.
Since Octave has automatic broadcasting (you'll need Octave 3.6.0 or later), the easies way I can think is to use this with a comparison. Here's how
octave> 1:5 == [1 3 2 1 3]'
ans =
1 0 0 0 0
0 0 1 0 0
0 1 0 0 0
1 0 0 0 0
0 0 1 0 0
Broadcasting is explained on the Octave manual but Scipy also has a good explanation for it with nice pictures.
Found another solution that does not use broadcasting. It does not need a matrix of zeroes either.
octave> y = [1; 3; 2; 1; 3]
octave> eye(5)(y,:)
ans =
1 0 0 0 0
0 0 1 0 0
0 1 0 0 0
1 0 0 0 0
0 0 1 0 0
Relevant reading here:
http://www.gnu.org/software/octave/doc/interpreter/Creating-Permutation-Matrices.html