I have been struggling to find an elegant way of looking up multiple values from a pandas DataFrame. Assume we have a dataframe df that holds the “result” R, that depends on multiple index keys, and we have another dataframe keys where each row is a list of values to look up from df. The problem is to loop over the keys and look up the corresponding value from df. If the value does not exist in df, I expect to get a np.nan.
So far I have come up with three different methods, but I feel that all of them lack elegance. So my question is there another prettier method for multiple lookups? Note that the three methods below all give the same result.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':range(5),
'B':range(10,15),
'C':range(100,105),
'R':np.random.rand(5)}).set_index(['A','B','C'])
print 'df'
print df
keys = pd.DataFrame({'A':[0,0,5],'B':[10,10,10],'C':[100,100,100]})
print '--'
print 'keys'
print keys
# By merge
print '--'
print pd.merge(df.reset_index(), keys, on=['A','B','C'],how='right').reset_index().R
# By reindex
print '--'
print df.reindex(keys.set_index(['A','B','C']).index).reset_index().R
# By apply
print '--'
print keys.apply(lambda s : df.R.get((s.A,s.B,s.C)),axis=1).to_frame('R').R
I think update is pretty.
result = keys.set_index( ['A','B','C']) # looks like R
result['R'] = pd.np.nan # add nan
Them use update
result.update(df)
R
A B C
0 10 100 0.068085
100 0.068085
5 10 100 NaN
I found an even simpler solution:
keys = (pd.DataFrame({'A':[0,0,5],'B':[10,10,10],'C':[100,100,100]})
.set_index(['A','B','C']))
keys['R'] = df
or similarly (and more chaining compatible):
keys.assign(R = df)
That's all that is needed. The automatic alignment of the index does the rest of the work! :-)
Related
I need to fill a pandas dataframe in a list comprehension.
Although rows satisfying the criterias are appended to the dataframe.
However, at the end, dataframe is empty.
Is there a way to resolve this?
In real code, I'm doing many other calculations. This is a simplified code to regenerate it.
import pandas as pd
main_df = pd.DataFrame(columns=['a','b','c','d'])
main_df=main_df.append({'a':'a1', 'b':'b1','c':'c1', 'd':'d1'},ignore_index=True)
main_df=main_df.append({'a':'a2', 'b':'b2','c':'c2', 'd':'d2'},ignore_index=True)
main_df=main_df.append({'a':'a3', 'b':'b3','c':'c3', 'd':'d3'},ignore_index=True)
main_df=main_df.append({'a':'a4', 'b':'b4','c':'c4', 'd':'d4'},ignore_index=True)
print(main_df)
sub_df = pd.DataFrame()
df_columns = main_df.columns.values
def search_using_list_comprehension(row,sub_df,df_columns):
if row[0]=='a1' or row[0]=='a2':
dict= {a:b for a,b in zip(df_columns,row)}
print('dict: ', dict)
sub_df=sub_df.append(dict, ignore_index=True)
print('sub_df.shape: ', sub_df.shape)
[search_using_list_comprehension(row,sub_df,df_columns) for row in main_df.values]
print(sub_df)
print(sub_df.shape)
The problem is that you define an empty frame with sub_df = dp.DataFrame() then you assign the same variable within the function parameters and within the list comprehension you provide always the same, empty sub_df as parameter (which is always empty). The one you append to within the function is local to the function only. Another “issue” is using python’s dict variable as user defined. Don’t do this.
Here is what can be changed in your code in order to work, but I would strongly advice against it
import pandas as pd
df_columns = main_df.columns.values
sub_df = pd.DataFrame(columns=df_columns)
def search_using_list_comprehension(row):
global sub_df
if row[0]=='a1' or row[0]=='a2':
my_dict= {a:b for a,b in zip(df_columns,row)}
print('dict: ', my_dict)
sub_df = sub_df.append(my_dict, ignore_index=True)
print('sub_df.shape: ', sub_df)
[search_using_list_comprehension(row) for row in main_df.values]
print(sub_df)
print(sub_df.shape)
For example: I have multiple dataframes. Each data frame has columns: variable_code, variable_description, year.
df1:
variable_code, variable_description
N1, Number of returns
N2, Number of Exemptions
df2:
variable_code, variable_description
N1, Number of returns
NUMDEP, # of dependent
I want to merge these two dataframes to get all variable_codes in both df1 and df2.
variable_code, variable_description
N1 Number of returns
N2 Number of Exemption
NUMDEP # of dependent
There is documentation for merge right here
Since your columns you want to merge on are both called "variable_code" then you can use on='variable_code'
so the whole thing would be:
df1.merge(df2, on='variable_code')
You can specify How='outer' if you want blanks where you have data in only one of those tables. Use how='inner' if you want only data that is in both tables (no blanks).
To attain your requirement, try this:
import pandas as pd
#Create the first dataframe, through a dictionary - several other possibilities exist.
data1 = {'variable_code': ['N1','N2'], 'variable_description': ['Number of returns','Number of Exemptions']}
df1 = pd.DataFrame(data=data1)
#Create second dataframe
data2 = {'variable_code': ['N1','NUMDEP'], 'variable_description': ['Number of returns','# of dependent']}
df2 = pd.DataFrame(data=data2)
#place the dataframes on a list.
dfs = [df1,df2] #additional dfs can be added here.
#You can loop over the list,merging the dfs. But here reduce and a lambda is used.
resultant_df = reduce(lambda left,right: pd.merge(left,right,on=['variable_code','variable_description'],how='outer'), dfs)
This gives:
>>> resultant_df
variable_code variable_description
0 N1 Number of returns
1 N2 Number of Exemptions
2 NUMDEP # of dependent
There are several options available for how, each catering for various needs. outer, used here allows for inclusion of even the rows with empty data. See the docs for detailed explanation on the other options.
First, concatenate df1, df2, by using
final_df = pd.concat([df1,df2]).
Then we can convert columns variable_code, variable_name into dictionary. variable_code as keys, variable_name as values by using
d = dict(zip(final_df['variable_code'], final_df['variable_name'])).
Then convert d into dataframe:
d_df = pd.DataFrame(list(d.items()), columns=['variable_code', 'variable_name']).
Unlike when I started, I found this problem to be a more difficult problem than I thought.
I want to refer to a particular column content from the SQLite database, make it into a Series, and then combine it into a single data frame.
I have tried like this but faild:
import pandas as pd
from pandas import Series, DataFrame
import sqlite3
con = sqlite3.connect("C:/Users/Kun/Documents/Dashin/data.db") #my sqldb
tmplist = ['A003060','A003070'] #db contains that table,I decided to call
#only two for practice.
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i), con , index_col =
None)['Close'].head(5)
tmpSeries2 = tmpSeries.append(listSeries)
print(tmpSeries2)
that code result show only dummy thing like this:
0 7150.0
1 6770.0
2 7450.0
3 7240.0
4 6710.0
dtype: float64
0 14950.0
1 15500.0
2 15000.0
3 14800.0
4 14500.0
What I want to do is like this:
A003060 A003070
0 7150.0 14950.0
1 6770.0 15500.0
2 7450.0 15000.0
3 7240.0 14800.0
4 6710.0 14500.0
I had a similar question ahead and got a answer. But The last question is
using predefined variables. But I must use loop because I have to deal with a series of large databases. I have already tried another effort using dataframe.append, transpose(). But I failed.
I would appreciate some small hints. Thank you.
To append pandas series using for loop
I think you can create list, then append data and last use concat:
dfs = []
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i) con,index_col = None)['Close'].head(5)
dfs.append(listSeries)
df = pd.concat(dfs, axis=1, keys=tmplist)
print(df)
I'm reading an html which brings back a list of dataframes. I want to be able to choose the dataframe from the list and set my index (index_col) in the least amount of lines.
Here is what I have right now:
import pandas as pd
df =pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)
df2 =df[4] #here I'm assigning df2 to dataframe#4 from the list of dataframes I read
df2.set_index('Date', inplace =True)
Is it possible to do all this in one line? Do I need to create another dataframe (df2) to assign one dataframe from a list, or is it possible I can assign the dataframe as soon as I read the list of dataframes (df).
Thanks.
Anyway:
import pandas as pd
df = pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)[4].set_index('Date')
I would like to create a new column newcol in a dataframe df as the result of
df.groupby('keycol').apply(somefunc)
The obvious:
df['newcol'] = df.groupby('keycol').apply(somefunc)
does not work: either df['newcol'] ends up containing all nan's (which is certainly not what the RHS evaluates to), OR some exception is raised (the details of the exception vary wildly depending on what somefunc returns).
I have tried many variations of the above, including stuff like
import pandas as pd
df['newcol'] = pd.Series(df.groupby('keycol').apply(somefunc), index=df.index)
They all fail.
The only thing that has worked requires defining an intermediate variable:
import pandas as pd
tmp = df.groupby('keycol').apply(lambda x: pd.Series(somefunc(x)))
tmp.index = df.index
df['rank'] = tmp
Is there a way to achieve this without having to create an intermediate variable?
(The documentation for GroupBy.apply is almost content-free.)
Let's build up an example and I think I can illustrate why your first attempts are failing:
Example data:
n = 25
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : randn(n)})
print df.head(10)
results in:
coef expenditure groupid
0 0.874076 bar one
1 -0.972586 foo two
2 -0.003457 bar one
3 -0.893106 bar one
4 -0.387922 bar two
5 -0.109405 bar two
6 1.275657 foo two
7 -0.318801 foo two
8 -1.134889 bar two
9 1.812964 foo two
So if apply a simple function, mean, to the grouped data we get the following:
df2= df.groupby('groupid').apply(mean)
print df2
Which is:
coef
groupid
one -0.215539
two 0.149459
So the dataframe above is indexed by groupid and has one column, coef.
What you tried to do first was, effectively, the following:
df['newcol'] = df2
That gives all NaNs for newcol. Honestly I have no idea why that doesn't throw an error. I'm not sure why it would produce anything at all. I think what you really want to do is merge df2 back into df
To merge df and df2 we need to remove the index from df2, rename the new column, then merge:
df2= df.groupby('groupid').apply(mean)
df2.reset_index(inplace=True)
df2.columns = ['groupid','newcol']
df.merge(df2)
which I think is what you were after.
This is such a common idiom that Pandas includes the transform method which wraps all this up into a much simpler syntax:
df['newcol'] = df.groupby('groupid').transform(mean)
print df.head()
results:
coef expenditure groupid newcol
0 1.705825 foo one -0.025112
1 -0.608750 bar one -0.025112
2 -1.215015 bar one -0.025112
3 -0.831478 foo two -0.073560
4 2.174040 bar one -0.025112
Better documentation is here.