How to fill a pandas dataframe in a list comprehension? - pandas

I need to fill a pandas dataframe in a list comprehension.
Although rows satisfying the criterias are appended to the dataframe.
However, at the end, dataframe is empty.
Is there a way to resolve this?
In real code, I'm doing many other calculations. This is a simplified code to regenerate it.
import pandas as pd
main_df = pd.DataFrame(columns=['a','b','c','d'])
main_df=main_df.append({'a':'a1', 'b':'b1','c':'c1', 'd':'d1'},ignore_index=True)
main_df=main_df.append({'a':'a2', 'b':'b2','c':'c2', 'd':'d2'},ignore_index=True)
main_df=main_df.append({'a':'a3', 'b':'b3','c':'c3', 'd':'d3'},ignore_index=True)
main_df=main_df.append({'a':'a4', 'b':'b4','c':'c4', 'd':'d4'},ignore_index=True)
print(main_df)
sub_df = pd.DataFrame()
df_columns = main_df.columns.values
def search_using_list_comprehension(row,sub_df,df_columns):
if row[0]=='a1' or row[0]=='a2':
dict= {a:b for a,b in zip(df_columns,row)}
print('dict: ', dict)
sub_df=sub_df.append(dict, ignore_index=True)
print('sub_df.shape: ', sub_df.shape)
[search_using_list_comprehension(row,sub_df,df_columns) for row in main_df.values]
print(sub_df)
print(sub_df.shape)

The problem is that you define an empty frame with sub_df = dp.DataFrame() then you assign the same variable within the function parameters and within the list comprehension you provide always the same, empty sub_df as parameter (which is always empty). The one you append to within the function is local to the function only. Another “issue” is using python’s dict variable as user defined. Don’t do this.
Here is what can be changed in your code in order to work, but I would strongly advice against it
import pandas as pd
df_columns = main_df.columns.values
sub_df = pd.DataFrame(columns=df_columns)
def search_using_list_comprehension(row):
global sub_df
if row[0]=='a1' or row[0]=='a2':
my_dict= {a:b for a,b in zip(df_columns,row)}
print('dict: ', my_dict)
sub_df = sub_df.append(my_dict, ignore_index=True)
print('sub_df.shape: ', sub_df)
[search_using_list_comprehension(row) for row in main_df.values]
print(sub_df)
print(sub_df.shape)

Related

Why pandas does not want to subset given columns in a list

I'm trying to remove certain values with that code, however pandas does not give me to, instead outputs
ValueError: Unable to coerce to Series, length must be 10: given 2
Here is my code:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv")
print(df.shape)
columns_df = ['index', 'company', 'body-style', 'wheel-base', 'length', 'engine-type',
'num-of-cylinders', 'horsepower', 'average-mileage', 'price']
prohibited_symbols = ['?','Nan''n.a']
df = df[df[columns_df] != prohibited_symbols]
print(df)
Try:
df = df[~df[columns_df].str.contains('|'.join(prohibited_symbols))]
The regex operator '|' helps remove records that contain any of your prohibited symbols.
Because what you are trying is not doing what you imagine it should.
df = df[df[columns_df] != prohibited_symbols]
Above line will always return False values for everything. You can't iterate over a list of prohibited symbols like that. != will do only a simple inequality check and none of your cells will be equal to the list of prohibited symbols probably. Also using that syntax will not delete those values from your cells.
You'll have to use a for loop and clean every column like this.
for column in columns_df:
df[column] = df[column].str.replace('|'.join(prohibited_symbols), '', regex=True)
You can as well specify the values you consider as null with the na_values argument when reading the data and then use dropna from pandas.
Example:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv", na_values=['?','Nan''n.a'])
df = df.dropna()

Empty cells when using an apply function

So I am trying to calculate a value from one column or another based based on which one has data available into a new column. This is the code I have right now. It doesn't seem to notice when there is no data present and always goes to the "else" statement. My dataframe is an imported excel file. Thanks for any advice!
def create_sulfide_col(row):
if row["Sulphate-S(HCL Leachable)_%S"] is None:
val = row["Total-S_%S"] - row["Sulphate-S(HCL Leachable)_%S"]
else:
val = ["Total-S_%S"]- df["Sulphate-S_%S"]
return val
df["Sulphide-S(calc)-C_%S"] = df.apply(lambda row: create_sulfide_col(row), axis='columns')
This is can be done by using numpy.where
Import numpy as np
df['newcol'] = np.where(df["Sulphate-S(HCL Leachable)_%S"].isna(),df["Total-S_%S"]- df["Sulphate-S(HCL Leachable)_%S"],df["Total-S_%S"]- df["Sulphate-S_%S"])

pandas groupby transform: multiple functions applied at the same time with custom names

as the title suggests, I want to be able to do the following (best explained with some code) [pandas 0.20.1 is mandatory]
import pandas as pd
import numpy as np
a = pd.DataFrame(np.random.rand(10, 4), columns=[['a','a','b','b'], ['alfa','beta','alfa','beta',]])
def as_is(x):
return x
def power_2(x):
return x**2
# desired result
a.transform([as_is, power_2])
the problem is the function could be more complex than this and thus I would lose the "naming" feature as pandas.DataFrame.transform only allows for lists to be passed whereas a dictionary would have been most convenient.
going back to the basics, I got to this:
dict_funct= {'as_is': as_is, 'power_2': power_2}
def wrapper(x):
return pd.concat({k: x.apply(v) for k,v in dict_funct.items()}, axis=1)
a.groupby(level=[0,1], axis=1).apply(wrapper)
but the output Dataframe is all nan, presumably due to multi-index columns ordering. is there any way I can fix this?
If need dict I remove paramater axis in concat to to default (axis=0), but then is necessary add parameter group_keys=False and function unstack:
def wrapper(x):
return pd.concat({k: x.apply(v) for k,v in dict_funct.items()})
a.groupby(level=[0,1], axis=1, group_keys=False).apply(wrapper).unstack(0)
Similar solution:
def wrapper(x):
return pd.concat({k: x.transform(v) for k,v in dict_funct.items()})
a.groupby(level=[0,1], axis=1, group_keys=False).apply(wrapper).unstack(0)
Another solution is simply add list comprehension:
a.transform([v for k, v in dict_funct.items()])

when reading an html (pandas.read_html), how to select dataframe and set_ index in one line

I'm reading an html which brings back a list of dataframes. I want to be able to choose the dataframe from the list and set my index (index_col) in the least amount of lines.
Here is what I have right now:
import pandas as pd
df =pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)
df2 =df[4] #here I'm assigning df2 to dataframe#4 from the list of dataframes I read
df2.set_index('Date', inplace =True)
Is it possible to do all this in one line? Do I need to create another dataframe (df2) to assign one dataframe from a list, or is it possible I can assign the dataframe as soon as I read the list of dataframes (df).
Thanks.
Anyway:
import pandas as pd
df = pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)[4].set_index('Date')

Looking up multiple values from a pandas DataFrame

I have been struggling to find an elegant way of looking up multiple values from a pandas DataFrame. Assume we have a dataframe df that holds the “result” R, that depends on multiple index keys, and we have another dataframe keys where each row is a list of values to look up from df. The problem is to loop over the keys and look up the corresponding value from df. If the value does not exist in df, I expect to get a np.nan.
So far I have come up with three different methods, but I feel that all of them lack elegance. So my question is there another prettier method for multiple lookups? Note that the three methods below all give the same result.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':range(5),
'B':range(10,15),
'C':range(100,105),
'R':np.random.rand(5)}).set_index(['A','B','C'])
print 'df'
print df
keys = pd.DataFrame({'A':[0,0,5],'B':[10,10,10],'C':[100,100,100]})
print '--'
print 'keys'
print keys
# By merge
print '--'
print pd.merge(df.reset_index(), keys, on=['A','B','C'],how='right').reset_index().R
# By reindex
print '--'
print df.reindex(keys.set_index(['A','B','C']).index).reset_index().R
# By apply
print '--'
print keys.apply(lambda s : df.R.get((s.A,s.B,s.C)),axis=1).to_frame('R').R
I think update is pretty.
result = keys.set_index( ['A','B','C']) # looks like R
result['R'] = pd.np.nan # add nan
Them use update
result.update(df)
R
A B C
0 10 100 0.068085
100 0.068085
5 10 100 NaN
I found an even simpler solution:
keys = (pd.DataFrame({'A':[0,0,5],'B':[10,10,10],'C':[100,100,100]})
.set_index(['A','B','C']))
keys['R'] = df
or similarly (and more chaining compatible):
keys.assign(R = df)
That's all that is needed. The automatic alignment of the index does the rest of the work! :-)