Calculate a new value using another dataframe - pandas

I am looking for a way to divide all columns in a dataframe with the value of a column from another df. This can be done using any of the 2 options mentioned below.
df_amenity_normalized = df_amenity.apply(
lambda row: row / df_targets['Population'].loc[row.name], axis=1)
Or join the tables and then calculate:
ndf=df_amenity.merge(df_targets, left_index=True, right_index=True)
ndft=ndf.apply(lambda x: x/ndf.Population, axis='rows' )
df_amenity_normalized1 = ndft.drop(columns=['Population', 'GNI', 'GDP', 'BM Dollar', 'HDI'])
Is there any other way to achive the same results?
Data is available here...
df_targets = pd.read_csv('https://raw.githubusercontent.com/njanakiev/osm-predict-economic-measurements/master/data/economic_measurements.csv', index_col='country')
df_targets.drop(columns='country_code', inplace=True)
df_targets = df_targets[['Population', 'GNI', 'GDP', 'BM Dollar', 'HDI']]
df_amenity = pd.read_csv('https://raw.githubusercontent.com/njanakiev/osm-predict-economic-measurements/master/data/country_amenity_counts.csv')
df_amenity.set_index('country', inplace=True)
df_amenity.drop(columns='country_code', inplace=True)

You can use the df.div() function from pandas. See below:
df_amenity.div(df_targets['Population'], axis = 0)

Related

python: aggregate columns in pivot table with multiindex structure

if i have multi-index pivot table like this:
what would be the way to aggregate total 'sum' and 'count' for all dates?
I want to see additional column with totals for all rows in the table.
Thanks to #Nik03 for the idea. The methond of concat returns required data frame but with single index level. To add it to original dataframe, you have to create columns first and assign new dataframes to:
table_to_show = pd.concat([table_to_record.filter(like='sum').sum(1), table_to_record.filter(like='count').sum(1)], axis=1)
table_to_show.columns = ['sum', 'count']
table_to_record['total_sum'] = table_to_show['sum']
table_to_record['total_count'] = table_to_show['count']
column_1st = table_to_record.pop('total_sum')
column_2nd = table_to_record.pop('total_count')
table_to_record.insert(0, 'total_sum', column_1st)
table_to_record.insert(1,'total_count', column_2nd)
and here is the result:
One way:
df1 = pd.concat([df.filter(like='sum').sum(
1), df.filter(like='mean').sum(1)], axis=1)
df1.columns = ['sum', 'mean']

How to convert the result of DataFrame groupby().agg() to a new Dataframe

Sounds basic, but...
I have a dataframe df with (yy, mm, dd, value1, value2,...)
df1 = df.groupby(['yy','dd'], as_index = False).agg({'value1':['count'],'value2':['sum']})
working ok, returning a df1 multi index object, that I can 'visualize' e.g. df1.info()
Q: how to convert this df1 into a 'basic' 2D DataFrame.
You need to drop the multilevel from the pandas column, and then reset index. You can try this:-
df.groupby(['yy','dd'], as_index = True).agg({'value1':['count'],'value2':['sum']})
df1.columns = df1.columns.droplevel()
df1.reset_index(inplace=True)
Hope this solves your problem.

convert rows of dataframe to separate dataframes

i need to convert to rows of a dataframe from separate 1 row dataframes. Looking for the most efficient / clean approach here.
I need to persist the column names, it is for a machine learning model and i basically need a list of dataframes.
My current solution:
def get_data(filename):
dataframe = pd.read_csv(filename, sep=';')
dataframes = []
for i,row in dataframe.iterrows():
dataframes.append(row.to_frame().T)
return dataframes
This looks very inefficient, maybe there is a cleaner shorter solution.
Use:
dataframe = pd.read_csv(filename, sep=';')
dataframes = [dataframe.iloc[[i]] for i in range(len(dataframe))]
Or:
dataframe = pd.read_csv(filename, sep=';')
dataframes = [x.to_frame().T for i,x in dataframe.T.items()]
Try:
df_list = []
_ = dataframe.apply(lambda x: df_list.append(x.to_frame().T),axis=1)
If I understood what you want is somethink like this:
start = 0
end = dataframe.shape[0]
dataframes = dataframe.loc[start:end]

How to resample a dataframe with different functions applied to each column if we have more than 20 columns?

I know this question has been asked before. The answer is as follows:
df.resample('M').agg({'col1': np.sum, 'col2': np.mean})
But I have 27 columns and I want to sum the first 25, and average the remaining two. Should I write this('col1' - 'col25': np.sum) for 25 columns and this('col26': np.mean, 'col27': np.mean) for two columns?
Mt dataframe contains hourly data and I want to convert it to monthly data. I want to try something like that but it is nonsense:
for i in col_list:
df = df.resample('M').agg({i-2: np.sum, 'col26': np.mean, 'col27': np.mean})
Is there any shortcut for this situation?
You can try this, not for loop :
sum_col = ['col1','col2','col3','col4', ...]
sum_df = df.resample('M')[sum_col].sum()
mean_col = ['col26','col27']
mean_df = df.resample('M')[mean_col].mean()
df = sum_col.join(mean_df)

Conditional on pandas DataFrame's

Let df1, df2, and df3 are pandas.DataFrame's having the same structure but different numerical values. I want to perform:
res=if df1>1.0: (df2-df3)/(df1-1) else df3
res should have the same structure as df1, df2, and df3 have.
numpy.where() generates result as a flat array.
Edit 1:
res should have the same indices as df1, df2, and df3 have.
For example, I can access df2 as df2["instanceA"]["parameter1"]["paramter2"]. I want to access the new calculated DataFrame/Series res as res["instanceA"]["parameter1"]["paramter2"].
Actually numpy.where should work fine there. Output here is 4x2 (same as df1, df2, df3).
df1 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df2 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df3 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
res = df3.copy()
res[:] = np.where( df1 > 1, (df2-df3)/(df1-1), df3 )
x y
0 -0.671787 -0.445276
1 -0.609351 -0.881987
2 0.324390 1.222632
3 -0.138606 0.955993
Note that this should work on both series and dataframes. The [:] is slicing syntax that preserves the index and columns. Without that res will come out as an array rather than series or dataframe.
Alternatively, for a series you could write as #Kadir does in his answer:
res = pd.Series(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index)
Or similarly for a dataframe you could write:
res = pd.DataFrame(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index,
columns=df1.columns)
Integrating the idea in this question into JohnE's answer, I have come up with this solution:
res = pd.Series(np.where( df1 > 1, (df2-df3)/(df1-1), df3 ), index=df1.index)
A better answer using DataFrames will be appreciated.
Say df is your initial dataframe and res is the new column. Use a combination of setting values and boolean indexing.
Set res to be a copy of df3:
df['res'] = df['df3']
Then adjust values for your condition.
df[df['df1']>1.0]['res'] = (df['df2'] - df['df3'])/(df['df1']-1)