i need to convert to rows of a dataframe from separate 1 row dataframes. Looking for the most efficient / clean approach here.
I need to persist the column names, it is for a machine learning model and i basically need a list of dataframes.
My current solution:
def get_data(filename):
dataframe = pd.read_csv(filename, sep=';')
dataframes = []
for i,row in dataframe.iterrows():
dataframes.append(row.to_frame().T)
return dataframes
This looks very inefficient, maybe there is a cleaner shorter solution.
Use:
dataframe = pd.read_csv(filename, sep=';')
dataframes = [dataframe.iloc[[i]] for i in range(len(dataframe))]
Or:
dataframe = pd.read_csv(filename, sep=';')
dataframes = [x.to_frame().T for i,x in dataframe.T.items()]
Try:
df_list = []
_ = dataframe.apply(lambda x: df_list.append(x.to_frame().T),axis=1)
If I understood what you want is somethink like this:
start = 0
end = dataframe.shape[0]
dataframes = dataframe.loc[start:end]
Related
I have a problem with Pandas' DataFrame Object.
I have read first excel file and I have DataFrame like this:
First DataFrame
And read second excel file like this:
Second DataFrame
I need to concatenate rows and it should like this:
Third DataFrame
I have code like this:
import pandas as pd
import numpy as np
x1 = pd.ExcelFile("x1.xlsx")
df1 = pd.read_excel(x1, "Sheet1")
x2 = pd.ExcelFile("x2.xlsx")
df2 = pd.read_excel(x2, "Sheet1")
result = pd.merge(df1, df2, how="outer")
The second df just follow the first df,how can I get the style with dataframe like the third one?
merge does not concatenate the dfs as you want, use append instead.
ndf = df1.append(df2).sort_values('name')
You can also use concat:
ndf = pd.concat([df1, df2]).sort_values('name')
I'm trying to harmonize the column names in all my data frames so that I can concatenate them and create one table. I'm struggling to create a loop over multiple dataframes. The code does not fail, but it does not work either. Here is an example of two dataframes and a list that includes the dataframes:
df_test = pd.DataFrame({'HHLD_ID':[6,7,8,9,10],
'sales':[25,50,25,25,50],
'units':[1,2,1,1,2],
})
df_test2 = pd.DataFrame({'HHLD_ID':[1,2,3,4,5],
'sale':[25,50,25,25,50],
'unit':[1,2,1,1,2],
})
list_df_export = [df_test,df_test2]
Here is what I have tried...
for d in list_df_export:
if 'sale' in d:
d = d.rename(columns={"sale": "sales",'unit':'units'})
Here is what I would like df_test2 to look like...
you can use:
d = {'sale':'sales','unit':'units'}
pd.concat(i.rename(columns=d) for i in list_df_export)
Maybe the "inplace" option can help you:
for d in list_df_export:
if 'sale' in d:
d = d.rename(columns={"sale": "sales", 'unit': 'units'}, inplace=True)
You can try:
df_test2.columns = df_test.columns
This will make the columns in df_test2 have the same names as df_test.
Is this what you need?
is there a way to conveniently merge two data frames side by side?
both two data frames have 30 rows, they have different number of columns, say, df1 has 20 columns and df2 has 40 columns.
how can i easily get a new data frame of 30 rows and 60 columns?
df3 = pd.someSpecialMergeFunct(df1, df2)
or maybe there is some special parameter in append
df3 = pd.append(df1, df2, left_index=False, right_index=false, how='left')
ps: if possible, i hope the replicated column names could be resolved automatically.
thanks!
You can use the concat function for this (axis=1 is to concatenate as columns):
pd.concat([df1, df2], axis=1)
See the pandas docs on merging/concatenating: http://pandas.pydata.org/pandas-docs/stable/merging.html
I came across your question while I was trying to achieve something like the following:
So once I sliced my dataframes, I first ensured that their index are the same. In your case both dataframes needs to be indexed from 0 to 29. Then merged both dataframes by the index.
df1.reset_index(drop=True).merge(df2.reset_index(drop=True), left_index=True, right_index=True)
If you want to combine 2 data frames with common column name, you can do the following:
df_concat = pd.merge(df1, df2, on='common_column_name', how='outer')
I found that the other answers didn't cut it for me when coming in from Google.
What I did instead was to set the new columns in place in the original df.
# list(df2) gives you the column names of df2
# you then use these as the column names for df
df[list(df2)] = df2
There is way, you can do it via a Pipeline.
** Use a pipeline to transform your numerical Data for ex-
Num_pipeline = Pipeline
([("select_numeric", DataFrameSelector([columns with numerical value])),
("imputer", SimpleImputer(strategy="median")),
])
**And for categorical data
cat_pipeline = Pipeline([
("select_cat", DataFrameSelector([columns with categorical data])),
("cat_encoder", OneHotEncoder(sparse=False)),
])
** Then use a Feature union to add these transformations together
preprocess_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
Read more here - https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
This solution also works if df1 and df2 have different indices:
df1.loc[:, df2.columns] = df2.to_numpy()
I am looking for a way to divide all columns in a dataframe with the value of a column from another df. This can be done using any of the 2 options mentioned below.
df_amenity_normalized = df_amenity.apply(
lambda row: row / df_targets['Population'].loc[row.name], axis=1)
Or join the tables and then calculate:
ndf=df_amenity.merge(df_targets, left_index=True, right_index=True)
ndft=ndf.apply(lambda x: x/ndf.Population, axis='rows' )
df_amenity_normalized1 = ndft.drop(columns=['Population', 'GNI', 'GDP', 'BM Dollar', 'HDI'])
Is there any other way to achive the same results?
Data is available here...
df_targets = pd.read_csv('https://raw.githubusercontent.com/njanakiev/osm-predict-economic-measurements/master/data/economic_measurements.csv', index_col='country')
df_targets.drop(columns='country_code', inplace=True)
df_targets = df_targets[['Population', 'GNI', 'GDP', 'BM Dollar', 'HDI']]
df_amenity = pd.read_csv('https://raw.githubusercontent.com/njanakiev/osm-predict-economic-measurements/master/data/country_amenity_counts.csv')
df_amenity.set_index('country', inplace=True)
df_amenity.drop(columns='country_code', inplace=True)
You can use the df.div() function from pandas. See below:
df_amenity.div(df_targets['Population'], axis = 0)
Let df1, df2, and df3 are pandas.DataFrame's having the same structure but different numerical values. I want to perform:
res=if df1>1.0: (df2-df3)/(df1-1) else df3
res should have the same structure as df1, df2, and df3 have.
numpy.where() generates result as a flat array.
Edit 1:
res should have the same indices as df1, df2, and df3 have.
For example, I can access df2 as df2["instanceA"]["parameter1"]["paramter2"]. I want to access the new calculated DataFrame/Series res as res["instanceA"]["parameter1"]["paramter2"].
Actually numpy.where should work fine there. Output here is 4x2 (same as df1, df2, df3).
df1 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df2 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df3 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
res = df3.copy()
res[:] = np.where( df1 > 1, (df2-df3)/(df1-1), df3 )
x y
0 -0.671787 -0.445276
1 -0.609351 -0.881987
2 0.324390 1.222632
3 -0.138606 0.955993
Note that this should work on both series and dataframes. The [:] is slicing syntax that preserves the index and columns. Without that res will come out as an array rather than series or dataframe.
Alternatively, for a series you could write as #Kadir does in his answer:
res = pd.Series(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index)
Or similarly for a dataframe you could write:
res = pd.DataFrame(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index,
columns=df1.columns)
Integrating the idea in this question into JohnE's answer, I have come up with this solution:
res = pd.Series(np.where( df1 > 1, (df2-df3)/(df1-1), df3 ), index=df1.index)
A better answer using DataFrames will be appreciated.
Say df is your initial dataframe and res is the new column. Use a combination of setting values and boolean indexing.
Set res to be a copy of df3:
df['res'] = df['df3']
Then adjust values for your condition.
df[df['df1']>1.0]['res'] = (df['df2'] - df['df3'])/(df['df1']-1)