I'm trying to understand how to concat three individual dataframes (i.e df1, df2, df3) into a new dataframe say df4 whereby each individual dataframe has its own column left to right order.
I've tried using concat with axis = 1 to do this, but it appears not possible to automate this with a single action.
Table1_updated = pd.DataFrame(columns=['3P','2PG-3Io','3Io'])
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io])
Note that with the exception of get_table1_2P_max_3Io, which has two columns, all other dataframes have one column
For example,
get_table1_3P =
get_table1_2P_max_3Io =
get_table1_3Io =
Ultimately, i would like to see the following:
I believe you need first concat and tthen change order by list of columns names:
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io], axis=1)
Table1_updated = Table1_updated[['3P','2PG-3Io','3Io']]
Related
I want to add a new column to a datframe "table" (name: conc) which uses the values in columns (plate, ab) to get the numeric value from the dataframe "concs"
Below is what I mean, with the dataframe "exp" used to show what I expect the data to look like
what is the proper way to do this. Is it using some multiple condition, or do I need to reshape the concs dataframe somehow?
Use DataFrame.melt with left join for new column concs, if no match is created NaNs:
exp = concs.melt('plate', var_name='ab', value_name='concs').merge(table,on=['plate', 'ab'], how='left')
Solution should be simplify - if same columns names 'plate', 'ab' in both DataFrames and need merge by both is possible omit on parameter:
exp = concs.melt('plate', var_name='ab', value_name='concs').merge(table, how='left')
First melt the concs dataframe and then merge with table:
out = concs.melt(id_vars=['plate'],
value_vars=concs.columns.drop('plate').tolist(),
var_name='ab').merge(table, on=['plate', 'ab'
]).rename(columns={'value': 'concs'})
or just make good use of parameters of melt like in jezraels' answer:
out = concs.melt(id_vars=['plate'],
value_name='concs',
var_name='ab').merge(table, on=['plate', 'ab'])
is there a way to conveniently merge two data frames side by side?
both two data frames have 30 rows, they have different number of columns, say, df1 has 20 columns and df2 has 40 columns.
how can i easily get a new data frame of 30 rows and 60 columns?
df3 = pd.someSpecialMergeFunct(df1, df2)
or maybe there is some special parameter in append
df3 = pd.append(df1, df2, left_index=False, right_index=false, how='left')
ps: if possible, i hope the replicated column names could be resolved automatically.
thanks!
You can use the concat function for this (axis=1 is to concatenate as columns):
pd.concat([df1, df2], axis=1)
See the pandas docs on merging/concatenating: http://pandas.pydata.org/pandas-docs/stable/merging.html
I came across your question while I was trying to achieve something like the following:
So once I sliced my dataframes, I first ensured that their index are the same. In your case both dataframes needs to be indexed from 0 to 29. Then merged both dataframes by the index.
df1.reset_index(drop=True).merge(df2.reset_index(drop=True), left_index=True, right_index=True)
If you want to combine 2 data frames with common column name, you can do the following:
df_concat = pd.merge(df1, df2, on='common_column_name', how='outer')
I found that the other answers didn't cut it for me when coming in from Google.
What I did instead was to set the new columns in place in the original df.
# list(df2) gives you the column names of df2
# you then use these as the column names for df
df[list(df2)] = df2
There is way, you can do it via a Pipeline.
** Use a pipeline to transform your numerical Data for ex-
Num_pipeline = Pipeline
([("select_numeric", DataFrameSelector([columns with numerical value])),
("imputer", SimpleImputer(strategy="median")),
])
**And for categorical data
cat_pipeline = Pipeline([
("select_cat", DataFrameSelector([columns with categorical data])),
("cat_encoder", OneHotEncoder(sparse=False)),
])
** Then use a Feature union to add these transformations together
preprocess_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
Read more here - https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
This solution also works if df1 and df2 have different indices:
df1.loc[:, df2.columns] = df2.to_numpy()
I have a dataframe as
df = pd.DataFrame(np.random.randn(5,4),columns=list('ABCD'))
I can use the following to achieve the traditional calculation like mean(), sum()etc.
df.loc['calc'] = df[['A','D']].iloc[2:4].mean(axis=0)
Now I have two questions
How can I apply a formula (like exp(mean()) or 2.5*mean()/sqrt(max()) to column 'A' and 'D' for rows 2 to 4
How can I append row to the existing df where two values would be mean() of the A and D and two values would be of specific formula result of C and B.
Q1:
You can use .apply() and lambda functions.
df.iloc[2:4,[0,3]].apply(lambda x: np.exp(np.mean(x)))
df.iloc[2:4,[0,3]].apply(lambda x: 2.5*np.mean(x)/np.sqrt(max(x)))
Q2:
You can use dictionaries and combine them and add it as a row.
First one is mean, the second one is some custom function.
ad = dict(df[['A', 'D']].mean())
bc = dict(df[['B', 'C']].apply(lambda x: x.sum()*45))
Combine them:
ad.update(bc)
df = df.append(ad, ignore_index=True)
Is it possible to only merge some columns? I have a DataFrame df1 with columns x, y, z, and df2 with columns x, a ,b, c, d, e, f, etc.
I want to merge the two DataFrames on x, but I only want to merge columns df2.a, df2.b - not the entire DataFrame.
The result would be a DataFrame with x, y, z, a, b.
I could merge then delete the unwanted columns, but it seems like there is a better method.
You want to use TWO brackets, so if you are doing a VLOOKUP sort of action:
df = pd.merge(df,df2[['Key_Column','Target_Column']],on='Key_Column', how='left')
This will give you everything in the original df + add that one corresponding column in df2 that you want to join.
You could merge the sub-DataFrame (with just those columns):
df2[list('xab')] # df2 but only with columns x, a, and b
df1.merge(df2[list('xab')])
If you want to drop column(s) from the target data frame, but the column(s) are required for the join, you can do the following:
df1 = df1.merge(df2[['a', 'b', 'key1']], how = 'left',
left_on = 'key2', right_on = 'key1').drop(columns = ['key1'])
The .drop(columns = 'key1') part will prevent 'key1' from being kept in the resulting data frame, despite it being required to join in the first place.
You can use .loc to select the specific columns with all rows and then pull that. An example is below:
pandas.merge(dataframe1, dataframe2.iloc[:, [0:5]], how='left', on='key')
In this example, you are merging dataframe1 and dataframe2. You have chosen to do an outer left join on 'key'. However, for dataframe2 you have specified .iloc which allows you to specific the rows and columns you want in a numerical format. Using :, your selecting all rows, but [0:5] selects the first 5 columns. You could use .loc to specify by name, but if your dealing with long column names, then .iloc may be better.
This is to merge selected columns from two tables.
If table_1 contains t1_a,t1_b,t1_c..,id,..t1_z columns,
and table_2 contains t2_a, t2_b, t2_c..., id,..t2_z columns,
and only t1_a, id, t2_a are required in the final table, then
mergedCSV = table_1[['t1_a','id']].merge(table_2[['t2_a','id']], on = 'id',how = 'left')
# save resulting output file
mergedCSV.to_csv('output.csv',index = False)
Slight extension of the accepted answer for multi-character column names, using inner join by default:
df1 = df1.merge(df2[["Key_Column", "Target_Column1", "Target_Column2"]])
This assumes that Key_Column is the only column both dataframes have in common.
I have two dataframes, one contains screen names/display names and another contains individuals, and I am trying to create a third dataframe that contains all the data from each dataframe in a new row for each time a last name appears in the screen name/display name. Functionally this will create a list of possible matching names. My current code, which works perfectly but very slowly, looks like this:
# Original Social Media Screen Names
# cols = 'userid','screen_name','real_name'
usernames = pd.read_csv('social_media_accounts.csv')
# List Of Individuals To Match To Accounts
# cols = 'first_name','last_name'
individuals = pd.read_csv('individuals_list.csv')
userid, screen_name, real_name, last_name, first_name = [],[],[],[],[]
for index1, row1 in individuals.iterrows():
for index2, row2 in usernames.iterrows():
if (row2['Screen_Name'].lower().find(row1['Last_Name'].lower()) != -1) | (row2['Real_Name'].lower().find(row1['Last_Name'].lower()) != -1):
userid.append(row2['UserID'])
screen_name.append(row2['Screen_Name'])
real_name.append(row2['Real_Name'])
last_name.append(row1['Last_Name'])
first_name.append(row1['First_Name'])
cols = ['UserID', 'Screen_Name', 'Real_Name', 'Last_Name', 'First_Name']
index = range(0, len(userid))
match_list = pd.DataFrame(index=index, columns=cols)
match_list = match_list.fillna('')
match_list['UserID'] = userid
match_list['Screen_Name'] = screen_name
match_list['Real_Name'] = real_name
match_list['Last_Name'] = last_name
match_list['First_Name'] = first_name
Because I need the whole row from each column, the list comprehension methods I have tried do not seem to work.
The thing you want is to iterate through a dataframe faster. Doing that with a list comprehension is, taking data out of a pandas dataframe, handling it using operations in python, then putting it back in a pandas dataframe. The fastest way (currently, with small data) would be to handle it using pandas iteration methods.
The next thing you want to do is work with 2 dataframes. There is a tool in pandas called join.
result = pd.merge(usernames, individuals, on=['Screen_Name', 'Last_Name'])
After the merge you can do your filtering.
Here is the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html