How to access (multi)index of a Data Frame? - indexing

I have a data frame and use some of its columns to group by:
grouped = df.groupby(['col1', 'col2'])
Now I use mean function to get a new data frame object from the above created groupby object:
df_new = grouped.mean()
Now I have two data frames (df and df2) and I would like to merge them using col1 and col2. The problem that I have now is that df2 does no have these columns. After groupby operation col1 and col2 are "shifted" to index. So, to resolve this problem, I try to create these columns:
df2['col1'] = df2['index'][0]
df2['col2'] = df2['index'][1]
But it does not work because 'index' is not recognized as a column of the data frame.

As an alternative Andy Hayden's method, you could use as_index=False to preserve the columns as columns rather than indices:
df2 = df.groupby(['col1', 'col2'], as_index=False).mean()

You can use left_index (or right_index) arguments of merge:
left_index : boolean, default False
Use the index from the left DataFrame as the join key(s).
If it is a MultiIndex, the number of keys in the other DataFrame (either the index
or a number of columns) must match the number of levels
and use right_on to determine which columns it should merge the index with.
So it'll be something like:
pd.merge(df, df_new, left_on=['col1', 'col2'], right_index=True)

Related

DataFrame Groupby apply on second dataframe?

I have 2 dataframes df1, df2. Both have id as a column. I want to compute a new column, weighted_average, in df1 that is a function of the values in df2 with the same id.
First, I think I should do df1.groupby("id"). Is it possible to use GroupBy.apply(...) and have it use values from df2? In the examples I've seen, it usually just operates on df1 values.
If they have same id positions and length, you can do some like:
df2["new column name"] = df1["column name"].apply(...)

Proper way to join data based on coditions

I want to add a new column to a datframe "table" (name: conc) which uses the values in columns (plate, ab) to get the numeric value from the dataframe "concs"
Below is what I mean, with the dataframe "exp" used to show what I expect the data to look like
what is the proper way to do this. Is it using some multiple condition, or do I need to reshape the concs dataframe somehow?
Use DataFrame.melt with left join for new column concs, if no match is created NaNs:
exp = concs.melt('plate', var_name='ab', value_name='concs').merge(table,on=['plate', 'ab'], how='left')
Solution should be simplify - if same columns names 'plate', 'ab' in both DataFrames and need merge by both is possible omit on parameter:
exp = concs.melt('plate', var_name='ab', value_name='concs').merge(table, how='left')
First melt the concs dataframe and then merge with table:
out = concs.melt(id_vars=['plate'],
value_vars=concs.columns.drop('plate').tolist(),
var_name='ab').merge(table, on=['plate', 'ab'
]).rename(columns={'value': 'concs'})
or just make good use of parameters of melt like in jezraels' answer:
out = concs.melt(id_vars=['plate'],
value_name='concs',
var_name='ab').merge(table, on=['plate', 'ab'])

How do i combine multiple dataframes using a repeating index system

I have multiple dataframes that I want to combine and only want to use the indexing system of the first dataframe. The problem is the indices I want to use are repeating and I want to keep it that way.
df = pd.concat([df1, df2, df3], axis=1, join='inner')
This gives me InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Just so it's clear, df1 has repeating indices (0-9 and then it repeats again multiple times), whereas df2 and df3 are single-column dataframes and have non-repeating indices. The number of rows do match though.
Well from what i understand your index repeats itself, on df1. That is what is causing the given error InvalidIndexError: Reindexing only valid with uniquely valued Index objects, since you have a loop beetween (0,9 values) pandas, will never be able to identify which row to join with what row since the indexes well are repeated so non unique. My apprach would be just to use join, but hey if you want to use concat for reasons
A few ways to do this would be just
Just using the join function
df1.join([df2,df3])
But if you insist on using concat, i would
x = df1.index
df1.reset_index(drop=True)
df = pd.concat([df1,df2,df3],axis=1,join='inner')
df.index = x

Remove rows from multiple dataframe that contain bad data

Say I have n dataframes, df1, df2...dfn.
Finding rows that contain "bad" values in a row in a given dataframe is done by e.g.,
index1 = df1[df1.isin([np.nan, np.inf, -np.inf])]
index2 = df2[df2.isin([np.nan, np.inf, -np.inf])]
Now, droping these bad rows in the bad dataframe is done with:
df1 = df1.replace([np.inf, -np.inf], np.nan).dropna()
df2 = df2.replace([np.inf, -np.inf], np.nan).dropna()
The problem is that any function that expects the two (n) dataframes columns to be of the same length may give an error if there is bad data in one df but not the other.
How do I drop not just the bad row from the offending dataframe, but the same row from a list of dataframes?
So in the two dataframe case, if in df1 date index 2009-10-09 contains a "bad" value, that same row in df2 will be dropped.
[Possible "ugly"? solution?]
I suspect that one way to do it is to merge the two (n) dataframes on date, then apply the cleanup function to drop "bad" values are automatic since the entire row gets dropped? But what happens if a date is missing from one dataframe and not the other? [and they still happen to be the same length?]
Doing your replace
df1 = df1.replace([np.inf, -np.inf], np.nan)
df2 = df2.replace([np.inf, -np.inf], np.nan)
Then, Here we using inner .
newdf=pd.concat([df1,df2],axis=1,keys=[1,2], join='inner').dropna()
And split it back to two dfs , here we using combine_first with dropna of original df
df1,df2=[s[1].loc[:,s[0]].combine_first(x.dropna()) for x,s in zip([df1,df2],newdf.groupby(level=0,axis=1))]

Add a column using another column as its index [duplicate]

Is it possible to only merge some columns? I have a DataFrame df1 with columns x, y, z, and df2 with columns x, a ,b, c, d, e, f, etc.
I want to merge the two DataFrames on x, but I only want to merge columns df2.a, df2.b - not the entire DataFrame.
The result would be a DataFrame with x, y, z, a, b.
I could merge then delete the unwanted columns, but it seems like there is a better method.
You want to use TWO brackets, so if you are doing a VLOOKUP sort of action:
df = pd.merge(df,df2[['Key_Column','Target_Column']],on='Key_Column', how='left')
This will give you everything in the original df + add that one corresponding column in df2 that you want to join.
You could merge the sub-DataFrame (with just those columns):
df2[list('xab')] # df2 but only with columns x, a, and b
df1.merge(df2[list('xab')])
If you want to drop column(s) from the target data frame, but the column(s) are required for the join, you can do the following:
df1 = df1.merge(df2[['a', 'b', 'key1']], how = 'left',
left_on = 'key2', right_on = 'key1').drop(columns = ['key1'])
The .drop(columns = 'key1') part will prevent 'key1' from being kept in the resulting data frame, despite it being required to join in the first place.
You can use .loc to select the specific columns with all rows and then pull that. An example is below:
pandas.merge(dataframe1, dataframe2.iloc[:, [0:5]], how='left', on='key')
In this example, you are merging dataframe1 and dataframe2. You have chosen to do an outer left join on 'key'. However, for dataframe2 you have specified .iloc which allows you to specific the rows and columns you want in a numerical format. Using :, your selecting all rows, but [0:5] selects the first 5 columns. You could use .loc to specify by name, but if your dealing with long column names, then .iloc may be better.
This is to merge selected columns from two tables.
If table_1 contains t1_a,t1_b,t1_c..,id,..t1_z columns,
and table_2 contains t2_a, t2_b, t2_c..., id,..t2_z columns,
and only t1_a, id, t2_a are required in the final table, then
mergedCSV = table_1[['t1_a','id']].merge(table_2[['t2_a','id']], on = 'id',how = 'left')
# save resulting output file
mergedCSV.to_csv('output.csv',index = False)
Slight extension of the accepted answer for multi-character column names, using inner join by default:
df1 = df1.merge(df2[["Key_Column", "Target_Column1", "Target_Column2"]])
This assumes that Key_Column is the only column both dataframes have in common.