Pandas: Joining information from multiple data frames, array - pandas

Suppose I have three data structures:
A data frame df1, with columns A, B, C of length 10000
A data frame df2, with columns A, some extra misc. columns... of length 8000
A Python list labels of length 8000, where the element at index i corresponds with row i in df2.
I'm trying to create a data frame from this information that, for every element in df2.a, I grab the relevant row from df1 and labels to pair up this information. It's possible that an entry in df2.A is NOT present in df1.A.
Currently, I'm doing this through a for i in xrange(len(df2)) loop, checking if df2.A.iloc[i] is present in df1.A, and if it is, I store df1.A, df1.B, df1.C, labels[i] into a dictionary with the first element as the key and the rest of the elements as a list.
Is there a more efficient way to do this and store the outputs df1.A, df1.B, df1.C, labels[i] into a 4 columns dataframe? The for loop is really slow.
Sample data:
df1
A B C
'uid1' 'Bob' 'Rock'
'uid2' 'Jack' 'Pop'
'uid5' 'Cat' 'Country'
...
df2
A
'uid10'
'uid3'
'uid1'
...
labels
[label10, label3, label1, ...]

OK from what I understand the following should work:
# create a new column for your labels, this will align to your index
df2['labels'] = labels
# now merge the rows from df1 on column 'A'
df2 = df2.merge(df1, on='A', how='left')
Example:
# setup my sample data
temp="""A B C
'uid1' 'Bob' 'Rock'
'uid2' 'Jack' 'Pop'
'uid5' 'Cat' 'Country'"""
temp1="""A
'uid10'
'uid3'
'uid1'"""
labels = ['label10', 'label3', 'label1']
df1 = pd.read_csv(io.StringIO(temp), sep='\s+')
df2 = pd.read_csv(io.StringIO(temp1))
In [97]:
# do the work
df2['labels'] = labels
df2 = df2.merge(df1, on='A', how='left')
df2
Out[97]:
A labels B C
0 'uid10' label10 NaN NaN
1 'uid3' label3 NaN NaN
2 'uid1' label1 'Bob' 'Rock'
This will be considerably faster than looping

Related

Concatenate single row dataframe with multiple row dataframe

I have a dataframe with large number of columns but single row as df1:
Col1 Col2 Price Qty
A B 16 5
I have another dataframe as follows, df2:
Price Qty
8 2.5
16 5
6 1.5
I want to achieve the following:
Col1 Col2 Price Qty
A B 8 2.5
A B 16 5
A B 6 1.5
Where essentially I am taking all rows of df1 and repeat it while concatenating with df2 but bring the Price and Qty columns from df2 and replace the ones present originally in df1.
I am not sure how to proceed with above.
I believe the following approach will work,
# first lets repeat the single row df1 as many times as there are rows in df2
df1 = pd.DataFrame(np.repeat(df1.values, len(df2.index), axis=0), columns=df1.columns)
# lets reset the indexes of both DataFrames just to be safe
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
# now, lets merge the two DataFrames based on the index
# after dropping the Price and Qty columns from df1
df3 = pd.merge(df1.drop(['Price', 'Qty'], axis=1), df2, left_index=True, right_index=True)
# finally, lets drop the index columns
df3.drop(['index_x', 'index_y'], inplace=True, axis=1)

How to pull a specific value from one dataframe into another?

I have two dataframes
How would one populate the values in bold from df1 into the column 'Value' in df2?
Use melt on df1 before merge your 2 dataframes
tmp = df1.melt('Rating', var_name='Category', value_name='Value2')
df2['Value'] = df2.merge(tmp, on=['Rating', 'Category'])['Value2']
print(df2)
# Output
Category Rating Value
0 Hospitals A++ 2.5
1 Education AA 2.1

Perplexing pandas index change after left merge

I have a data frame and I am interested in a particular row. When I run
questionnaire_events[questionnaire_events['event_id'].eq(6506308)]
I get the row, and its index is 7,816. I then merge questionnaire_events with another data frame
merged = questionnaire_events.merge(
ordinals,
how='left',
left_on='event_id',
right_on='id')
(It is worth noting that the ordinals data frame has no NaNs and no duplicated ids, but questionnaire_events does have some rows with NaN values for event_id.)
merged[merged['event_id'].eq(6506308)]
The resulting row has index 7,581. Why? What has happened in the merge, a left outer merge, to mean that my row has moved from 7,816 to 7,581? If there were multiple rows with the same id in the ordinals data frame then I can see how the merged data frame would have more rows than the left data frame in the merge, but that is not the case, so why has the row moved?
(N.B. Sorry I cannot give a crisp code sample. When I try to produce test data the row index change does not happen, it is only happening on my real data.)
pd.DataFrame.merge does not preserve the original datafame indexes.
df1 = pd.DataFrame({'key':[*'ABCDE'], 'val':[1,2,3,4,5]}, index=[100,200,300,400,500])
print('df1 dataframe:')
print(df1)
print('\n')
df2 = pd.DataFrame({'key':[*'AZCWE'], 'val':[10,20,30,40,50]}, index=[*'abcde'])
print('df2 dataframe:')
print(df2)
print('\n')
df_m = df1.merge(df2, on='key', how='left')
print('df_m dataframe:')
print(df_m)
Now, if your df1 is the default range index, then it is possible that you could get different index in your merged dataframe. If you subset or filter your df1, then your indexing will not match.
Work Around:
df1 = df1.reset_index()
df_m2 = df1.merge(df2, on='key', how='left')
df_m2 = df_m2.set_index('index')
print('df_m2 work around dataframe:')
print(df_m2)
Output:
df_m2 work around dataframe:
key val_x val_y
index
100 A 1 10.0
200 B 2 NaN
300 C 3 30.0
400 D 4 NaN
500 E 5 50.0

iterating over a dictionary of empty pandas dataframes to append them with data from existing dataframe based on list of column names

I'm a biologist and very new to Python (I use v3.5) and pandas. I have a pandas dataframe (df), from which I need to make several dataframes (df1... dfn) that can be placed in a dictionary (dictA), which currently has the correct number (n) of empty dataframes. I also have a dictionary (dictB) of n (individual) lists of column names that were extracted from df. The keys in 2 dictionaries match. I'm trying to append the empty dfs within dictA with parts of df based on the column names within the lists in dictB.
import pandas as pd
listA=['A', 'B', 'C',...]
dictA={i:pd.DataFrame() for i in listA}
lets say I have something like this:
dictA={'A': df1, 'B': df2}
dictB={'A': ['A1', A2', 'A3'],
'B': ['B1', B2']}
df=pd.DataFrame({'A1': [0,2,4,5],
'A2': [2,5,6,7],
'A3': [5,6,7,8],
'B1': [2,5,6,7],
'B2': [1,3,5,6]})
listA=['A', 'B']
what I'm trying to get is for df1 and df2 to get appended with portions of df like this, so that the output for df1 is like this:
A1 A2 A3
0 0 2 5
1 2 4 6
2 4 6 7
3 5 7 8
df2 would have columns B1 and B2.
I tried the following loop and some alterations, but it doesn't yield populated dfs:
for key, values in dictA.items():
values.append(df[dictB[key]])
Thanks and sorry if this was already addressed elsewhere but I couldn't find it.
You could create the dataframes you want like this instead :
df = #Your original dataframe containing all the columns
df_A = df.iloc[:][[col for col in df if 'A' in col]]
df_B = df.iloc[:][[col for col in df if 'B' in col]]

Pandas: Selecting rows by list

I tried following code to select columns from a dataframe. My dataframe has about 50 values. At the end, I want to create the sum of selected columns, create a new column with these sum values and then delete the selected columns.
I started with
columns_selected = ['A','B','C','D','E']
df = df[df.column.isin(columns_selected)]
but it said AttributeError: 'DataFrame' object has no attribute 'column'
Regarding the sum: As I don't want to write for the sum
df['sum_1'] = df['A']+df['B']+df['C']+df['D']+df['E']
I also thought that something like
df['sum_1'] = df[columns_selected].sum(axis=1)
would be more convenient.
You want df[columns_selected] to sub-select the df by a list of columns
you can then do df['sum_1'] = df[columns_selected].sum(axis=1)
To filter the df to just the cols of interest pass a list of the columns, df = df[columns_selected] note that it's a common error to just a list of strings: df = df['a','b','c'] which will raise a KeyError.
Note that you had a typo in your original attempt:
df = df.loc[:,df.columns.isin(columns_selected)]
The above would've worked, firstly you needed columns not column, secondly you can use the boolean mask as a mask against the columns by passing to loc or ix as the column selection arg:
In [49]:
df = pd.DataFrame(np.random.randn(5,5), columns=list('abcde'))
df
Out[49]:
a b c d e
0 -0.778207 0.480142 0.537778 -1.889803 -0.851594
1 2.095032 1.121238 1.076626 -0.476918 -0.282883
2 0.974032 0.595543 -0.628023 0.491030 0.171819
3 0.983545 -0.870126 1.100803 0.139678 0.919193
4 -1.854717 -2.151808 1.124028 0.581945 -0.412732
In [50]:
cols = ['a','b','c']
df.ix[:, df.columns.isin(cols)]
Out[50]:
a b c
0 -0.778207 0.480142 0.537778
1 2.095032 1.121238 1.076626
2 0.974032 0.595543 -0.628023
3 0.983545 -0.870126 1.100803
4 -1.854717 -2.151808 1.124028