Merge and inverleave rows of two dataframes [duplicate] - pandas

This question already has answers here:
Pandas - Interleave / Zip two DataFrames by row
(5 answers)
Closed 20 days ago.
This post was edited and submitted for review 20 days ago.
Suppose we have:
>>> df1
A B
0 1 a
1 2 a
2 3 a
3 4 a
>>> df2
A B
0 1 b
1 2 b
2 3 b
3 5 b
I would like to merge them on "A" and then list them by interleaving rows like:
A B
0 1 a
0 1 b
1 2 a
1 2 b
2 3 a
2 3 b
I tried merge but it list them column by column. For example if I have 3 or more data frames, merge can merge them on some columns, but my problem would be then to interleave them

If need match by A filter rows by Series.isin in boolean indexing, pass to concat with DataFrame.sort_index:
df = pd.concat([df1[df1.A.isin(df2.A)],
df2[df2.A.isin(df1.A)]]).sort_index(kind='stable')
print (df)
A B
0 1 a
0 1 b
1 2 a
1 2 b
2 3 a
2 3 b
EDIT:
For general data is possible sorting by A and create default index for correct interleaving:
df = (pd.concat([df1[df1.A.isin(df2.A)].sort_values('A', kind='stable').reset_index(drop=True),
df2[df2.A.isin(df1.A)].sort_values('A', kind='stable').reset_index(drop=True)])
.sort_index(kind='stable'))

Related

Remove duplicates from dataframe, based on two columns A,B, keeping [list of values] in another column C

I have a pandas dataframe which contains duplicates values according to two columns (A and B):
A B C
1 2 1
1 2 4
2 7 1
3 4 0
3 4 8
I want to remove duplicates keeping the values in column C inside a list of len N values in C (example 2 values in this example). This would lead to:
A B C
1 2 [1,4]
2 7 1
3 4 [0,8]
I cannot figure out how to do that. Maybe use groupby and drop_duplicates?

Python: obtaining the first observation according to its date [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

most efficient way to set dataframe column indexing to other columns

I have a large Dataframe. One of my columns contains the name of others. I want to eval this colum and set in each row the value of the referenced column:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| B |
|2|5|3| A |
|3|5|9| C |
Desired output:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| 3 |
|2|5|3| 2 |
|3|5|9| 9 |
I am achieving this result using:
df.apply(lambda d: eval("d." + d['Column']), axis=1)
But it is very slow, even using swifter. Is there a more efficient way of performing this?
For better performance, use df.to_numpy():
In [365]: df['Column'] = df.to_numpy()[df.index, df.columns.get_indexer(df.Column)]
In [366]: df
Out[366]:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
For Pandas < 1.2.0, use lookup:
df['Column'] = df.lookup(df.index, df['Column'])
From 1.2.0+, lookup is decprecated, you can just use a for loop:
df['Column'] = [df.at[idx, r['Column']] for idx, r in df.iterrows()]
Output:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
Since lookup is going to decprecated try numpy method with get_indexer
df['new'] = df.values[df.index,df.columns.get_indexer(df.Column)]
df
Out[75]:
A B C Column new
0 1 3 4 B 3
1 2 5 3 A 2
2 3 5 9 C 9

Pandas, multiply part of one DF against another based on condition

Pretty new to this and am having trouble finding the right way to do this.
Say I have dataframe1 looking like this with column names and a bunch of numbers as data:
D L W S
1 2 3 4
4 3 2 1
1 2 3 4
and I have dataframe2 looking like this:
Name1 Name2 Name3 Name4
2 data data D
3 data data S
4 data data L
5 data data S
6 data data W
I would like a new dataframe produced with the result of multiplying each row of the second dataframe against each row of the first dataframe, where it multiplies the value of Name1 against the value in the column of dataframe1 which matches the Name4 value of dataframe2.
Is there any nice way to do this? I was trying to look at using methods like where, condition, and apply but haven't been understanding things well enough to get something working.
EDIT: Use the following code to create fake data for the DataFrames:
d1 = {'D':[1,2,3,4,5,6],'W':[2,2,2,2,2,2],'L':[6,5,4,3,2,1],'S':[1,2,3,4,5,6]}
d2 = {'col1': [3,2,7,4,5,6], 'col2':[2,2,2,2,3,4], 'col3':['data', 'data', 'data','data', 'data', 'data' ], 'col4':['D','L','D','W','S','S']}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
EDIT AGAIN FOR MORE INFO
First I changed the data in df1 at this point so this new example will turn out better.
Okay so from those two dataframes the data frame I'd like to create would come out like this if the multiplication when through for the first four rows of df2. You can see that Col2 and Col3 are unchanged, but depending on the letter of Col4, Col1 was multiplied with the corresponding factor from df1:
d3 = { 'col1':[3,6,9,12,15,18,12,10,8,6,4,2,7,14,21,28,35,42,8,8,8,8,8,8], 'col2':[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], 'col3':['data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data'], 'col4':['D','D','D','D','D','D','L','L','L','L','L','L','D','D','D','D','D','D','W','W','W','W','W','W']}
df3 = pd.DataFrame(data = d3)
I think I understand what you are trying to achieve. You want to multiply each row r in df2 with the corresponding column c in df1 but the elements from c are only multiplied with the first element in r the rest of the row doesn't change.
I was thinking there might be a way to join df1.transpose() and df2 but I didn't find one.
While not pretty, I think the code below solves your problem:
def stretch(row):
repeated_rows = pd.concat([row]*len(df1), axis=1, ignore_index=True).transpose()
factor = row['col1']
label = row['col4']
first_column = df1[label] * factor
repeated_rows['col1'] = first_column
return repeated_rows
pd.concat((stretch(r) for _, r in df2.iterrows()), ignore_index=True)
#resulting in
col1 col2 col3 col4
0 3 2 data D
1 6 2 data D
2 9 2 data D
3 12 2 data D
4 15 2 data D
5 18 2 data D
0 12 2 data L
1 10 2 data L
2 8 2 data L
3 6 2 data L
4 4 2 data L
5 2 2 data L
0 7 2 data D
1 14 2 data D
2 21 2 data D
3 28 2 data D
4 35 2 data D
5 42 2 data D
0 8 2 data W
1 8 2 data W
2 8 2 data W
3 8 2 data W
4 8 2 data W
5 8 2 data W
...

Pandas : dataframe cumsum , reset if other column is false [duplicate]

This question already has an answer here:
How to reset cumsum after change in sign of values?
(1 answer)
Closed 4 years ago.
I have a dataframe with 2 columns, the objective here is simple ; reset the df.cumsum() if a row column is set to False;
df
value condition
0 1 1
1 2 1
2 3 1
3 4 0
4 5 1
the wanted result is as follows :
df
value condition
0 1 1
1 3 1
2 6 1
3 4 0
4 9 1
If i loop over the dataframe as described in this post Python pandas cumsum() reset after hitting max
i can achieve the wanted results, but i was looking for a more vectorized way using pandas standard functions
How about:
df['cSum'] = df.groupby((df.condition == 0).cumsum()).value.cumsum()
Output:
value condition cSum
0 1 1 1
1 2 1 3
2 3 1 6
3 4 0 4
4 5 1 9
You'll group consecutive rows together until you encounter a 0 in the condition column, and then you apply the cumsum within each group separately.