Lookup a pandas df for a column value by matching rows with another dataframe - pandas

Say I have a pandas dataframe df1 as follows:
OpDay Rid Tid Sid Dist
0 18Sep 1 1 1 10
1 18Sep 1 1 1 15
2 18Sep 1 1 1 20
3 18Sep 1 5 4 5
4 18Sep 1 5 4 50
and df2 like:
S_Day R_ID T_ID S_ID ABC XYZ
0 18Sep 1 1 1 100 60
1 18Sep 1 5 4 125 100
Number of rows in df2 is equal to total number of unique combinations of OpDay+Rid+Tid+Sid in df1.
Now, I want the values of columns ABC and XYZ from df2 corresponding to this each unique combination. But I don't want to store these values in df1. Just need these values for some computation purpose and then I want to store the result in df2 only by creating a new column.
To summarize, lets say ,I want to do some computation using df1.Dist[3] for which I need values from columns df2.ABC and df2.XYZ also, so first find the row index in df2 where,
S_Day = OpDay[3],
R_ID = Rid[3],
T_ID = Tid[3] and
S_ID = Sid[3]
(In this case its row#1),
so use df2.ABC[1] and df2.XYZ[1] and store results in df2.RESULT[1].
So now df2 will look something like:
S_Day R_ID T_ID S_ID ABC XYZ RESULT
0 18Sep 1 1 1 100 60 Nan
1 18Sep 1 5 4 125 100 some computed value
Basically I guess I need a lookup kind of a function but don't know how to proceed further.
Please help as I am new to the world of python and programming. Many thanks in advance.

You can use .loc and Boolean indices to do what you want. Let's say that you're after the ith row of df1:
i = 3
Next, you can use Boolean indexing to find the corresponding rows in df2:
bool_index = (df1.loc[i, 'OpDay'] == df2.loc[:, 'S_Day']) & (df1.loc[i, 'Rid'] == df2.loc[:, 'R_ID']) & (df1.loc[i, 'Tid'] == df2.loc[:, 'T_ID']) & (df1.loc[i, 'Sid'] == df2.loc[:, 'S_ID'])
You might want to include a check to verify that you found one and only one combination:
sum(bool_index) == 1
And finally, you can use the boolean index to call the right values from df2:
ABC_for_computation = df2.loc[bool_index, 'ABC']
XYZ_for_computation = df2.loc[bool_index, 'XYZ']
Note that I'm not too sure about the speed of this operation if you have large datasets. In my experience, if speed is affected you should switch to numpy arrays instead of dataframes, particularly when writing data into your dataframe.

Related

How to keep only the last index in groups of rows where a condition is met in pandas?

I have the following dataframe:
d = {'value': [1,1,1,1,1,1,1,1,1,1], 'flag_1': [0,1,0,1,1,1,0,1,1,1],'flag_2':[1,0,1,1,1,1,1,0,1,1],'index':[1,2,3,4,5,6,7,8,9,10]}
df = pd.DataFrame(data=d)
I need to perform the following filter on it:
If flag 1 and flag 2 are equal keep the row with the maximum index from the consecutive indices. Below for rows 4,5,6 and rows 9,10 flag 1 and flag 2 are equal. From the group of consecutive indices 4,5,6 therefore I wish to keep only row 6 and drop rows 4 and 5. For the next group of rows 9 and 10 I wish to keep only row 10. The rows where flag 1 and 2 are not equal should all be retained. I want my final output to look as shown below:
I am really not sure how to achieve what is required so I would be grateful for any advice on how to do it.
IIUC, you can compare consecutive rows with shift. This solution requires a sorted index.
In [5]: df[~df[['flag_1', 'flag_2']].eq(df[['flag_1', 'flag_2']].shift(-1)).all(axis=1)]
Out[5]:
value flag_1 flag_2 index
0 1 0 1 1
1 1 1 0 2
2 1 0 1 3
5 1 1 1 6
6 1 0 1 7
7 1 1 0 8
9 1 1 1 10

How to split pandas dataframe into multiple dataframes (holding together rows) based upon a column's value

My problem is similar to split a dataframe into chunks of N rows problem, expect that the number of rows in each chunk will be different. I have a datafame as such:
A
B
C
1
2
0
1
2
1
1
2
2
1
2
0
1
2
1
1
2
2
1
2
3
1
2
4
1
2
0
A and B are just whatever don't pay attention. Column C though starts at 0 and increments with each row until it suddenly resets to 0. So in the dataframe included the first 3 rows are a new dataframe, then the next 5 are a second new dataframe, and this continues as my dataframe adds more and more rows.
To finish off the question,
df = [x for _, x in df.groupby(df['C'].eq(0).cumsum())]
allows me to group all the subgroups and then with this groupby I can select each subgroups as a separate dataframe.

drop consecutive duplicates of groups

I am removing consecutive duplicates in groups in a dataframe. I am looking for a faster way than this:
def remove_consecutive_dupes(subdf):
dupe_ids = [ "A", "B" ]
is_duped = (subdf[dupe_ids].shift(-1) == subdf[dupe_ids]).all(axis=1)
subdf = subdf[~is_duped]
return subdf
# dataframe with columns key, A, B
df.groupby("key").apply(remove_consecutive_dupes).reset_index()
Is it possible to remove these without grouping first? Applying the above function to each group individually takes a lot of time, especially if the group count is like half the row count. Is there a way to do this operation on the entire dataframe at once?
A simple example for the algorithm if the above was not clear:
input:
key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5
5 x 1 2
output:
key A B
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5
5 x 1 2
Row 2 was dropped because A=1 B=2 was also the previous row in group x.
Row 5 will not be dropped because it is not a consecutive duplicate in group x.
According to your code, you drop only lines if they appear below each other if
they are grouped by the key. So rows with another key inbetween do not influence this logic. But doing this, you want to preserve the original order of the records.
I guess the biggest influence in the runtime is the call of your function and
possibly not the grouping itself.
If you want to avoid this, you can try the following approach:
# create a column to restore the original order of the dataframe
df.reset_index(drop=True, inplace=True)
df.reset_index(drop=False, inplace=True)
df.columns= ['original_order'] + list(df.columns[1:])
# add a group column, that contains consecutive numbers if
# two consecutive rows differ in at least one of the columns
# key, A, B
compare_columns= ['key', 'A', 'B']
df.sort_values(['key', 'original_order'], inplace=True)
df['group']= (df[compare_columns] != df[compare_columns].shift(1)).any(axis=1).cumsum()
df.drop_duplicates(['group'], keep='first', inplace=True)
df.drop(columns=['group'], inplace=True)
# now just restore the original index and it's order
df.set_index('original_order', inplace=True)
df.sort_index(inplace=True)
df
Testing this, results in:
key A B
original_order
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5
If you don't like the index name above (original_order), you just need to add the following line to remove it:
df.index.name= None
Testdata:
from io import StringIO
infile= StringIO(
""" key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5"""
)
df= pd.read_csv(infile, sep='\s+') #.set_index('Date')
df

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

Looking up values from one dataframe in specific row of another dataframe

I'm struggling with a bit of a complex (to me) lookup-type problem.
I have a dataframe df1 that looks like this:
Grade Value
0 A 16
1 B 12
2 C 5
And another dataframe (df2) where the values in one of the columns('Grade') from df1 forms the index:
Tier 1 Tier 2 Tier 3
A 20 17 10
B 16 11 3
C 7 6 2
I've been trying to write a bit of code that for each row in df1, look ups the row corresponding with 'Grade' in df2, finds the smallest value in df2 greater than 'Value', and returns the name of that column.
E.g. for the second row of df1, it looks up the row with index 'B' in df2: 16 is the smallest value greater than 12, so it returns 'Tier 1'. Ideal output would be:
Grade Value Tier
0 A 16 Tier 2
1 B 12 Tier 1
2 C 5 Tier 2
My novice, downloaded-Python-last-week attempt so far has been as follows, which is throwing up all manner of errors and doesn't even try returning the column name. Sorry also about the micro-ness of the question: any help appreciated!
for i, row in input_df1.iterrows():
Tier = np.argmin(df1['Value']<df2.loc[row,0:df2.shape[1]])
df2.loc[df1.Grade].eq(df1.Value, 0).idxmax(1)