How to filter a Dataframe based on an ID-Column which corresponds to a second Dataframe containing conditions for each ID efficiently? - pandas

I have a Dataframe with one ID column and two data columns X,Y containing numeric values. For each ID there are several rows of data.
I have a second Dataframe with the same ID column and two numeric columns specifing the lower and upper Limit for the X - Values for each ID.
I want to use the second Dataframe to filter the first Dataframe to only have rows which have X Values within in the X_min-X_max Range of the specific ID.
I can solve this by Looping over the second dataframe and filtering groupby(ID) - Elements of the first DF but that is slow for large amount of IDs. Is there an efficient way to solve this?
Example Code with the data in df, the ranges in df_ranges and the expected result in df_result. The real data Frame is obviously a lot bigger.
import pandas as pd
x=[2.1,2.2,2.6,2.4,2.8,3.5,2.8,3.2]
y=[3.1,3.5,3.4,2.7,2.1,2.7,4.1,4.3]
ID=[0]*4+[0.1]*4
x_min=[2.0,3.0]
x_max=[2.5,3.4]
IDs=[0,0.1]
df=pd.DataFrame({'ID':ID,'X':x,'Y':y})
df_ranges=pd.DataFrame({'ID':IDs,'X_min':x_min,'X_max':x_max})
df_result=df.iloc[[0,1,3,7],:]
Possible Solution:
def filter_ranges(grp,df_ranges):
x_min=df_ranges.loc[df_ranges.ID==grp.name,'X_min'].values[0]
x_max=df_ranges.loc[df_ranges.ID==grp.name,'X_max'].values[0]
return grp.loc[(grp.X>=x_min)&(grp.X<=x_max),:]
target_df_grp=df.groupby('ID').apply(filter_ranges,df_ranges=df_ranges)

Try this:
merged = df.merge(df_ranges, on='ID')
target_df = merged[(merged.X>=merged.X_min)&(merged.X<=merged.X_max)][['ID', 'X', 'Y']] # Here, desired filter is applied.
print(target_df) will give:
ID X Y
0 0.0 2.1 3.1
1 0.0 2.2 3.5
3 0.0 2.4 2.7
7 0.1 3.2 4.3

Related

Removing values of a certain object type from a dataframe column in Pandas

I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

How to a row in pandas based on column condition?

I have a pandas data frame and I would like to duplicate those rows which meet some column condition (i.e. having multiple elements in CourseID column)
I tried iterating over the data frame to identify the rows which should be duplicated but i don't know how to duplicate them,
Using Pandas version 0.25 it is quite easy:
The first step is to split df.CourseID (converting each element to a list)
and then to explode it (break each list into multiple rows,
repeating other columns in each row):
course = df.CourseID.str.split(',').explode()
The result is:
0 456
1 456
1 799
2 789
Name: CourseID, dtype: object
Then, all to do is to join df with course, but in order to avoid
repeating column names, you have to drop original CourseID column before.
Fortunately, in can be expressed in a single instruction:
df.drop(columns=['CourseID']).join(course)
If you have some older version of Pandas this is a good reason to
upgrade it.

Fast remove element of list if contained by pandas dataframe

I have a list of strings, and two separate pandas dataframes. One of the dataframes contains NaNs. I am trying to find a fast way of checking if any item in the list is contained in either of the dataframes, and if so, to remove it from the list.
Currently, I do this with list comprehension. I first concatenate the two dataframes. I then loop through the list, and using an if statement check if it is contained in the concatenated dataframe values.
patches = [patch for patch in patches if not patch in bad_patches.values]
The first 5 elements of my list of strings:
patches[1:5]
['S2A_MSIL2A_20170613T101031_11_52',
'S2A_MSIL2A_20170717T113321_35_89',
'S2A_MSIL2A_20170613T101031_12_39',
'S2A_MSIL2A_20170613T101031_11_77']
An example of one of my dataframes, with the second being the same but containing less rows. Note first row contains patches[2].
cloud_patches.head()
0 S2A_MSIL2A_20170717T113321_35_89
1 S2A_MSIL2A_20170717T113321_39_84
2 S2B_MSIL2A_20171112T114339_0_13
3 S2B_MSIL2A_20171112T114339_0_52
4 S2B_MSIL2A_20171112T114339_0_53
The concatenated dataframe:
bad_patches = pd.concat([cloud_patches, snow_patches], axis=1)
bad_patches.head()
0 S2A_MSIL2A_20170717T113321_35_89 S2B_MSIL2A_20170831T095029_27_76
1 S2A_MSIL2A_20170717T113321_39_84 S2B_MSIL2A_20170831T095029_27_85
2 S2B_MSIL2A_20171112T114339_0_13 S2B_MSIL2A_20170831T095029_29_75
3 S2B_MSIL2A_20171112T114339_0_52 S2B_MSIL2A_20170831T095029_30_75
4 S2B_MSIL2A_20171112T114339_0_53 S2B_MSIL2A_20170831T095029_30_78
and the tail, showing the NaNs of one column:
bad_patches.tail()
61702 NaN S2A_MSIL2A_20180228T101021_43_6
61703 NaN S2A_MSIL2A_20180228T101021_43_8
61704 NaN S2A_MSIL2A_20180228T101021_43_11
61705 NaN S2A_MSIL2A_20180228T101021_43_13
61706 NaN S2A_MSIL2A_20180228T101021_43_16
Column headers are all (poorly) named 0.
The second element of patches should be removed as it's contained in the first row of bad_patches. My method does work but takes absolutely ages. Bad_patches is 60,000 rows and the length of patches is variable. Right now for a length of 1000 patches it takes a 2.04 seconds but I need to scale up to 500k patches so hoping there is a faster way. Thanks!
I would create a set with the values from cloud_patches and snow_patches. Then also create a set of patches:
patch_set = set(cloud_patches[0]).union(set(snow_patches[0])
patches = set(patches)
Now you just subtract all values in patch_set from the values in patches, and you will be left with only values in patches that do not show up in cloud_patches nor snow_patches:
cleaned_list = list(patches - patch_set)

Fillna (forward fill) on a large dataframe efficiently with groupby?

What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.
How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)