Fast remove element of list if contained by pandas dataframe - pandas

I have a list of strings, and two separate pandas dataframes. One of the dataframes contains NaNs. I am trying to find a fast way of checking if any item in the list is contained in either of the dataframes, and if so, to remove it from the list.
Currently, I do this with list comprehension. I first concatenate the two dataframes. I then loop through the list, and using an if statement check if it is contained in the concatenated dataframe values.
patches = [patch for patch in patches if not patch in bad_patches.values]
The first 5 elements of my list of strings:
patches[1:5]
['S2A_MSIL2A_20170613T101031_11_52',
'S2A_MSIL2A_20170717T113321_35_89',
'S2A_MSIL2A_20170613T101031_12_39',
'S2A_MSIL2A_20170613T101031_11_77']
An example of one of my dataframes, with the second being the same but containing less rows. Note first row contains patches[2].
cloud_patches.head()
0 S2A_MSIL2A_20170717T113321_35_89
1 S2A_MSIL2A_20170717T113321_39_84
2 S2B_MSIL2A_20171112T114339_0_13
3 S2B_MSIL2A_20171112T114339_0_52
4 S2B_MSIL2A_20171112T114339_0_53
The concatenated dataframe:
bad_patches = pd.concat([cloud_patches, snow_patches], axis=1)
bad_patches.head()
0 S2A_MSIL2A_20170717T113321_35_89 S2B_MSIL2A_20170831T095029_27_76
1 S2A_MSIL2A_20170717T113321_39_84 S2B_MSIL2A_20170831T095029_27_85
2 S2B_MSIL2A_20171112T114339_0_13 S2B_MSIL2A_20170831T095029_29_75
3 S2B_MSIL2A_20171112T114339_0_52 S2B_MSIL2A_20170831T095029_30_75
4 S2B_MSIL2A_20171112T114339_0_53 S2B_MSIL2A_20170831T095029_30_78
and the tail, showing the NaNs of one column:
bad_patches.tail()
61702 NaN S2A_MSIL2A_20180228T101021_43_6
61703 NaN S2A_MSIL2A_20180228T101021_43_8
61704 NaN S2A_MSIL2A_20180228T101021_43_11
61705 NaN S2A_MSIL2A_20180228T101021_43_13
61706 NaN S2A_MSIL2A_20180228T101021_43_16
Column headers are all (poorly) named 0.
The second element of patches should be removed as it's contained in the first row of bad_patches. My method does work but takes absolutely ages. Bad_patches is 60,000 rows and the length of patches is variable. Right now for a length of 1000 patches it takes a 2.04 seconds but I need to scale up to 500k patches so hoping there is a faster way. Thanks!

I would create a set with the values from cloud_patches and snow_patches. Then also create a set of patches:
patch_set = set(cloud_patches[0]).union(set(snow_patches[0])
patches = set(patches)
Now you just subtract all values in patch_set from the values in patches, and you will be left with only values in patches that do not show up in cloud_patches nor snow_patches:
cleaned_list = list(patches - patch_set)

Related

What's the best way to insert columns in a pandas Dataframe when you don't know the exact number of columns?

I have an input dataframe.
I have also a list, with the same len as the number of rows in the dataframe.
Every element of the list is a dictionary: the key is the name of the new column, and the value is the value to be inserted in the cell.
I have to insert the columns from that list in the dataframe.
What is the best way to do so?
So far, given the input dataframe indf and the list l, I came up with something on the line of:
from copy import deepcopy
outdf = deepcopy(indf)
for index, row in indf.iterrows():
e = l[index]
for key, value in e:
outdf.loc[index, key] = value
But it doesn't seem pythonic and pandasnic and I get performance warnings like:
<ipython-input-5-9dde586a9c14>:8: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
If the sorting of the list and the data frame is the same, you can convert your list of dictionaries to a data frame:
mylist = [
{'a':1,'b':2,'c':3},
{'e':11,'f':22,'c':33},
{'a':111,'b':222,'c':333}
]
mylist_df = pd.DataFrame(mylist)
a
b
c
e
f
0
1
2
3
nan
nan
1
nan
nan
33
11
22
2
111
222
333
nan
nan
Then you can use pd.concat to merge the list to your input data frame:
result = pd.concat([input_df, mylist_df], axis=1)
In this way, there is always a column created for all unique keys in your dictionary, regardless of they exist in one dictionary and not the other.

Removing values of a certain object type from a dataframe column in Pandas

I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397

Pandas get_dummies for a column of lists where a cell may have no value in that column

I have a column in a dataframe where all the values are lists (list of one item usually for each row). So, I would like to use get_dummies to one hot encode all the values. However, there may be a few rows where there is not a value for the column. I have seen it originally as a nan and then I have replaced that nan with an empty list, but in either case I do not see 0 and 1s for the result for the get_dummies, but rather each generated column is blank (I would expect each generated column to be 0).
How do I get get_dummies to work with an empty list?
# create column from dict where value will be a list
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
# line to replace nan in sponsor_list column with empty list
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
# use of get_dummies to encode the sponsor_list column
X = pd.concat([X, pd.get_dummies(X.sponsor_list.apply(pd.Series).stack()).sum(level=0)], axis=1)
Example:
111th-congress_senate-bill_3695.txt False ['Menendez,_Robert_[D-NJ].txt']
112th-congress_house-bill_3630.txt False []
111th-congress_senate-bill_852.txt False ['Vitter,_David_[R-LA].txt']
114th-congress_senate-bill_2832.txt False
['Isakson,_Johnny_[R-GA].txt']
107th-congress_senate-bill_535.txt False ['Bingaman,_Jeff_[D-NM].txt']
I want to one hot encode on the third column. That particular data item in the 2nd row has no person associated it with them, so I need that row to be encoded with all 0s. The reason I need the third column to be a list is because I need to do this to a related column as well where I need to have [0,n] values where n can be 5 or 10 or even 20.
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
mlb = MultiLabelBinarizer()
X = X.join(pd.DataFrame(mlb.fit_transform(X.pop('sponsor_list')),
columns=mlb.classes_,
index=X.index))
I used a MultiLabelBinarizer to capture what I was trying to do. I still replace nan with empty list before applying, but then I fit_transform to create the 0/1 values which can result in no 1's in a row, or many 1's in a row.

Adding lists stored in dataframe

I have two dataframes as:
df1.ix[1:3]
DateTime
2018-01-02 [-0.0031537018416199097, 0.006451397621428631,...
2018-01-03 [-0.0028882814454597745, -0.005829869983964528...
df2.ix[1:3]
DateTime
2018-01-02 [-0.03285881500135208, -0.027806145786217932, ...
2018-01-03 [-0.0001314381449719178, -0.006278235444742629...
len(df1.ix['2018-01-02'][0])
500
len(df2.ix['2018-01-02'][0])
500
When I do df1 + df2 I get:
len((df1 + df2).ix['2018-01-02'][0])
1000
So, the lists instead of being summation is being concatenated.
How do I add element wise the lists in the dataframes df1 and df2.
When an operation is applied between two dataframes, it gets broadcasted at element level. Element in your case is a list and when '+' operator is applied between two lists, it concatenates them. That's why resulting dataframe contains concatenated lists.
There can be multiple approaches for actually summing up elements of lists instead of concatenating.
One approach can be converting list elements into columns and then adding dataframes and then merging columns back to a single list.(which has been suggested in first answer but in a wrong way)
Step 1: Converting list elements to columns
df1=df1.apply(lambda row:pd.Series(row[0]), axis=1)
df2=df2.apply(lambda row:pd.Series(row[0]), axis=1)
We need to pass row[0] instead of row to get rid of column index associated with series.
Step 2: Add dataframes
df=df1+df2 #this dataframe will have 500 columns
Step 3: Merge columns back to lists
df=df.apply(lambda row:pd.Series({0:list(row)}),axis=1)
This is an interesting part. Why are we returning a series here? Why only returning list(row) doesn't work and keep retaining 500 columns?
Reason is - if length of list returned is same as length of columns in the beginning, then this list gets fit in columns and to us it seems nothing happened. Whereas if length of the list is not equal to number of columns, then it is returned as single list.
Let's look at an example.
Suppose I've a dataframe, having columns 0 ,1 and 2.
df=pd.DataFrame({0:[1,2,3],1:[4,5,6],2:[7,8,9]})
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
Number of columns in original dataframe are 3. If I try to return a list with two columns, it works and a series is returned,
df1=df.apply(lambda row:[row[0],row[1]],axis=1)
0 [1, 4]
1 [2, 5]
2 [3, 6]
dtype: object
Instead if try to return list of three numbers, it would get fit in columns.
df1=df.apply(list,axis=1)
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
So if we want to return list of same size as number of columns, we'll have to return it in form of Series where one row's value has been given as list.
Another approach can be, introduce one column of a dataframe into other and then add columns using apply function.
df1[1]=df2[0]
df=df1.apply(lambda r: list(np.array(r[0])+np.array(r[1])),axis=1)
We can take advantage of numpy arrays here. '+' operator on numpy arrays sums up corresponding values and gives a single numpy array.
Cast them to series so that they become columns, then add your dfs:
df1 = df1.apply(pd.Series, axis=1)
df2 = df2.apply(pd.Series, axis=1)
df1 + df2

Fillna (forward fill) on a large dataframe efficiently with groupby?

What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.
How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)