What's the best way to insert columns in a pandas Dataframe when you don't know the exact number of columns? - pandas

I have an input dataframe.
I have also a list, with the same len as the number of rows in the dataframe.
Every element of the list is a dictionary: the key is the name of the new column, and the value is the value to be inserted in the cell.
I have to insert the columns from that list in the dataframe.
What is the best way to do so?
So far, given the input dataframe indf and the list l, I came up with something on the line of:
from copy import deepcopy
outdf = deepcopy(indf)
for index, row in indf.iterrows():
e = l[index]
for key, value in e:
outdf.loc[index, key] = value
But it doesn't seem pythonic and pandasnic and I get performance warnings like:
<ipython-input-5-9dde586a9c14>:8: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

If the sorting of the list and the data frame is the same, you can convert your list of dictionaries to a data frame:
mylist = [
{'a':1,'b':2,'c':3},
{'e':11,'f':22,'c':33},
{'a':111,'b':222,'c':333}
]
mylist_df = pd.DataFrame(mylist)
a
b
c
e
f
0
1
2
3
nan
nan
1
nan
nan
33
11
22
2
111
222
333
nan
nan
Then you can use pd.concat to merge the list to your input data frame:
result = pd.concat([input_df, mylist_df], axis=1)
In this way, there is always a column created for all unique keys in your dictionary, regardless of they exist in one dictionary and not the other.

Related

new_df = df1[df2['pin'].isin(df1['vpin'])] UserWarning: Boolean Series key will be reindexed to match DataFrame index

I'm getting the following warning while executing this line
new_df = df1[df2['pin'].isin(df1['vpin'])]
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
The df1 and df2 has only one similar column and they do not have same number of rows.
I want to filter df1 based on the column in df2. If df2.pin is in df1.vpin I want those rows.
There are multiple rows in df1 for same df2.pin and I want to retrieve them all.
pin
count
1
10
2
20
vpin
Column B
1
Cell 2
1
Cell 4
The command is working. I'm trying to overcome the warning.
It doesn't really make sense to use df2['pin'].isin(df1['vpin']) as a boolean mask to index df1 as this mean will have the indices of df2, thus the reindexing performed by pandas.
Use instead:
new_df = df1[df1['vpin'].isin(df2['pin'])]

Removing values of a certain object type from a dataframe column in Pandas

I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397

Fast remove element of list if contained by pandas dataframe

I have a list of strings, and two separate pandas dataframes. One of the dataframes contains NaNs. I am trying to find a fast way of checking if any item in the list is contained in either of the dataframes, and if so, to remove it from the list.
Currently, I do this with list comprehension. I first concatenate the two dataframes. I then loop through the list, and using an if statement check if it is contained in the concatenated dataframe values.
patches = [patch for patch in patches if not patch in bad_patches.values]
The first 5 elements of my list of strings:
patches[1:5]
['S2A_MSIL2A_20170613T101031_11_52',
'S2A_MSIL2A_20170717T113321_35_89',
'S2A_MSIL2A_20170613T101031_12_39',
'S2A_MSIL2A_20170613T101031_11_77']
An example of one of my dataframes, with the second being the same but containing less rows. Note first row contains patches[2].
cloud_patches.head()
0 S2A_MSIL2A_20170717T113321_35_89
1 S2A_MSIL2A_20170717T113321_39_84
2 S2B_MSIL2A_20171112T114339_0_13
3 S2B_MSIL2A_20171112T114339_0_52
4 S2B_MSIL2A_20171112T114339_0_53
The concatenated dataframe:
bad_patches = pd.concat([cloud_patches, snow_patches], axis=1)
bad_patches.head()
0 S2A_MSIL2A_20170717T113321_35_89 S2B_MSIL2A_20170831T095029_27_76
1 S2A_MSIL2A_20170717T113321_39_84 S2B_MSIL2A_20170831T095029_27_85
2 S2B_MSIL2A_20171112T114339_0_13 S2B_MSIL2A_20170831T095029_29_75
3 S2B_MSIL2A_20171112T114339_0_52 S2B_MSIL2A_20170831T095029_30_75
4 S2B_MSIL2A_20171112T114339_0_53 S2B_MSIL2A_20170831T095029_30_78
and the tail, showing the NaNs of one column:
bad_patches.tail()
61702 NaN S2A_MSIL2A_20180228T101021_43_6
61703 NaN S2A_MSIL2A_20180228T101021_43_8
61704 NaN S2A_MSIL2A_20180228T101021_43_11
61705 NaN S2A_MSIL2A_20180228T101021_43_13
61706 NaN S2A_MSIL2A_20180228T101021_43_16
Column headers are all (poorly) named 0.
The second element of patches should be removed as it's contained in the first row of bad_patches. My method does work but takes absolutely ages. Bad_patches is 60,000 rows and the length of patches is variable. Right now for a length of 1000 patches it takes a 2.04 seconds but I need to scale up to 500k patches so hoping there is a faster way. Thanks!
I would create a set with the values from cloud_patches and snow_patches. Then also create a set of patches:
patch_set = set(cloud_patches[0]).union(set(snow_patches[0])
patches = set(patches)
Now you just subtract all values in patch_set from the values in patches, and you will be left with only values in patches that do not show up in cloud_patches nor snow_patches:
cleaned_list = list(patches - patch_set)

Adding lists stored in dataframe

I have two dataframes as:
df1.ix[1:3]
DateTime
2018-01-02 [-0.0031537018416199097, 0.006451397621428631,...
2018-01-03 [-0.0028882814454597745, -0.005829869983964528...
df2.ix[1:3]
DateTime
2018-01-02 [-0.03285881500135208, -0.027806145786217932, ...
2018-01-03 [-0.0001314381449719178, -0.006278235444742629...
len(df1.ix['2018-01-02'][0])
500
len(df2.ix['2018-01-02'][0])
500
When I do df1 + df2 I get:
len((df1 + df2).ix['2018-01-02'][0])
1000
So, the lists instead of being summation is being concatenated.
How do I add element wise the lists in the dataframes df1 and df2.
When an operation is applied between two dataframes, it gets broadcasted at element level. Element in your case is a list and when '+' operator is applied between two lists, it concatenates them. That's why resulting dataframe contains concatenated lists.
There can be multiple approaches for actually summing up elements of lists instead of concatenating.
One approach can be converting list elements into columns and then adding dataframes and then merging columns back to a single list.(which has been suggested in first answer but in a wrong way)
Step 1: Converting list elements to columns
df1=df1.apply(lambda row:pd.Series(row[0]), axis=1)
df2=df2.apply(lambda row:pd.Series(row[0]), axis=1)
We need to pass row[0] instead of row to get rid of column index associated with series.
Step 2: Add dataframes
df=df1+df2 #this dataframe will have 500 columns
Step 3: Merge columns back to lists
df=df.apply(lambda row:pd.Series({0:list(row)}),axis=1)
This is an interesting part. Why are we returning a series here? Why only returning list(row) doesn't work and keep retaining 500 columns?
Reason is - if length of list returned is same as length of columns in the beginning, then this list gets fit in columns and to us it seems nothing happened. Whereas if length of the list is not equal to number of columns, then it is returned as single list.
Let's look at an example.
Suppose I've a dataframe, having columns 0 ,1 and 2.
df=pd.DataFrame({0:[1,2,3],1:[4,5,6],2:[7,8,9]})
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
Number of columns in original dataframe are 3. If I try to return a list with two columns, it works and a series is returned,
df1=df.apply(lambda row:[row[0],row[1]],axis=1)
0 [1, 4]
1 [2, 5]
2 [3, 6]
dtype: object
Instead if try to return list of three numbers, it would get fit in columns.
df1=df.apply(list,axis=1)
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
So if we want to return list of same size as number of columns, we'll have to return it in form of Series where one row's value has been given as list.
Another approach can be, introduce one column of a dataframe into other and then add columns using apply function.
df1[1]=df2[0]
df=df1.apply(lambda r: list(np.array(r[0])+np.array(r[1])),axis=1)
We can take advantage of numpy arrays here. '+' operator on numpy arrays sums up corresponding values and gives a single numpy array.
Cast them to series so that they become columns, then add your dfs:
df1 = df1.apply(pd.Series, axis=1)
df2 = df2.apply(pd.Series, axis=1)
df1 + df2

Pandas: fill in NaN values with dictionary references another column

I have a dictionary that looks like this
dict = {'b' : '5', 'c' : '4'}
My dataframe looks something like this
A B
0 a 2
1 b NaN
2 c NaN
Is there a way to fill in the NaN values using the dictionary mapping from columns A to B while keeping the rest of the column values?
You can map dict values inside fillna
df.B = df.B.fillna(df.A.map(dict))
print(df)
A B
0 a 2
1 b 5
2 c 4
This can be done simply
df['B'] = df['B'].fillna(df['A'].apply(lambda x: dict.get(x)))
This can work effectively for a bigger dataset as well.
Unfortunately, this isn't one of the options for a built-in function like pd.fillna().
Edit: Thanks for the correction. Apparently this is possible as illustrated in #Vaishali's answer.
However, you can subset the data frame first on the missing values and then apply the map with your dictionary.
df.loc[df['B'].isnull(), 'B'] = df['A'].map(dict)