Key Error after droping null values in pandas data frame - pandas

I have a simple dataframe as :
0 1 2 3
1 NaN like dislike
2 Cow dog snail
After dropping the nan value the dataframe is :
0 1 2 3
2 Cow dog snail
Now when I try the following code to print the values it gives key error :
for i in range(len(data)):
print(data.loc[i,:])
Any help will be appreciated.

Please add the following line after dropping nan values:
data = data.reset_index(drop=True)

You're dropping a value with an specific index in which after dropping NA values will not exist anymore. In your specific case, the index 1 will be dropped and you won't be able to iterate your dataframe via len. I highly recommend you to use iterrows instead as it's better form. Example:
for index, value in mydataframe.iterrows():
print(index, " ", value)
# 2
# 1 Cow
# 2 dog
# 3 snail
The value is of class 'pandas.core.series.Series' which in your case functions a lot like a dictionary. Pay attention to the column names which are exactly the ones in your example.

Related

What is "roundtripping" in the context of Pandas?

The documentation for pandas.read_excel mentions something called 'roundtripping', under the description of the index_col parameter.
Missing values will be forward filled to allow roundtripping with to_excel for merged_cells=True.
I have never heard of this term before, and if I search for a definition, I can find one only in the context of finance. I have seen it referred to in the context of merging dataframes in Pandas, but I have not found a definition.
For context, this is the complete description of the index_col parameter:
index_col : int, list of int, default None
Column (0-indexed) to use as the row labels of the DataFrame. Pass
None if there is no such column. If a list is passed, those columns
will be combined into a MultiIndex. If a subset of data is selected
with usecols, index_col is based on the subset.
Missing values will be forward filled to allow roundtripping with
to_excel for merged_cells=True. To avoid forward filling the
missing values use set_index after reading the data instead of
index_col.
For a general idea of the meaning of roundtripping, have a look at the answers to this post on SE. Applied to your example, "allow roundtripping" is used to mean something like this:
facilitate an easy back-and-forth between the data in an Excel file
and the same data in a df. I.e. while maintaining the intended
structure throughout.
Example round trip
The usefulness of this idea is perhaps best seen if we start with a somewhat complex df with both index and columns as named MultiIndices (for the constructor, see pd.MultiIndex.from_product):
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.rand(4,4),
columns=pd.MultiIndex.from_product([['A','B'],[1,2]],
names=['col_0','col_1']),
index=pd.MultiIndex.from_product([[0,1],[1,2]],
names=['idx_0','idx_1']))
print(df)
col_0 A B
col_1 1 2 1 2
idx_0 idx_1
0 1 0.952749 0.447125 0.846409 0.699479
2 0.297437 0.813798 0.396506 0.881103
1 1 0.581273 0.881735 0.692532 0.725254
2 0.501324 0.956084 0.643990 0.423855
If we now use df.to_excel with the default for merge_cells (i.e. True) to write this data to an Excel file, we will end up with data as follows:
df.to_excel('file.xlsx')
Result:
Aesthetics aside, the structure here is very clear, and indeed, the same as the structure in our df. Take notice of the merged cells especially.
Now, let's suppose we want to retrieve this data again from the Excel file at some later point, and we use pd.read_excel with default parameters. Problematically, we will end up with a complete mess:
df = pd.read_excel('file.xlsx')
print(df)
Unnamed: 0 col_0 A Unnamed: 3 B Unnamed: 5
0 NaN col_1 1.000000 2.000000 1.000000 2.000000
1 idx_0 idx_1 NaN NaN NaN NaN
2 0 1 0.952749 0.447125 0.846409 0.699479
3 NaN 2 0.297437 0.813798 0.396506 0.881103
4 1 1 0.581273 0.881735 0.692532 0.725254
5 NaN 2 0.501324 0.956084 0.643990 0.423855
Getting this data "back into shape" would be quite time-consuming. To avoid such a hassle, we can rely on the parameters index_col and header inside pd.read_excel:
df2 = pd.read_excel('file.xlsx', index_col=[0,1], header=[0,1])
print(df2)
col_0 A B
col_1 1 2 1 2
idx_0 idx_1
0 1 0.952749 0.447125 0.846409 0.699479
2 0.297437 0.813798 0.396506 0.881103
1 1 0.581273 0.881735 0.692532 0.725254
2 0.501324 0.956084 0.643990 0.423855
# check for equality
df.equals(df2)
# True
As you can see, we have made a "round trip" here, and index_col and header allow for it to have been smooth sailing!
Two final notes:
(minor) The docs for pd.read_excel contain a typo in the index_col section: it should read merge_cells=True, not merged_cells=True.
The header section is missing a similar comment (or a reference to the comment at index_col). This is somewhat confusing. As we saw above, the two behave exactly the same (for present purposes, at least).

What's the best way to insert columns in a pandas Dataframe when you don't know the exact number of columns?

I have an input dataframe.
I have also a list, with the same len as the number of rows in the dataframe.
Every element of the list is a dictionary: the key is the name of the new column, and the value is the value to be inserted in the cell.
I have to insert the columns from that list in the dataframe.
What is the best way to do so?
So far, given the input dataframe indf and the list l, I came up with something on the line of:
from copy import deepcopy
outdf = deepcopy(indf)
for index, row in indf.iterrows():
e = l[index]
for key, value in e:
outdf.loc[index, key] = value
But it doesn't seem pythonic and pandasnic and I get performance warnings like:
<ipython-input-5-9dde586a9c14>:8: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
If the sorting of the list and the data frame is the same, you can convert your list of dictionaries to a data frame:
mylist = [
{'a':1,'b':2,'c':3},
{'e':11,'f':22,'c':33},
{'a':111,'b':222,'c':333}
]
mylist_df = pd.DataFrame(mylist)
a
b
c
e
f
0
1
2
3
nan
nan
1
nan
nan
33
11
22
2
111
222
333
nan
nan
Then you can use pd.concat to merge the list to your input data frame:
result = pd.concat([input_df, mylist_df], axis=1)
In this way, there is always a column created for all unique keys in your dictionary, regardless of they exist in one dictionary and not the other.

Removing values of a certain object type from a dataframe column in Pandas

I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397

Copy a column value from another dataframe based on a condition

Let us say I have two dataframes: df1 and df2. Assume the following initial values.
df1=pd.DataFrame({'ID':['ASX-112','YTR-789','ASX-124','UYT-908','TYE=456','ERW-234','UUI-675','GHV-805','NMB-653','WSX-123'],
'Costperlb':[4515,5856,3313,9909,8980,9088,6765,3456,9012,1237]})
df2=df1[df1['Costperlb']>4560]
As you can see, df2 is a proper subset of df1 (it was created from df1 by imposing a condition on selection of rows).
I added a column to df2, which contains certain values based on a calculation. Let us call this df2['grade'].
df2['grade']=[1,4,3,5,1,1]
df1 and df2 contain one column named 'ID' which is guaranteed to be unique in each dataframe.
I want to:
Create a new column in df1 and initialize it to 0. Easy. df1['grade']=0.
Copy df2['grade'] values over to df1['grade'], ensuring that df1['ID']=df2['ID'] for each such copy.
The result should be the grade values for the corresponding IDs copied over.
Step 2 is what is perplexing me a bit. A naive df1['grade']=df2['grade'].values does not work obviously as the lengths of the two dataframes is different.
Now, if I think hard enough, I could possibly come up with a monstrosity like:
df1['grade'].loc[(df1['ID'].isin(df2)) & ...] but I am uncomfortable with doing that.
I am a newbie with python, and furthermore, the indices of df1 are being used elsewhere after this assignment, and I do not want drop indices, reset indices as some of the solutions are suggested in some of the search results I found.
I just want to find out rows in df1 where the 'ID' row matches the 'ID' row in df2, and then copy the 'grade' column value in that specific row over. How do I do this?
Your code:
df1=pd.DataFrame({'ID':['ASX-112','YTR-789','ASX-124','UYT-908','TYE=456','ERW-234','UUI-675','GHV-805','NMB-653','WSX-123'],
'Costperlb':[4515,5856,3313,9909,8980,9088,6765,3456,9012,1237]})
df2=df1[df1['Costperlb']>4560]
df2['grade']=[1,4,3,5,1,1]
You can use merge with "left". In this way the indexing of df1 is preserved:
new_df = df1.merge(df2[["ID","grade"]], on="ID", how="left")
new_df["grade"] = new_df["grade"].fillna(0)
new_df
Output:
ID Costperlb grade
0 ASX-112 4515 0.0
1 YTR-789 5856 1.0
2 ASX-124 3313 0.0
3 UYT-908 9909 4.0
4 TYE=456 8980 3.0
5 ERW-234 9088 5.0
6 UUI-675 6765 1.0
7 GHV-805 3456 0.0
8 NMB-653 9012 1.0
9 WSX-123 1237 0.0
Here I called the merged dataframe new_df, but you can simply change it to df1.
EDIT
If instead of 0 you want to replace the NaN with a string, try this:
new_df = df1.merge(df2[["ID","grade"]], on="ID", how="left")
new_df["grade"] = new_df["grade"].fillna("No transaction possible")
new_df
Output:
ID Costperlb grade
0 ASX-112 4515 No transaction possible
1 YTR-789 5856 1
2 ASX-124 3313 No transaction possible
3 UYT-908 9909 4
4 TYE=456 8980 3
5 ERW-234 9088 5
6 UUI-675 6765 1
7 GHV-805 3456 No transaction possible
8 NMB-653 9012 1
9 WSX-123 1237 No transaction possible

Pandas: fill in NaN values with dictionary references another column

I have a dictionary that looks like this
dict = {'b' : '5', 'c' : '4'}
My dataframe looks something like this
A B
0 a 2
1 b NaN
2 c NaN
Is there a way to fill in the NaN values using the dictionary mapping from columns A to B while keeping the rest of the column values?
You can map dict values inside fillna
df.B = df.B.fillna(df.A.map(dict))
print(df)
A B
0 a 2
1 b 5
2 c 4
This can be done simply
df['B'] = df['B'].fillna(df['A'].apply(lambda x: dict.get(x)))
This can work effectively for a bigger dataset as well.
Unfortunately, this isn't one of the options for a built-in function like pd.fillna().
Edit: Thanks for the correction. Apparently this is possible as illustrated in #Vaishali's answer.
However, you can subset the data frame first on the missing values and then apply the map with your dictionary.
df.loc[df['B'].isnull(), 'B'] = df['A'].map(dict)