Drop rows where a subset of columns are empty in Pandas - pandas

I have a pandas dataframe in the below format
No ei1 ei2 ei3 ei4 ei1_val ei2_val ei3_val ei4_val
123
124
125 0 0 0 1 low low high high
To simplify, I have shown only a subset of columns here but actually the pandas dataframe has columns from ei1 to ei24 and ei1_val to ei24_val.
I have retrieved the column names using the below code:
val_cols = df[[col for col in df.columns if col.endswith("_val")]]
cols = [col.replace('_val', '') for col in val_cols.columns]
After that, I need to drop the rows from dataframe df if all columns in val_cols and all columns in cols are empty. Hence the output dataframe would drop rows with No's 123 and 124. Not sure whether is there a way to do it efficiently in Pandas rather than looping over the columns and checking the values.
Any suggestions would be appreciated.

IIUC, try:
m = ~df.filter(regex='.*_val').isna().all(axis=1)
df[m]
Output:
No ei1 ei2 ei3 ei4 ei1_val ei2_val ei3_val ei4_val
2 125 0.0 0.0 0.0 1.0 low low high high
Find all the columns where the column header ends with _val using regex in the pd.DataFrame.filter method.
Check to see if all values are NaN using isna and all with axis=1

Related

What's the best way to insert columns in a pandas Dataframe when you don't know the exact number of columns?

I have an input dataframe.
I have also a list, with the same len as the number of rows in the dataframe.
Every element of the list is a dictionary: the key is the name of the new column, and the value is the value to be inserted in the cell.
I have to insert the columns from that list in the dataframe.
What is the best way to do so?
So far, given the input dataframe indf and the list l, I came up with something on the line of:
from copy import deepcopy
outdf = deepcopy(indf)
for index, row in indf.iterrows():
e = l[index]
for key, value in e:
outdf.loc[index, key] = value
But it doesn't seem pythonic and pandasnic and I get performance warnings like:
<ipython-input-5-9dde586a9c14>:8: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
If the sorting of the list and the data frame is the same, you can convert your list of dictionaries to a data frame:
mylist = [
{'a':1,'b':2,'c':3},
{'e':11,'f':22,'c':33},
{'a':111,'b':222,'c':333}
]
mylist_df = pd.DataFrame(mylist)
a
b
c
e
f
0
1
2
3
nan
nan
1
nan
nan
33
11
22
2
111
222
333
nan
nan
Then you can use pd.concat to merge the list to your input data frame:
result = pd.concat([input_df, mylist_df], axis=1)
In this way, there is always a column created for all unique keys in your dictionary, regardless of they exist in one dictionary and not the other.

Removing values of a certain object type from a dataframe column in Pandas

I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397

Pandas Check Excel import for isnumeric() data in Dataframe

I am importing data into Pandas from Excel and I need to verify that the data is numeric based on the Columns.
month value dp wd ... mg fee pr comment
0 2013-07-31 208372.33 4206.84 4692.22 ... 0 0 0 some comment
1 2013-08-31 210669.77 0.00 1270.28 ... 0 0 0
There are about 20 columns and I only need to exclude the "month" and "comment" columns.
Is there something like: df.iloc[:, 2: 18].isnumeric() or will this require a loop?
I would like to get a True / False response.
thank you.
One way is select_dtypes and compare:
np.array_equal(df.select_dtypes(include='number').columns, df.columns[1:-1])
You can use apply method to apply series methods to columns of a dataframe.
df2 = df.drop(["month", "comment"], axis=1)
df2 = df2.apply(lambda x: x.str.isnumeric())

Copy a column value from another dataframe based on a condition

Let us say I have two dataframes: df1 and df2. Assume the following initial values.
df1=pd.DataFrame({'ID':['ASX-112','YTR-789','ASX-124','UYT-908','TYE=456','ERW-234','UUI-675','GHV-805','NMB-653','WSX-123'],
'Costperlb':[4515,5856,3313,9909,8980,9088,6765,3456,9012,1237]})
df2=df1[df1['Costperlb']>4560]
As you can see, df2 is a proper subset of df1 (it was created from df1 by imposing a condition on selection of rows).
I added a column to df2, which contains certain values based on a calculation. Let us call this df2['grade'].
df2['grade']=[1,4,3,5,1,1]
df1 and df2 contain one column named 'ID' which is guaranteed to be unique in each dataframe.
I want to:
Create a new column in df1 and initialize it to 0. Easy. df1['grade']=0.
Copy df2['grade'] values over to df1['grade'], ensuring that df1['ID']=df2['ID'] for each such copy.
The result should be the grade values for the corresponding IDs copied over.
Step 2 is what is perplexing me a bit. A naive df1['grade']=df2['grade'].values does not work obviously as the lengths of the two dataframes is different.
Now, if I think hard enough, I could possibly come up with a monstrosity like:
df1['grade'].loc[(df1['ID'].isin(df2)) & ...] but I am uncomfortable with doing that.
I am a newbie with python, and furthermore, the indices of df1 are being used elsewhere after this assignment, and I do not want drop indices, reset indices as some of the solutions are suggested in some of the search results I found.
I just want to find out rows in df1 where the 'ID' row matches the 'ID' row in df2, and then copy the 'grade' column value in that specific row over. How do I do this?
Your code:
df1=pd.DataFrame({'ID':['ASX-112','YTR-789','ASX-124','UYT-908','TYE=456','ERW-234','UUI-675','GHV-805','NMB-653','WSX-123'],
'Costperlb':[4515,5856,3313,9909,8980,9088,6765,3456,9012,1237]})
df2=df1[df1['Costperlb']>4560]
df2['grade']=[1,4,3,5,1,1]
You can use merge with "left". In this way the indexing of df1 is preserved:
new_df = df1.merge(df2[["ID","grade"]], on="ID", how="left")
new_df["grade"] = new_df["grade"].fillna(0)
new_df
Output:
ID Costperlb grade
0 ASX-112 4515 0.0
1 YTR-789 5856 1.0
2 ASX-124 3313 0.0
3 UYT-908 9909 4.0
4 TYE=456 8980 3.0
5 ERW-234 9088 5.0
6 UUI-675 6765 1.0
7 GHV-805 3456 0.0
8 NMB-653 9012 1.0
9 WSX-123 1237 0.0
Here I called the merged dataframe new_df, but you can simply change it to df1.
EDIT
If instead of 0 you want to replace the NaN with a string, try this:
new_df = df1.merge(df2[["ID","grade"]], on="ID", how="left")
new_df["grade"] = new_df["grade"].fillna("No transaction possible")
new_df
Output:
ID Costperlb grade
0 ASX-112 4515 No transaction possible
1 YTR-789 5856 1
2 ASX-124 3313 No transaction possible
3 UYT-908 9909 4
4 TYE=456 8980 3
5 ERW-234 9088 5
6 UUI-675 6765 1
7 GHV-805 3456 No transaction possible
8 NMB-653 9012 1
9 WSX-123 1237 No transaction possible

How to filter a Dataframe based on an ID-Column which corresponds to a second Dataframe containing conditions for each ID efficiently?

I have a Dataframe with one ID column and two data columns X,Y containing numeric values. For each ID there are several rows of data.
I have a second Dataframe with the same ID column and two numeric columns specifing the lower and upper Limit for the X - Values for each ID.
I want to use the second Dataframe to filter the first Dataframe to only have rows which have X Values within in the X_min-X_max Range of the specific ID.
I can solve this by Looping over the second dataframe and filtering groupby(ID) - Elements of the first DF but that is slow for large amount of IDs. Is there an efficient way to solve this?
Example Code with the data in df, the ranges in df_ranges and the expected result in df_result. The real data Frame is obviously a lot bigger.
import pandas as pd
x=[2.1,2.2,2.6,2.4,2.8,3.5,2.8,3.2]
y=[3.1,3.5,3.4,2.7,2.1,2.7,4.1,4.3]
ID=[0]*4+[0.1]*4
x_min=[2.0,3.0]
x_max=[2.5,3.4]
IDs=[0,0.1]
df=pd.DataFrame({'ID':ID,'X':x,'Y':y})
df_ranges=pd.DataFrame({'ID':IDs,'X_min':x_min,'X_max':x_max})
df_result=df.iloc[[0,1,3,7],:]
Possible Solution:
def filter_ranges(grp,df_ranges):
x_min=df_ranges.loc[df_ranges.ID==grp.name,'X_min'].values[0]
x_max=df_ranges.loc[df_ranges.ID==grp.name,'X_max'].values[0]
return grp.loc[(grp.X>=x_min)&(grp.X<=x_max),:]
target_df_grp=df.groupby('ID').apply(filter_ranges,df_ranges=df_ranges)
Try this:
merged = df.merge(df_ranges, on='ID')
target_df = merged[(merged.X>=merged.X_min)&(merged.X<=merged.X_max)][['ID', 'X', 'Y']] # Here, desired filter is applied.
print(target_df) will give:
ID X Y
0 0.0 2.1 3.1
1 0.0 2.2 3.5
3 0.0 2.4 2.7
7 0.1 3.2 4.3