Pandas: How can I check if a pandas dataframe contains a specific value? - pandas

Specifically, I want to see if my pandas dataframe contains a False.
It's an nxn dataframe, where the indices are labels.
So if all values were True except for even one cell, then I would want to return False.

your question is a bit confusing, but assuming that you want to know whether there is at least one False in your dataframe, you could simply use
mask = df.mycol == False
mask.value_counts()
mask.sum()
mask.sum() > 0
All will tell you the truth

If you just want to scan your whole dataframe looking for a value, check df.values. It returns an array of all the rows in the dataframe.
value = False # This is the value you're searching
df_contains_value = any([value in row for row in df.values])

Related

Fill empty cell with what's on the next row

I've scraped a PDF table and it came with an annoying formatting feature.
The table has two columns. In some cases, one row stayed with what should be the column A value and the next stayed with what should be the column B value. Like this:
df = pd.DataFrame()
df['names'] = ['John','Mary',np.nan,'George']
df['numbers'] = ['1',np.nan,'2','3']
I want to reformat that database so wherever there is an empty cell on df['numbers'] it fills it with the value of the next line. Then I apply .dropna() to eliminate the still-wrong cells.
I thied this:
for i in range(len(df)):
if df['numbers'][i] == np.nan:
df['numbers'][i] = df['numbers'][i+1]
No change on the dataframe, though. No error message, too.
What am I missing?
While I don't think this solves all your problems, the reason why you are not updating the dataframe is the line
if df['numbers'][i] == np.nan: , since this always evaluates to False.
To implement a vlaid test for nan in this case you must use
if pd.isnull(df['numbres'][i]): this will evaluate to True or False depending on the cell contents.
This is the solution I found:
df[['numbers']] = df[['numbers']].fillna(method='bfill')
df = df[~df['names'].isna()]
It's probably not the most elegant, but it worked.

Remove a specific string value from the whole dataframe without specifying the column or row

I have a dataframe that has some cells with the value of "?". now this value causes an error ("Could not convert string to float: "?") whenever i try to use the multi information metric.
I already found a solution by simply using:
df.replace("?",0,inplace=True)
And it worked. BUT i'm wondering if i wanted to remove the whole row if one of its cells has the value of "?", how can i do that?
Notice that i don't have the column names that contains this value. it's spread in different column and that's why i can't use df.drop.
You can check for each cell if they are equal to "?" and then get a boolean series over rows that contain that character in any one of their cells. Then get the indices of rows that gave True and drop them:
has_ques_mark = df.eq("?").any(axis=1) # a boolean series
inds = has_ques_mark[has_ques_mark].index # row indices where above is True
new_df = df.drop(inds)
You can do it the following way:
df.drop(df.loc[df['column_name'] == "?"].index, inplace=True)
or in a slightly simpler syntax but maybe a bit less performant:
df = df.loc[df['column_name'] != "?"]

Assigning value to an iloc slice with condition on a separate column?

I would like to slice my dataframe using iloc (rather than loc) + some condition based on one of the dataframe's columns and assign a value to all the items in this slice (which is effectively a subset of the main dataframe).
My simplified attempt:
df.iloc[:, 1:21][df['column1'] == 'some_value'] = 1
This is meant to take a slice of the dataframe:
All rows;
Columns 2 to 20;
Then slice it again:
Only the rows where column1 = some_value.
The slicing works fine, but equalling this to 1 doesn't work. Nothing changes in df and I get this warning
A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
I really need to use iloc rather than loc if possible. It feels like there should be a way of doing this?
You can search for the error on SO. In short, you should update on one single loc/iloc:
df.loc[df['column1']=='some_value', df.columns[1:21]] = 1

Pandas get_dummies for a column of lists where a cell may have no value in that column

I have a column in a dataframe where all the values are lists (list of one item usually for each row). So, I would like to use get_dummies to one hot encode all the values. However, there may be a few rows where there is not a value for the column. I have seen it originally as a nan and then I have replaced that nan with an empty list, but in either case I do not see 0 and 1s for the result for the get_dummies, but rather each generated column is blank (I would expect each generated column to be 0).
How do I get get_dummies to work with an empty list?
# create column from dict where value will be a list
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
# line to replace nan in sponsor_list column with empty list
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
# use of get_dummies to encode the sponsor_list column
X = pd.concat([X, pd.get_dummies(X.sponsor_list.apply(pd.Series).stack()).sum(level=0)], axis=1)
Example:
111th-congress_senate-bill_3695.txt False ['Menendez,_Robert_[D-NJ].txt']
112th-congress_house-bill_3630.txt False []
111th-congress_senate-bill_852.txt False ['Vitter,_David_[R-LA].txt']
114th-congress_senate-bill_2832.txt False
['Isakson,_Johnny_[R-GA].txt']
107th-congress_senate-bill_535.txt False ['Bingaman,_Jeff_[D-NM].txt']
I want to one hot encode on the third column. That particular data item in the 2nd row has no person associated it with them, so I need that row to be encoded with all 0s. The reason I need the third column to be a list is because I need to do this to a related column as well where I need to have [0,n] values where n can be 5 or 10 or even 20.
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
mlb = MultiLabelBinarizer()
X = X.join(pd.DataFrame(mlb.fit_transform(X.pop('sponsor_list')),
columns=mlb.classes_,
index=X.index))
I used a MultiLabelBinarizer to capture what I was trying to do. I still replace nan with empty list before applying, but then I fit_transform to create the 0/1 values which can result in no 1's in a row, or many 1's in a row.

Pandas Dataframe: How to get the cell instead of is value

I have a task to compare two dataframe with same columns name but different size, we can call it previous and current. I am trying to get the difference between (previous and current) in the Quantity and Booked Columns and highlight it as yellow. The common key between the two dataframe would be the 'SN' columns
I have coded out the following
for idx, rows in df_n.iterrows():
if rows["Quantity"] == rows['Available'] + rows['Booked']:
continue
else:
rows["Quantity"] = rows["Quantity"] - rows['Available'] - rows['Booked']
df_n.loc[idx, 'Quantity'].style.applymap('background-color: yellow')
# pdb.set_trace()
if (df_o['Booked'][df_o['SN'] == rows["SN"]] != rows['Booked']).bool():
df_n.loc[idx, 'Booked'].style.apply('background-color: yellow')
I realise I have a few problems here and need some help
df_n.loc[idx, 'Quantity'] returns value instead of a dataframe type. How can I get a dataframe from one cell. Do I have to pd.DataFrame(data=df_n.loc[idx, 'Quantity'], index=idx, columns ='Quantity'). Will this create a copy or will update the reference?
How do I compare the SN of both dataframe, looking for a better way to compare. One thing I could think of is to use set index for both dataframe and when finished using them, reset them back?
My dataframe:
Previous dataframe
Current Dataframe
df_n.loc[idx, 'Quantity'] returns value instead of a dataframe type.
How can I get a dataframe from one cell. Do I have to
pd.DataFrame(data=df_n.loc[idx, 'Quantity'], index=idx, columns
='Quantity'). Will this create a copy or will update the reference?
To create a DataFrame from one cell you can try: df_n.loc[idx, ['Quantity']].to_frame().T
How do I compare the SN of both dataframe, looking for a better way to
compare. One thing I could think of is to use set index for both
dataframe and when finished using them, reset them back?
You can use df_n.merge(df_o, on='S/N') to merge dataframes and 'compare' columns.