I've scraped a PDF table and it came with an annoying formatting feature.
The table has two columns. In some cases, one row stayed with what should be the column A value and the next stayed with what should be the column B value. Like this:
df = pd.DataFrame()
df['names'] = ['John','Mary',np.nan,'George']
df['numbers'] = ['1',np.nan,'2','3']
I want to reformat that database so wherever there is an empty cell on df['numbers'] it fills it with the value of the next line. Then I apply .dropna() to eliminate the still-wrong cells.
I thied this:
for i in range(len(df)):
if df['numbers'][i] == np.nan:
df['numbers'][i] = df['numbers'][i+1]
No change on the dataframe, though. No error message, too.
What am I missing?
While I don't think this solves all your problems, the reason why you are not updating the dataframe is the line
if df['numbers'][i] == np.nan: , since this always evaluates to False.
To implement a vlaid test for nan in this case you must use
if pd.isnull(df['numbres'][i]): this will evaluate to True or False depending on the cell contents.
This is the solution I found:
df[['numbers']] = df[['numbers']].fillna(method='bfill')
df = df[~df['names'].isna()]
It's probably not the most elegant, but it worked.
Related
I have a dataframe that has some cells with the value of "?". now this value causes an error ("Could not convert string to float: "?") whenever i try to use the multi information metric.
I already found a solution by simply using:
df.replace("?",0,inplace=True)
And it worked. BUT i'm wondering if i wanted to remove the whole row if one of its cells has the value of "?", how can i do that?
Notice that i don't have the column names that contains this value. it's spread in different column and that's why i can't use df.drop.
You can check for each cell if they are equal to "?" and then get a boolean series over rows that contain that character in any one of their cells. Then get the indices of rows that gave True and drop them:
has_ques_mark = df.eq("?").any(axis=1) # a boolean series
inds = has_ques_mark[has_ques_mark].index # row indices where above is True
new_df = df.drop(inds)
You can do it the following way:
df.drop(df.loc[df['column_name'] == "?"].index, inplace=True)
or in a slightly simpler syntax but maybe a bit less performant:
df = df.loc[df['column_name'] != "?"]
I would like to slice my dataframe using iloc (rather than loc) + some condition based on one of the dataframe's columns and assign a value to all the items in this slice (which is effectively a subset of the main dataframe).
My simplified attempt:
df.iloc[:, 1:21][df['column1'] == 'some_value'] = 1
This is meant to take a slice of the dataframe:
All rows;
Columns 2 to 20;
Then slice it again:
Only the rows where column1 = some_value.
The slicing works fine, but equalling this to 1 doesn't work. Nothing changes in df and I get this warning
A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
I really need to use iloc rather than loc if possible. It feels like there should be a way of doing this?
You can search for the error on SO. In short, you should update on one single loc/iloc:
df.loc[df['column1']=='some_value', df.columns[1:21]] = 1
Trying to create the new column in my dataframe based on the below condition:
dataFrame01['final'] = dataFrame01.apply(lambda x: x['Name'] if x['Eval'] == 'NAN' else x['Eval'], axis=1)
but every time only ELSE block is getting executed I mean values from else condition as getting populated but not from IF conditions.
Please help and let me know what mistake I am doing here.
Hard to say without seeing the data. It appears as though the below expression is not getting evaluated.
x['Eval'] == 'NAN'
As a hunch, check that you are specifying your NaN correctly. In Pandas, missing values are typically specified as np.nan. One way to evaluate missing values in Pandas is with pd.isnull(). Thus, the code would look something like this:
dataFrame01['final'] = dataFrame01.apply(lambda x: x['Name'] if pd.isnull(x['Eval']) else x['Eval'], axis=1)
Specifically, I want to see if my pandas dataframe contains a False.
It's an nxn dataframe, where the indices are labels.
So if all values were True except for even one cell, then I would want to return False.
your question is a bit confusing, but assuming that you want to know whether there is at least one False in your dataframe, you could simply use
mask = df.mycol == False
mask.value_counts()
mask.sum()
mask.sum() > 0
All will tell you the truth
If you just want to scan your whole dataframe looking for a value, check df.values. It returns an array of all the rows in the dataframe.
value = False # This is the value you're searching
df_contains_value = any([value in row for row in df.values])
I am indexing a subset of cells from a DataFrame column and attempting to assign a boolean True to said subset:
df['mycolumn'][df['myothercolumn'] == val][idx: idx + 25] = True
However, when I slice df['mycolumn'][df['myothercolumn'] == val][idx: idx + 25], my the initial values are still found. In other words the changes were not applied!
I'm about to rip my hair out. What am I doing wrong?
Try this:
df.loc[df['myothercolumn']==val, some_column_name] = True
some_column_name should be the name of the column you want to add or change.