Pandas boolean indexing w/ Column boolean array - pandas

Dataset: The name of dataframe I am working on is 'f500'. Here is the
first five rows in the dataframe
Goal: Select data with only numeric value
What I've tried:
1) I tried to use boolean array to filter out the non-numeric values and there was no error.
numeric_only_bool = (f500.dtypes != object)
boolean array
2) However, when I tried to do indexing with that boolean array, an error occurs.
numeric_only = f500[:, numeric_only_bool]
error message
I saw index-wise(row-wise) boolean indexing examples but could not find column-wise boolean indexing.
Can anyone help how to fix this code?
Thank you in advance.

Use DataFrame.loc:
numeric_only = f500.loc[:, numeric_only_bool]
Another soluion with DataFrame.select_dtypes:
#only numeric
numeric_only = f500.select_dtypes(np.number)
#exclude object columns
numeric_only = f500.select_dtypes(exclude=object)

Related

How to type cast an Array as empty String array

I can type cast NULL as a string.
How can I type cast an empty array as an empty array of strings?
I need to solve it inside the SQL query.
The following snippet throws a ValueError: Some of types cannot be determined after inferring
df = spark.sql("select Array()").collect()
display(df)
I only found a somewhat roundabout way of doing this only with SQL:
select from_json("[]", "array<string>")
I think just keeping quotes should have empty string array
df = spark.sql("select array('')").collect()
display(df)

Aggregating DataFrame string columns expected to be the same

I am calling DataFrame.agg on a dataframe with various numeric and string columns. For string columns, I want the result of the aggregation to be (a) the value of an arbitrary row if every row has that same string value or (b) an error otherwise.
I could write a custom aggregation function to do this, but is there a canonical way to approach this?
You can test numbers and add some aggregate function like sum and then if same strings column get first else raise error:
df = pd.DataFrame({'a':['s','s3'], 'b':[5,6]})
def f(x):
if np.issubdtype(x.dtype, np.number):
return x.sum()
else:
if x.eq(x.iat[0]).all():
return x.iat[0]
else:
raise ValueError('not same strings values')
s = df.agg(f)

Remove a specific string value from the whole dataframe without specifying the column or row

I have a dataframe that has some cells with the value of "?". now this value causes an error ("Could not convert string to float: "?") whenever i try to use the multi information metric.
I already found a solution by simply using:
df.replace("?",0,inplace=True)
And it worked. BUT i'm wondering if i wanted to remove the whole row if one of its cells has the value of "?", how can i do that?
Notice that i don't have the column names that contains this value. it's spread in different column and that's why i can't use df.drop.
You can check for each cell if they are equal to "?" and then get a boolean series over rows that contain that character in any one of their cells. Then get the indices of rows that gave True and drop them:
has_ques_mark = df.eq("?").any(axis=1) # a boolean series
inds = has_ques_mark[has_ques_mark].index # row indices where above is True
new_df = df.drop(inds)
You can do it the following way:
df.drop(df.loc[df['column_name'] == "?"].index, inplace=True)
or in a slightly simpler syntax but maybe a bit less performant:
df = df.loc[df['column_name'] != "?"]

How to apply custom string matching function to pandas dataframe and return summary dataframe about correct/ incorrect patterns?

I have written a pattern matching function to classify weather a dataframe column value matches a given pattern or not. I created a column 'Correct_Pattern' to store the boolean answers in that dataframe. I also created a new dataframe called Incorrect_Pattern_df, which only contains the values that do not match the desired pattern. I did this, because I later on would like to see if I can correct those incorrect numbers. Now, every time I corrected a batch of numbers I would like to check the number format again and regenerate the Incorrect_Pattern_df. Please see my code below. What do I need to do to make it work?
#data
mylist = ['850/07-498745', '850/07-148465', '07-499015']
#create dataframe
df = pd.DataFrame(mylist)
df.rename(columns={ df.columns[0]: "mycolumn" }, inplace = True)
#function to check if my numbers follow the correct pattern
def check_number_format(dataframe, rm_pattern, column_name):
#create a column Correct_pattern that contains a boolean 'true or false' depending wheather the
pattern was matched or not
dataframe['Correct_pattern'] = dataframe[column_name].str.match(pattern)
#filter all incorrect patterns and put them in a dataframe called Incorrect-Pattern_df
Incorrect_Pattern_df = dataframe[dataframe.Correct_pattern == False]
#return both the original dataframe with the added Correct_pattern_df and the dataframe containing
the Incorrect_Pattern_df
return Incorrect_Pattern_df
#apply the check_Schadennumer_Format to a dataframe
Incorrect_Pattern_df = df['mycolumn'].apply(check_number_format, args=(df, r'^\d{2}-\d+$',
'mycolumn'))
The desired output should look as follows:

Pandas: How can I check if a pandas dataframe contains a specific value?

Specifically, I want to see if my pandas dataframe contains a False.
It's an nxn dataframe, where the indices are labels.
So if all values were True except for even one cell, then I would want to return False.
your question is a bit confusing, but assuming that you want to know whether there is at least one False in your dataframe, you could simply use
mask = df.mycol == False
mask.value_counts()
mask.sum()
mask.sum() > 0
All will tell you the truth
If you just want to scan your whole dataframe looking for a value, check df.values. It returns an array of all the rows in the dataframe.
value = False # This is the value you're searching
df_contains_value = any([value in row for row in df.values])