Syntax for subseting when using pandas (printing) - pandas

I am trying to address a certain column, but only the values that are within a subsetting rule for another column.
I tried:
Dataframe[Dataframe[ColumnA == 'Value'][Dataframe[Dataframe[ColumnB]]
Can someone point me into the direction of the correct syntax?
I would use that for printing

You can access the data using a chained index as follows. The
Dataframe['ColumnA'] == 'Value'
piece is a boolean mask that's used. You could also use .loc, but I've tried to keep this as similar to your initial approach as possible.
Dataframe[Dataframe['ColumnA'] == 'Value']['ColumnB']
or
Dataframe['ColumnB'][Dataframe['ColumnA'] == 'Value']

Related

pandas copy vs slice view

i am fully ware of the pandas dataframe view vs copy issue.
Pandas dataframe index slice view vs copy
I would think the below code <approach 1> will be "safe" and robust:
mydf = mydf[mydf.something == some condition]
mydf['some column'] = something else
Note that doing above, I change the parent dataframe all together, not creating a separate view.
<approach 2> I make the explicit .copy() method
mydf = mydf[mydf.something == some condition].copy()
mydf['some column'] = something else
In fact, I would think the latter will be an unneccessary overhead?
However, occasionally, (not consistently) i will still receive the below warning message, using the first approach (without the .copy())
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
am I missing any subtlety using apporach 1? Or should one always for robustness use approach 2? is the .copy() going to be a meaningful overhead?

Pandas dataframe: create new column using conditional and slicing string

I have this piece of code that creates a new dataframe column, using first a conditional, and then slicing some string, with a fixed slicing index (0, 5):
df.loc[df['operation'] == 'dividend', ['order_adj']] = df['comment'].str.slice(0, 5)
But, instead of having a fixed slicing index, I need to use str.find() at the final of this code, to have a dynamic slice index on df['comment'], based on its characters.
As I'm creating a new column by broadcasting, I couldn't find the correct sintaxe to use str.find('some_string') inside str.slice(). Thanks.
Option using split:
df['comment'].str.split("some_string").str[0]
Or option using regex (move the capture group to be where you want regarding inclusive/exclusive):
pandas.Series.str.extract("(.*?)some_string")
pandas.Series.str.extract("(.*?some_string)")

DataFrame difference between where and query

I was able to solve a problem with pandas thanks to the answer provided in Grouping by with Where conditions in Pandas.
I was first trying to make use of the .where() function like the following:
df['X'] = df['Col1'].where(['Col1'] == 'Y').groupby('Z')['S'].transform('max').astype(int)
but got this error: ValueError: Array conditional must be same shape as self
By writing it like
df['X'] = df.query('Col1 == "Y"').groupby('Z')['S'].transform('max').astype(int)
it worked.
I'm trying to understand what is the difference as I thought .where() would do the trick.
You have a typo in your first statement. .where(['Col1'] == 'Y') is comparing a single element list with 'Y'. I think you meant to use .where(df['Col1'] == 'Y', however, this will not work either because you filtering dataframe columns to just 'Col1' in front of the where method. This is what you really wanted to do, in my opinion.
df['X'] = df.where(df['Col1'] == 'Y').groupby('Z')['S'].transform('max')
Which is equalivant to using
df['X'] = df.query('Col1 == "Y"').groupby('Z')['S'].transform('max').astype(int)
Also, not the astype(int) is not going to do any good on either of these statements because one side effect in pandas is that any column with a 'int' dtype that contains a NaN will automatically change that column to a 'float'.

How to do na_values when creating pandas dataframe from Google BigQuery

I had used pd.read_csv(my_csv, na_values=['N/A', '--']) such that string 'N/A' and '--' get interpreted as NULL, NaN, etc.
But if I used BigQuery client, I couldn't figure how to achieve the same feat. I read the quick help from .to_dataframe() which "Return a pandas DataFrame from a QueryJob" but it didn't seem to take in any extra argument.
Is this possible? Or I have to do my own custom post-processing to track missing values?
you can achieve same from below:
dataFrame.applymap(lambda x: np.nan if x in ['N/A', '--'] else x)
If you are running some query before getting its results in to the dataframe, you can easily do it on the BigQuery side without worrying about filtering your results on the client side.
Something like IF(column in ('N\A', '--'), null, column) as column should do the job for you.

Defining a function in Pandas

I am new to Pandas and I am taking this course online. I know there is a way to define a function to make this code cleaner but I'm not sure how to go about it.
noshow = len((df[
(df['Gender'] == 'M') \
& (df['No_show'] == 'Yes') \
& (df['Persons_age'] == 'Child')
]))
noshow
There is multiple Genders and multiple No_show answers and Multiple Person's age and I don't want to have write out the code for each one of those.
I've gotten the code for a single function but not for the mutiple iterations.
def print_noshow_percentage(column_name, value, percentage_text):
total = (df[column_name] == value).sum()
noshow = len((df[(df[column_name] == value) & (df['No_show'] == 'Yes')]))
print(int((noshow / total) * 100), percentage_text)
I hope this makes sense. Thanks for any help!
Welcome to Stack Exchange. You are not too clear about your desired output, but I think what you are trying to do is to get a summary of every possible combination of age, gender, and no_show in your df. To accomplish this you can use the built in groupby method of pandas documentation here.
As mentioned by #ALollz, the following code will get you everything you need to know about your counts in terms of percentages.
counts = df.groupby(['Gender', 'Persons_age'])['No_show'].value_counts(normalize=True)
Now you need to decide what to do with it. You can either iterate through the dataframe printing each line, or you can find specific combinations or you can print out the whole thing.
In general, it is better to look for a built in method than to try to build a function outside of pandas. There are a lot of different ways to do things and checking the documentation is a good place to start.