DataFrame difference between where and query - pandas

I was able to solve a problem with pandas thanks to the answer provided in Grouping by with Where conditions in Pandas.
I was first trying to make use of the .where() function like the following:
df['X'] = df['Col1'].where(['Col1'] == 'Y').groupby('Z')['S'].transform('max').astype(int)
but got this error: ValueError: Array conditional must be same shape as self
By writing it like
df['X'] = df.query('Col1 == "Y"').groupby('Z')['S'].transform('max').astype(int)
it worked.
I'm trying to understand what is the difference as I thought .where() would do the trick.

You have a typo in your first statement. .where(['Col1'] == 'Y') is comparing a single element list with 'Y'. I think you meant to use .where(df['Col1'] == 'Y', however, this will not work either because you filtering dataframe columns to just 'Col1' in front of the where method. This is what you really wanted to do, in my opinion.
df['X'] = df.where(df['Col1'] == 'Y').groupby('Z')['S'].transform('max')
Which is equalivant to using
df['X'] = df.query('Col1 == "Y"').groupby('Z')['S'].transform('max').astype(int)
Also, not the astype(int) is not going to do any good on either of these statements because one side effect in pandas is that any column with a 'int' dtype that contains a NaN will automatically change that column to a 'float'.

Related

pandas copy vs slice view

i am fully ware of the pandas dataframe view vs copy issue.
Pandas dataframe index slice view vs copy
I would think the below code <approach 1> will be "safe" and robust:
mydf = mydf[mydf.something == some condition]
mydf['some column'] = something else
Note that doing above, I change the parent dataframe all together, not creating a separate view.
<approach 2> I make the explicit .copy() method
mydf = mydf[mydf.something == some condition].copy()
mydf['some column'] = something else
In fact, I would think the latter will be an unneccessary overhead?
However, occasionally, (not consistently) i will still receive the below warning message, using the first approach (without the .copy())
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
am I missing any subtlety using apporach 1? Or should one always for robustness use approach 2? is the .copy() going to be a meaningful overhead?

Syntax for subseting when using pandas (printing)

I am trying to address a certain column, but only the values that are within a subsetting rule for another column.
I tried:
Dataframe[Dataframe[ColumnA == 'Value'][Dataframe[Dataframe[ColumnB]]
Can someone point me into the direction of the correct syntax?
I would use that for printing
You can access the data using a chained index as follows. The
Dataframe['ColumnA'] == 'Value'
piece is a boolean mask that's used. You could also use .loc, but I've tried to keep this as similar to your initial approach as possible.
Dataframe[Dataframe['ColumnA'] == 'Value']['ColumnB']
or
Dataframe['ColumnB'][Dataframe['ColumnA'] == 'Value']

Access pandas DataFrame attributes inside chained methods

Good afternoon,
I have a few .csv files to be transformed into pandas DataFrames. Although they contain the same type of data in the same columns, they have different column names. I am trying to do all the small transformations on the fly to be able to concatenate the DataFrames all at once. The problem I am having is that as far as I know there is not way to access the attributes of the DataFrame "on the fry", first you assign it to a variable and then access the data. In the following way:
df = pd.read_csv("my_csv.csv")
df = df.rename(columns=dict(zip(df.columns, [my_columns])))
So I was wondering if anyone knows a way to do something like the following:
(pd.read_csv("my_csv.csv")
.rename(columns=dict(zip(SELF.columns, [my_columns])))
)
where SELF references the DataFrame that has been just created.
So far I have tried unsuccessfully to use lambda functions as I know they can be used to subset the DataFrame by conditions set on the just created object like [lambda x: x.ColumnA > 20]
Thank you in advance.
EDIT:
I was able to do what I was looking for with the help of .pipe() I did the following:
def rename_columns(self, columns):
return self.rename(columns=dict(zip(self.columns, columns)))
(pd.DataFrame([{'a':1},{'a':1},{'a':1},{'a':1},{'a':1}])
.pipe(rename_columns, ['b'])
)
You can use .set_axis for this:
(pd.DataFrame(np.random.randn(5, 5))
.set_axis(['A', 'B', 'C', 'D', 'E'], axis=1, inplace=False)
)
inplace will change in a future version of pandas, but currently defaults to True; axis=1 operates on columns.

What's the difference between .col and ['col'] in pandas

I've been using pandas for a little while now and I've realised that I use
df.col
df['col']
interchangeably. Are they actually the same or am I missing something?
Following on from the link in the comments.
df.col
Simply refers to an attribute of the dataframe, similar to say
df.shape
Now if 'col' is a column name in the dataframe then accessing this attribute returns the column as series. This sometimes will be sufficient but
df['col']
will always work, and can also be used to add a new column to a dataframe.
I think this is kind of obvious.....
You cannot use df.col if the column name 'col' has a space in it. But df['col'] always works.
e.g,
df['my col'] works but df.my col will not work.
I'll note there's a difference in how some methods consume data. For example, in the LifeTimes library if I use dataframe.col with some methods, the method will consider the column to be an ndarray and throw an exception that the data must be 1-dimensional.
If however I use dataframe['col'] then the method will consume the data as expected.

How to Mutate a DataFrame?

I am trying to remove some columns from my data frame and would prefer not to return the modified data frame and reassign it to the old. Instead, I would like the function to just modify the data frame. This is what I tried but it does not seem to be doing what I except. I was under the impression arguments as passed as reference and not by value?
function remove_cols! (df::DataFrame, cols)
df = df[setdiff(names(df), cols)];
end
df = DataFrame(x = [1:10], y = [11:20]);
remove_cols!(df, [:y]); # this does not modify the original data frame
Of course the below works but I would prefer if remove_cols! just changed the df in place
df = remove_cols!(df, [:y]);
How can I change the df in place inside my function?
Thanks!
As I understand Julia it uses what is called pass by sharing, meaning that the reference is passed by value. So when you pass the DataFrame to the function a new reference to the DataFrame is created which is local to the function. When you reassign the local df variable with its own reference to the DataFrame it has no effect on the separate global variable and its separate reference to the DataFrame.
There is a function in DataFrames.jl for deleting columns in DataFrames.
To answer the question of how to mutate a dataframe in your own function in general, the key is to use functions and operations that mutate the dataframe within the function. For example, see the function below which builds upon the standard dataframe append! function with some added benefits like it can append from any number of dataframes, the order of columns does not matter and missing columns will be added to the dataframes:
function append_with_missing!(df1::DataFrame, dfs::AbstractDataFrame...)
columns = Dict{Symbol, Type}(zip(names(df1), colwise(eltype, df1)))
for df in dfs
columns_temp = Dict(zip(names(df), colwise(eltype, df)))
merge!(columns, columns_temp)
end
for (n, t) in columns, df in [df1; [i for i in dfs]]
n in names(df) || (df[n] = Vector{Union{Missing,t}}(missing, size(df, 1)))
end
for df in dfs
append!(df1, df[names(df1)])
end
end
Here, the first dataframe passed itself is mutated with rows added from the other dataframes.
(The functionality for adding missing columns is based upon the answer given by #Bogumił Kamiński here: Breaking change on vcat when columns are missing)