How can you call different aggregate operations on pandas groubpy object - pandas

df.groupby("home_team_name")["home_team_goal_count","away_team_goal_count"].sum()
I want to group examples in my data-frame based on the variable home_team_name I would like to perform different operations on different attributes. I want sum one of them and mean for one of them and the last occurence for another.
As of now I only know how to perform the same operation on all of them like in my code example.

You can do:
import numpy as np
df.groupby("home_team_name").agg({'home_team_goal_count': sum,
'away_team_goal_count': np.mean})
Refer to more examples in documentation
To get the last value, you could do:
df.groupby("home_team_name").agg({'home_team_goal_count': 'last',
'away_team_goal_count': 'last'})

Related

Custom distance function between every row of two dataframes

I have two dataframes, I want to calculate the "distance" between every row in one data frame compared to every row in another using a custom distance measure (for example euclidian for the first column, taxi cab for the second, etc). Is there a way to do this quickly using broadcasting?
you can simply create a custom function and use that in apply function:
for example:
def custom_func(a, b):
result = a + b for example:
return result
df = df.apply(custom_func)
if you want answer that needs to be tested in some dataset, please add the data. But Idea is you can create any custom functions and passed in your pandas apply function.
more about apply function and examples are: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Also you may want to check applymap function in pandas too. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html

Extracting data as a list from a Pandas dataframe while preserving order

Suppose I have some Pandas dataframe df that has a column called "HEIGHT", among many other columns.
If I issue list(df["HEIGHT"]), then this will give me a list of the items in that column in the exact order in which they were in the dataframe, i.e. ordered by the index of the dataframe.
Is that always the case? The df["HEIGHT"] command will return a Series and list() will convert it to a list. But are these operations always order-preserving? Interestingly in the [book1 by the Pandas author (!), from my reading so far, it is unclear to me, when these elementary operations preserve order; is order perhaps always preserved, or is there some simple rule to know when order should be preserved?
The order of elements in a pandas Series (i.e., a column in a pandas DataFrame) will not change unless you do something that makes it change. And the order of a python list is guaranteed to reflect insertion order (SO thread).
So yes, df[0].tolist() (slightly faster than list(df[0])) should always yield a Python list of elements in the same order as the elements in df[0].
Order will always be preserved. When you use the list function, you provide it an iterator, and construct a list by iterating over it. For more information on iterators, you might want to read PEP 234 on iterators.
The iteration order is determined by the iterator you provide it. Iterators for a series are provided by pd.Series.__iter__() (the standard way to access an iterator for an object, which is searched for by the list method and similar). For more information on iteration and indexing in Pandas, consider reading the relevant API reference section and the much more in-depth indexing documentation.

Bar plot with groupby

My categorical variable case_satus takes on four unique values. I have data from 2014 to 2016. I would like to plot the distribution of case_status grouped by year. I try to this using:
df.groupby('year').case_status.value_counts().plot.barh()
And I get the following plot:
What I would like to have is a nicer represenation. For example where I have one color for each year, and all the "DENIED" would stand next to each other.
I think it can be achieved since the groupby object is a multi-index, but I don't understand it well enough to create the plot I want.
The solution is:
df.groupby('year').case_status.value_counts().unstack(0).plot.barh()
and results in
I think you need add unstack for DataFrame:
df.groupby('year').case_status.value_counts().unstack().plot.barh()
Also is possible change level:
df.groupby('year').case_status.value_counts().unstack(0).plot.barh()

no method matching size(::DataFrames.GroupedDataFrame)

It's the first time a post a question, so I will try to give some example but I might not be totally aware of the best way to do it.
I am using groupby() function to divide a DataFrame according to a pooled variable. My intent is to create from the SubDataframes a new one in which the rows splitted with groupby() become 2 separate columns. For instance a in DataFrame A I have :meanX and :Treatment, in dataframe B I want to have :meanX_Treatment1 and :meanX_Treatment2.
Now I found a way to use join() for this pourpose, but having many other variables to block I need to repeat the operation several time and I need to know how many SubDataFrames the initial call of groupby() created. The result is variable so I can't simply read it I need to store it in a variable, that's why I tried size(::DataFrames.GroupedDataFrame).
Is there a solution?
To get the number of groups in a GroupedDataFrame use the length method. For example:
using DataFrames
df = DataFrame(x=repeat(1:4,inner=2,outer=2),y='a':'p')
grouped = groupby(df,:x)
num_of_groups = length(grouped) # returns 4
# to do something with each group `for g in grouped ... end` is useful
As noted in comments, you might also consider using Query.jl (see documentation at http://www.david-anthoff.com/Query.jl/stable) for data processing along the question's lines.

pandas read_sql not reading all rows

I am running the exact same query both through pandas' read_sql and through an external app (DbVisualizer).
DbVisualizer returns 206 rows, while pandas returns 178.
I have tried reading the data from pandas by chucks based on the information provided at How to create a large pandas dataframe from an sql query without running out of memory?, it didn't make a change.
What could be the cause for this and ways to remedy it?
The query:
select *
from rainy_days
where year=’2010’ and day=‘weekend’
The columns contain: date, year, weekday, amount of rain at that day, temperature, geo_location (row per location), wind measurements, amount of rain the day before, etc..
The exact python code (minus connection details) is:
import pandas
from sqlalchemy import create_engine
engine = create_engine(
'postgresql://user:pass#server.com/weatherhist?port=5439',
)
query = """
select *
from rainy_days
where year=’2010’ and day=‘weekend’
"""
df = pandas.read_sql(query, con=engine)
https://github.com/xzkostyan/clickhouse-sqlalchemy/issues/14
If you use pure engine.execute you should care about format manually
The problem is that pandas returns a packed dataframe (DF). For some reason this is always on by default and the results varies widely as to what is shown. The solution is to use the unpacking operator (*) before/when trying to print the df, like this:
print(*df)
(This is also know as the splat operator for Ruby enthusiasts.)
To read more about this, please check out these references & tutorials:
https://treyhunner.com/2018/10/asterisks-in-python-what-they-are-and-how-to-use-them/
https://www.geeksforgeeks.org/python-star-or-asterisk-operator/
https://medium.com/understand-the-python/understanding-the-asterisk-of-python-8b9daaa4a558
https://towardsdatascience.com/unpacking-operators-in-python-306ae44cd480
It's not a fix, but what worked for me was to rebuild the indices:
drop the indices
export the whole thing to a csv:
delete all the rows:
DELETE FROM table
import the csv back in
rebuild the indices
pandas:
df = read_csv(..)
df.to_sql(..)
If that works, then at least you know you have a problem somewhere with the indices keeping up to date.