convert pandas groupby object to dataframe while preserving group semantics - pandas

I have miserably failed extrapolating from any answers I have found for grouping a dataframe, and then merging back the group semantics computed by groupby into the original dataframe. Seems documentation is lacking and SO answers are not applicable to current pandas versions.
This code:
grouped = df.groupby(pd.Grouper(
key = my_time_column,
freq = '15Min',
label='left',
sort=True)).apply(pd.DataFrame)
Yields back a dataframe, but I have found no way of making the transition to a dataframe having the same data as the original df, while also populating a new column with the start datetime, of the group that each row belonged to in the groupby object.
Here's my current hack that solves it:
grouped = df.groupby(pd.Grouper(
key = my_datetime_column,
freq = '15Min',
label='left',
sort=True))
sorted_df = grouped.apply(pd.DataFrame)
interval_starts = []
for group_idx, group_member_indices in grouped.indices.items():
for group_member_index in group_member_indices:
interval_starts.append(group_idx)
sorted_df['interval_group_start'] = interval_starts
Wondering if there's an elegant pandas way.
pandas version: 0.23.0

IIUC, this should do what you're looking for:
grouped = df.groupby(pd.Grouper(key=my_time_column,
freq = '15Min',
label='left',
sort=True))\
.apply(pd.DataFrame)
grouped['start'] = grouped.loc[:, my_time_column] \
.groupby(level=0) \
.transform('min')

Related

Working on multiple data frames with data for NBA players during the season, how can I modify all the dataframes at the same time?

I have a list of 16 dataframes that contain stats for each player in the NBA during the respective season. My end goal is to run unsupervised learning algorithms on the data frames. For example, I want to see if I can determine a player's position by their stats or if I can determine their total points during the season based on their stats.
What I would like to do is modify the list(df_list), unless there's a better solution, of these dataframes instead modifying each dataframe to:
Change the datatype of the MP(minutes played column from str to int.
Modify the dataframe where there are only players with 1000 or more MP and there are no duplicate players(Rk)
(for instance in a season, a player(Rk) can play for three teams in a season and have 200MP, 300MP, and 400MP mins with each team. He'll have a column for each team and a column called TOT which will render his MP as 900(200+300+400) for a total of four rows in the dataframe. I only need the TOT row
Use simple algebra with various and individual columns columns, for example: being able to total the MP column and the PTS column and then diving the sum of the PTS column by the MP column.
Or dividing the total of the PTS column by the len of the PTS column.
What I've done so far is this:
Import my libraries and create 16 dataframes using pd.read_html(url).
The first dataframes created using two lines of code:
url = "https://www.basketball-reference.com/leagues/NBA_1997_totals.html"
ninetysix = pd.read_html(url)[0]
HOWEVER, the next four data frames had to be created using a few additional line of code(I received an error code that said "html5lib not found, please install it" so I downloaded both html5lib and requests). I say that to say...this distinction in creating the DF may have to considered in a solution.
The code I used:
import requests
import uuid
url = 'https://www.basketball-reference.com/leagues/NBA_1998_totals.html'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
ninetyseven = pd.read_html(html)[0]
These four data frames look like this:
I tried this but it didn't do anything:
df_list = [
eightyfour, eightyfive, eightysix, eightyseven,
eightyeight, eightynine, ninety, ninetyone,
ninetytwo, ninetyfour, ninetyfive,
ninetysix, ninetyseven, ninetyeight, owe_one, owe_two
]
for df in df_list:
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
owe_two
============================UPDATE===================================
This code will solves a portion of problem # 2
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
dd = pd.read_html(url)[0]
dd = dd[dd['Rk'].ne('Rk')]
dd['MP'] = dd['MP'].astype(int)
players_1000_rk_list = list(dd[dd['MP'] >= 1000]['Rk'])
players_dd = dd[dd['Rk'].isin(players_1000_rk_list)]
But it doesn't remove the duplicates.
==================== UPDATE 10/11/22 ================================
Let's say I take rows with values "TOT" in the "Tm" and create a new DF with them, and these rows from the original data frame...
could I then compare the new DF with the original data frame and remove the names from the original data IF they match the names from the new data frame?
the problem is that the df you are working on in the loop is not the same df that is in the df_list. you could solve this by saving the new df back to the list, overwriting the old df
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
df_list[i] = df
the2 lines are probably wrong as well
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
perhaps you want this
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
#df = list(df[df['MP'] >= 1000]['Rk'])
#df = df[df['Rk'].isin(df)]
# just the rows where MP > 1000
df_list[i] = df[df['MP'] >= 1000]

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

How can I add values from pandas group to new Dataframe after a function?

I am trying to separate a Dataframe into groups, run each group through a function, and have the return value from the first row of each group placed into a new Dataframe.
When I try the code below, I can print out the information I want, but when I try to add it to the new Dataframe, it only shows the values for the last group.
How can I add the values from each group into the new Dataframe?
Thanks,
Here is what I have so far:
import pandas as pd
import numpy as np
#Build random dataframe
df = pd.DataFrame(np.random.randint(0,40,size=10),
columns=["Random"],
index=pd.date_range("20200101", freq='6h',periods=10))
df["Random2"] = np.random.randint(70,100,size=10)
df["Random3"] = 2
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d'))
df.index.name = 'Date'
df.reset_index(inplace=True)
#Setup groups by date
df = df.groupby(['Date']).apply(lambda x: x.reset_index())
df.drop(["index","Date"],axis=1,inplace = True)
#Creat new dataframe for newValue
df2 = pd.DataFrame(index=(df.index)).unstack()
#random function for an example
def any_func(df):
df["Value"] = df["Random"] * df["Random2"] / df["Random3"]
return df["Value"]
#loop by unique group name
for date in df.index.get_level_values('Date').unique():
#I can print the data I want
print(any_func(df.loc[date])[0])
#But when I add it to a new dataframe, it only shows the value from the last group
df2["newValue"] = any_func(df.loc[date])[0]
df2
Unrelated, but try modifying your any_func to take advantage of vectorized functions is possible.
Now if I understand you correctly:
new_value = df['Random'] * df['Random2'] / df['Random3']
df2['New Value'] = new_value.loc[:, 0]
This line of code gave me the desired outcome. I just needed to set the index using the "date" variable when I created the column, not when I created the Dataframe.
df2.loc[date, "newValue"] = any_func(df.loc[date])[0]

python How to select the latest sample per user as testing data?

My data is as below. I want to sort by the timestamp and use the latest sample of each userid as the testing data. How should I do the train and test split? What I have tried is using pandas to sort_values timestamp and then groupby 'userid'. But I only get a groupby object. What is the correct way to do that? Is pyspark a better tool?
After I get the dataframe of the testing data, how should split data? Obviously I cannot use sklearn's train_test_split.
You could do the following:
# Sort the data by time stamp
df = df.sort_values('timestamp')
# Group by userid and get the last entry from each group
test_df = df.groupby(by='userid', as_index=False).nth(-1)
# The rest of the values
train_df = df.drop(test_df.index)
You can do the following:
import pyspark.sql.functions as F
max_df = df.groupby("userid").agg(F.max("timestamp"))
# join it back to the original DF
df = df.join(max_df, on="userid")
train_df = df.filter(df["timestamp"] != df["max(timestamp)"])
test_df = df.filter(df["timestamp"] == df["max(timestamp)"])

Reassigning pandas column in place from a slice of another dataframe

So I learned from this answer to this question that in pandas 0.20.3 and above reassigning the values of a dataframe works without giving the SettingWithCopyWarning in many ways as follows:
df = pd.DataFrame(np.ones((5,6)),columns=['one','two','three',
'four','five','six'])
df.one *=5
df.two = df.two*5
df.three = df.three.multiply(5)
df['four'] = df['four']*5
df.loc[:, 'five'] *=5
df.iloc[:, 5] = df.iloc[:, 5]*5
HOWEVER
If I were to take a part of that dataframe like this for example:
df1 = df[(df.index>1)&(df.index<5)]
And then try one of the above methods for reassigning a column like so:
df.one *=5
THEN I will get the SettingWithCopyWarning.
So is this a bug or am I just missing something about the way pandas expects for this kind of operation to work?