python How to select the latest sample per user as testing data? - pandas

My data is as below. I want to sort by the timestamp and use the latest sample of each userid as the testing data. How should I do the train and test split? What I have tried is using pandas to sort_values timestamp and then groupby 'userid'. But I only get a groupby object. What is the correct way to do that? Is pyspark a better tool?
After I get the dataframe of the testing data, how should split data? Obviously I cannot use sklearn's train_test_split.

You could do the following:
# Sort the data by time stamp
df = df.sort_values('timestamp')
# Group by userid and get the last entry from each group
test_df = df.groupby(by='userid', as_index=False).nth(-1)
# The rest of the values
train_df = df.drop(test_df.index)

You can do the following:
import pyspark.sql.functions as F
max_df = df.groupby("userid").agg(F.max("timestamp"))
# join it back to the original DF
df = df.join(max_df, on="userid")
train_df = df.filter(df["timestamp"] != df["max(timestamp)"])
test_df = df.filter(df["timestamp"] == df["max(timestamp)"])

Related

How do you split a pandas multiindex dataframe into train/test sets?

I have a multi-index pandas dataframe consisting of a date element and an index representing store locations. I want to split into training and test sets based on the time index. So, everything before a certain time being my training data set and after being my testing dataset. Below is some code for a sample dataset.
import pandas as pd
import stats
data = stats.poisson(mu=[5,2,1,7,2]).rvs([60, 5]).T.ravel()
dates = pd.date_range('2017-01-01', freq='M', periods=60)
locations = [f'location_{i}' for i in range(5)]
df_train = pd.DataFrame(data, index=pd.MultiIndex.from_product([dates, locations]), columns=['eaches'])
df_train.index.names = ['date', 'location']
I would like df_train to represent everything before 2021-01 and df_test to represent everything after.
I've tried using df[df.loc['dates'] > '2020-12-31'] but that yielded errors.
You have 'date' as an index, that's why your query doesn't work. For index, you can use:
df_train.loc['2020-12-31':]
That will select all rows, where df_train >= '2020-12-31'. So, if you would like to choose only rows where df_train > '2020-12-31', you should use df_train.loc['2021-01-01':]
You can't do df.loc['dates'] > '2020-12-31' because df.loc['dates'] still represents your numerical data, and you can't compare those to a string.
You can use query which works with index:
df.query('date>"2020-12-31"')

How to get access to dropped values in pandas?

My datasets hava two columns with values. In order to calculate top 1% of the data in each column, I used quantile method. After that,
I dropped the values which are higher than top 1% in my datasets by drop method.
Now, I want to get my dropped values. How can I access the dropped values in a separate column?
features = ['HYG_FT01', 'HYG_PU12_PW_PV']
for features in df:
new_df = df[[features]].quantile(q=.99, axis=0, numeric_only=True).iloc[0]
df.drop(df[df[features] > new_df].index, inplace=True)
here is my code hope it's help, if you want me to specify let me know in the comments
features = ['HYG_FT01', 'HYG_PU12_PW_PV']
for features in df:
new_df = df[[features]].quantile(q=.9, axis=0, numeric_only=True).iloc[0]
df[features+ '_droped'] = np.where(df[features] <= new_df,None,df[features])
df[features] = np.where(df[features] > new_df,None,df[features])
df
output:

How can I add values from pandas group to new Dataframe after a function?

I am trying to separate a Dataframe into groups, run each group through a function, and have the return value from the first row of each group placed into a new Dataframe.
When I try the code below, I can print out the information I want, but when I try to add it to the new Dataframe, it only shows the values for the last group.
How can I add the values from each group into the new Dataframe?
Thanks,
Here is what I have so far:
import pandas as pd
import numpy as np
#Build random dataframe
df = pd.DataFrame(np.random.randint(0,40,size=10),
columns=["Random"],
index=pd.date_range("20200101", freq='6h',periods=10))
df["Random2"] = np.random.randint(70,100,size=10)
df["Random3"] = 2
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d'))
df.index.name = 'Date'
df.reset_index(inplace=True)
#Setup groups by date
df = df.groupby(['Date']).apply(lambda x: x.reset_index())
df.drop(["index","Date"],axis=1,inplace = True)
#Creat new dataframe for newValue
df2 = pd.DataFrame(index=(df.index)).unstack()
#random function for an example
def any_func(df):
df["Value"] = df["Random"] * df["Random2"] / df["Random3"]
return df["Value"]
#loop by unique group name
for date in df.index.get_level_values('Date').unique():
#I can print the data I want
print(any_func(df.loc[date])[0])
#But when I add it to a new dataframe, it only shows the value from the last group
df2["newValue"] = any_func(df.loc[date])[0]
df2
Unrelated, but try modifying your any_func to take advantage of vectorized functions is possible.
Now if I understand you correctly:
new_value = df['Random'] * df['Random2'] / df['Random3']
df2['New Value'] = new_value.loc[:, 0]
This line of code gave me the desired outcome. I just needed to set the index using the "date" variable when I created the column, not when I created the Dataframe.
df2.loc[date, "newValue"] = any_func(df.loc[date])[0]

convert pandas groupby object to dataframe while preserving group semantics

I have miserably failed extrapolating from any answers I have found for grouping a dataframe, and then merging back the group semantics computed by groupby into the original dataframe. Seems documentation is lacking and SO answers are not applicable to current pandas versions.
This code:
grouped = df.groupby(pd.Grouper(
key = my_time_column,
freq = '15Min',
label='left',
sort=True)).apply(pd.DataFrame)
Yields back a dataframe, but I have found no way of making the transition to a dataframe having the same data as the original df, while also populating a new column with the start datetime, of the group that each row belonged to in the groupby object.
Here's my current hack that solves it:
grouped = df.groupby(pd.Grouper(
key = my_datetime_column,
freq = '15Min',
label='left',
sort=True))
sorted_df = grouped.apply(pd.DataFrame)
interval_starts = []
for group_idx, group_member_indices in grouped.indices.items():
for group_member_index in group_member_indices:
interval_starts.append(group_idx)
sorted_df['interval_group_start'] = interval_starts
Wondering if there's an elegant pandas way.
pandas version: 0.23.0
IIUC, this should do what you're looking for:
grouped = df.groupby(pd.Grouper(key=my_time_column,
freq = '15Min',
label='left',
sort=True))\
.apply(pd.DataFrame)
grouped['start'] = grouped.loc[:, my_time_column] \
.groupby(level=0) \
.transform('min')

pandas / numpy arithmetic mean in csv file

I have a csv file which contains 3000 rows and 5 columns, which constantly have more rows appended to it on a weekly basis.
What i'm trying to do is to find the arithmetic mean for the last column for the last 1000 rows, every week. (So when new rows are added to it weekly, it'll just take the average of most recent 1000 rows)
How should I construct the pandas or numpy array to achieve this?
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#How should I write the next line of codes to get the average for the most 1000 rows?
I'm on a different machine than what my pandas is installed on so I'm going on memory, but I think what you'll want to do is...
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#Let's pretend your 5th column has a name (header) of `Stuff`
last_thousand = df_1.tail(1000)
np.mean(last_thousand.Stuff)
A little bit quicker using mean():
df = pd.read_csv("fds.csv", header = 0)
results = df.tail(1000).mean()
Results will contain the mean for each column within the last 1000 rows. If you want more statistics, you can also use describe():
resutls = df.tail(1000).describe().unstack()
So basically I needed to use the pandas tail function. My Code below works.
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
numpy.average(df_1.tail(1000))