How do you split a pandas multiindex dataframe into train/test sets? - pandas

I have a multi-index pandas dataframe consisting of a date element and an index representing store locations. I want to split into training and test sets based on the time index. So, everything before a certain time being my training data set and after being my testing dataset. Below is some code for a sample dataset.
import pandas as pd
import stats
data = stats.poisson(mu=[5,2,1,7,2]).rvs([60, 5]).T.ravel()
dates = pd.date_range('2017-01-01', freq='M', periods=60)
locations = [f'location_{i}' for i in range(5)]
df_train = pd.DataFrame(data, index=pd.MultiIndex.from_product([dates, locations]), columns=['eaches'])
df_train.index.names = ['date', 'location']
I would like df_train to represent everything before 2021-01 and df_test to represent everything after.
I've tried using df[df.loc['dates'] > '2020-12-31'] but that yielded errors.

You have 'date' as an index, that's why your query doesn't work. For index, you can use:
df_train.loc['2020-12-31':]
That will select all rows, where df_train >= '2020-12-31'. So, if you would like to choose only rows where df_train > '2020-12-31', you should use df_train.loc['2021-01-01':]

You can't do df.loc['dates'] > '2020-12-31' because df.loc['dates'] still represents your numerical data, and you can't compare those to a string.
You can use query which works with index:
df.query('date>"2020-12-31"')

Related

Interpolate values based in date in pandas

I have the following datasets
import pandas as pd
import numpy as np
df = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1")
df2 = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet2")
df2.dropna(inplace = True)
For each group of values on the first df X-Axis Value, Y-Axis Value, where the first one is the date and the second one is a value, I would like to create rows with the same date. For instance, df.iloc[0,0] the timestamp is Timestamp('2020-08-25 23:14:12'). However, in the following columns of the same row maybe there is other dates with different Y-Axis Value associated. The first one in that specific row being X-Axis Value NCVE-064 HPNDE with a timestap 2020-08-25 23:04:12 and a Y-Axis Value associated of value 0.952.
What I want to accomplish is to interpolate those values for a time interval, maybe 10 minutes, and then merge those results to have the same date for each row.
For the df2 is moreless the same, interpolate the values in a time interval and add them to the original dataframe. Is there any way to do this?
The trick is to realize that datetimes can be represented as seconds elapsed with respect to some time.
Without further context part the hardest things is to decide at what times you wants to have the interpolated values.
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
df = pd.read_excel(
"https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1",
)
x_columns = [col for col in df.columns if col.startswith("X-Axis")]
# What time do we want to align the columsn to?
# You can use anything else here or define equally spaced time points
# or something else.
target_times = df[x_columns].min(axis=1)
def interpolate_column(target_times, x_times, y_values):
ref_time = x_times.min()
# For interpolation we need to represent the values as floats. One options is to
# compute the delta in seconds between a reference time and the "current" time.
deltas = (x_times - ref_time).dt.total_seconds()
# repeat for our target times
target_times_seconds = (target_times - ref_time).dt.total_seconds()
return interp1d(deltas, y_values, bounds_error=False,fill_value="extrapolate" )(target_times_seconds)
output_df = pd.DataFrame()
output_df["Times"] = target_times
output_df["Y-Axis Value NCVE-063 VPNDE"] = interpolate_column(
target_times,
df["X-Axis Value NCVE-063 VPNDE"],
df["Y-Axis Value NCVE-063 VPNDE"],
)
# repeat for the other columns, better in a loop

select top n rows after resampling DatetimeIndex

I need to get top n rows by some value per week (and I have hourly data).
data:
import numpy as np
import pandas as pd
dates = pd.date_range(start='1/1/2020', end='11/1/2020', freq="1H")
values = np.random.randint(20, 100500, len(dates))
some_other_column = np.random.randint(0, 10000000, len(dates))
df = pd.DataFrame({"date": dates, "value": values, "another_column": some_other_column})
My attempt:
resampled = df.set_index("date").resample("W")["value"].nlargest(5).to_frame()
It does give top 5 rows but all other columns except for date and value are missing - and I want to keep them all (in my dataset I have lots of columns but here another_column just to show that it's missing).
The solution I came up with:
resampled.index.names = ["week", "date"]
result = pd.merge(
resampled.reset_index(),
df,
how="left",
on=["date", "value"]
)
But it all feels wrong, I know there should be much simpler solution. Any help?
The output I was looking for. Thanks #wwnde.
df["week"] = df["date"].dt.isocalendar().week
df.loc[df.groupby("week")["value"].nlargest(5).index.get_level_values(1), :]
Groupby, and mask any nlargest
df.set_index('date', inplace=True)
df[df.groupby(df.index.week)['value'].transform(lambda x:x.nlargest(5).any())]

pandas groupby returns multiindex with two more aggregates

When grouping by a single column, and using as_index=False, the behavior is expected in pandas. However, when I use .agg, as_index no longer appears to behave as expected. In short, it doesn't appear to matter.
# imports
import pandas as pd
import numpy as np
# set the seed
np.random.seed(834)
df = pd.DataFrame(np.random.rand(10, 1), columns=['a'])
df['letter'] = np.random.choice(['a','b'], size=10)
summary = df.groupby('letter', as_index=False).agg([np.count_nonzero, np.mean])
summary
returns:
a
count_nonzero mean
letter
a 6.0 0.539313
b 4.0 0.456702
When I would have expected the axis to be 0 1 with letter as a column in the dataframe.
In summary, I want to be able to group by one or more columns, summarize a single column with multiple aggregates, and return a dataframe that does not have the group by columns as the index, nor a Multi Index in the column.
The comment from #Trenton did the trick.
summary = df.groupby('letter')['a'].agg([np.count_nonzero, np.mean]).reset_index()

python How to select the latest sample per user as testing data?

My data is as below. I want to sort by the timestamp and use the latest sample of each userid as the testing data. How should I do the train and test split? What I have tried is using pandas to sort_values timestamp and then groupby 'userid'. But I only get a groupby object. What is the correct way to do that? Is pyspark a better tool?
After I get the dataframe of the testing data, how should split data? Obviously I cannot use sklearn's train_test_split.
You could do the following:
# Sort the data by time stamp
df = df.sort_values('timestamp')
# Group by userid and get the last entry from each group
test_df = df.groupby(by='userid', as_index=False).nth(-1)
# The rest of the values
train_df = df.drop(test_df.index)
You can do the following:
import pyspark.sql.functions as F
max_df = df.groupby("userid").agg(F.max("timestamp"))
# join it back to the original DF
df = df.join(max_df, on="userid")
train_df = df.filter(df["timestamp"] != df["max(timestamp)"])
test_df = df.filter(df["timestamp"] == df["max(timestamp)"])

pandas / numpy arithmetic mean in csv file

I have a csv file which contains 3000 rows and 5 columns, which constantly have more rows appended to it on a weekly basis.
What i'm trying to do is to find the arithmetic mean for the last column for the last 1000 rows, every week. (So when new rows are added to it weekly, it'll just take the average of most recent 1000 rows)
How should I construct the pandas or numpy array to achieve this?
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#How should I write the next line of codes to get the average for the most 1000 rows?
I'm on a different machine than what my pandas is installed on so I'm going on memory, but I think what you'll want to do is...
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#Let's pretend your 5th column has a name (header) of `Stuff`
last_thousand = df_1.tail(1000)
np.mean(last_thousand.Stuff)
A little bit quicker using mean():
df = pd.read_csv("fds.csv", header = 0)
results = df.tail(1000).mean()
Results will contain the mean for each column within the last 1000 rows. If you want more statistics, you can also use describe():
resutls = df.tail(1000).describe().unstack()
So basically I needed to use the pandas tail function. My Code below works.
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
numpy.average(df_1.tail(1000))