select top n rows after resampling DatetimeIndex - pandas

I need to get top n rows by some value per week (and I have hourly data).
data:
import numpy as np
import pandas as pd
dates = pd.date_range(start='1/1/2020', end='11/1/2020', freq="1H")
values = np.random.randint(20, 100500, len(dates))
some_other_column = np.random.randint(0, 10000000, len(dates))
df = pd.DataFrame({"date": dates, "value": values, "another_column": some_other_column})
My attempt:
resampled = df.set_index("date").resample("W")["value"].nlargest(5).to_frame()
It does give top 5 rows but all other columns except for date and value are missing - and I want to keep them all (in my dataset I have lots of columns but here another_column just to show that it's missing).
The solution I came up with:
resampled.index.names = ["week", "date"]
result = pd.merge(
resampled.reset_index(),
df,
how="left",
on=["date", "value"]
)
But it all feels wrong, I know there should be much simpler solution. Any help?

The output I was looking for. Thanks #wwnde.
df["week"] = df["date"].dt.isocalendar().week
df.loc[df.groupby("week")["value"].nlargest(5).index.get_level_values(1), :]

Groupby, and mask any nlargest
df.set_index('date', inplace=True)
df[df.groupby(df.index.week)['value'].transform(lambda x:x.nlargest(5).any())]

Related

In a pyspark.sql.dataframe.Dataframe, how to resample the "TIMESTAMP" column to daily intervals, for each unique id in the "ID" column?

The title almost says it already. I have a pyspark.sql.dataframe.Dataframe with a "ID", "TIMESTAMP", "CONSUMPTION" and "TEMPERATURE" column. I need the "TIMESTAMP" column to be resampled to daily intervals (from 15min intervals) and the "CONSUMPTION" and "TEMPERATURE" column aggregated by summation. However, this needs to be performed for each unique id in the "ID" column. How do I do this?
Efficiency/speed is of importance to me. I have a huge dataframe to start with, which is why I would like to avoid .toPandas() and for loops.
Any help would be greatly appreciated!
The following code will build a spark_df to play around with. The input_spark_df represents the input spark dataframe, the disred output is like desired_outcome_spark_df.
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
df_list = []
for unique_id in ['012', '345', '678']:
date_range = pd.date_range(pd.Timestamp('2022-12-28 00:00'), pd.Timestamp('2022-12-30 23:00'),freq = 'H')
df = pd.DataFrame()
df['TIMESTAMP'] = date_range
df['ID'] = unique_id
df['TEMPERATURE'] = np.random.randint(1, 10, df.shape[0])
df['CONSUMPTION'] = np.random.randint(1, 10, df.shape[0])
df = df[['ID', 'TIMESTAMP', 'TEMPERATURE', 'CONSUMPTION']]
df_list.append(df)
pandas_df = pd.concat(df_list)
spark = SparkSession.builder.getOrCreate()
input_spark_df = spark.createDataFrame(pandas_df)
desired_outcome_spark_df = spark.createDataFrame(pandas_df.set_index('TIMESTAMP').groupby('ID').resample('1d').sum().reset_index())
To condense the question thus: how do I go from input_spark_df to desired_outcome_spark_df as efficient as possible?
I found the answer to my own question. I first change the timestamp to "date only" using pyspark.sql.functions.to_date. Then I groupby both "ID" and "TIMESTAMP" and perfrom the aggregation.
from pyspark.sql.functions import to_date, sum, avg
# Group the DataFrame by the "ID" column
spark_df = input_spark_df.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
desired_outcome = (input_spark_df
.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
.groupBy("ID", 'TIMESTAMP')
.agg(
sum(col("CONSUMPTION")).alias("CUMULATIVE_DAILY_POWER_CONSUMPTION"),
avg(col('TEMPERATURE')).alias("AVERAGE_DAILY_TEMPERATURE")
))
grouped_df.display()

How do you split a pandas multiindex dataframe into train/test sets?

I have a multi-index pandas dataframe consisting of a date element and an index representing store locations. I want to split into training and test sets based on the time index. So, everything before a certain time being my training data set and after being my testing dataset. Below is some code for a sample dataset.
import pandas as pd
import stats
data = stats.poisson(mu=[5,2,1,7,2]).rvs([60, 5]).T.ravel()
dates = pd.date_range('2017-01-01', freq='M', periods=60)
locations = [f'location_{i}' for i in range(5)]
df_train = pd.DataFrame(data, index=pd.MultiIndex.from_product([dates, locations]), columns=['eaches'])
df_train.index.names = ['date', 'location']
I would like df_train to represent everything before 2021-01 and df_test to represent everything after.
I've tried using df[df.loc['dates'] > '2020-12-31'] but that yielded errors.
You have 'date' as an index, that's why your query doesn't work. For index, you can use:
df_train.loc['2020-12-31':]
That will select all rows, where df_train >= '2020-12-31'. So, if you would like to choose only rows where df_train > '2020-12-31', you should use df_train.loc['2021-01-01':]
You can't do df.loc['dates'] > '2020-12-31' because df.loc['dates'] still represents your numerical data, and you can't compare those to a string.
You can use query which works with index:
df.query('date>"2020-12-31"')

Set DateTime to index and then sum over a day

i would like to change the index of my dataframe to datetime to sum the colum "Heizung" over a day.
But it dont work.
After i set the new index, i like to use resample to sum over a day.
Here is an extraction from my dataframe.
Nr;DatumZeit;Erdtemp;Heizung
0;25.04.21 12:58:42;21.8;1
1;25.04.21 12:58:54;21.8;1
2;25.04.21 12:59:06;21.9;1
3;25.04.21 12:59:18;21.9;1
4;25.04.21 12:59:29;21.9;1
5;25.04.21 12:59:41;22.0;1
6;25.04.21 12:59:53;22.0;1
7;25.04.21 13:00:05;22.1;1
8;25.04.21 13:00:16;22.1;0
9;25.04.21 13:00:28;22.1;0
10;25.04.21 13:00:40;22.1;0
11;25.04.21 13:00:52;22.2;0
12;25.04.21 13:01:03;22.2;0
13;25.04.21 13:01:15;22.2;1
14;25.04.21 13:01:27;22.2;1
15;25.04.21 13:01:39;22.3;1
16;25.04.21 13:01:50;22.3;1
17;25.04.21 13:02:02;22.4;1
18;25.04.21 13:02:14;22.4;1
19;25.04.21 13:02:26;22.4;0
20;25.04.21 13:02:37;22.4;1
21;25.04.21 13:02:49;22.4;0
22;25.04.21 13:03:01;22.4;0
23;25.04.21 13:03:13;22.5;0
24;25.04.21 13:03:25;22.4;0
This is my code
import pandas as pd
Tab = pd.read_csv('/home/kai/Dokumente/TempData', delimiter=';')
Tab1 = Tab[["DatumZeit","Erdtemp","Heizung"]].copy()
Tab1['DatumZeit'] = pd.to_datetime(Tab1['DatumZeit'])
Tab1.plot(x='DatumZeit', figsize=(20, 5),subplots=True)
#Tab1.index.to_datetime()
#Tab1.index = pd.to_datetime(Tab1.index)
Tab1.set_index('DatumZeit')
Tab.info()
Tab1.resample('D').sum()
print(Tab1.head(10))
This is how we can set index and create Timestamp object and then resample it for 'D' and sum a column over it.
Tab1['DatumZeit'] = pd.to_datetime(Tab1.DatumZeit)
Tab1 = Tab1.set_index('DatumZeit') ## missed here
Tab1.resample('D').Heizung.sum()
If we don't want to set index explicitly then other way to resample is pd.Grouper.
Tab1['DatumZeit'] = pd.to_datetime(Tab1.DatumZeit
Tab1.groupby(pd.Grouper(key='DatumZeit', freq='D')).Heizung.sum()
If we want output to be dataframe, then we can use to_frame method.
Tab1 = Tab1.groupby(pd.Grouper(key='DatumZeit', freq='D')).Heizung.sum().to_frame()
Output
Heizung
DatumZeit
2021-04-25 15
Pivot tables to the rescue:
import pandas as pd
import numpy as np
Tab1.pivot_table(index=["DatumZeit"], values=["Heizung"], aggfunc=np.sum)
If you need to do it with setting the index first, you need to use inplace=True on set_index
Tab1.set_index("DatumZeit", inplace=True)
Just note if you do this way, you can't go back to a pivot table. In the end, it's whatever works best for you.

get less correlated variable names

I have a dataset (50 columns, 100 rows).
Also have 50 variable names, 0,1,2...49 for 50 columns.
I have to find less correlated variables, say correlation < 0.7.
I tried as follows:
import os, glob, time, numpy as np, pandas as pd
data = np.random.randint(1,99,size=(100, 50))
dataframe = pd.DataFrame(data)
print (dataframe.shape)
codes = np.arange(50).astype(str)
dataframe.columns = codes
corr = dataframe.corr()
corr = corr.unstack().sort_values()
print (corr)
corr = corr.values
indices = np.where(corr < 0.7)
print (indices)
res = codes[indices[0]].tolist() + codes[indices[1]].tolist()
print (len(res))
res = list(set(res))
print (len(res))
The result is, 50(all variables!), which is unexpected.
How to solve this problem, guys?
As mentioned in the comments, your question is somewhat ambiguous. First, there is the possibility, that no column pair is correlated. Second, the unstacking doesn't make sense, because you create an index array that you can't directly use on your 2D array. Third, which should be first, but I was blind to this - as #AmiTavory mentioned there is no point in "correlating names".
The correlation procedure per se works, as you can see in the following example:
import numpy as np
import pandas as pd
A = np.arange(100).reshape(25, 4)
#random order in column 2, i.e. a low correlation to the first columns
np.random.shuffle(A[:,2])
#flip column 3 to create a negative correlation with the first columns
A[:,3] = np.flipud(A[:,3])
#column 1 is unchanged, therefore positively correlated to column 0
df = pd.DataFrame(A)
print(df)
#establish a correlation matrix
corr = df.corr()
#retrieve index of pairs below a certain value
#use only the upper triangle with np.triu to filter for symmetric solutions
#use np.abs to take also negative correlation into account
res = np.argwhere(np.triu(np.abs(corr.values) <0.7))
print(res)
Output:
[[0 2]
[1 2]
[2 3]]
As expected, column 2 is the only one that is not correlated to any other, meaning, that all other columns are correlated with each other.

pandas / numpy arithmetic mean in csv file

I have a csv file which contains 3000 rows and 5 columns, which constantly have more rows appended to it on a weekly basis.
What i'm trying to do is to find the arithmetic mean for the last column for the last 1000 rows, every week. (So when new rows are added to it weekly, it'll just take the average of most recent 1000 rows)
How should I construct the pandas or numpy array to achieve this?
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#How should I write the next line of codes to get the average for the most 1000 rows?
I'm on a different machine than what my pandas is installed on so I'm going on memory, but I think what you'll want to do is...
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#Let's pretend your 5th column has a name (header) of `Stuff`
last_thousand = df_1.tail(1000)
np.mean(last_thousand.Stuff)
A little bit quicker using mean():
df = pd.read_csv("fds.csv", header = 0)
results = df.tail(1000).mean()
Results will contain the mean for each column within the last 1000 rows. If you want more statistics, you can also use describe():
resutls = df.tail(1000).describe().unstack()
So basically I needed to use the pandas tail function. My Code below works.
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
numpy.average(df_1.tail(1000))