Merging time intervals in pandas dataframes after grouping by Day - pandas

this is how the data frame looks I want to group by day and then do the following to get merged intervals for each group.Is there an elegant way to shrink merge overlapping intervals for each day?
(df["startTime"]>df["endTime"].shift()).cumsum()
I know that I can add a column that denotes the partition on day like so
df["partition"]=df.groupby(["actualDay","weekDay"]).ngroup()
but how do I make a shift exclusively within the group?

Related

resampling timeseries data per group in sql

I have timeseries data that I want to query. The data is collected over multiple sensors. I want to direclty resample the data when loading: So each sensor separately resampled. Using pandas this can be reached like this:
#df is a pandas dataframe. Index is a timestamp (datetime64).
df=df.groupby('group').resample('1H').mean()
In sql i tried an approach like this:
SELECT date_trunc('hour', timestamp) AS timestamp, avg(signal.value) AS value, source_name,
FROM signal AS t_signal
GROUP BY(1, t_source.name)
This gives me different results, since in the first case with pandas, the resampling will create a row with a unique timestamp even if the original data did not have a datapoint within a specific hour.
The date_trunc does only aggregate existing data. is there a function that does the same as pandas resampling?
Creating a SELECT or table with only the timestamps you want (from-to) and then a full-outer-join with your resampled data should work.
Then, you only have to fill the NULL's with what you want to be the missing data.
Does this help?

Calculate mean for panda databframe based on date

In my dataset as follows (two columns: DATE and RATE)
I want to get the mean for the RATE for each day (from the dataset, you can see that there are multiple rate values for the same day). I have about 1,000 rows, so that I am trying to find an easier way to calculate the mean for each day, then save the results to a data frame.
You have to group by date then aggregate
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html
In your case
df.groupby('DATE').agg({'RATE': ['mean']})
You can groupby the date and perform mean operation.
new_df = df.groupby('DATE').mean()

How to select a range of consecutive dates of a dataframe with many users in pandas

I have a dataframe with 19M rows of different customers (~10K customers) and for their daily consumption over different date ranges. I have resampled this data into weekly consumption and the resulted dataframe is 2M rows. I want to know the ranges of consecutive dates for each customer and select those with the max(range). Any ideas? Thank you!
It would be great if you could post some example code, so the replies will be more specific.
You probably want to do something like earliest = df.groupby('Customer_ID').min()['Consumption_date'] to get the earliest consumption date per customer, and latest = df.groupby('Customer_ID').max()['Consumption_date'] for the latest consumption date, and then take the difference time_span = latest-earliest to get the time span per customer.
Knowing the specific df and variable names would be great

Max value in pandas based on descending windows

i have some experience with pandas - but cannot figure out the following:
i have several weeks of timestamped data with multiple records within one day,
i want to add a column in which, for each day, the maximum value of the remaining records of that day is displayed.
so if 5 records remain in a particular day, i need the max the next 5 records, after that, the max of next 4 records etc etc.
I have tried to use Group By but this does not seem to do the trick,
can somebody help me out?
exampledata
This is not the fastest, but you can try this -
dt['mvalue'] = dt.sort('datetime', ascending=False).groupby('date').value.cummax()
It simply does rolling max on a reverse sorted series

Using Hive, how to query data that is split across multiple partitions?

From a table partitioned over date field (a new partition is generated every day), I need to extract records that range over last three months. This means that I need to query the table on every partition in the last three months to get the data by using "where date < 'today's date' and date>= 'today - 90 days'.
I think that this query would not be very efficient.
Is there a better way of accessing data that is spread across multiple partitions?