I have a use case where:
Data is of the form: Col1, Col2, Col3 and Timestamp.
Now, I just want to get the counts of the rows vs Timestamp Bins.
i.e. for every half hour bucket (even the ones which have no correponding rows), I need the counts of how many rows are there.
Timestamps are spread over a one year period, so I can't divide it into 24 buckets.
I have to bin them at 30 minutes interval.
groupby via pd.Grouper
# optionally, if needed
# df['Timestamp'] = pd.to_datetime(df['Timestamp'], errors='coerce')
df.groupby(pd.Grouper(key='Timestamp', freq='30min')).count()
resample
df.set_index('Timestamp').resample('30min').count()
Related
this is how the data frame looks I want to group by day and then do the following to get merged intervals for each group.Is there an elegant way to shrink merge overlapping intervals for each day?
(df["startTime"]>df["endTime"].shift()).cumsum()
I know that I can add a column that denotes the partition on day like so
df["partition"]=df.groupby(["actualDay","weekDay"]).ngroup()
but how do I make a shift exclusively within the group?
I have timeseries data that I want to query. The data is collected over multiple sensors. I want to direclty resample the data when loading: So each sensor separately resampled. Using pandas this can be reached like this:
#df is a pandas dataframe. Index is a timestamp (datetime64).
df=df.groupby('group').resample('1H').mean()
In sql i tried an approach like this:
SELECT date_trunc('hour', timestamp) AS timestamp, avg(signal.value) AS value, source_name,
FROM signal AS t_signal
GROUP BY(1, t_source.name)
This gives me different results, since in the first case with pandas, the resampling will create a row with a unique timestamp even if the original data did not have a datapoint within a specific hour.
The date_trunc does only aggregate existing data. is there a function that does the same as pandas resampling?
Creating a SELECT or table with only the timestamps you want (from-to) and then a full-outer-join with your resampled data should work.
Then, you only have to fill the NULL's with what you want to be the missing data.
Does this help?
in my data frame, I have data for 3 months, and it's per day. ( for every day, I have a different number of samples, for example on 1st January I have 20K rows of samples and on the second of January there are 15K samples)
what I need is that I want to take the mean number and apply it to all the data frames.
for example, if the mean value is 8K, i want to get the random 8k rows from 1st January data and 8k rows randomly from 2nd January, and so on.
as far as I know, rand() will give the random values of the whole data frame, But I need to apply it per day. since my data frame is on a daily basis and the date is mentioned in a column of the data frame.
Thanks
You can use groupby_sample after computing the mean of records:
# Suppose 'date' is the name of your column
sample = df.groupby('date').sample(n=int(df['date'].value_counts().mean()))
# Or
g = df.groupby('date')
sample = g.sample(n=int(g.size().mean()))
Update
Is there ant solution for the dates that their sum is lower than the mean? I face with this error for those dates: Cannot take a larger sample than population when 'replace=False'
n = np.floor(df['date'].value_counts().mean()).astype(int)
sample = (df.groupby('date').sample(n, replace=True)
.loc[lambda x: ~x.index.duplicated()])
In my dataset as follows (two columns: DATE and RATE)
I want to get the mean for the RATE for each day (from the dataset, you can see that there are multiple rate values for the same day). I have about 1,000 rows, so that I am trying to find an easier way to calculate the mean for each day, then save the results to a data frame.
You have to group by date then aggregate
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html
In your case
df.groupby('DATE').agg({'RATE': ['mean']})
You can groupby the date and perform mean operation.
new_df = df.groupby('DATE').mean()
I have a db table containing a datetime column with values stretching over 24hours. If I use pandas dataframe groupby function to give a minute by minute aggregation, this will throw everything into 0-59 buckets regardless of which hour they were in.
How do I get minute by minute aggregations spread over the timeframe of the table, in this case 24 hours? Also, for those minutes in which there is no values in the table, how do I insert a zero count for that minute into the dataframe?
Try using pd.TimeGroupper
import pandas as pd
df = pd.DataFrame(index=pd.date_range("11:00", "21:30", freq="100ms"))
df['x'] = 1
g = df.groupby(pd.TimeGrouper('S')).sum()