Creating custom timestamp buckets bigquery - sql

I have an hourly (timestamp) dataset of events from the past month.
I would like to check the performance of events that occurred between certain hours, group them together and average the results.
For example: AVG income of the hours 23:00-02:00 per user:
So if I have this data set below. I'd like to summarise the coloured rows and then average them (the result should be 218).
I tried NTILE but it couldn't divide the data properly, ignoring the irrelevant hours.
Is there a good way to create these custom buckets using SQL?
dataset

From description not exactly sure how you want to aggregate. If you provide an example dataset can update answer.
However you can easily achieve this with an AVG and IF statement.
AVG(IF(EXTRACT(HOUR FROM timestamp_field) BETWEEN 0 AND 4, value, NULL) as avg_value
Using the above you can then group by either day or month to get the aggregation level you want.

Related

SQL - Returning max count, after breaking down a day into hourly rows

I need to write a SQL query that helps return the highest count in a given hourly range. The problem is that in my table, it just logs orders as they come and doesn’t have a unique identifier that separates hours from hours.
So basically, I need to find the highest number of orders (on any given hour), from 7/08/2022, - 7/15/2022, have a table that does not distinguish distinct hour sets, and logs orders as they come.
I have tried to use a query that combines MAX(), COUNT(), and DATETIME(), but to no avail.
Can I please receive some help?
I've had to tackle this kind of measurement in the past..
Here's what I did for 15 minute intervals:
My datetime column is named datreg in my database log area.
cast(round(floor(cast(datreg as float(53))*24*4)/(24*4),5) as smalldatetime
I times by 4 in this formula, to get 4 intervals inside my 24 hour period.. For you it would look like this to get just hourly intervals:
cast(round(floor(cast(datreg as float(53))*24)/(24),5) as smalldatetime
This is a little piece of magic when it comes to dashboards and reports.

Want to get order cancellation rate per week for prior 12 months

I have a table that has each transaction along with a field that shows how many units were cancelled in the order. If I filter the table on cancelled_units > 0 i can pull all transactions that are cancelled. There is also detailed date information for each transaction but I think I only need date. I need to create a rate calculation of total cancelled orders / total orders to get cancellation rate and then spread that out across every week for the past 12 months. I was thinking maybe using a CASE statement with some sort of counter in place? Also, I am using Databricks so maybe there is some built in function or operators that would make this easier. Appreciate you taking a look at my question.
From the context provided, you have a data frame with the list of transactions. It is also clear that there is a transaction column, the timestamp indicating when the order was placed, and the number of units cancelled in each transaction. So, when you filter your data frame on condition cancelled_units>0, and count the number of fields, you get number of cancelled_orders.
Using spark in Databricks:
Now, to find the rate of cancellation (cancelled_orders/total_orders) for every week in the past 12 months. I was able to find a way that allows you to calculate the rate of cancellation in a PARTICULAR YEAR (not past 12 months). So, gather all the records in a certain year first.
Since the timestamp indicating when the order was placed is already available in the data frame, we can use this timestamp to find out which week of the year this transaction was made. You can use the following way to achieve this (similar sytax for both pyspark and spark with scala).
df.withColumn("order_placed_week",date_format(col("transaction_date"), "w")).show()
Here transaction_date is a timestamp. If it is a date, then use the following method.
df.withColumn("order_placed_week", date_format(to_date("transaction_date", "dd/mm/yyyy"), "w")).show()
to_date() function helps you to specify the format of the transaction_date in your data frame.
The libraries that are required to be imported are:
For pyspark:
from pyspark.sql.functions import to_date, date_format, col
Reference: https://www.datasciencemadesimple.com/get-month-year-and-quarter-from-date-in-pyspark/
For spark:
import org.apache.spark.sql.functions._
Reference: https://sparkbyexamples.com/spark/spark-how-to-get-a-day-and-week-of-year/
After completing this process, you can use the resulting data frame with order_placed_week column to get cancellation rate.
Get the count of orders for each week number, and then the count of orders with cancelled units using groupBy and filter. Dividing the count_of_cancelled/total_count for each week will give your desired result.

SQL GROUPING SETS averages with multiple many-to-many dimensions

I have a table of data with the following:
User,Platform,Dt,Activity_Flag,Total_Purchases
1,iOS,05/05/2016,1,1
1,Android,05/05/2016,1,2
2,iOS,05/05/2016,1,0
2,Android,05/05/2016,1,2
3,iOS,05/05/2016,1,1
3,Android,06/05/2016,1,3
1,iOS,06/05/2016,1,2
4,Android,06/05/2016,1,2
1,Android,06/05/2016,1,0
3,iOS,07/05/2016,1,2
2,iOS,08/05/2016,1,0
I want to do a GROUPING SETS (Platform,Dt,(Platform,Dt),()) aggregation to be able to find for each combination of Platform and Dt the following:
Total Purchases
Total Unique Users
Average Purchases per User per Day
The first two are simple as these can be achieved via a sum(Total_Purchases) and count(distinct user) respectively.
The problem I have is with the last metric. The result set should look like this but I don't know how to get the last column to be calculated correctly:
Platform,Dt,Total_Purchases,Total_Unique_Users,Average_Purchases_Per_User_Per_Day
Android,05/05/2016,4,2,2.0
iOS,05/05/2016,2,3,0.7
Android,06/05/2016,5,3,1.7
iOS,06/05/2016,2,1,2.0
iOS,07/05/2016,2,1,2.0
iOS,08/05/2016,0,1,0.0
,05/05/2016,6,3,2.0
,06/05/2016,7,3,2.3
,07/05/2016,1,1,1.0
,08/05/2016,1,1,1.0
Android,,9,4,1.8
iOS,,6,3,1.2
,,15,4,1.6
For the first ten rows we see that getting the Average purchase per user per day is a simple division of the first two columns as the dimension in these rows represent a single date only. But when we look at the final 3 rows we see that the division is not the way to achieve the desired result. This is because it needs to take an average for each day in turn to get the overall per day amount.
If this isn't clear please let me know and I'll be happy to explain better. This is my first post on this site!

Aggregating 15-minute data into weekly values

I'm currently working on a project in which I want to aggregate data (resolution = 15 minutes) to weekly values.
I have 4 weeks and the view should include a value for each week AND every station.
My dataset includes more than 50 station.
What I have is this:
select name, avg(parameter1), avg(parameter2)
from data
where week in ('29','30','31','32')
group by name
order by name
But it only displays the avg value of all weeks. What I need is avg values for each week and each station.
Thanks for your help!
The problem is that when you do a 'GROUP BY' on just name you then flatten the weeks and you can only perform aggregate functions on them.
Your best option is to do a GROUP BY on both name and week so something like:
select name, week, avg(parameter1), avg(parameter2)
from data
where week in ('29','30','31','32')
group by name, week
order by name
PS - It' not entirely clear whether you're suggesting that you need one set of results for stations and one for weeks, or whether you need a set of results for every week at every station (which this answer provides the solution for). If you require the former then separate queries are the way to go.

Sql Queries for finding the sales trend

Suppose ,I have a table which has all the billing records. Now I want to see the sales trend for a user given time duration group by each 3 days ...what should be the sql query regarding this?
please help,Otherwise I am gone ...
I can only give a vague suggestion as per the question, however you may want to have a derived column with a standardised date (as per MS date format, just a number per day) that you could then use a modulus (3) on so that days are equal per 3 day period. You can then group and aggregate over this column to get the values for a 3 day period. Obviously to display the date nicely you would have to multiply back and convert your column as well.
Again I'm not sure of the specifics, but I think this general idea could be achieved to get a result (may well not be the best way so it would help to add more to the question...)