Oracle SQL group by time interval - sql

I have a table with the following columns:
order_number
customer_number
creation_date
estimated_ship_date
I need to select the max estimated_ship_date for any records with the same customer_number and the creation_date is within 15 minutes of each other. It could be any number of records, 1-50 max I would say.
So basically group by customer_number and creation_date within 15 minutes of each other. It is the creation_date within 15 minutes of each other that I am stuck on.

If you will only ever have to consider rows as a group if they all fall within a single 15 minute span then you could use a windowing clause:
select order_number, customer_number, creation_date, estimated_ship_date,
max(estimated_ship_date) over (partition by customer_number order by creation_date
range between 15/1440 preceding and 15/1440 following) as estimated_ship_date
from cust_orders;
That gets each row of your table back with an additional column that shows the maximum ship date for any row fifteen minutes* either side of the current one.
It might not do quite what you want if you have a series of orders where each is within 15 minutes of the previous one, but they are not all within 15 minutes of all of the others - as in the example in my earlier comment. It sounds like you maybe don't expect that situation or wouldn't want to group them all together anyway, but you'd need to look at how the grouping worked if it did happen and maybe adjust it slightly.
* Oracle date arithmetic is based on 1 representing a full days, so 15 minutes is (15*60)/(24*60*60) seconds; or 900/86400; or 15/(24*60); or 15/1440; or 1/96; etc. Which representation you use is a matter of taste and maintainability. You can also use intervals if you prefer.

Related

Finding the initial sampled time window after using SAMPLE BY again

I can't seem to find a perhaps easy solution to what I'm trying to accomplish here, using SQL and, more importantly, QuestDB. I also find it hard to put my exact question into words so bear with me.
Input
My real input is different of course but a similar dataset or case is the gas_prices table on the demo page of QuestDB. On https://demo.questdb.io, you can directly write and run queries against some sample database, so it should be easy enough to follow.
The main task I want to accomplish is to find out which month was responsible for the year's highest galon price.
Output
Using the following query, I can get the average galon price per month just fine.
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
timestamp
avg_per_month
2000-06-05T00:00:00.000000Z
1.6724
2000-07-05T00:00:00.000000Z
1.69275
2000-08-05T00:00:00.000000Z
1.635
...
...
Then, I get all these monthly averages, group them by year and return the maximum galon price per year by wrapping the above query in a subquery, like so:
SELECT timestamp, max(avg_per_month) as max_per_year FROM (
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
) SAMPLE BY 12M
timestamp
max_per_year
2000-01-05T00:00:00.000000Z
1.69275
2001-01-05T00:00:00.000000Z
1.767399999999
2002-01-05T00:00:00.000000Z
1.52075
...
...
Wanted output
I want to know which month was responsible for the maximum price of a year.
Looking at the output of the above query, we see that the maximum galon price for the year 2000 was 1.69275. Which month of the year 2000 had this amount as average price? I'd like to display this month in an additional column.
For the first row, July 2000 is shown in the additional column for year 2000 because it is responsible for the highest average price in 2000. For the second row, it was May 2001 as that month had the highest average price of 2001.
timestamp
max_per_year
which_month_is_responsible
2000-01-05T00:00:00.000000Z
1.69275
2000-07-05T00:00:00.000000Z
2001-01-05T00:00:00.000000Z
1.767399999999
2001-05-05T00:00:00.000000Z
...
...
What did I try?
I tried by adding a subquery to the SELECT to have a "duplicate" of some sort for the timestamp column but that's apparently never valid in QuestDB (?), so probably the solution is by adding even more subqueries in the FROM? Or a UNION?
Who can help me out with this? The data is there in the database and it can be calculated. It's just a matter of getting it out.
I think 'wanted output' can be achieved with window functions.
Please have a look at:
CREATE TABLE electricity (ts TIMESTAMP, consumption DOUBLE) TIMESTAMP(ts);
INSERT INTO electricity
SELECT (x*1000000)::timestamp, rnd_double()
FROM long_sequence(10000000);
SELECT day, ts, max_per_day
FROM
(
SELECT timestamp_floor('d', ts) as day,
ts,
avg_in_15_min as max_per_day,
row_number() OVER (PARTITION BY timestamp_floor('d', ts) ORDER BY avg_in_15_min desc) as rn_per_day
FROM
(
SELECT ts, avg(consumption) as avg_in_15_min
FROM electricity
SAMPLE BY 15m
)
) WHERE rn_per_day = 1

PL/SQL check time period, repeat, up until 600 records, from large database

What would be the best way to check if there has been data within a 3 month period up until a maximum of 600 records, then repeat for the 3 months before that if 600 hasn't been reached? Also it's a large table so querying the whole thing could take a few minutes or completely hang Oracle SQL Developer.
ROWNUM seems to give row numbers to the whole table before returning the result of the query, so that seems to take too long. The way we are currently doing it is entering a time period explicitly that we guess there will be enough records within and then limiting the rows to 600. This only takes 5 seconds, but needs to be changed constantly.
I was thinking to do a FOR loop through each row, but am having trouble storing the number of results outside of the query itself to check whether or not 600 has been reached.
I was also thinking about creating a data index? But I don't know much about that. Is there a way to sort the data by date before grabbing the whole table that would be faster?
Thank you
check if there has been data within a 3 month period up until a maximum of 600 records, then repeat for the 3 months before that if 600 hasn't been reached?
Find the latest date and filter to only allow the rows that are within 6 months of it and then fetch the first 600 rows:
SELECT *
FROM (
SELECT t.*,
MAX(date_column) OVER () AS max_date_column
FROM table_name t
)
WHERE date_column > ADD_MONTHS( max_date_column, -6 )
ORDER BY date_column DESC
FETCH FIRST 600 ROWS ONLY;
If there are 600 or more within the latest 3 months then they will be returned; otherwise it will extend the result set into the next 3 month period.
If you intend to repeat the extension over more than two 3-month periods then just use:
SELECT *
FROM table_name
ORDER BY date_column DESC
FETCH FIRST 600 ROWS ONLY;
I was also thinking about creating a data index? But I don't know much about that. Is there a way to sort the data by date before grabbing the whole table that would be faster?
Yes, creating an index on the date column would, typically, make filtering the table faster.

SQL performance issues with window functions on daily basis

Given ~23 million users, what is the most efficient way to compute the cumulative number of logins within the last X months for any given day (even when no login was performed) ? Start date of a customer is its first ever login, end date is today.
Desired output
c_id day nb_logins_past_6_months
----------------------------------------------
1 2019-01-01 10
1 2019-01-02 10
1 2019-01-03 9
...
1 today 5
➔ One line per user per day with the number of logins between current day and 179 days in the past
Approach 1
1. Cross join each customer ID with calendar table
2. Left join on login table on day
3. Compute window function (i.e. `sum(nb_logins) over (partition by c_id order by day rows between 179 preceding and current row)`)
+ Easy to understand and mantain
- Really heavy, quite impossible to run on daily basis
- Incremental does not bring much benefit : still have to go 179 days in the past
Approach 2
1. Cross join each customer ID with calendar table
2. Left join on login table on day between today and 179 days in the past
3. Group by customer ID and day to get nb logins within 179 days
+ Easier to do incremental
- Table at step 2 is exceeding 300 billion rows
What is the common way to deal with this knowing this is not the only use case, we have to compute other columns like this (nb logins in the past 12 months etc.)
In standard SQL, you would use:
select l.*,
count(*) over (partition by customerid
order by login_date
range between interval '6 month' preceding and current row
) as num_logins_180day
from logins l;
This assumes that the logins table has a date of the login with no time component.
I see no reason to multiply 23 million users by 180 days to generate a result set in excess of 4 million rows to answer this question.
For performance, don't do the entire task all at once. Instead, gather subtotals at the end of each month (or day or whatever makes sense for your data). Then SUM up the subtotals to provide the 'report'.
More discussion (with a focus on MySQL): http://mysql.rjweb.org/doc.php/summarytables
(You should tag questions with the specific product; different products have different syntax/capability/performance/etc.)

SQL Select statement Where time is *:00

I'm attempting to make a filtered table based off an existing table. The current table has rows for every minute of every hour of 24 days based off of locations (tmcs).
I want to filter this table into another table that has rows for just 1 an hour for each of the 24 days based off the locations (tmcs)
Here is the sql statement that i thought would have done it...
SELECT
Time_Format(t.time, '%H:00') as time, ROUND(AVG(t.avg), 0) as avg,
tmc, Date, Date_Time FROM traffic t
GROUP BY time, tmc, Date
The problem is i still get 247,000 rows effected...and according to simple math I should only have:
Locations (TMCS): 14
Hours in a day: 24
Days tracked: 24
Total = 14 * 24 * 24 = 12,096
My original table has 477,277 rows
When I make a new table off this query i get right around 247,000 which makes no sense, so my query must be wrong.
The reason I did this method instead of a where clause is because I wanted to find the average speed(avg)per hour. This is not mandatory so I'd be fine with using a Where clause for time, but I just don't know how to do this based off *:00
Any help would be much appreciated
Fix the GROUP BY so it's standard, rather then the random MySQL extension
SELECT
Time_Format(t.time, '%H:00') as time,
ROUND(AVG(t.avg), 0) as avg,
tmc, Date, Date_Time
FROM traffic t
GROUP BY
Time_Format(t.time, '%H:00'), tmc, Date, Date_Time
Run this with SET SESSION sql_mode = 'ONLY_FULL_GROUP_BY'; to see the errors that other RDBMS will give you and make MySQL work properly

Query to find a weekly average

I have an SQLite database with the following fields for example:
date (yyyymmdd fomrat)
total (0.00 format)
There is typically 2 months of records in the database. Does anyone know a SQL query to find a weekly average?
I could easily just execute:
SELECT COUNT(1) as total_records, SUM(total) as total FROM stats_adsense
Then just divide total by 7 but unless there is exactly x days that are divisible by 7 in the db I don't think it will be very accurate, especially if there is less than 7 days of records.
To get a daily summary it's obviously just total / total_records.
Can anyone help me out with this?
You could try something like this:
SELECT strftime('%W', thedate) theweek, avg(total) theaverage
FROM table GROUP BY strftime('%W', thedate)
I'm not sure how the syntax would work in SQLite, but one way would be to parse out the date parts of each [date] field, and then specifying which WEEK and DAY boundaries in your WHERE clause and then GROUP by the week. This will give you a true average regardless of whether there are rows or not.
Something like this (using T-SQL):
SELECT DATEPART(w, theDate), Avg(theAmount) as Average
FROM Table
GROUP BY DATEPART(w, theDate)
This will return a row for every week. You could filter it in your WHERE clause to restrict it to a given date range.
Hope this helps.
Your weekly average is
daily * 7
Obviously this doesn't take in to account specific weeks, but you can get that by narrowing the result set in a date range.
You'll have to omit those records in the addition which don't belong to a full week. So, prior to summing up, you'll have to find the min and max of the dates, manipulate them such that they form "whole" weeks, and then run your original query with a WHERE that limits the date values according to the new range. Maybe you can even put all this into one query. I'll leave that up to you. ;-)
Those values which are "truncated" are not used then, obviously. If there's not enough values for a week at all, there's no result at all. But there's no solution to that, apparently.