Finding the initial sampled time window after using SAMPLE BY again - sql

I can't seem to find a perhaps easy solution to what I'm trying to accomplish here, using SQL and, more importantly, QuestDB. I also find it hard to put my exact question into words so bear with me.
Input
My real input is different of course but a similar dataset or case is the gas_prices table on the demo page of QuestDB. On https://demo.questdb.io, you can directly write and run queries against some sample database, so it should be easy enough to follow.
The main task I want to accomplish is to find out which month was responsible for the year's highest galon price.
Output
Using the following query, I can get the average galon price per month just fine.
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
timestamp
avg_per_month
2000-06-05T00:00:00.000000Z
1.6724
2000-07-05T00:00:00.000000Z
1.69275
2000-08-05T00:00:00.000000Z
1.635
...
...
Then, I get all these monthly averages, group them by year and return the maximum galon price per year by wrapping the above query in a subquery, like so:
SELECT timestamp, max(avg_per_month) as max_per_year FROM (
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
) SAMPLE BY 12M
timestamp
max_per_year
2000-01-05T00:00:00.000000Z
1.69275
2001-01-05T00:00:00.000000Z
1.767399999999
2002-01-05T00:00:00.000000Z
1.52075
...
...
Wanted output
I want to know which month was responsible for the maximum price of a year.
Looking at the output of the above query, we see that the maximum galon price for the year 2000 was 1.69275. Which month of the year 2000 had this amount as average price? I'd like to display this month in an additional column.
For the first row, July 2000 is shown in the additional column for year 2000 because it is responsible for the highest average price in 2000. For the second row, it was May 2001 as that month had the highest average price of 2001.
timestamp
max_per_year
which_month_is_responsible
2000-01-05T00:00:00.000000Z
1.69275
2000-07-05T00:00:00.000000Z
2001-01-05T00:00:00.000000Z
1.767399999999
2001-05-05T00:00:00.000000Z
...
...
What did I try?
I tried by adding a subquery to the SELECT to have a "duplicate" of some sort for the timestamp column but that's apparently never valid in QuestDB (?), so probably the solution is by adding even more subqueries in the FROM? Or a UNION?
Who can help me out with this? The data is there in the database and it can be calculated. It's just a matter of getting it out.

I think 'wanted output' can be achieved with window functions.
Please have a look at:
CREATE TABLE electricity (ts TIMESTAMP, consumption DOUBLE) TIMESTAMP(ts);
INSERT INTO electricity
SELECT (x*1000000)::timestamp, rnd_double()
FROM long_sequence(10000000);
SELECT day, ts, max_per_day
FROM
(
SELECT timestamp_floor('d', ts) as day,
ts,
avg_in_15_min as max_per_day,
row_number() OVER (PARTITION BY timestamp_floor('d', ts) ORDER BY avg_in_15_min desc) as rn_per_day
FROM
(
SELECT ts, avg(consumption) as avg_in_15_min
FROM electricity
SAMPLE BY 15m
)
) WHERE rn_per_day = 1

Related

Fill in gaps in time series after a group by date in BigQuery

I have a big dataset with multiple timeseries of different products and quantities sold. After I grouped by Product_ID and Date(monthly), products that have not been sold during a specific month generate gaps in the time series. How could I fill all these gaps with rows with quantity sold 0?
I found similar stuff but haven't been able to adapt it to my situation
Your question is rather vague, best to provide example input and outputs in future, but here is my guess as to what you are trying to do.
You will probably want to create a calendar table, with the intervals of your data. I will use daily as the example as you have not suggested, you can adapt for different regularities as required.
CREATE OR REPLACE TABLE `project.my_dataset.daily_calendar` (
MY_DATE DATE NOT NULL
)
INSERT INTO `project.my_dataset.daily_calendar`
SELECT
MY_DATE
FROM
(
SELECT DATE_ADD('2000-01-01',INTERVAL param DAY) AS MY_DATE
FROM unnest(GENERATE_ARRAY(0, 10000, 1)) as param
)
;
The above will create a calendar with daily intervals from 10000 days from the year 2000, adjust according to your requirements.
You will then want to cross join your data with the calendar data, this will create a record for every day, and every product.
select
cal.date,
your_data.product_name,
ifnull(your_data.quantity, 0) -- if no data then make the value 0
from `project.dataset.table`
cross join
`project.my_dataset.daily_calendar`
From your vague description, this could work. Ideally provide example input and output in future to get a more accurate response.
Edit:
An alternate (arguably better) method to fill the calendar table, as suggested by #Cylldby:
INSERT INTO `project.my_dataset.daily_calendar`
SELECT
MY_DATE
FROM
(
SELECT param AS MY_DATE
FROM unnest(GENERATE_DATE_ARRAY('2010-01-01', '2030-12-31')) as param
)
;

Sum dates with different timestamps and picking the min date?

Beginner here. I want to have only one row for each delivery date but it is important to keep the hours and the minutes. I have the following table in Oracle (left):
As you can see there are days that a certain SKU (e.g SKU A) was delivered twice in the same day. The table on the right is the desired result. Essentially, I want to have the quantities that arrived on the 28th summed up and in the Supplier_delivery column I want to have the earliest delivery timestamp.
I need to keep the hours and the minutes otherwise I know I could achieve this by writing sth like: SELECT SKU, TRUNC(TO_DATE(SUPPLIER_DELIVERY), 'DDD'), SUM(QTY) FROM TABLE GROUP BY SKU , TRUNC(TO_DATE(SUPPLIER_DELIVERY), 'DDD')
Any ideas?
You can use MIN():
SELECT SKU, MIN(SUPPLIER_DELIVERY), SUM(QTY)
FROM TABLE
GROUP BY SKU, TRUNC(SUPPLIER_DELIVERY);
This assumes that SUPPLIER_DELIVERY is a date and does not need to be converted to one. But it would work with TO_DATE() in the GROUP BY as well.

Calculation of weighted average counts in SQL

I have a query that I am currently using to find counts
select Name, Count(Distinct(ID)), Status, Team, Date from list
In addition to the counts, I need to calculate a goal based on weighted average of counts per status and team, for each day.
For example, if Name 1 counts are divided into 50% Status1-Team1(X) and 50% Status2-Team2(Y) yesterday, then today's goal for Name1 needs to be (X+Y)/2.
The table would look like this, with the 'Goal' field needed as the output:
What is the best way to do this in the same query?
I'm almost guessing here since you did not provide more details but maybe you want to do this:
SELECT name,status,team,data,(select sum(data)/(select count(*) from list where name = q.name)) FROM (SELECT Name, Count(Distinct(ID)) as data, Status, Team, Date FROM list) as q

Query to find a weekly average

I have an SQLite database with the following fields for example:
date (yyyymmdd fomrat)
total (0.00 format)
There is typically 2 months of records in the database. Does anyone know a SQL query to find a weekly average?
I could easily just execute:
SELECT COUNT(1) as total_records, SUM(total) as total FROM stats_adsense
Then just divide total by 7 but unless there is exactly x days that are divisible by 7 in the db I don't think it will be very accurate, especially if there is less than 7 days of records.
To get a daily summary it's obviously just total / total_records.
Can anyone help me out with this?
You could try something like this:
SELECT strftime('%W', thedate) theweek, avg(total) theaverage
FROM table GROUP BY strftime('%W', thedate)
I'm not sure how the syntax would work in SQLite, but one way would be to parse out the date parts of each [date] field, and then specifying which WEEK and DAY boundaries in your WHERE clause and then GROUP by the week. This will give you a true average regardless of whether there are rows or not.
Something like this (using T-SQL):
SELECT DATEPART(w, theDate), Avg(theAmount) as Average
FROM Table
GROUP BY DATEPART(w, theDate)
This will return a row for every week. You could filter it in your WHERE clause to restrict it to a given date range.
Hope this helps.
Your weekly average is
daily * 7
Obviously this doesn't take in to account specific weeks, but you can get that by narrowing the result set in a date range.
You'll have to omit those records in the addition which don't belong to a full week. So, prior to summing up, you'll have to find the min and max of the dates, manipulate them such that they form "whole" weeks, and then run your original query with a WHERE that limits the date values according to the new range. Maybe you can even put all this into one query. I'll leave that up to you. ;-)
Those values which are "truncated" are not used then, obviously. If there's not enough values for a week at all, there's no result at all. But there's no solution to that, apparently.

SQL: Calculating system load statistics

I have a table like this that stores messages coming through a system:
Message
-------
ID (bigint)
CreateDate (datetime)
Data (varchar(255))
I've been asked to calculate the messages saved per second at peak load. The only data I really have to work with is the CreateDate. The load on the system is not constant, there are times when we get a ton of traffic, and times when we get little traffic. I'm thinking there are two parts to this problem: 1. Determine ranges of time that are considered peak load, 2. Calculate the average messages per second during these times.
Is this the right approach? Are there things in SQL that can help with this? Any tips would be greatly appreciated.
I agree, you have to figure out what Peak Load is first before you can start to create reports on it.
The first thing I would do is figure out how I am going to define peak load. Ex. Am I going to look at an hour by hour breakdown.
Next I would do a group by on the CreateDate formated in seconds (no milleseconds). As part of the group by I would do an avg based on number of records.
I don't think you'd need to know the peak hours; you can generate them with SQL, wrapping a the full query and selecting the top 20 entries, for example:
select top 20 *
from (
[...load query here...]
) qry
order by LoadPerSecond desc
This answer had a good lesson about averages. You can calculate the load per second by looking at the load per hour, and dividing by 3600.
To get a first glimpse of the load for the last week, you could try (Sql Server syntax):
select datepart(dy,createdate) as DayOfYear,
hour(createdate) as Hour,
count(*)/3600.0 as LoadPerSecond
from message
where CreateDate > dateadd(week,-7,getdate())
group by datepart(dy,createdate), hour(createdate)
To find the peak load per minute:
select max(MessagesPerMinute)
from (
select count(*) as MessagesPerMinute
from message
where CreateDate > dateadd(days,-7,getdate())
group by datepart(dy,createdate),hour(createdate),minute(createdate)
)
Grouping by datepart(dy,...) is an easy way to distinguish between days without worrying about month borders. It works until you select more that a year back, but that would be unusual for performance queries.
warning, these will run slow!
this will group your data into "second" buckets and list them from the most activity to least:
SELECT
CONVERT(char(19),CreateDate,120) AS CreateDateBucket,COUNT(*) AS CountOf
FROM Message
GROUP BY CONVERT(Char(19),CreateDate,120)
ORDER BY 2 Desc
this will group your data into "minute" buckets and list them from the most activity to least:
SELECT
LEFT(CONVERT(char(19),CreateDate,120),16) AS CreateDateBucket,COUNT(*) AS CountOf
FROM Message
GROUP BY LEFT(CONVERT(char(19),CreateDate,120),16)
ORDER BY 2 Desc
I'd take those values and calculate what they want