SQL: Calculating system load statistics - sql

I have a table like this that stores messages coming through a system:
Message
-------
ID (bigint)
CreateDate (datetime)
Data (varchar(255))
I've been asked to calculate the messages saved per second at peak load. The only data I really have to work with is the CreateDate. The load on the system is not constant, there are times when we get a ton of traffic, and times when we get little traffic. I'm thinking there are two parts to this problem: 1. Determine ranges of time that are considered peak load, 2. Calculate the average messages per second during these times.
Is this the right approach? Are there things in SQL that can help with this? Any tips would be greatly appreciated.

I agree, you have to figure out what Peak Load is first before you can start to create reports on it.
The first thing I would do is figure out how I am going to define peak load. Ex. Am I going to look at an hour by hour breakdown.
Next I would do a group by on the CreateDate formated in seconds (no milleseconds). As part of the group by I would do an avg based on number of records.

I don't think you'd need to know the peak hours; you can generate them with SQL, wrapping a the full query and selecting the top 20 entries, for example:
select top 20 *
from (
[...load query here...]
) qry
order by LoadPerSecond desc
This answer had a good lesson about averages. You can calculate the load per second by looking at the load per hour, and dividing by 3600.
To get a first glimpse of the load for the last week, you could try (Sql Server syntax):
select datepart(dy,createdate) as DayOfYear,
hour(createdate) as Hour,
count(*)/3600.0 as LoadPerSecond
from message
where CreateDate > dateadd(week,-7,getdate())
group by datepart(dy,createdate), hour(createdate)
To find the peak load per minute:
select max(MessagesPerMinute)
from (
select count(*) as MessagesPerMinute
from message
where CreateDate > dateadd(days,-7,getdate())
group by datepart(dy,createdate),hour(createdate),minute(createdate)
)
Grouping by datepart(dy,...) is an easy way to distinguish between days without worrying about month borders. It works until you select more that a year back, but that would be unusual for performance queries.

warning, these will run slow!
this will group your data into "second" buckets and list them from the most activity to least:
SELECT
CONVERT(char(19),CreateDate,120) AS CreateDateBucket,COUNT(*) AS CountOf
FROM Message
GROUP BY CONVERT(Char(19),CreateDate,120)
ORDER BY 2 Desc
this will group your data into "minute" buckets and list them from the most activity to least:
SELECT
LEFT(CONVERT(char(19),CreateDate,120),16) AS CreateDateBucket,COUNT(*) AS CountOf
FROM Message
GROUP BY LEFT(CONVERT(char(19),CreateDate,120),16)
ORDER BY 2 Desc
I'd take those values and calculate what they want

Related

Finding the initial sampled time window after using SAMPLE BY again

I can't seem to find a perhaps easy solution to what I'm trying to accomplish here, using SQL and, more importantly, QuestDB. I also find it hard to put my exact question into words so bear with me.
Input
My real input is different of course but a similar dataset or case is the gas_prices table on the demo page of QuestDB. On https://demo.questdb.io, you can directly write and run queries against some sample database, so it should be easy enough to follow.
The main task I want to accomplish is to find out which month was responsible for the year's highest galon price.
Output
Using the following query, I can get the average galon price per month just fine.
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
timestamp
avg_per_month
2000-06-05T00:00:00.000000Z
1.6724
2000-07-05T00:00:00.000000Z
1.69275
2000-08-05T00:00:00.000000Z
1.635
...
...
Then, I get all these monthly averages, group them by year and return the maximum galon price per year by wrapping the above query in a subquery, like so:
SELECT timestamp, max(avg_per_month) as max_per_year FROM (
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
) SAMPLE BY 12M
timestamp
max_per_year
2000-01-05T00:00:00.000000Z
1.69275
2001-01-05T00:00:00.000000Z
1.767399999999
2002-01-05T00:00:00.000000Z
1.52075
...
...
Wanted output
I want to know which month was responsible for the maximum price of a year.
Looking at the output of the above query, we see that the maximum galon price for the year 2000 was 1.69275. Which month of the year 2000 had this amount as average price? I'd like to display this month in an additional column.
For the first row, July 2000 is shown in the additional column for year 2000 because it is responsible for the highest average price in 2000. For the second row, it was May 2001 as that month had the highest average price of 2001.
timestamp
max_per_year
which_month_is_responsible
2000-01-05T00:00:00.000000Z
1.69275
2000-07-05T00:00:00.000000Z
2001-01-05T00:00:00.000000Z
1.767399999999
2001-05-05T00:00:00.000000Z
...
...
What did I try?
I tried by adding a subquery to the SELECT to have a "duplicate" of some sort for the timestamp column but that's apparently never valid in QuestDB (?), so probably the solution is by adding even more subqueries in the FROM? Or a UNION?
Who can help me out with this? The data is there in the database and it can be calculated. It's just a matter of getting it out.
I think 'wanted output' can be achieved with window functions.
Please have a look at:
CREATE TABLE electricity (ts TIMESTAMP, consumption DOUBLE) TIMESTAMP(ts);
INSERT INTO electricity
SELECT (x*1000000)::timestamp, rnd_double()
FROM long_sequence(10000000);
SELECT day, ts, max_per_day
FROM
(
SELECT timestamp_floor('d', ts) as day,
ts,
avg_in_15_min as max_per_day,
row_number() OVER (PARTITION BY timestamp_floor('d', ts) ORDER BY avg_in_15_min desc) as rn_per_day
FROM
(
SELECT ts, avg(consumption) as avg_in_15_min
FROM electricity
SAMPLE BY 15m
)
) WHERE rn_per_day = 1

Grafana User Growth Time Series with SQL Server

I'm trying display the user growth per day using Grafana Time Series with SQL Server. However I found the documentation to be unhelpful and my queries are incorrect.
The following returns a constant value of 1 for every day. What do I need to change to display the number of new users created per day?
Thank you very much in advance.
SELECT
$__timeGroup([created_at],'1d') as time,
COUNT(id) as value,
'users' as metric
FROM [db].[user]
WHERE $__timeFilter([created_at])
GROUP BY [created_at]
ORDER BY 1
This works for me:
SELECT
$__timeGroup(created_at, '1d') AS time,
COUNT(id) as 'New Users'
FROM [db].[user]
GROUP BY $__timeGroup(created_at, '1d')
ORDER BY 1

Use SQL to ensure I have data for each day of a certain time period

I'm looking to only select one data point from each date in my report. I want to ensure each day is accounted for and has at least one row of information, as we had to do a few different things to move a large data file into our data warehouse (import one large Google Sheet for some data, use Python for daily pulls of some of the other data - want to make sure no date was left out), and this data goes from now through last summer. I could do a COUNT DISTINCT clause to just make sure the number of days between the first data point and yesterday (the latest data point), but I want to verify each day is accounted for. Should mention I am in BigQuery. Also, an example of the created_at style is: 2021-02-09 17:05:44.583 UTC
This is what I have so far:
SELECT FIRST(created_at)
FROM 'large_table'
ORDER BY created_at
**I know FIRST is probably not the best clause for this case, and it's currently acting to grab the very first data point in created_at, but just as a jumping-off point.
You can use aggregation:
select any_value(lt).*
from large_table lt
group by created_at
order by min(created_at);
Note: This assumes that created_at is a date -- or at least only has one value per date. You might need to convert it to a date:
select any_value(lt).*
from large_table lt
group by date(created_at)
order by min(created_at);
BigQuery equivalent of the query in your question
SELECT created_at
FROM 'large_table'
ORDER BY created_at
LIMIT 1

Is there a way to handle immutability that's robust and scalable?

Since bigquery is append-only, I was thinking about stamping each record I upload to it with an 'effective date' similar to how peoplesoft works, if anybody is familiar with that pattern.
Then, I could issue a select statement and join on the max effective date
select UTC_USEC_TO_MONTH(timestamp) as month, sum(amt)/100 as sales
from foo.orders as all
join (select id, max(effdt) as max_effdt from foo.orders group by id) as latest
on all.effdt = latest.max_effdt and all.id = latest.id
group by month
order by month;
Unfortunately, I believe this won't scale because of the big query 'small joins' restriction, so I wanted to see if anyone else had thought around this use case.
Yes, adding a timestamp for each record (or in some cases, a flag that captures the state of a particular record) is the right approach. The small side of a BigQuery "Small Join" can actually return at least 8MB (this value is compressed on our end, so is usually 2 to 10 times larger), so for "lookup" table type subqueries, this can actually provide a lot of records.
In your case, it's not clear to me what the exact query you are trying to run is.. it looks like you are trying to return the most recent sales times of every individual item - and then JOIN this information with the SUM of sales amt per month of each item? Can you provide more info about the query?
It might be possible to do this all in one query. For example, in our wikipedia dataset, an example might look something like...
SELECT contributor_username, UTC_USEC_TO_MONTH(timestamp * 1000000) as month,
SUM(num_characters) as total_characters_used FROM
[publicdata:samples.wikipedia] WHERE (contributor_username != '' or
contributor_username IS NOT NULL) AND timestamp > 1133395200
AND timestamp < 1157068800 GROUP BY contributor_username, month
ORDER BY contributor_username DESC, month DESC;
...to provide wikipedia contributions per user per month (like sales per month per item). This result is actually really large, so you would have to limit by date range.
UPDATE (based on comments below) a similar query that finds "num_characters" for the latest wikipedia revisions by contributors after a particular time...
SELECT current.contributor_username, current.num_characters
FROM
(SELECT contributor_username, num_characters, timestamp as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL)
AS current
JOIN
(SELECT contributor_username, MAX(timestamp) as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL AND timestamp > 1265073722 GROUP BY contributor_username) AS latest
ON
current.contributor_username = latest.contributor_username
AND
current.time = latest.time;
If your query requires you to use first build a large aggregate (for example, you need to run essentially an accurate COUNT DISTINCT) another option is to break this query up into two queries. The first query could provide the max effective date by month along with a count and save this result as a new table. Then, could run a sum query on the resulting table.
You could also store monthly sales records in separate tables, and only query the particular table for the months you are interested in, simplifying your monthly sales summaries (this could also be a more economical use of BigQuery). When you need to find aggregates across all tables, you could run your queries with multiple tables listed after the FROM clause.

SQL Select statement Where time is *:00

I'm attempting to make a filtered table based off an existing table. The current table has rows for every minute of every hour of 24 days based off of locations (tmcs).
I want to filter this table into another table that has rows for just 1 an hour for each of the 24 days based off the locations (tmcs)
Here is the sql statement that i thought would have done it...
SELECT
Time_Format(t.time, '%H:00') as time, ROUND(AVG(t.avg), 0) as avg,
tmc, Date, Date_Time FROM traffic t
GROUP BY time, tmc, Date
The problem is i still get 247,000 rows effected...and according to simple math I should only have:
Locations (TMCS): 14
Hours in a day: 24
Days tracked: 24
Total = 14 * 24 * 24 = 12,096
My original table has 477,277 rows
When I make a new table off this query i get right around 247,000 which makes no sense, so my query must be wrong.
The reason I did this method instead of a where clause is because I wanted to find the average speed(avg)per hour. This is not mandatory so I'd be fine with using a Where clause for time, but I just don't know how to do this based off *:00
Any help would be much appreciated
Fix the GROUP BY so it's standard, rather then the random MySQL extension
SELECT
Time_Format(t.time, '%H:00') as time,
ROUND(AVG(t.avg), 0) as avg,
tmc, Date, Date_Time
FROM traffic t
GROUP BY
Time_Format(t.time, '%H:00'), tmc, Date, Date_Time
Run this with SET SESSION sql_mode = 'ONLY_FULL_GROUP_BY'; to see the errors that other RDBMS will give you and make MySQL work properly