Meaning of SQL-code in reviewed code - sql

I need review some code of test-application written in PHP/MySQL.
Author of this code wrote three SQL-queries.
I can't understand, if he meant here some performace optimization?
DB::fetch("SELECT COUNT( * ) AS count, `ip`,`datetime`
FROM `logs`
WHERE `datetime` > ('2006-02-03' - INTERVAL 10 DAY)
GROUP BY `ip`
ORDER BY `datetime` DESC");
$hits = DB::fetchAll("SELECT COUNT( * ) AS count, `datetime`
FROM `logs`
WHERE `datetime` > ( '2006-02-03' - INTERVAL 10
DAY ) AND `is_doc` = 1
GROUP BY `datetime`
ORDER BY `datetime` DESC");
$hosts = DB::fetchAll("SELECT COUNT( * ) AS hosts , datetime
FROM (
SELECT `ip` , datetime
FROM `logs`
WHERE `is_doc` = 1
GROUP BY `datetime` , `ip`
ORDER BY `logs`.`datetime` DESC
) AS cnt
WHERE cnt.datetime > ( '2006-02-03' - INTERVAL 10
DAY )
GROUP BY cnt.datetime
ORDER BY datetime DESC ");
Results of first query are not used in application.

The 1st query is invalid, as it selects 2 columns + 1 aggregate and only groups by 1 of the 2 columns selected.
The 2nd query is getting a count of all rows in logs by date within the last 10 days since 2006-02-03
The 3rd query is getting a count of all distinct ip values from logs within the last 10 days since 2006-02-03 and could be better written as
SELECT COUNT(DISTINCT ip) hosts, datetime
FROM logs
WHERE is_doc = 1
GROUP BY datetime
ORDER BY datetime desc
If this was a submission for a job iterview you may wonder why the cutoff date isn't passed as a variable.

Related

Fill Sparse Data with SQL (Rockset)

We have created the following query in order to convert sparse time series data into dense data with specific time slots. The idea is that a time range (e.g. 1 hour) is converted into distinct time slots (e.g. 60 x 1 min slots). For each slot (1 min in this example) we compute if there are one or more values and if there are we use a MAX function to get our value. If there are no values in the time range we use the one from the previous slot.
Here is the basic query:
WITH readings AS (
(
-- Get the first value before the time window to set the entry value
SELECT
timestamp AS timestamps,
attributeId AS id,
DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) AS ts,
value AS value
FROM
node_iot_attribute_values
WHERE
attributeId = 'cu937803-ne9de7df-nn7453b2-na2c7e14'
AND DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) < TIMESTAMP '2021-10-26T08:42:06.000000Z'
ORDER BY
ts DESC
LIMIT
1
)
UNION
(
-- Get the values in the time range
SELECT
timestamp AS timestamps,
attributeId AS id,
DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) AS ts,
value AS value
FROM
node_iot_attribute_values
WHERE
attributeId = 'cu937803-ne9de7df-nn7453b2-na2c7e14'
AND DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) > TIMESTAMP '2021-10-26T08:42:06.000000Z'
AND DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) < TIMESTAMP '2021-10-26T09:42:06.000000Z'
)
),
slots AS (
-- Create time slots at the correct resolution
SELECT
TIMESTAMP '2021-10-26T08:42:06.000000Z' + MINUTES(u.i - 1) AS last_ts,
TIMESTAMP '2021-10-26T08:42:06.000000Z' + MINUTES(u.i) AS ts
FROM
UNNEST(SEQUENCE(0, 60, 1) AS i) AS u
),
slot_values AS (
-- Get the values for each time slot from the readings retrieved
SELECT
slots.ts,
(
SELECT
r.value
FROM
readings r
WHERE
r.ts <= slots.ts
ORDER BY
r.ts DESC
LIMIT
1
) AS last_val,
(
SELECT
MAX(r.value)
FROM
readings r
WHERE
r.ts <= slots.ts
AND r.ts >= slots.last_ts
) AS slot_agg_val,
FROM
slots
)
SELECT
-- Use either the MAX value if several are in the same slot or the last if none
CAST(ts AT TIME ZONE 'Europe/Paris' AS string) AS ts,
COALESCE(
slot_agg_val,
LAG(slot_agg_val, 1) OVER(
ORDER BY
ts
),
last_val
) AS value
FROM
slot_values
ORDER BY
ts;
The good news is that the query works. The bad news is the performance is terrible!!!
Interestingly the part of the query that retrieves the data from storage is very performant. In our case this part of the query returns all the results in ~50ms
WITH readings AS (
(
-- Get the first value before the time window to set the entry value
SELECT
timestamp AS timestamps,
attributeId AS id,
DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) AS ts,
value AS value
FROM
node_iot_attribute_values
WHERE
attributeId = 'cu937803-ne9de7df-nn7453b2-na2c7e14'
AND DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) < TIMESTAMP '2021-10-26T08:42:06.000000Z'
ORDER BY
ts DESC
LIMIT
1
)
UNION
(
-- Get the values in the time range
SELECT
timestamp AS timestamps,
attributeId AS id,
DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) AS ts,
value AS value
FROM
node_iot_attribute_values
WHERE
attributeId = 'cu937803-ne9de7df-nn7453b2-na2c7e14'
AND DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) > TIMESTAMP '2021-10-26T08:42:06.000000Z'
AND DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) < TIMESTAMP '2021-10-26T09:42:06.000000Z'
)
)
Having analysed the different parts of the query the one that is exploding the performance is this:
slot_values AS (
-- Get the values for each time slot from the readings retrieved
SELECT
slots.ts,
(
SELECT
r.value
FROM
readings r
WHERE
r.ts <= slots.ts
ORDER BY
r.ts DESC
LIMIT
1
) AS last_val,
(
SELECT
MAX(r.value)
FROM
readings r
WHERE
r.ts <= slots.ts
AND r.ts >= slots.last_ts
) AS slot_agg_val,
FROM
slots
)
For some reason this part takes ~25 seconds to execute! I would really appreciate some help in optimizing this query.
I would use JOIN and AGGREGATION logic to compute this. SQL works well with map and reduce logic.
Try
SELECT
filled_slots.ts,
MAX(value) AS last_val,
slot_agg_val
FROM
(
SELECT
slots.ts,
MAX(previous_r.ts) last_previous_time,
MAX(in_interval_r.value) AS slot_agg_val,
FROM
slots
LEFT JOIN readings previous_r ON previous_r.ts <= slots.ts
LEFT JOIN readings in_interval_r ON in_interval_r.ts < slots.ts
AND in_interval_r.ts > slots.last_ts
GROUP BY
slots.ts
) filled_slots
LEFT JOIN readings ON filled_slots.last_previous_time = readings.ts
GROUP BY
filled_slots.ts,
slot_agg_val
The last one aggregation is useful to avoid issues due to duplicated data.
Code is not tested.

Retrieve date from subtracting variable number of days from date on calendar table

In our system we have a created table that lists all the days that extends out to 20230 with a special field to specify a holiday/weekend.
SAMPLE BELOW:
DATE_FIELD HOLIDAY_FIELD
20200430 N
20200501 N
20200502 Y
20200503 Y
20200504 N
20200505 N
20200506 N
20200507 N
..............
My goal is to provide a date variable and subtract x number of days from the provided date.
The number of days is not a constant field, it can be different so FETCH and LIMIT wont work.
Ive already tried the code below and it works just as i want it if i always want to subtract 5 days from the given date:
select date_field
from table.calendar
where date_field <= '20200507' and holiday_field = 'N'
order by date_field desc
LIMIT 5,1
This will give me the result I want '20200430' because it skips the weekends.
However I want to be able to do something like below:
select date_field
from table.calendar
where date_field <= (variable date) and holiday_field = 'N'
order by date_field desc
LIMIT (variable n),1
But from what Ive read you cannot specify a variable for a fetch or limit.
Also to add this select statement will be used in a sub select.
So it most likely use as it is below:
SELECT table1.*,
( select date_field
from table.calendar
where date_field <= (table1.date) and holiday_field = 'N'
order by date_field desc
LIMIT (table1.days n),1 ) AS DATE
from table1
order by table1.date
Ive tried using row_number() but have no clue on how to pass the date and days variable.
This would start from the absolute top of the list and go down. I need it to start from a specific date.
with CALENDAR AS(
SELECT x.* FROM (
select date_field
row_number() over() as rownum
from table.calendar X
where holiday_field = 'N'
order by date_field
) AS t
)
select table1.*, A.date_field
from table1
left join CALENDAR A on A.date_field <= table1.date and A.rownum = 5
I also understand i could easily do this in a user created function but my ultimate goal is to produce a sql views to export to a 3rd party software. Their is severe performance slow down when using user functions in sql views.
Any suggestions?
The solution with global variables.
CREATE OR REPLACE VARIABLE GV_DATE_FIELD VARCHAR(8) DEFAULT (TO_CHAR(CURRENT DATE, 'YYYYMMDD'));
CREATE OR REPLACE VARIABLE GV_DAYS INT DEFAULT 5;
CREATE OR REPLACE VIEW CALENDAR_V AS
select date_field
from
(
select date_field, rownumber() over (order by date_field desc) rn
from calendar
where date_field <= GV_DATE_FIELD and holiday_field = 'N'
)
where rn = GV_DAYS;
-- GV_DATE_FIELD == TO_CHAR(CURRENT DATE, 'YYYYMMDD')
-- GV_DAYS == 5
select * from CALENDAR_V;
SET GV_DAYS = 4;
-- GV_DATE_FIELD == TO_CHAR(CURRENT DATE, 'YYYYMMDD')
-- GV_DAYS == 4
select * from CALENDAR_V;
This couple of global variables are set for every session to their default values and work as parameters.
You may set their values explicitly (with the SET statement as described) before running the statement using them (the CALENDAR_V view in this case) to get the corresponding result.

Standard deviation of a set of dates

I have a table of transactions with columns id | client_id | datetime and I have calculated the mean of days between transactions to know how often this transactions are made by each client:
SELECT *, ((date_last_transaction - date_first_transaction)/total_transactions) AS frequency
FROM (
SELECT client_id, COUNT(id) AS total_transactions, MIN(datetime) AS date_first_transaction, MAX(datetime) AS date_last_transaction
FROM transactions
GROUP BY client_id
) AS t;
What would be the existing methods to calculate the standard deviation (in days) in a set of dates with postgresql? Preferably with only one query, if it is posible :-)
I have found this way:
SELECT extract(day from date_trunc('day', (
CASE WHEN COUNT(*) <= 1 THEN
0
ELSE
SUM(time_since_last_invoice)/(COUNT(*)-1)
END
) * '1 day'::interval)) AS days_between_purchases,
extract(day from date_trunc('day', (
CASE WHEN COUNT(*) <= 2 THEN
0
ELSE
STDDEV(time_since_last_invoice)
END
) * '1 day'::interval)) AS range_of_days
FROM (
SELECT client_id, datetime, COALESCE(datetime - lag(datetime)
OVER (PARTITION BY client_id ORDER BY client_id, datetime
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
), 0
) AS time_since_last_invoice
FROM my_table
GROUP BY client_id, datetime
ORDER BY client_id, datetime
)
Explanation:
This query groups by client and date and then calculates the difference between each pair of transaction dates (datetime) by client_id and returns a table with these results. After this, the external query processes the table and calculates de average time between differences greater than 0 (first value in each group is excluded because is the first transaction and therefore the interval is 0).
The standard deviation is calculated when there existe 2 o more transaction dates for the same client, to avoid division by zero errors.
All differences are returned in PostgreSQL interval format.

How to write SQL query for the following case.?

I have one Change Report Table which has two columns ChangedTime,FileName
Please consider this table has over 1000 records
Here I need to query all the changes based on following factors
i) Interval (i.e-1mins )
ii) No of files
It means when we have given Interval 1 min and No Of files 10.
If the the no of changed files more than 10 in any of the 1 minute interval, we need to get all the changed files exists in that 1 minute interval
Example:
i) Consider we have 15 changes in the interval 11:52 to 11:53
ii)And consider we have 20 changes in the interval 12:58 to 12:59
Now my expected results would be 35 records.
Thanks in advance.
You need to aggregate by the interval and then do the count. Assuming that an interval starting at time 0 is ok, the following should work:
declare #interval int = 1;
declare #limit int = 10;
select sum(cnt)
from (select count(*) as cnt
from t
group by DATEDIFF(minute, 0, ChangedTime)/#interval
) t
where cnt >= #limit;
If you have another time in mind for when intervals should start, then substitute that for 0.
EDIT:
For your particular query:
select sum(ChangedTime)
from (select count(*) as ChangedTime
from [MyDB].[dbo].[Log_Table.in_PC]
group by DATEDIFF(minute, 0, ChangedTime)/#interval
) t
where ChangedTime >= #limit;
You can't have a three part alias name on a subquery. t will do.
Something like this should work:
You count the number of records using the COUNT() function.
Then you limit the selection with the WHERE clause:
SELECT COUNT(FileName)
FROM "YourTable"
WHERE ChangedTime >= "StartInteval"
AND ChangedTime <= "EndInterval";
Another method that is useful in a where clause is BETWEEN : http://msdn.microsoft.com/en-us/library/ms187922.aspx.
You didn't state which SQL DB you are using so I assume its MSSQL.
select count(*) from (select a.FileName,
b.ChangedTime startTime,
a.ChangedTime endTime,
DATEDIFF ( minute , a.ChangedTime , b.ChangedTime ) timeInterval
from yourtable a, yourtable b
where a.FileName = b.FileName
and a.ChangedTime > b.ChangedTime
and DATEDIFF ( minute , a.ChangedTime , b.ChangedTime ) = 1) temp
group by temp.FileName

SQL query records within a range of boundaries and max/min outside the range

I have the following three simple T-SQL queries. First one is to get records within a range of boundaries (DATETIME type):
SELECT value, timestamp
FROM myTable
WHERE timestamp BETWEEN #startDT AND #endDT
the second one is to get the closest record to #startDT (DATETIME type)
SELECT TOP 1
value, timestamp
FROM myTable
WHERE timestamp > #startDT
ORDER BY timestamp DESC
and the last one is to get the closest record after #endDT:
SELECT TOP 1
value, timestamp
FROM myTable
WHERE timestamp < #endDT
ORDER BY timestamp ASC
I would like to get all the records of above three queries as one group of records. I tried to use UNION, but it seems that sub-queries within UNION does not allow ORDER BY clause. Is there efficient way to get my result?
. . * | * * * * * | * . . .
start end
The above graph simply shows the records of *s as my required records, and |...| is the boundaries.
By the way, the amount of data in myTable is huge. My understanding UNION is not an efficient way to get data from UNIONs. Any efficient way to get data without UNION?
As you wish, without UNION.
MySQL (TESTED)
SELECT
dv1.timestamp, dv1.values
FROM
myTable AS dv1
WHERE
dv1.timestamp
BETWEEN (
SELECT dv2.timestamp
FROM myTable AS dv2
WHERE dv2.timestamp < '#START_DATE'
ORDER BY dv2.timestamp DESC
LIMIT 1
)
AND ( SELECT dv3.timestamp
FROM myTable AS dv3
WHERE dv3.timestamp > '#END_DATE'
ORDER BY dv3.timestamp ASC
LIMIT 1
)
EDIT Sorry, I forgot to notice about T-SQL.
T-SQL (NOT TESTED)
SELECT
dv1.timestamp, dv1.values
FROM
myTable AS dv1
WHERE
dv1.timestamp
BETWEEN (
SELECT TOP 1 dv2.timestamp
FROM myTable AS dv2
WHERE dv2.timestamp > #START_DATE
ORDER BY dv2.timestamp DESC
)
AND ( SELECT TOP 1 dv3.timestamp
FROM myTable AS dv3
WHERE dv3.timestamp < #END_DATE
ORDER BY dv3.timestamp ASC
)
Note If the result is not right, you could just exchange the sub queries (i.e. operators, and ASC/DESC).
Think out of the box :)
U can use max/min to get value u need. Order by +top 1 isnt best way to get max value, what i can see in ur querys. To sort n items its O(n to power 2), getting max should be only O(n)
SELECT value, timestamp
FROM myTable
WHERE timestamp BETWEEN #startDT AND #endDT
union
select A.Value, A.TimeStamp
From (
SELECT TOP 1
value, timestamp
FROM myTable
WHERE timestamp > #startDT
ORDER BY value, timestamp DESC ) A
Union
Select A.Value, A.TimeStamp
From (
SELECT TOP 1
value, timestamp
FROM myTable
WHERE timestamp < #endDT
ORDER BY value, timestamp ASC ) A
The second and third queries in your post don't make much sense because
WHERE timestamp > #startDT
and
WHERE timestamp < #endDT
result in timestamps INSIDE the range, but your descriptions
. . * | * * * * * | * . . .
start end
The above graph simply shows the records of *s as my required records, and |...| is the boundaries.
means something different.
So following the descriptions and using the following mapping
myTable = Posts
value = score
timestamp = creationdate
I wrote this query on data.stackexchange.com (modified from exodream's answer but with the comparison operators in the correct reverse direction)
DECLARE #START_DATE datetime
DECLARE #END_DATE datetime
SET #START_DATE = '2010-10-20'
SET #END_DATE = '2010-11-01'
SELECT score,
creationdate
FROM posts
WHERE creationdate BETWEEN (SELECT TOP 1 creationdate
FROM posts
WHERE creationdate < #START_DATE
ORDER BY creationdate DESC)
AND
(SELECT TOP 1 creationdate
FROM posts
WHERE creationdate > #END_DATE
ORDER BY creationdate ASC)
ORDER by creationDate
Which outputs
score creationdate
----- -------------------
4 2010-10-19 23:55:48
3 2010-10-20 2:24:50
6 2010-10-20 2:55:54
...
...
7 2010-10-31 23:14:48
4 2010-10-31 23:18:17
4 2010-10-31 23:18:48
0 2010-11-01 3:59:38
(382 row(s) affected)
Note how the first row and last rows are just outside the limits of the range
You can put those ordered queries into subqueries to get around not being able to UNION them directly. A little annoying, but it'll get you what you want.
SELECT value, timestamp
FROM myTable
WHERE timestamp BETWEEN #startDT AND #endDT
UNION
SELECT value, timestamp
FROM (
SELECT TOP 1
value, timestamp
FROM myTable
WHERE timestamp > #startDT
ORDER BY value, timestamp DESC
) x
UNION
SELECT value, timestamp
FROM (
SELECT TOP 1
value, timestamp
FROM myTable
WHERE timestamp < #endDT
ORDER BY value, timestamp ASC
) x