Finding gaps in server data using presto - sql

I have the data for requests sent to a server with the API endpoint name and the epoch at which the request was sent. There are many API endpoints (in thousands). There was an issue due to which the requests didn't come into the server. I want to find out the list of APIs whose data didn't come in, i.e., I want to find out the gaps in this server data.
I followed some examples in https://blog.jooq.org/how-to-find-the-longest-consecutive-series-of-events-in-sql/. But this uses window functions and works well for consecutive dates for a single group but doesn't extend to multiple groups of epochs since they might not be consecutive. I want to extend this to my use case, where there are multiple APIs, each with its own epochs and gaps. How can I do it in Presto?
Example schema in the table
endpointid (string)
serverepoch (integer)
Example Data
endpointid1, 123
endpointid2, 123
endpointid1, 234
endpointid2, 567
From this data, say I want to find all gaps of 300 seconds or more then I get
endpointid2 123
endpointid2 567

You could use the lead and lag functions as the following:
with next_prev_serverepoch_gaps as
(
select *,
case
when lead(serverepoch, 1, serverepoch) over (partition by endpointid order by serverepoch) - serverepoch >= 300
or serverepoch - lag(serverepoch, 1, serverepoch) over (partition by endpointid order by serverepoch) >= 300
then 1 else 0
end as gap_flag
from table_name
)
select endpointid, serverepoch
from next_prev_serverepoch_gaps
where gap_flag = 1
order by endpointid, serverepoch

Related

SQL- Sample the 3rd from the beginning and backwards

I have a table with id and score. I want to create a new set of data with a sampling method. The sampling method would be to order the id in decreasing order of the scores and sample the 3rd id, starting with the first form the beginning until we get 10k positive samples. And we would like to do the same in the other direction, starting from the end to get 10k negative samples.
id
score
24
0.55
58
0.43
987
0.93
How can I write a SQL query to execute this sampling and get the expected output?
To start with, this would be more straightforward to write an answer if you included the database you used (SQL Server, MySQL, etc). Different SQL versions have different syntax.
BACKGROUND
To answer this question, the main tools you need are the ability to sort, and an ability to take every 3rd row.
I'm using SQL Server here, so sorting includes
TOP modifier in SELECT statements - in other databases it's often LIMIT (e.g., SELECT * FROM Test LIMIT 1000)
ROW_NUMBER() which I believe is relatively common
To get every third row, I use the 'modulus' mathematical function - in SQL Server signified by a % symbol - so, for example
1 % 3 = 1
2 % 3 = 2
3 % 3 = 0
4 % 3 = 1
APPROACH
There is an example of this in this db<>fiddle - but note that it is only dealing with test data (1000 rows, selecting top and bottom 10).
Running through the steps - and assuming your data is stored in #DataTable:
The following command assigns a row number rn to the data, sorted by the score.
SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable;
To get every third value, start with that data and take every third value (e.g., where the row number is a multiple of 3)
SELECT *
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable)
WHERE rn % 3 = 0;
To get the first 10,000 of them, use TOP (or LIMIT, etc)
SELECT TOP 10000 *
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable)
WHERE rn % 3 = 0
ORDER BY rn;
Note - to get it the other way/get the highest scores, take the ROW_NUMBER() in reverse order (e.g., ORDER BY score DESC, id DESC).
FINAL ANSWER
Take the above 10,000 rows, and do a similar for the other way (e.g., to get the highest scores) then UNION them together. Below it is done with a CTE.
WITH TopScores AS
(SELECT TOP 10000 id, score
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score DESC, id DESC) as rn
FROM #DataTable
) AS RankedScores_down
WHERE RankedScores_down.rn % 3 = 0
ORDER BY RankedScores_down.rn
),
LowScores AS
(SELECT TOP 10000 id, score
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable
) AS RankedScores_up
WHERE RankedScores_up.rn % 3 = 0
ORDER BY RankedScores_up.rn
)
SELECT * FROM TopScores
UNION
SELECT * FROM LowScores
ORDER BY score, id;
Notes
I used 'UNION' rather than 'UNION ALL' because, in the chance that there is overlap (e.g., you have less than 60,000 datapoints) we only want to include each sample once
If you use a different database, you'll need to translate this! Here are the benefits of specifying the database you use.
Note that taking every third value (when sorted by score) is not really 'independent' sampling - one would ask why you just don't use all of the top/bottom 30,000 scores? If you to sample 1 in 3 of them, instead you could use id % 3 instead of rn % 3. But once again, why would you do this? Why not just collect fewer results and use them all?
Of course, one good reason is to use half the data to check the validity of stats e.g., take half the data, do your model - then check against the other half how good your model is.

How to check changes in column values?

I need to try to check some device IDs for work. These are values (15 characters, random string of numbers+letters) that mostly remain constant for users. However, every now and then these deviceIDs will change. And I'm trying to detect when they do change. Is there a way to write this kind of a dynamic query with SQL? Say, perhaps with a CASE statement?
user
device
date
1
23127dssds1272d
10-11
1
23127dssds1272d
10-11
1
23127dssds1272d
10-12
1
23127dssds1272d
10-12
1
04623jqdnq3000x
10-12
Count distinct device by id having count > 1?
Consider below approach
select *
from your_table
where true
qualify device != lag(device, 1, '') over(partition by user order by date)
if applied to sample data in your question - output is
As you can see here - at 10-11 first 'change, assignment' happened for user=1 ; and then on 10-12 he device changed

SQL to return records that do not have a complete set according to a second table

I have two tables. I want to find the erroneous records in the first table based on the fact that they aren't complete set as determined by the second table. eg:
custID service transID
1 20 1
1 20 2
1 50 2
2 49 1
2 138 1
3 80 1
3 140 1
comboID combinations
1 Y00020Y00050
2 Y00049Y00138
3 Y00020Y00049
4 Y00020Y00080Y00140
So in this example I would want a query to return the first row of the first table because it does not have a matching 49 or 50 or (80 and 140), and the last two rows as well (because there is no 20). The second transaction is fine, and the second customer is fine.
I couldn't figure this out with a query, so I wound up writing a program that loads the services per customer and transid into an array, iterates over them, and ensures that there is at least one matching combination record where all the services in the combination are present in the initially loaded array. Even that came off as hamfisted, but it was less of a nightmare than the awkward outer joining of multiple joins I was trying to accomplish with SQL.
Taking a step back, I think I need to restructure the combinations table into something more accommodating, but I still can't think of what the approach would be.
I do not have DB2 so I have tested on Oracle. However listagg function should be there as well. The table service is the first table and comb the second one. I assume the service numbers to be sorted as in the combinations column.
select service.*
from service
join
(
select S.custid, S.transid
from
(
select custid, transid, listagg(concat('Y000',service)) within group(order by service) as agg
from service
group by custid, transid
) S
where not exists
(
select *
from comb
where S.agg = comb.combinations
)
) NOT_F on NOT_F.custid = service.custid and NOT_F.transid = service.transid
I dare to say that your database design does not conform to the first normal form since the combinations column is not atomic. Think about it.

SQL Custom Sorting Bandwidth Data

Beloved SO Cronies,
I'm trying to custom sort bandwidth data using ORDER BY or any performance-focused solution likely involving a temp table. I've scoured SO and Google and have only turned up parts of functions that I can use, so I've arrived at posting here as a final stop.
Data (example)
VALUE
---------
10 Kbps
5 Kbps
1 Mbps
10 Mbps
100 Mbps
10 Gbps
1 Gbps
SQL fiddle with the below. Can you hear it playing in the background?
Bandwidth Sorting Start (SQL Fiddle)
select * from Bandwidth
order by (
case
when Value like '%kbps%' then 1
when Value like '%mbps%' then 2
when Value like '%gbps%' then 3
else 4
end)
My thinking is splitting the string Value into a parameter and running a case on the metric type (e.g. Kbps, Mbps) then applying a multiplier to the parameter based on that and presenting that in a temp table that I can sort and return on an int-based sort without showing the column in the results!
Thanks in advance. I tried to post on DBA StackExchange but existing work location presently blocks the login creation there.
Just use a delimiter to separate the numbers and convert them to integer
order by
(
case
when Value like '%Kbps%' then 1
when Value like '%Mbps%' then 2
when Value like '%Gbps%' then 3
else 4
end) ,
CONVERT(INT,SUBSTRING(Value, 0, CHARINDEX(' ', Value)))
FIDDLE

Manually specify starting value for Row_Number()

I want to define the start of ROW_NUMBER() as 3258170 instead of 1.
I am using the following SQL query
SELECT ROW_NUMBER() over(order by (select 3258170)) as 'idd'.
However, the above query is not working. When I say not working I mean its executing but its not starting from 3258170. Can somebody help me?
The reason I want to specify the row number is I am inserting Rows from one table to another. In the first Table the last record's row number is 3258169 and when I insert new records I want them to have the row number from 3258170.
Just add the value to the result of row_number():
select 3258170 - 1 + row_number() over (order by (select NULL)) as idd
The order by clause of row_number() is specifying what column is used for the order by. By specifying a constant there, you are simply saying "everything has the same value for ordering purposes". It has nothing, nothing at all to do with the first value chosen.
To avoid confusion, I replaced the constant value with NULL. In SQL Server, I have observed that this assigns a sequential number without actually sorting the rows -- an observed performance advantage, but not one that I've seen documented, so we can't depend on it.
I feel this is easier
ROW_NUMBER() OVER(ORDER BY Field) - 1 AS FieldAlias (To start from 0)
ROW_NUMBER() OVER(ORDER BY Field) + 3258169 AS FieldAlias (To start from 3258170)
Sometimes....
The ROW_NUMBER() may not be the best solution especially when there could be duplicate records in the underlying data set (for JOIN queries etc.). This may result in more rows returned than expected. You may consider creating a SEQUENCE which can be in some cases considered a cleaner solution.
i.e.:
CREATE SEQUENCE myRowNumberId
START WITH 1
INCREMENT BY 1
GO
SELECT NEXT VALUE FOR myRowNumberId AS 'idd' -- your query
GO
DROP SEQUENCE myRowNumberId; -- just to clean-up after ourselves
GO
The downside is that sequences may be difficult to use in complex queries with DISTINCT, WINDOW functions etc. See the complete sequence documentation here.
I had a situation where I was importing a hierarchical structure into an application where a seq number had to be unique within each hierarchical level and start at 110 (for ease of subsequent manual insertion). The data beforehand looked like this...
Level Prod Type Component Quantity Seq
1 P00210005 R NZ1500 57.90000000 120
1 P00210005 C P00210005M 1.00000000 120
2 P00210005M R M/C Operation 20.00000000 110
2 P00210005M C P00210006 1.00000000 110
2 P00210005M C P00210007 1.00000000 110
I wanted the row_number() function to generate the new sequence numbers but adding 10 and then multiplying by 10 wasn't achievable as expected. To force the sequence of arithmetic functions you have to enclose the entire row_number(), and partition clause in brackets. You can only perform simple addition and substraction on the row_number() as such.
So, my solution for this problem was
,10*(10+row_number() over (partition by Level order by Type desc, [Seq] asc)) [NewSeq]
Note the position of the brackets to allow the multiplication to occur after the addition.
Level Prod Type Component Quantity [Seq] [NewSeq]
1 P00210005 R NZ1500 57.90000000 120 110
1 P00210005 C P00210005M 1.00000000 120 120
2 P00210005M R M/C Operation 20.00000000 110 110
2 P00210005M C P00210006 1.00000000 110 120
2 P00210005M C P00210007 1.00000000 110 130
ROW_NUMBER() OVER(ORDER BY Field) - 1 AS FieldAlias (To start from 0)
ROW_NUMBER() OVER(ORDER BY Field) - 2862718 AS FieldAlias (To start from 2862718)
The order by clause of row_number() is specifying what column is used for the order by. By specifying a constant there, you are simply saying "everything has the same value for ordering purposes". It has nothing, nothing at all to do with the first value chosen.