An SQL query with OFFSET/FETCH is returning unexpected results - sql

I have an SQL Server 2019 database table named User with 1,000 rows as follows:
I am having a hard time understanding how this SELECT query with OFFSET/FETCH is returning unexpected results:
SELECT *
FROM [User]
WHERE (([NameGiven] LIKE '%1%')
OR ([NameFamily] LIKE '%2%'))
ORDER BY [Id] ASC
OFFSET 200 ROWS FETCH NEXT 100 ROWS ONLY;
Query results:
The results range from 264 to 452 with a total of 100 rows. Why would the records 201, 211, etc. not show up? Am I wrong in my expectations or is there a mistake in the query criteria?
If I remove the OFFSET/FETCH options from the ORDER BY clause, the results are as expected. That makes me think that the WHERE clause is not the problem.
Any advice would be appreciated.

The problem is that you expect the offset to happen before the filter but in actuality it doesn't happen until after the filter. Think about a simpler example where you want all the people named 'sam' and there are more people named 'sam' than your offset:
CREATE TABLE dbo.foo(id int, name varchar(32));
INSERT dbo.foo(id, name) VALUES
(1, 'sam'),
(2, 'sam'),
(3, 'bob'),
(4, 'sam'),
(5, 'sam'),
(6, 'sam');
If you just say:
SELECT id FROM dbo.foo WHERE name = 'sam';
You get:
1
2
4
5
6
If you then add an offset of 3,
-- this offsets 3 rows _from the filtered result_,
-- not the full table
SELECT id FROM dbo.foo
WHERE name = 'sam'
ORDER BY id
OFFSET 3 ROWS FETCH NEXT 2 ROWS ONLY;
You get:
5
6
It takes all the rows that match the filter, then skips the first three of those filtered rows (1,2,4) - not 1,2,3 like your question implies that you expect.
Example db<>fiddle
Going back to your case in the question, you are filtering out rows like 77 and 89 because they don't contain a 1 or a 2. So the offset you asked for is 200, but in terms of which rows that means, the offset is actually more like:
200 PLUS the number of rows that *don't* match your filter
until you hit the 200th row that *does*
You could try to force the filter to happen after, e.g.:
;WITH u AS
(
SELECT *
FROM [User]
ORDER BY [Id]
OFFSET 200 ROWS FETCH NEXT 100 ROWS ONLY
)
SELECT * FROM u
WHERE (([NameGiven] LIKE '%1%')
OR ([NameFamily] LIKE '%2%'))
ORDER BY [Id]; -- yes you still need this one
...but then you would almost certainly never get 100 rows in each page because some of those 100 rows would then be removed by the filter. I don't think this is what you're after.

Related

How to create a new table that only keeps rows with more than 5 data records under the same id in Bigquery

I have a table like this:
Id
Date
Steps
Distance
1
2016-06-01
1000
1
There are over 1000 records and 50 Ids in this table, most ids have about 20 records, and some ids only have 1, or 2 records which I think are useless.
I want to create a table that excludes those ids with less than 5 records.
I wrote this code to find the ids that I want to exclude:
SELECT
Id,
COUNT(Id) AS num_id
FROM `table`
GROUP BY
Id
ORDER BY
num_id
Since there are only two ids I need to exclude, I use WHERE clause:
CREATE TABLE `` AS
SELECT
*
FROM ``
WHERE
Id <> 2320127002
AND Id <> 7007744171
Although I can get the result I want, I think there are better ways to solve this kind of problem. For example, if there are over 20 ids with less than 5 records in this table, what shall I do? Thank you.
Consider this:
CREATE TABLE `filtered_table` AS
SELECT *
FROM `table`
WHERE TRUE QUALIFY COUNT(*) OVER (PARTITION BY Id) >= 5
Note: You can remove WHERE TRUE if it runs successfully without it.

Get distinct information across many fields some of which are NULL

I have a table with just over 65 million rows and 140 columns. The data comes from several sources and is submitted at least every month.
I look for a quick way to grab specific fields from this data only where they are unique. Thing is, I want to process all the information to link which invoice was sent with which identifying numbers and it was sent by whom. Issue is, I don't want to iterate over 65 million records. If I can get distinct values, then I will only have to process say 5 million records as opposed to 65 million. See below for a description of the data and SQL Fiddle for a sample
If say a client submits an invoice_number linked to passport_number_1, national_identity_number_1 and driving_license_1 every month, I only want one row where this appears. i.e. the 4 fields have got to be unique
If they submit the above for 30 months then on the 31st month they send the invoice_number linked to passport_number_1, national_identity_number_2 and driving_license_1, I want to pick this row also since the national_identity field is new hence the whole row is unique
By linked to I mean they appear on the same row
For all fields its possible to have Null occurring at one point.
The 'pivot/composite' columns are the invoice_number and
submitted_by. If any of those aren't there, drop that row
I also need to include the database_id with the above data. i.e.
the primary_id which is auto generated by the postgresql database
The only fields that don't need to be returned are the other_column
and yet_another_column. Remember the table has 140 columns so don't
need them
With the results, create a new table that will hold this unique
records
See this SQL fiddle for an attempt to recreate the scenario.
From that fiddle, I'd expect a result like:
Row 1, 2 & Row 11: Only one of them shall be kept as they are exactly the
same. Preferably the row with the smallest id.
Row 4 and Row 9: One of them would be dropped as they are exactly the
same.
Row 5, 7, & 8: Would be dropped since they are missing either the
invoice_number or submitted_by.
The result would then have Row (1, 2 or 11), 3, (4 or 9), 6 and 10.
To get one representative row (with additional fields) from a group with the four distinct fields:
SELECT
distinct on (
invoice_number
, passport_number
, national_id_number
, driving_license_number
)
* -- specify the columns you want here
FROM my_table
where invoice_number is not null
and submitted_by is not null
;
Note that it is unpredictable which row exactly is returned unless you specify an ordering (documentation on distinct)
Edit:
To order this result by id simply adding order by id to the end doesn't work, but it can be done by eiter using a CTE
with distinct_rows as (
SELECT
distinct on (
invoice_number
, passport_number
, national_id_number
, driving_license_number
-- ...
)
* -- specify the columns you want here
FROM my_table
where invoice_number is not null
and submitted_by is not null
)
select *
from distinct_rows
order by id;
or making the original query a subquery
select *
from (
SELECT
distinct on (
invoice_number
, passport_number
, national_id_number
, driving_license_number
-- ...
)
* -- specify the columns you want here
FROM my_table
where invoice_number is not null
and submitted_by is not null
) t
order by id;
quick way to grab specific fields from this data only where they are unique
I don't think so. I think you mean you want to select a distinct set of rows from a table in which they are not unique.
As far as I can tell from your description, you simply want
SELECT distinct invoice_number, passport_number,
driving_license_number, national_id_number
FROM my_table
where invoice_number is not null
and submitted_by is not null;
In your SQLFiddle example, that produces 5 rows.

SQL select id=1

I've a table that has id_categoria field having comma separated value, e.g., 1,2,3,4,64,31,12,14, because a record can belong to multiple categories. If I want to select records that belongs to category 1, I have to run following SQL query
SELECT *
FROM cme_notizie
WHERE id_categoria LIKE '1%'
ORDER BY `id` ASC
and then select all records from the record set that have id_categoria exactly 1 in id_categoria. Let's assume that the value 1 does not exist, but column value like 12, 15, 120 ... still contains 1.
There is a way to take only 1? without taking derivatives or other?
As comments say, you probably shouldn't do that. Instead, you should have another table with one row per category. But if you decide to go with this inferior solution, you can do the following:
SELECT *
FROM cme_notizie
WHERE CONCAT(',', id_categoria, ',') LIKE '%,1,%'
ORDER BY id ASC

How to update each row of a table with a random row from another table

I'm building my first de-identification script, and running into issues with my approach.
I have a table dbo.pseudonyms whose firstname column is populated with 200 rows of data. Every row in this column of 200 rows has a value (none are null). This table also has an id column (int, primary key, not null) with the numbers 1-200.
What I want to do is, in one statement, re-populate my entire USERS table with firstname data randomly selected for each row from my pseudonyms table.
To generate the random number for picking I'm using ABS(Checksum(NewId())) % 200. Every time I do SELECT ABS(Checksum(NewId())) % 200 I get a numeric value in the range I'm looking for just fine, no intermittently erratic behavior.
HOWEVER, when I use this formula in the following statement:
SELECT pn.firstname
FROM DeIdentificationData.dbo.pseudonyms pn
WHERE pn.id = ABS(Checksum(NewId())) % 200
I get VERY intermittent results. I'd say about 30% of the results return one name picked out of the table (this is the expected result), about 30% come back with more than one result (which is baffling, there are no duplicate id column values), and about 30% come back with NULL (even though there are no empty rows in the firstname column)
I did look for quite a while for this specific issue, but to no avail so far. I'm assuming the issue has to do with using this formula as a pointer, but I'd be at a loss how to do this otherwise.
Thoughts?
Why your query in the question returns unexpected results
Your original query selects from Pseudonyms. Server scans through each row of the table, picks the ID from that row, generates a random number, compares the generated number to the ID.
When by chance the generated number for particular row happen to be the same as ID of that row, this row is returned in the result set. It is quite possible that by chance generated number would never be the same as ID, as well as that generated number coincided with ID several times.
A bit more detailed:
Server picks a row with ID=1.
Generates a random number, say 25. Why not? A decent random number.
Is 1 = 25 ? No => This row is not returned.
Server picks a row with ID=2.
Generates a random number, say 125. Why not? A decent random number.
Is 2 = 125 ? No => This row is not returned.
And so on...
Here is a complete solution on SQL Fiddle
Sample data
DECLARE #VarPseudonyms TABLE (ID int IDENTITY(1,1), PseudonymName varchar(50) NOT NULL);
DECLARE #VarUsers TABLE (ID int IDENTITY(1,1), UserName varchar(50) NOT NULL);
INSERT INTO #VarUsers (UserName)
SELECT TOP(1000)
'UserName' AS UserName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;
INSERT INTO #VarPseudonyms (PseudonymName)
SELECT TOP(200)
'PseudonymName'+CAST(ROW_NUMBER() OVER(ORDER BY sys.all_objects.object_id) AS varchar) AS PseudonymName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;
Table Users has 1000 rows with the same UserName for each row. Table Pseudonyms has 200 rows with different PseudonymNames:
SELECT * FROM #VarUsers;
ID UserName
-- --------
1 UserName
2 UserName
...
999 UserName
1000 UserName
SELECT * FROM #VarPseudonyms;
ID PseudonymName
-- -------------
1 PseudonymName1
2 PseudonymName2
...
199 PseudonymName199
200 PseudonymName200
First attempt
At first I tried a direct approach. For each row in Users I want to get one random row from Pseudonyms:
SELECT
U.ID
,U.UserName
,CA.PseudonymName
FROM
#VarUsers AS U
CROSS APPLY
(
SELECT TOP(1)
P.PseudonymName
FROM #VarPseudonyms AS P
ORDER BY CRYPT_GEN_RANDOM(4)
) AS CA
;
It turns out that optimizer is too smart and this produced some random, but the same PseudonymName for each User, which is not what I expected:
ID UserName PseudonymName
1 UserName PseudonymName181
2 UserName PseudonymName181
...
999 UserName PseudonymName181
1000 UserName PseudonymName181
So, I tweaked this approach a bit and generated a random number for each row in Users first. Then I used the generated number to find the Pseudonym with this ID for each row in Users using CROSS APPLY.
CTE_Users has an extra column with random number from 1 to 200. In CTE_Joined we pick a row from Pseudonyms for each User.
Finally we UPDATE the original Users table.
Final solution
WITH
CTE_Users
AS
(
SELECT
U.ID
,U.UserName
,1 + 200 * (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) AS rnd
FROM #VarUsers AS U
)
,CTE_Joined
AS
(
SELECT
CTE_Users.ID
,CTE_Users.UserName
,CA.PseudonymName
FROM
CTE_Users
CROSS APPLY
(
SELECT P.PseudonymName
FROM #VarPseudonyms AS P
WHERE P.ID = CAST(CTE_Users.rnd AS int)
) AS CA
)
UPDATE CTE_Joined
SET UserName = PseudonymName;
Results
SELECT * FROM #VarUsers;
ID UserName
1 PseudonymName41
2 PseudonymName132
3 PseudonymName177
...
998 PseudonymName60
999 PseudonymName141
1000 PseudonymName157
SQL Fiddle
A simpler approach:
UPDATE u
SET u.FirstName = p.Name
FROM Users u
CROSS APPLY (
SELECT TOP(1) p.Name
FROM pseudonyms p
WHERE u.Id IS NOT NULL -- must be some unique identifier on Users
ORDER BY NEWID()
) p
Full example from: https://stackoverflow.com/a/36185100/6620329
Update a random Users id into UpdatedBy column of Table01
UPDATE a
SET a.UpdatedBy=b.id
FROM [dbo].[Table01] a
CROSS APPLY (
SELECT
id,
ROW_NUMBER() over(partition by 1 order by NEWID()) RN
FROM Users b
WHERE a.id != b.id
) b
WHERE RN = 1

MySQL querying with a dynamic range?

Given the table snippet:
id | name | age
I am trying to form a query that will return 10 people within a certain age range. However, if there are not enough people in that range, I want to extend the range until I can find 10 people.
For instance, if I only find 5 people in a range of 30-40, I would find 5 others in a 25-45 range.
In addition, I would like the query to be able use order by RAND() or similar, in order to be able to get different results each time.
Is this going beyond what MySQL can handle? Will I have to put some of this logic in the application instead?
UPDATED for performance:
My original solution worked but requuired a table scan. Am's solution is a good one and doesn't require a table scan but its hard-coded ranges won't work when the only matches are far outliers. Plus it requires de-duping records. But combining both solutions can get you the best of both worlds, provided you have an index on age. (if you don't have an index on age, then all solutions will require a table scan).
The combined solution first picks only the rows which might qualify (the desired range, plus the 10 rows over and 10 rows under that range), and then uses my original logic to rank the results. Caveat: I don't have enough sample data present to verify that MySQL's optimizer is indeed smart enough to avoid a table scan here-- MySQL might not be smart enough to weave those three UNIONs together without a scan.
[just updated again to fix 2 embarrassing SQL typos: DESC where DESC shouldn't have been!]
SELECT * FROM
(
SELECT id, name, age,
CASE WHEN age BETWEEN 25 and 35 THEN RAND() ELSE ABS (age-30) END as distance
FROM
(
SELECT * FROM (SELECT * FROM Person WHERE age > 35 ORDER BY age DESC LIMIT 10) u1
UNION
SELECT * FROM (SELECT * FROM Person WHERE age < 25 ORDER BY age LIMIT 10) u2
UNION
SELECT * FROM (SELECT * FROM Person WHERE age BETWEEN 25 and 35) u3
) p2
ORDER BY distance
LIMIT 10
) p ORDER BY RAND() ;
Original Solution:
I'd approach it this way:
first, compute how far each record is from the center of the desired age range, and order the results by that distance. For all results inside the range, treat the distance as a random number between zero and one. This ensures that records inside the range will be selected in a random order, while records outside the range, if needed, will be selected in order closest to the desired range.
trim the number of records in that distance-ordered resultset to 10 records
randomize order of the resulting records
Like this:
CREATE TABLE Person (id int AUTO_INCREMENT PRIMARY KEY, name varchar(50) NOT NULL, age int NOT NULL);
INSERT INTO Person (name, age) VALUES ("Joe Smith", 26);
INSERT INTO Person (name, age) VALUES ("Frank Johnson", 32);
INSERT INTO Person (name, age) VALUES ("Sue Jones", 24);
INSERT INTO Person (name, age) VALUES ("Ella Frederick", 44);
SELECT * FROM
(
SELECT id, name, age,
CASE WHEN age BETWEEN 25 and 35 THEN RAND() ELSE ABS (age-30) END as distance
FROM Person
ORDER BY distance DESC
LIMIT 10
) p ORDER BY RAND() ;
Note that I'm assuming that, if there are not enough records inside the range, the records you want to append are the ones closest to that range. If this assumption is incorrect, please add more details to the question.
re: performance, this requires a scan through the table, so won't be fast-- I'm working on a scan-less solution now...
I would do somthing like this:
select * from (
SELECT * FROM (select * from ppl_table where age>30 and age<40 order by rand() limit 10) as Momo1
union
SELECT * FROM (select * from ppl_table where age>25 and age<40 order by rand() limit 20) as Momo2
) as FinalMomo
limit 10
basically selecting 10 users from the first group and then more from the second group.
if the first group doesn't add up to 10, there will be more from the second group.
The reason we are selectong 20 from the second group is because UNION will remove the duplicated values, and you want to have at least 10 users in the final result.
Edit
I added the as aliases from the inner SELECT, and made a separate in the inner SELECTs since MySql doesn't like ORDER BY with UNION