Why does tsql Rand function not work in where clause? - sql

I am trying to select a single row at random from a table. I am curious as to why the two statements below don't work:
select LastName from DataGeneratorNameLast where id = (ABS(CHECKSUM(NewId())) % 3)+1
select LastName from DataGeneratorNameLast where id = cast(Ceiling(RAND(convert(varbinary, newid())) *4) as int)
Both statements return, at random, either 1 row, no rows, or multiple rows. For the life of me I can't figure out why. Just adding top 1 to the query only solves the problem of multiple rows - but not of no rows returned.
Yes I could do the same thing by selecting top 1 and ordering by newid(). But the mystery of why this does not work is driving me crazy.
Thoughts on why I get multiple rows back?
Here is the table I am using to select from:
Create Table dbo.DataGeneratorNameLast
(
[Id] [int] IDENTITY(1,1) NOT NULL,
LastName varchar(50) NOT NULL,
)
Go
insert into DataGeneratorNameLast (LastName) values ('SMITH')
insert into DataGeneratorNameLast (LastName) values ('JOHNSON')
insert into DataGeneratorNameLast (LastName) values ('Booger')
insert into DataGeneratorNameLast (LastName) values ('Tiger')

The newid() gets evaluated for every row it is compared against, generating a different number. To do what you want, you should generate the random value into a variable before the select and then reference the variable.
Declare #randId int = (abs(checksum(newid())) % 3) + 1;
select LastName from DataGeneratorNameLast where id = #randId;
As Martin said in comments to this. Rand() would behave differently, only being evaluated once per query.

If the table has at least one row that this query would return one row is mandatory.
select TOP (1) LastName
from DataGeneratorNameLast
ORDER BY NEWID()
Notice that this solution can be slow if the table has a large number of rows.
About select LastName from DataGeneratorNameLast where id = #Rand - This solution does not guarantee that there exists a row with id. Even the IDENTITY column can contain gaps. If you definitely need one row then do a preliminary check IF EXISTS (select * from DataGeneratorNameLast where id = #Rand) SELECT ...

I've Had a similar issue and fixed it by making the ID a PRIMARY KEY.
NEWID() is computed per-row. Without a primary key, there is no access pattern other than a table scan, and the filter is checked for each row, so a different value is computed for each row, and you get however many rows match.
With the key, a seek is available, so the predicate is computed once and used as a search argument for a seek.

Related

First name should randomly match with other FIRST name

All first name should randomly match with each other and when I tried to run query again the First Name should be match with others name. Not the match with FIRST time match.
For example I have 6 records in one table ...
First name column looks like:
JHON
LEE
SAM
HARRY
JIM
KRUK
So I want result like
First name1 First name2
Jhon. Harry
LEE. KRUK
HARRY SAM
The simplest solution is to first randomly sort the records, then calculate the grouping and a sequence number within the group and then finally select out the groups as rows.
You can follow along with the logic in this fiddle: https://dbfiddle.uk/9JlK59w4
DECLARE #Sorted TABLE
(
Id INT PRIMARY KEY,
FirstName varchar(30),
RowNum INT IDENTITY(1,1)
);
INSERT INTO #Sorted (Id, FirstName)
SELECT Id, FirstName
FROM People
ORDER BY NEWID();
WITH Pairs as
(
SELECT *
, (RowNum+1)/2 as PairNum
, RowNum % 2 as Ordinal
FROM #Sorted
)
SELECT
Person1.FirstName as [First name1], Person2.FirstName as [First name2]
FROM Pairs Person1
LEFT JOIN Pairs Person2 ON Person1.PairNum = Person2.PairNum AND Person2.Ordinal = 1
WHERE Person1.Ordinal = 0
ORDER BY Person1.PairNum
ORDER BY NEWID() is used here to randomly sort the records. Note that it is indeterminate and will return a new value with each execution. It's not very efficient, but is suitable for our requirement.
You can't easily use CTE's for producing lists of randomly sorted records because the result of a CTE is not cached. Each time the CTE is referenced in the subsequent logic can result in re-evaluating the expression. Run this fiddle a few times and watch how it often allocates the names incorrectly: https://dbfiddle.uk/rpPdkkAG
Due to the volatility of NEWID() this example stores the results in a table valued variable. For a very large list of records a temporary table might be more efficient.
PairNum uses the simple divide by n logic to assign a group number with a length of n
It is necessary to add 1 to the RowNum because the integer math will round down, see this in action in the fiddle.
Ordinal uses the modulo on the RowNumber and is a value we can use to differentiate between Person 1 and Person 2 in the pair. This helps us keep the rest of the logic determinate.
In the final SELECT we select first from the Pairs that have an Ordinal of 0, then we join on the Pairs that have an Ordinal of 1 matching by the PairNum
You can see in the fiddle I added a solution using groups of 3 to show how this can be easily extended to larger groupings.

How to group by one column and limit to rows where another column has the same value for all rows in group?

I have a table like this
CREATE TABLE userinteractions
(
userid bigint,
dobyr int,
-- lots more fields that are not relevant to the question
);
My problem is that some of the data is polluted with multiple dobyr values for the same user.
The table is used as the basis for further processing by creating a new table. These cases need to be removed from the pipeline.
I want to be able to create a clean table that contains unique userid and dobyr limited to the cases where there is only one value of dobyr for the userid in userinteractions.
For example I start with data like this:
userid,dobyr
1,1995
1,1995
2,1999
3,1990 # dobyr values not equal
3,1999 # dobyr values not equal
4,1989
4,1989
And I want to select from this to get a table like this:
userid,dobyr
1,1995
2,1999
4,1989
Is there an elegant, efficient way to get this in a single sql query?
I am using postgres.
EDIT: I do not have permissions to modify the userinteractions table, so I need a SELECT solution, not a DELETE solution.
Clarified requirements: your aim is to generate a new, cleaned-up version of an existing table, and the clean-up means:
If there are many rows with the same userid value but also the same dobyr value, one of them is kept (doesn't matter which one), rest gets discarded.
All rows for a given userid are discarded if it occurs with different dobyr values.
create table userinteractions_clean as
select distinct on (userid,dobyr) *
from userinteractions
where userid in (
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1 )
order by userid,dobyr;
This could also be done with an not in, not exists or exists conditions. Also, select which combination to keep by adding columns at the end of order by.
Updated demo with tests and more rows.
If you don't need the other columns in the table, only something you'll later use as a filter/whitelist, plain userid's from records with (userid,dobyr) pairs matching your criteria are enough, as they already uniquely identify those records:
create table userinteractions_whitelist as
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1
Just use a HAVING clause to assert that all rows in a group must have the same dobyr.
SELECT
userid,
MAX(dobyr) AS dobyr
FROM
userinteractions
GROUP BY
userid
HAVING
COUNT(DISTINCT dobyr) = 1

Finding Duplicates: GROUP BY and DISTINCT giving different answers

I have looked through all the questions about group by and distinct and they seem to be different in terms of allowing aggregate functions, but none of them answered my question... so here goes..
I have a database table with 126266 rows of data, each complete row should be unique, but I'm not using row numbers.
I'm trying to find all the duplicate values in this table (as I know they exist) and then delete them. None of the columns are aggregates.
Table:
CREATE TABLE [dbo].[DBAScanResults](
[ScanNumber] [float] NOT NULL,
[DB_ID] [bigint] NOT NULL,
[PluginID] [bigint] NOT NULL,
[PluginID_Version] [bigint] NOT NULL,
[Result] [nvarchar](50) NULL,
[ActualValue] [nvarchar](max) NULL
I've got foreign keys on: ScanNumber, DB_ID, PluginID_Version. Each related primary key is on a different table. (So my database is four tables currently)
If I do a group by, it gives me 12745 rows, which are my duplicate rows:
Select top 1000000 [ScanNumber]
,[DB_ID]
,[PluginID]
,[PluginID_Version]
,[Result]
,[ActualValue]
FROM [ITSecMaster].[dbo].[DBAScanResultsNew]
group by [ScanNumber]
,[DB_ID]
,[PluginID]
,[PluginID_Version]
,[Result]
,[ActualValue]
HAVING COUNT(*) >1
If I do a distinct ( Select distinct * from [dbo].[DBAScanResults]) it gives me 78,871 rows, which I am guessing is my unique count of rows without duplicates. My issue here is that 12745+78871 does not equal 126226 ...
So which one is actually right? Do I have 12745 duplicates, or 47,355 duplicates?
And Once I've worked out which is right, I then need to delete the duplicate values from the table ... Normally I'd do this to delete values with a fk, but I can't get the syntax right for multiple fks across 2+ tables.
DELETE a
FROM DBAScanResults a
INNER JOIN DBAScanDate b
ON a.ScanNumber = b.ScanNumber
WHERE (expression)
Any help with this would be greatly appreciated.
Thanks in advance!
Your counting logic is off, and mine was too, until I came up with a simple example to better understand your question. Imagine a simple table with only one column, text:
text
----
A
B
B
C
C
C
Running SELECT COUNT(*) just yields 6 records, as expected. SELECT DISTINCT text returns 3 records, for A,B,C. Finally, SELECT text with HAVING COUNT(*) > 1 returns only two records, for the B and C groups.
None of these numbers add up at all. The issue here is that a distinct select also returns records which are not duplicate, in addition to records which are duplicate. Also, a given duplicate record could occur more than two times. Your current comparison is somewhat apples to oranges.
Edit:
If you want to remove all duplicates in your six-column table, leaving only one distinct record from all columns, then try using a deletable CTE:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ScanNumber, DB_ID, PluginID,
PluginID_Version, Result, ActualValue
ORDER BY (SELECT NULL)) rn
FROM DBAScanResults
)
DELETE
FROM cte
WHERE rn > 1;

Split one large, denormalized table into a normalized database

I have a large (5 million row, 300+ column) csv file I need to import into a staging table in SQL Server, then run a script to split each row up and insert data into the relevant tables in a normalized db. The format of the source table looks something like this:
(fName, lName, licenseNumber1, licenseIssuer1, licenseNumber2, licenseIssuer2..., specialtyName1, specialtyState1, specialtyName2, specialtyState2..., identifier1, identifier2...)
There are 50 licenseNumber/licenseIssuer columns, 15 specialtyName/specialtyState columns, and 15 identifier columns. There is always at least one of each of those, but the remaining 49 or 14 could be null. The first identifier is unique, but is not used as the primary key of the Person in our schema.
My database schema looks like this
People(ID int Identity(1,1))
Names(ID int, personID int, lName varchar, fName varchar)
Licenses(ID int, personID int, number varchar, issuer varchar)
Specialties(ID int, personID int, name varchar, state varchar)
Identifiers(ID int, personID int, value)
The database will already be populated with some People before adding the new ones from the csv.
What is the best way to approach this?
I have tried iterating over the staging table one row at a time with select top 1:
WHILE EXISTS (Select top 1 * from staging)
BEGIN
INSERT INTO People Default Values
SET #LastInsertedID = SCOPE_IDENTITY() -- might use the output clause to get this instead
INSERT INTO Names (personID, lName, fName)
SELECT top 1 #LastInsertedID, lName, fName from staging
INSERT INTO Licenses(personID, number, issuer)
SELECT top 1 #LastInsertedID, licenseNumber1, licenseIssuer1 from staging
IF (select top 1 licenseNumber2 from staging) is not null
BEGIN
INSERT INTO Licenses(personID, number, issuer)
SELECT top 1 #LastInsertedID, licenseNumber2, licenseIssuer2 from staging
END
-- Repeat the above 49 times, etc...
DELETE top 1 from staging
END
One problem with this approach is that it is prohibitively slow, so I refactored it to use a cursor. This works and is significantly faster, but has me declaring 300+ variables for Fetch INTO.
Is there a set-based approach that would work here? That would be preferable, as I understand that cursors are frowned upon, but I'm not sure how to get the identity from the INSERT into the People table for use as a foreign key in the others without going row-by-row from the staging table.
Also, how could I avoid copy and pasting the insert into the Licenses table? With a cursor approach I could try:
FETCH INTO ...#LicenseNumber1, #LicenseIssuer1, #LicenseNumber2, #LicenseIssuer2...
INSERT INTO #LicenseTemp (number, issuer) Values
(#LicenseNumber1, #LicenseIssuer1),
(#LicenseNumber2, #LicenseIssuer2),
... Repeat 48 more times...
.
.
.
INSERT INTO Licenses(personID, number, issuer)
SELECT #LastInsertedID, number, issuer
FROM #LicenseTEMP
WHERE number is not null
There still seems to be some redundant copy and pasting there, though.
To summarize the questions, I'm looking for idiomatic approaches to:
Break up one large staging table into a set of normalized tables, retrieving the Primary Key/identity from one table and using it as the foreign key in the others
Insert multiple rows into the normalized tables that come from many repeated columns in the staging table with less boilerplate/copy and paste (Licenses and Specialties above)
Short of discreet answers, I'd also be very happy with pointers towards resources and references that could assist me in figuring this out.
Ok, I'm not an SQL Server expert, but here's the "strategy" I would suggest.
Calculate the personId on the staging table
As #Shnugo suggested before me, calculating the personId in the staging table will ease the next steps
Use a sequence for the personID
From SQL Server 2012 you can define sequences. If you use it for every person insert, you'll never risk an overlapping of IDs. If you have (as it seems) personId that were loaded before the sequence you can create the sequence with the first free personID as starting value
Create a numbers table
Create an utility table keeping numbers from 1 to n (you need n to be at least 50.. you can look at this question for some implementations)
Use set logic to do the insert
I'd avoid cursor and row-by-row logic: you are right that it is better to limit the number of accesses to the table, but I'd say that you should strive to limit it to one access for target table.
You could proceed like these:
People:
INSERT INTO People (personID)
SELECT personId from staging;
Names:
INSERT INTO Names (personID, lName, fName)
SELECT personId, lName, fName from staging;
Licenses:
here we'll need the Number table
INSERT INTO Licenses (personId, number, issuer)
SELECT * FROM (
SELECT personId,
case nbrs.n
when 1 then licenseNumber1
when 2 then licenseNumber2
...
when 50 then licenseNumber50
end as licenseNumber,
case nbrs.n
when 1 then licenseIssuer1
when 2 then licenseIssuer2
...
when 50 then licenseIssuer50
end as licenseIssuer
from staging
cross join
(select n from numbers where n>=1 and n<=50) nbrs
) WHERE licenseNumber is not null;
Specialties:
INSERT INTO Specialties(personId, name, state)
SELECT * FROM (
SELECT personId,
case nbrs.n
when 1 then specialtyName1
when 2 then specialtyName2
...
when 15 then specialtyName15
end as specialtyName,
case nbrs.n
when 1 then specialtyState1
when 2 then specialtyState2
...
when 15 then specialtyState15
end as specialtyState
from staging
cross join
(select n from numbers where n>=1 and n<=15) nbrs
) WHERE specialtyName is not null;
Identifiers:
INSERT INTO Identifiers(personId, value)
SELECT * FROM (
SELECT personId,
case nbrs.n
when 1 then identifier1
when 2 then identifier2
...
when 15 then identifier15
end as value
from staging
cross join
(select n from numbers where n>=1 and n<=15) nbrs
) WHERE value is not null;
Hope it helps.
You say: but the staging table could be modified
I would
add a PersonID INT NOT NULL column and fill it with DENSE_RANK() OVER(ORDER BY fname,lname)
add an index to this PersonID
use this ID in combination with GROUP BY to fill your People table
do the same with your names table
And then use this ID for a set-based insert into your three side tables
Do it like this
SELECT AllTogether.PersonID, AllTogether.TheValue
FROM
(
SELECT PersonID,SomeValue1 AS TheValue FROM StagingTable
UNION ALL SELECT PersonID,SomeValue2 FROM StagingTable
UNION ALL ...
) AS AllTogether
WHERE AllTogether.TheValue IS NOT NULL
UPDATE
You say: might cause a conflict with IDs that already exist in the People table
You did not tell anything about existing People...
Is there any sure and unique mark to identify them? Use a simple
UPDATE StagingTable SET PersonID=xyz WHERE ...
to set existing PersonIDs into your staging table and then use something like
UPDATE StagingTable
SET PersonID=DENSE RANK() OVER(...) + MaxExistingID
WHERE PersonID IS NULL
to set new IDs for PersonIDs still being NULL.

How to insert values from column A of table X to column B of table Y - and order them randomly

I need to collect the values from the column "EmployeeID" of the table "Employees" and insert them into the column "EmployeeID" of the table "Incident".
At the end, the Values in the rows of the column "EmployeeID" should be arranged randomly.
More precisely;
I created 10 employees with their ID's, counting from 1 up to 10.
Those Employees, in fact the ID's, should receive random Incidents to work on.
So ... there are 10 ID's to spread on all Incidents - which might be 1000s.
How do i do this?
It's just for personal exercise on the local maschine.
I googled, but didn't find an explicit answer to my problem.
Should be simple to solve for you champs. :)
May anyone help me, please?
NOTES:
1) I've already created a column called "EmployeeID" in the table "Incident", therefore I'll need an update statement, won't I?
2) Schema:
[dbo].[EmployeeType]
[dbo].[Company]
[dbo].[Division]
[dbo].[Team]
[dbo].[sysdiagrams]
[dbo].[Incident]
[dbo].[Employees]
3) 1. Pre-solution:
CREATE TABLE IncidentToEmployee
(
IncidentToEmployeeID BIGINT IDENTITY(1,1) NOT NULL,
EmployeeID BIGINT NULL,
Incident FLOAT NULL
PRIMARY KEY CLUSTERED (IncidentToEmployeeID)
)
INSERT INTO IncidentToEmployee
SELECT
EmployeeID,
Incident
FROM dbo.Employees,
dbo.Incident
ORDER BY NEWID()
SELECT * FROM IncidentToEmployee
GO
3) 2. Output by INNER JOIN ON
In case you are wondering about the "Alias" column;
Nobody really knows which persons are behind the ID's - that's why I used an Alias column.
SELECT Employees.Alias,
IncidentToEmployee.Incident
FROM Employees
INNER JOIN
IncidentToEmployee ON
Employees.EmployeeID = IncidentToEmployee.EmployeeID
ORDER BY Alias
4) Final Solution
As I mentioned, I added at first a column called "EmployeeID" already to my "Incident" table. That's why I couldn't use an INSERT INTO statement at first and had to use an UPDATE statement. I found the most suitable solution now - without creating a new table as I did as a pre-solution.
Take a look at the following code:
ALTER Table Incident
ADD EmployeeID BIGINT NULL
UPDATE Incident
SET Incident.EmployeeID = EmployeeID
FROM Incident INNER JOIN Employees
ON Incident = EmployeeID
SELECT
EmployeeID,
Incident
FROM dbo.Employees,
dbo.Incident
ORDER BY NEWID()
Thank you all for your help - It took way longer to find a solution as I thought it would take; but I finally made it. Thanks!
UPDATE
I think you need to allocate different task to different user, a better approach will be to create a new table let's say EmployeeIncidents having columns Id(primary) , EmployeeID and IncidentID .
Now you can insert random EmployeesID and random IncidentID to new table, this way you will be able to keep records also ,
Updating Incident table will not be a smart choice.
INSERT INTO EmployeeIncidents
SELECT TOP ( 10 )
EmployeesID ,
IncidentID
FROM dbo.Employees,
dbo.Incident
ORDER BY NEWID()
Written by hand, so may need to tweak syntax, but something like this should do it. The Rand() function will give the same value unless seeded, so you can see with something like date to get randomness.
Insert Into Incidents
Select Top 10
EmployeeID
From Employees
Order By
Rand(GetDate())