Higher Query result with the DISTINCT Keyword? - sql

Say I have a table with 100,000 User IDs (UserID is an int).
When I run a query like
SELECT COUNT(Distinct User ID) from tableUserID
the result I get is HIGHER than the result from the following statement:
SELECT COUNT(User ID) from tableUserID
I thought Distinct implied unique, which would mean a lower result. What would cause this discrepancy and how would I identify those user IDs that don't show up in the 2nd query?
Thanks
**
UPDATE - 11:14 am est
**
Hi All
I sincerely apologize as I should've taken the trouble to reproduce this in my local environment. But I just wanted to see if there was a general consensus about this. Here are the full details:
The query is a result of an inner join between 2 tables.
One has this information:
TABLE ACTIVITY (NO PRIMARY KEY)
UserID int (not Nullable)
JoinDate datetime
Status tinyint
LeaveDate datetime
SentAutoMessage tinyint
SectionDetails varchar
And here is the second table:
TABLE USER_INFO (CLUSTERED PRIMARY KEY)
UserID int (not Nullable)
UserName varchar
UserActive int
CreatedOn datetime
DisabledOn datetime
The tables are joined on UserID and the UserID being selected in the original 2 queries is the one from the TABLE ACTIVITY.
Hope this clarifies the question.

This is not technically an answer, but since I took time to analyze this, I might as well post it (although I have the risk of being down voted).
There was no way I could reproduce the described behavior.
This is the scenario:
declare #table table ([user id] int)
insert into #table values
(1),(1),(1),(1),(1),(1),(1),(2),(2),(2),(2),(2),(2),(null),(null)
And here are some queries and their results:
SELECT COUNT(User ID) FROM #table --error: this does not run
SELECT COUNT(dsitinct User ID) FROM #table --error: this does not run
SELECT COUNT([User ID]) FROM #table --result: 13 (nulls not counted)
SELECT COUNT(distinct [User ID]) FROM #table --result: 2 (nulls not counted)
And something interesting:
SELECT user --result: 'dbo' in my sandbox DB
SELECT count(user) from #table --result: 15 (nulls are counted because user value
is not null)
SELECT count(distinct user) from #table --result: 1 (user is the same
value always)
I find it very odd that you are able to run the queries exactly how you described. You'd have to let us know the table structure and the data to get further help.

how would I identify those user IDs that don't show up in the 2nd query
Try this query
SELECT UserID from tableUserID Where UserID not in (SELECT Distinct User ID from tableUserID)
I think there will be no row.
Edit:
User is a reserved keyword. Do you mean UserID in your requests ?
Ray : Yes

I tried to reproduce the problem in my environment and my conclusion is that given the conditions you described, the result from the first query can not be higher than the second one. Even if there would be NULL's, that just won't happen.
Did you run the query #Jean-Charles sugested?
I'm very intrigued with this, please let us know what turns out to be the problem.

Related

How to group by one column and limit to rows where another column has the same value for all rows in group?

I have a table like this
CREATE TABLE userinteractions
(
userid bigint,
dobyr int,
-- lots more fields that are not relevant to the question
);
My problem is that some of the data is polluted with multiple dobyr values for the same user.
The table is used as the basis for further processing by creating a new table. These cases need to be removed from the pipeline.
I want to be able to create a clean table that contains unique userid and dobyr limited to the cases where there is only one value of dobyr for the userid in userinteractions.
For example I start with data like this:
userid,dobyr
1,1995
1,1995
2,1999
3,1990 # dobyr values not equal
3,1999 # dobyr values not equal
4,1989
4,1989
And I want to select from this to get a table like this:
userid,dobyr
1,1995
2,1999
4,1989
Is there an elegant, efficient way to get this in a single sql query?
I am using postgres.
EDIT: I do not have permissions to modify the userinteractions table, so I need a SELECT solution, not a DELETE solution.
Clarified requirements: your aim is to generate a new, cleaned-up version of an existing table, and the clean-up means:
If there are many rows with the same userid value but also the same dobyr value, one of them is kept (doesn't matter which one), rest gets discarded.
All rows for a given userid are discarded if it occurs with different dobyr values.
create table userinteractions_clean as
select distinct on (userid,dobyr) *
from userinteractions
where userid in (
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1 )
order by userid,dobyr;
This could also be done with an not in, not exists or exists conditions. Also, select which combination to keep by adding columns at the end of order by.
Updated demo with tests and more rows.
If you don't need the other columns in the table, only something you'll later use as a filter/whitelist, plain userid's from records with (userid,dobyr) pairs matching your criteria are enough, as they already uniquely identify those records:
create table userinteractions_whitelist as
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1
Just use a HAVING clause to assert that all rows in a group must have the same dobyr.
SELECT
userid,
MAX(dobyr) AS dobyr
FROM
userinteractions
GROUP BY
userid
HAVING
COUNT(DISTINCT dobyr) = 1

SQL - Select duplicates based on two columns in DB2

I am using DB2 and am trying to count duplicate rows in a table called ML_MEASURE. What I define as a duplicate in this table, is a row containing the same DATETIME and TAG_NAME value. So I tried this below:
SELECT
DATETIME,
TAG_NAME,
COUNT(*) AS DUPLICATES
FROM
ML_MEASURE
GROUP BY DATETIME, TAG_NAME
HAVING COUNT(*) > 1
The query doesn't fail, but I get an empty result, even though I now for a fact I have at least one duplicate, when I tried this query below I got the result correct for this specific tag_name and datetime:
SELECT
DATETIME,
TAG_NAME,
COUNT(*) AS DUPLICATES
FROM
ML_MEASURE
WHERE
DATETIME='2018-03-23 15:09:30' AND
TAG_NAME='HOG.613KU201'
GROUP BY
DATETIME,
TAG_NAME.
The result of the second query looked like this:
DATETIME TAG_NAME DUPLICATES
--------------------- ------------ ----------
2018-03-23 15:09:30.0 HOG.613KU201 3
What am I doing wrong in the first query?
* UPDATE *
My table is row organized, not sure if that makes any difference.
Yes, you should get the same row back on the first query. If you had a NOT ENFORCED TRUSTED Primary Key or Unique constraint on those two columns, then the Optimizer would be within it's rights to trust the constraint and return you no rows. However from a quick test, I don't believe it does that for this query.
Do you have any indexes defined on the table?
(P.S. I assume you are not running the query from a shell prompt and redirecting the output to a file of the name 1)
This worked for me:
SELECT * FROM (
SELECT DATETIME, TAG_NAME, COUNT(*) AS DUPLICATES
FROM ML_MEASURE
GROUP BY DATETIME, TAG_NAME
) WHERE DUPLICATES > 1

Finding a match for an entry on a date

I am looking for a query that would find all entries that have a login without a logout.
My data looks like this
Key Date Employee
LOGIN 20171225 111
LOGIN 20171225 111
LOGIN 20171226 111
There should be a record here. I need to catch that.
LOGIN 20171227 111
LOGIN 20171227 111
12345 20171227 222 (There is also a LOT of other random data in the table.)
Select Date, Employee
From My Table
Where Key = 'LOGIN'
Group by date, employee
Order by employee
I don't know how to filter out to see if there is one or two logins for that day. I'll need to see where there's only one, because that indicates they have not logged out. This isn't giving me the correct information.
Thank you.
You might use this
DECLARE #dummyTbl TABLE([Key] VARCHAR(100),[Date] DATE,Employee INT);
INSERT INTO #dummyTbl VALUES
('LOGIN','20171225',111)
,('LOGIN','20171225',111)
,('LOGIN','20171226',111)
,('LOGIN','20171227',111)
,('LOGIN','20171227',111);
SELECT *
FROM #dummyTbl
GROUP BY [Key],[Date],Employee
HAVING COUNT(*)=1
But I wonder, why your Key is LOGIN in both cases, why not use LOGOUT?
If your key is always LogIn then you really want to look for any ODD number of entries. Which you can do by using remainder (%) of division by 2 not equal to 0.
DECLARE #dummyTbl TABLE([Key] VARCHAR(100),[Date] DATE,Employee INT);
INSERT INTO #dummyTbl VALUES
('LOGIN','20171225',111)
,('LOGIN','20171225',111)
,('LOGIN','20171226',111)
,('LOGIN','20171227',111)
,('LOGIN','20171227',111);
SELECT *
FROM #dummyTbl
GROUP BY [Key],[Date],Employee
HAVING COUNT(*) % 2 <> 0
If you have multiple keys login/logout and you are attempting to figure out if the user is Logged Out then it is best to look at the last value for the user for the day and if it is not logout then you know they are still logged in.
DECLARE #dummyTbl TABLE(Id INT IDENTITY(1,1), [Key] VARCHAR(100),[Date] DATE,Employee INT);
INSERT INTO #dummyTbl VALUES
('LOGIN','20171225',111)
,('LOGIN','20171225',111)
,('LOGOUT','20171225',111)
,('LOGIN','20171226',111)
,('LOGIN','20171227',111)
,('LOGOUT','20171227',111);
;WITH cteRowNum AS (
SELECT *
,LastDailyActivityRowNum = ROW_NUMBER() OVER (PARTITION BY Date, Employee ORDER BY Id DESC)
FROM
#dummyTbl
)
SELECT *
FROM
cteRowNum
WHERE
LastDailyActivityRowNum = 1
AND [Key] = 'LOGIN'
If you potentially have dirty data (missing login or logout record) then it gets a bit more complicated where you will have to make some business decisions but the last record method is still generally the way to go. When you have employees that can work past midnight without logging out then it gets a bit more complicated too.....

SQL query to get records

I don't know how to frame this question - so putting the sql statement directly here.
declare #tbl table(Id varchar(4),Userid varchar(10),Name varchar(max),Course varchar(max))
insert into #tbl values
('1','UserID1','UserA','Physics'),
('2','UserID2','UserB','Chemistry'),
('3,','UserID1','UserA','Chemistry')
Now,
To get a list of users who have taken Chemistry, I would write
select * from #tbl where Course='Chemistry'
Similarly for Physics, I would write
select * from #tbl where Course='Physics'
The problem is when I try to write the query "get a list of students who haven't taken Physics". Without thinking much, I wrote this
select * from #tbl where Course!='Physics'
and it results in, which is wrong (it is getting details about UserA - though he has registered in Physics)
Id Userid Name Course
2 UserID2 UserB Chemistry
3, UserID1 UserA Chemistry
To fix this, I rewrote the query like this - but somehow I think this is not the right way.
select * from #tbl where Course!='Physics'
and Userid not in (select Userid from #tbl where Course='Physics')
Please help!
Try the following:
SELECT *
FROM #tlb U
WHERE NOT EXISTS (SELECT *
FROM #tbl Inner
WHERE Inner.UserId = U.UserId
AND Course = 'Physics')
For a full discussion of NOT IN versus EXISTS, see this question. The consensus seems to be that NOT EXISTS is preferable.
(I note that your table definition does not mark the columns as NOT NULL; if it would be appropriate to add that in your scenario, it would be a good idea.)
If you want the list of students who haven't taken physics, then I would suggest aggregation with the having clause:
select userId
from #tbl
group by userId
having sum(case when course = 'Physics' then 1 else 0 end) = 0;
This has the obvious advantage of only returning the student ids, and not multiple rows for a student (when there are multiple other courses). It is also an example of a "set-within-sets" subquery, and is more easily generalized than the where version. On the downside, the use of not exists might be able to better take advantage of indexes.

How can I query rankings for the users in my DB, but only consider the latest entry for each user?

Lets say I have a database table called "Scrape" possibly setup like:
UserID (int)
UserName (varchar)
Wins (int)
Losses (int)
ScrapeDate (datetime)
I'm trying to be able to rank my users based on their Wins/Loss ratio. However, each week I'll be scraping for new data on the users and making another entry in the Scrape table.
How can I query a list of users sorted by wins/losses, but only taking into consideration the most recent entry (ScrapeDate)?
Also, do you think it matters that people will be hitting the site and the scrape may possibly be in the middle of completing?
For example I could have:
1 - Bob - Wins: 320 - Losses: 110 - ScrapeDate: 7/8/09
1 - Bob - Wins: 360 - Losses: 122 - ScrapeDate: 7/17/09
2 - Frank - Wins: 115 - Losses: 20 - ScrapeDate: 7/8/09
Where, this represents a scrape that has only updated Bob so far, and is in the process of updating Frank but has yet to be inserted. How would you handle this situation as well?
So, my question is:
How would you handle querying only the most recent scrape of each user to determine the rankings
Do you think the fact that the database may be in a state of updating (especially if a scrape could take up to 1 day to complete), and not all users have completely updated yet matters? If so, how would you handle this?
Thank you, and thank you for your responses you have given me on my related question:
When scraping a lot of stats from a webpage, how often should I insert the collected results in my DB?
This is what I call the "greatest-n-per-group" problem. It comes up several times per week on StackOverflow.
I solve this type of problem using an outer join technique:
SELECT s1.*, s1.wins / s1.losses AS win_loss_ratio
FROM Scrape s1
LEFT OUTER JOIN Scrape s2
ON (s1.username = s2.username AND s1.ScrapeDate < s2.ScrapeDate)
WHERE s2.username IS NULL
ORDER BY win_loss_ratio DESC;
This will return only one row for each username -- the row with the greatest value in the ScrapeDate column. That's what the outer join is for, to try to match s1 with some other row s2 with the same username and a greater date. If there is no such row, the outer join returns NULL for all columns of s2, and then we know s1 corresponds to the row with the greatest date for that given username.
This should also work when you have a partially-completed scrape in progress.
This technique isn't necessarily as speedy as the CTE and RANKING solutions other answers have given. You should try both and see what works better for you. The reason I prefer my solution is that it works in any flavor of SQL.
Try something like:
Select user id and max date of last entry for each user.
Select and order records to get ranking based on above query results.
This should work, however depends on your database size.
DECLARE
#last_entries TABLE(id int, dte datetime)
-- insert date (dte) of last entry for each user (id)
INSERT INTO
#last_entries (id, dte)
SELECT
UserID,
MAX(ScrapeDate)
FROM
Scrape WITH (NOLOCK)
GROUP BY
UserID
-- select ranking
SELECT
-- optionally you can use RANK OVER() function to get rank value
UserName,
Wins,
Losses
FROM
#last_entries
JOIN
Scraps WITH (NOLOCK)
ON
UserID = id
AND ScrapeDate = dte
ORDER BY
Winds,
Losses
I do not test this code, so it could not compile on first run.
The answer to part one of your question depends on the version of SQL server you are using - SQL 2005+ offers ranking functions which make this kind of query a bit simpler than in SQL 2000 and before. I'll update this with more detail if you will indicate which platform you're using.
I suspect the clearest way to handle part 2 is to display the stats for the latest complete scraping exercise, otherwise you aren't showing a time-consistent ranking (although, if your data collection exercise takes 24 hours, there's a certain amount of latitude already).
To simplify this, you could create a table to hold metadata about each scrape operation, giving each one an id, start date and completion date (at a minimum), and display those records which relate to the latest complete scrape. To make this easier, you could remove the "scrape date" from the data collection table, and replace it with a foreign key linking each data row to a row in the scrape table.
EDIT
The following code illustrates how to rank users by their latest score, regardless of whether they are time-consistent:
create table #scrape
(userName varchar(20)
,wins int
,losses int
,scrapeDate datetime
)
INSERT #scrape
select 'Alice',100,200,'20090101'
union select 'Alice',120,210,'20090201'
union select 'Bob' ,200,200,'20090101'
union select 'Clara',300,100,'20090101'
union select 'Clara',300,210,'20090201'
union select 'Dave' ,100,10 ,'20090101'
;with latestScrapeCTE
AS
(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY userName
ORDER BY scrapeDate desc
) AS rn
,wins + losses AS totalPlayed
,wins - losses as winDiff
from #scrape
)
SELECT userName
,wins
,losses
,scrapeDate
,winDiff
,totalPlayed
,RANK() OVER (ORDER BY winDiff desc
,totalPlayed desc
) as rankPos
FROM latestScrapeCTE
WHERE rn = 1
ORDER BY rankPos
EDIT 2
An illustration of the use of a metadata table to select the latest complete scrape:
create table #scrape_run
(runID int identity
,startDate datetime
,completedDate datetime
)
create table #scrape
(userName varchar(20)
,wins int
,losses int
,scrapeRunID int
)
INSERT #scrape_run
select '20090101', '20090102'
union select '20090201', null --null completion date indicates that the scrape is not complete
INSERT #scrape
select 'Alice',100,200,1
union select 'Alice',120,210,2
union select 'Bob' ,200,200,1
union select 'Clara',300,100,1
union select 'Clara',300,210,2
union select 'Dave' ,100,10 ,1
;with latestScrapeCTE
AS
(
SELECT TOP 1 runID
,startDate
FROM #scrape_run
WHERE completedDate IS NOT NULL
)
SELECT userName
,wins
,losses
,startDate AS scrapeDate
,wins - losses AS winDiff
,wins + losses AS totalPlayed
,RANK() OVER (ORDER BY (wins - losses) desc
,(wins + losses) desc
) as rankPos
FROM #scrape
JOIN latestScrapeCTE
ON runID = scrapeRunID
ORDER BY rankPos