Why select of count fetches a lot of rows? - sql

Table teams contains 1169 rows, 1133 from them have UserId field !=0. There is an index on the "UserId" field
Query:
EXPLAIN SELECT count(*) FROM teams WHERE UserId != 0
returns output that has Estimate of rows to be examined equal to 1133.
Why query need to examine all rows? Should not it just use index for this purpose?
Thank you.

It will examine almost all rows because you want almost all rows (because you said UserId != 0). Sure, you then make a "count" so you show only one record, but they all had to be fetched in order to count them.
If you where to do
select count(1) from teams where UserId = 100
then it will examine ony a few rows, because you are asking for a precise value (UserId = XX as opposed to UserId != yy).

Related

How to group by one column and limit to rows where another column has the same value for all rows in group?

I have a table like this
CREATE TABLE userinteractions
(
userid bigint,
dobyr int,
-- lots more fields that are not relevant to the question
);
My problem is that some of the data is polluted with multiple dobyr values for the same user.
The table is used as the basis for further processing by creating a new table. These cases need to be removed from the pipeline.
I want to be able to create a clean table that contains unique userid and dobyr limited to the cases where there is only one value of dobyr for the userid in userinteractions.
For example I start with data like this:
userid,dobyr
1,1995
1,1995
2,1999
3,1990 # dobyr values not equal
3,1999 # dobyr values not equal
4,1989
4,1989
And I want to select from this to get a table like this:
userid,dobyr
1,1995
2,1999
4,1989
Is there an elegant, efficient way to get this in a single sql query?
I am using postgres.
EDIT: I do not have permissions to modify the userinteractions table, so I need a SELECT solution, not a DELETE solution.
Clarified requirements: your aim is to generate a new, cleaned-up version of an existing table, and the clean-up means:
If there are many rows with the same userid value but also the same dobyr value, one of them is kept (doesn't matter which one), rest gets discarded.
All rows for a given userid are discarded if it occurs with different dobyr values.
create table userinteractions_clean as
select distinct on (userid,dobyr) *
from userinteractions
where userid in (
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1 )
order by userid,dobyr;
This could also be done with an not in, not exists or exists conditions. Also, select which combination to keep by adding columns at the end of order by.
Updated demo with tests and more rows.
If you don't need the other columns in the table, only something you'll later use as a filter/whitelist, plain userid's from records with (userid,dobyr) pairs matching your criteria are enough, as they already uniquely identify those records:
create table userinteractions_whitelist as
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1
Just use a HAVING clause to assert that all rows in a group must have the same dobyr.
SELECT
userid,
MAX(dobyr) AS dobyr
FROM
userinteractions
GROUP BY
userid
HAVING
COUNT(DISTINCT dobyr) = 1

Limiting output of rows based on count of values in another table?

As a base example, I have a query that effectively produces a table with a list of values (ID numbers), each of which is attached to a specific category. As a simplified example, it would produce something like this (but at a much larger scale):
IDS
Categories
12345
type 1
12456
type 6
77689
type 3
32456
type 4
67431
type 2
13356
type 2
.....
.....
Using this table, I want to populate another table that gives me a list of ID numbers, with a limit placed on how many of each category are in that list, cross referenced against a sort of range based chart. For instance, if there are 5-15 IDS of type 1 in my first table, I want the new table with the column of IDS to have 3 type 1 IDS in it, if there are 15-30 type 1 IDS in the first table, I want to have 6 type 1 IDS in the new table.
This sort of range based limit would apply to each category, and the IDS would all populate the same column in the new table. The order, or specific IDS that end up in the final table don't matter, as long as the correct number of IDS end up as a part of that final list of ID numbers. This is being used to provide a semi-random sampling of ID numbers based on categories for a sort of QA related process.
If parts of this are unclear I can do my best to explain more. My initial thought was using a variable for a limit clause, but that isnt possible. I have been trying to sort out how to do this with a case statement but I'm really just not making any headway there but I feel like I am at this sort of paper thin wall I just can't break through.
You can use two window functions:
COUNT to keep track of the amount of ids for each category
ROW_NUMBER to uniquely identify each id within each category
Once you have collected these information, it's sufficient to keep all those rows that satisfy either of the following conditions:
count of rows less or equal to 30 >> ranking less or equal to 6
count of rows less or equal to 15 >> ranking less or equal to 3
WITH cte AS (
SELECT IDS,
Categories,
ROW_NUMBER() OVER(ORDER BY IDS PARTITION BY Categories) AS rn
COUNT(IDS) OVER(PARTITION BY Categories) AS cnt
FROM tab
)
SELECT *
FROM cte
WHERE (rn <= 3 AND cnt <= 15)
OR (rn <= 6 AND cnt <= 30)
Note: If you have concerns regarding a specific ordering, you need to fix the ORDER BY clause inside the ROW_NUMBER window function.

SQL Server : index for finding latest value which is greater than a passed value

I have a table with 4 columns
USER_ID: numeric
EVENT_DATE: date
VERSION: date
SCORE: decimal
I have a clustered index on (USER_ID, EVENT_DATE, VERSION). These three values together are unique.
I need to get the maximum EventDate for a set of UserIds (~1000 different ids) where the Score is larger than a specific value and only consider those entries with a specific Version.
SELECT M.*
FROM (VALUES
( 5237 ),
………1000 more
( 27054 ) ) C (USER_ID)
CROSS APPLY
(SELECT TOP 1 C.USER_ID, M.EVENT_DATE, M.SCORE
FROM MY_HUGE_TABLE M
WHERE C. USER_ID = M. USER_ID
AND M.VERSION = 'xxxx-xx-xx'
AND M.SCORE > 2 --Comment M.SCORE > 2
ORDER BY M.EVENT_DATE DESC) M
Once I execute the query, I get poor results with respect to runtime, due to a missing index on score column (I suppose).
If I delete the filtering on “M.SCORE > 2” I get my results ten times faster, nevertheless the latest Scores may be less than “2”.
Could anyone please hint me on how to setup an index which could allow me to improve my query performance.
Thank you very much in advance
For your query, the optimal index would be on (User_ID, Version, ValueDate desc, Score).
Unfortunately, your clustered index doesn't match. Only the first and third columns match, but they need to match in order. So, only the User_ID can help but that probably doesn't do much to filter the data.

Order by or where clause: which is effecient when retrieving a record from versioned table

This might be a basic sql questions, however I was curious to know the answer to this.
I need to fetch top one record from the db. Which query would be more efficient, one with where clause or order by?
Example:
Table
Movie
id name isPlaying endDate isDeleted
Above is a versioned table for storing records for movie.
If the endDate is not null and isDeleted = 1 then the record is old and an updated one already exist in this table.
So to fetch the movie "Gladiator" which is currently playing, I can write a query in two ways:
1.
Select m.isPlaying
From Movie m
where m.name=:name (given)
and m.endDate is null and m.isDeleted=0
2. Select TOP 1 m.isPlaying
From Movie m
where m.name=:name (given)
order by m.id desc --- This will always give me the active record (one which is not deleted)
Which query is faster and the correct way to do it?
Update:
id is the only indexed column and id is the unique key. I am expecting the queries to return me only one result.
Update:
Examples:
Movie
id name isPlaying EndDate isDeleted
3 Gladiator 1 03/1/2017 1
4 Gladiator 1 03/1/2017 1
5 Gladiator 0 null 0
I would go with the where clause:
Select m.isPlaying
From Movie m
where m.id = :id and m.endDate is null and m.isDeleted = 0;
This can take advantage of an index on (id, isDeleted, endDate).
Also, the two are not equivalent. The second might return multiple rows when the first returns 1. Or the second might return one row when the first returns none.
The first option might return more than 1 row. Maybe you know it won't because you know what data you have stored but the SQL engine doesn't, and it will affect it's execution plan.
Considering that you only have 1 index and it's on the ID column, the 2nd query should be faster in theory, since it would do an index scan from the highest ID with a predicate for the given name, stopping at the first match.
The first query will do a full table scan while comparing column name, endDate and isDeleted, since it won't stop at the first result that matches.
Posting your execution plans for both queries might enlighten a few loose cables.

Fetch only part of your result set at a time?

I am fetching a huge result set of about 5 million rows (with 10-15 columns) with my query. There is no ID column and one cannot even be created (not my fault), so I cannot even partition my data on the basis of ID and then load it in parts. What makes it worse is that this is SQL server 2000, so most of the convenient SQL coding features might not even be available for this DB. Is there any way i can do something like -
Select top 10000 column_list from myTable
then, select next top 10000 column_list from myTable (ie 10001 to 20000)
and so on...
If you have a useful index, you can grab 10000 rows at a time by tracking the value based on the index.
Suppose the useful index is LastName + FirstName
Select top 10000 column_list from MyTable
order by LastName, FirstName
Then when you get the next 10000 rows, use the query
Select top 10000 column_list from MyTable
where LastName >= PreviousLastname && FirstName > PreviousFirstname
order by LastName, FirstName
Pseudocode above assumes no duplicates on the combination, if you could have duplicates, easiest method is to add another column (even if not indexed), that makes it unique. You would need that 3rd column in the order by clause.
PreviousLastname is the value from the 10,000 record of the previous query.
ADDED
A useful index in this context is any index that high a high cardinality -- mostly distinct values or at most a minimal numbers of non distinct values. An extremely non-useful index would be something like gender (M/F/null)
Since you are using this for data loading, the index selection is not important (ignoring performance considerations) as long as it has a high cardinality. Note that the index and and order by clause must match or you will put a heavy load on your database.
REVISION -- I saw an obvious mistake for the additional data where clause
where LastName >= PreviousLastname && FirstName > PreviousFirstname
This should have been
where (LastName > PreviousLastname)
or (LastName = PreviousLastname && FirstName > PreviousFirstname)