Fetch only part of your result set at a time? - sql

I am fetching a huge result set of about 5 million rows (with 10-15 columns) with my query. There is no ID column and one cannot even be created (not my fault), so I cannot even partition my data on the basis of ID and then load it in parts. What makes it worse is that this is SQL server 2000, so most of the convenient SQL coding features might not even be available for this DB. Is there any way i can do something like -
Select top 10000 column_list from myTable
then, select next top 10000 column_list from myTable (ie 10001 to 20000)
and so on...

If you have a useful index, you can grab 10000 rows at a time by tracking the value based on the index.
Suppose the useful index is LastName + FirstName
Select top 10000 column_list from MyTable
order by LastName, FirstName
Then when you get the next 10000 rows, use the query
Select top 10000 column_list from MyTable
where LastName >= PreviousLastname && FirstName > PreviousFirstname
order by LastName, FirstName
Pseudocode above assumes no duplicates on the combination, if you could have duplicates, easiest method is to add another column (even if not indexed), that makes it unique. You would need that 3rd column in the order by clause.
PreviousLastname is the value from the 10,000 record of the previous query.
ADDED
A useful index in this context is any index that high a high cardinality -- mostly distinct values or at most a minimal numbers of non distinct values. An extremely non-useful index would be something like gender (M/F/null)
Since you are using this for data loading, the index selection is not important (ignoring performance considerations) as long as it has a high cardinality. Note that the index and and order by clause must match or you will put a heavy load on your database.
REVISION -- I saw an obvious mistake for the additional data where clause
where LastName >= PreviousLastname && FirstName > PreviousFirstname
This should have been
where (LastName > PreviousLastname)
or (LastName = PreviousLastname && FirstName > PreviousFirstname)

Related

First name should randomly match with other FIRST name

All first name should randomly match with each other and when I tried to run query again the First Name should be match with others name. Not the match with FIRST time match.
For example I have 6 records in one table ...
First name column looks like:
JHON
LEE
SAM
HARRY
JIM
KRUK
So I want result like
First name1 First name2
Jhon. Harry
LEE. KRUK
HARRY SAM
The simplest solution is to first randomly sort the records, then calculate the grouping and a sequence number within the group and then finally select out the groups as rows.
You can follow along with the logic in this fiddle: https://dbfiddle.uk/9JlK59w4
DECLARE #Sorted TABLE
(
Id INT PRIMARY KEY,
FirstName varchar(30),
RowNum INT IDENTITY(1,1)
);
INSERT INTO #Sorted (Id, FirstName)
SELECT Id, FirstName
FROM People
ORDER BY NEWID();
WITH Pairs as
(
SELECT *
, (RowNum+1)/2 as PairNum
, RowNum % 2 as Ordinal
FROM #Sorted
)
SELECT
Person1.FirstName as [First name1], Person2.FirstName as [First name2]
FROM Pairs Person1
LEFT JOIN Pairs Person2 ON Person1.PairNum = Person2.PairNum AND Person2.Ordinal = 1
WHERE Person1.Ordinal = 0
ORDER BY Person1.PairNum
ORDER BY NEWID() is used here to randomly sort the records. Note that it is indeterminate and will return a new value with each execution. It's not very efficient, but is suitable for our requirement.
You can't easily use CTE's for producing lists of randomly sorted records because the result of a CTE is not cached. Each time the CTE is referenced in the subsequent logic can result in re-evaluating the expression. Run this fiddle a few times and watch how it often allocates the names incorrectly: https://dbfiddle.uk/rpPdkkAG
Due to the volatility of NEWID() this example stores the results in a table valued variable. For a very large list of records a temporary table might be more efficient.
PairNum uses the simple divide by n logic to assign a group number with a length of n
It is necessary to add 1 to the RowNum because the integer math will round down, see this in action in the fiddle.
Ordinal uses the modulo on the RowNumber and is a value we can use to differentiate between Person 1 and Person 2 in the pair. This helps us keep the rest of the logic determinate.
In the final SELECT we select first from the Pairs that have an Ordinal of 0, then we join on the Pairs that have an Ordinal of 1 matching by the PairNum
You can see in the fiddle I added a solution using groups of 3 to show how this can be easily extended to larger groupings.

Order by or where clause: which is effecient when retrieving a record from versioned table

This might be a basic sql questions, however I was curious to know the answer to this.
I need to fetch top one record from the db. Which query would be more efficient, one with where clause or order by?
Example:
Table
Movie
id name isPlaying endDate isDeleted
Above is a versioned table for storing records for movie.
If the endDate is not null and isDeleted = 1 then the record is old and an updated one already exist in this table.
So to fetch the movie "Gladiator" which is currently playing, I can write a query in two ways:
1.
Select m.isPlaying
From Movie m
where m.name=:name (given)
and m.endDate is null and m.isDeleted=0
2. Select TOP 1 m.isPlaying
From Movie m
where m.name=:name (given)
order by m.id desc --- This will always give me the active record (one which is not deleted)
Which query is faster and the correct way to do it?
Update:
id is the only indexed column and id is the unique key. I am expecting the queries to return me only one result.
Update:
Examples:
Movie
id name isPlaying EndDate isDeleted
3 Gladiator 1 03/1/2017 1
4 Gladiator 1 03/1/2017 1
5 Gladiator 0 null 0
I would go with the where clause:
Select m.isPlaying
From Movie m
where m.id = :id and m.endDate is null and m.isDeleted = 0;
This can take advantage of an index on (id, isDeleted, endDate).
Also, the two are not equivalent. The second might return multiple rows when the first returns 1. Or the second might return one row when the first returns none.
The first option might return more than 1 row. Maybe you know it won't because you know what data you have stored but the SQL engine doesn't, and it will affect it's execution plan.
Considering that you only have 1 index and it's on the ID column, the 2nd query should be faster in theory, since it would do an index scan from the highest ID with a predicate for the given name, stopping at the first match.
The first query will do a full table scan while comparing column name, endDate and isDeleted, since it won't stop at the first result that matches.
Posting your execution plans for both queries might enlighten a few loose cables.

What indexes do I need to speed up AND/OR SQL queries

Let's assume I have a table named customer like this:
+----+------+----------+-----+
| id | name | lastname | age |
+----+------+----------+-----+
| .. | ... | .... | ... |
and I need to perform the following query:
SELECT * FROM customer WHERE ((name = 'john' OR lastname = 'doe') AND age = 21)
I'm aware of how single and multi-column indexes work, so I created these ones:
(name, age)
(lastname, age)
Is that all the indexes I need?
The above condition can be rephrased as:
... WHERE ((name = 'john' AND age = 21) OR (lastname = 'doe' AND age = 21)
but I'm not sure how smart RDBMS are, and if those indexes are the correct ones
Your approach is reasonable. Two factors are essential here:
Postgres can combine multiple indexes very efficiently with bitmap index scans.
PostgreSQL versus MySQL for EAV structures storage
B-tree index usage is by far most effective when only leading columns of the index are involved.
Is a composite index also good for queries on the first field?
Working of indexes in PostgreSQL
Test case
If you don't have enough data to measure tests, you can always whip up a quick test case like this:
CREATE TABLE customer (id int, name text, lastname text, age int);
INSERT INTO customer
SELECT g
, left(md5('foo'::text || g%500) , 3 + ((g%5)^2)::int)
, left(md5('bar'::text || g%1000), 5 + ((g%5)^2)::int)
, ((random()^2) * 100)::int
FROM generate_series(1, 30000) g; -- 30k rows for quick test case
For your query (reformatted):
SELECT *
FROM customer
WHERE (name = 'john' OR lastname = 'doe')
AND age = 21;
I would go with
CREATE INDEX customer_age_name_idx ON customer (age, name);
CREATE INDEX customer_age_lastname_idx ON customer (age, lastname);
However, depending on many factors, a single index with all three columns and age as first may be able to deliver similar performance. The rule of thumb is to create as few indexes as possible and as many as necessary.
CREATE INDEX customer_age_lastname_name_idx ON customer (age, lastname, name);
The check on (age, name) is potentially slower in this case, but depending on selectivity of the first column it may not matter much.
Updated SQL Fiddle.
Why age first in the index?
This is not very important and needs deeper understanding to explain. But since you ask ...
The order of columns doesn't matter for the 2-column indexes customer_age_name_idx and customer_age_lastname_idx. Details and a test-case:
Multicolumn index and performance
I still put age first to stay consistent with the 3rd index I suggested customer_age_lastname_name_idx, where the order of columns does matter in multiple ways:
Most importantly, both your predicates (age, name) and (age, lastname) share the column age. B-tree indexes are (by far) most effective on leading columns, so putting age first benefits both.
And, less importantly, but still relevant: the size of the index is smaller this way due to data type characteristics, alignment, padding and page layout of index pages.
age is a 4-byte integer and must be aligned at multiples of 4 bytes in the data page. text is of variable length and has no alignment restrictions. Putting the integer first or last is more efficient due to the rules of "column tetris". I added another index on (lastname, age, name) (age in the middle!) to the fiddle just to demonstrate it's ~ 10 % bigger. No space lost to additional padding, which results in a smaller index. And size matters.
For the same reasons it would be better to reorder columns in the demo table like this: (id, age, name, lastname). If you want to learn why, start here:
Making sense of Postgres row sizes
Calculating and saving space in PostgreSQL
Configuring PostgreSQL for read performance
Measure the size of a PostgreSQL table row
Everything I wrote is for the case at hand. If you have other queries / other requirements, the resulting strategy may change.
UNION query equivalent?
Note that a UNION query may or may not return the same result. It folds duplicate rows, which your original does not. Even if you don't have complete duplicates in your table, you may still see this effect with a subset of columns in the SELECT list. Do not blindly substitute with a UNION query. It's not going to be faster anyway.
Turn the OR into two queries UNIONed:
SELECT * FROM Customer WHERE Age = 21 AND Name = 'John'
UNION
SELECT * FROM Customer WHERE Age = 21 AND LastName = 'Doe'
Then create an index over (Age, Name) and another over (Age, LastName).

Why select of count fetches a lot of rows?

Table teams contains 1169 rows, 1133 from them have UserId field !=0. There is an index on the "UserId" field
Query:
EXPLAIN SELECT count(*) FROM teams WHERE UserId != 0
returns output that has Estimate of rows to be examined equal to 1133.
Why query need to examine all rows? Should not it just use index for this purpose?
Thank you.
It will examine almost all rows because you want almost all rows (because you said UserId != 0). Sure, you then make a "count" so you show only one record, but they all had to be fetched in order to count them.
If you where to do
select count(1) from teams where UserId = 100
then it will examine ony a few rows, because you are asking for a precise value (UserId = XX as opposed to UserId != yy).

Fetch two next and two previous entries in a single SQL query

I want to display an image gallery, and on the view page, one should be able to have a look at a bunch of thumbnails: the current picture, wrapped with the two previous entries and the two next ones.
The problem of fetching two next/prev is that I can't (unless I'm mistaken) select something like MAX(id) WHERE idxx.
Any idea?
note: of course the ids do not follow as they should be the result of multiple WHERE instances.
Thanks
Marshall
You'll have to forgive the SQL Server style variable names, I don't remember how MySQL does variable naming.
SELECT *
FROM photos
WHERE photo_id = #current_photo_id
UNION ALL
SELECT *
FROM photos
WHERE photo_id > #current_photo_id
ORDER BY photo_id ASC
LIMIT 2
UNION ALL
SELECT *
FROM photos
WHERE photo_id < #current_photo_id
ORDER BY photo_id DESC
LIMIT 2;
This query assumes that you might have non-contiguous IDs. It could become problematic in the long run, though, if you have a lot of photos in your table since TOP is often evaluated after the entire result set has been retrieved from the database. YMMV.
In a high load scenario, I would probably use these queries, but I would also prematerialize them on a regular basis so that each photo had a PreviousPhotoOne, PreviousPhotoTwo, etc column. It's a bit more maintenance, but it works well when you have a lot of static data and need performance.
if your IDs are continuous you could do
where id >= #id-2 and id <= #id+2
Otherwise I think you'd have to union 3 queries, one to get the record with the given id and two others messing about with top and order by like this
select *
from table
where id = #id
union
select top 2 *
from table
where id < #id
order by id desc
union
select top 2 *
from table
where id > #id
order by id
Performance will not be too bad as you aren't retrieving massive sets of data but it won't be great due to using a union.
If you find performance starts being a problem you could add columns to hold the ids of the previous and next items; calculating the ids using a trigger or overnight process or something. This will mean you only do the hard query once rather than each time you need it.
I think this method should work fine for non-continguous ID's and should be more effecient than using a UNION's. currentID would be set either using a constant in SQL or passing from your program.
SELECT * FROM photos WHERE ID = currentID OR ID IN (
SELECT ID FROM photos WHERE ID < currentID ORDER BY ID DESC LIMIT 2
) OR ID IN (
SELECT ID FROM photos WHERE ID > currentID ORDER BY ID ASC LIMIT 2
) ORDER BY ID ASC
If you are just interested in the previous and next records by id couldn't you just have a where clause that restricts WHERE id=xx, xx-1, xx-1, xx+1, xx+2 using multiple WHERE clauses or using WHERE IN ?