Paginated UNION ALL result - best performance - sql

We have a situation where we need results from 4 different tables combined into one list and paginate it through OFFSET/FETCH.
What want to select records from tables a, b, c & d, order them by CreatedDatetime and then OFFSET X, FETCH Y. Tables are quite big (in terms of numbers of rows) and it sounds horrible to do just UNION ALL and then pagination because it would mean probably compiling whole list of records and then taking paginated part.
Problem is that none of the tables can be taken as reference to extract Start/End Datetime window because every collection might but also might not contain records from any of the table. For example, ending result might contain records from any combination of tables a; a/b; a/b/c; a/b/c/d; b; b/c;.... and we need fixed size number to be returned (paging size, for example, being 20).
Any ideas on how to most effectively approach this?
UPDATE
Based on question from #HABO
There are unfortunately no special clues like that about queries. We are showing user activities in the system. There are different kinds of it (tables we select over). Now, query pops up data for administrator who views the activities. How administrator will look at data may vary drastically: some users will have thousands of activities in last few hours and admin will want to see them all. In other cases, users will have 3 actions in a day and admin will see just first page of data.
PS. It's not a pure log tables as activities act as state machines over time, each having their states, which we also look for in these queries.

if you know the page size (eg 100) then you can simply write 4 Top 100 queries (order by Create Date) - Then do a Union ALL on the result.
That way even if all the first 100 records come from 1 table you are covered.
For Subsequent Paging queries - You'll need to record the last displayed row from each table and use this as your High-Water mark for the next fetch - (Select top 100 FROM TableA Where RowID > #HighWater)
Should be fairly efficient...

This is where a cache comes in useful. You can either cache the result of the query in your application layer and do the paging there if it is not too large, or cache the results of the query in a table (or temp table) if it is large.

There would be filters i suppose. From what you say, those may vary a lot. So at the worst scenario, all columns can be filters.
My suggestion is to use 5 views, one for each table and a final one union them. Just make sure all filter columns go up the physical tables as straightforward as possible.
Finally, select the master view and fetch but be careful of the order by clause. Make sure order by has unique data combination else you might have cases where a row change pages on a simple plain refresh. If there is user order by defined, force add some key columns at the end.
How to safely ensure order by to have distinct values for 100% safe fetch/offset:
At the 4 views create a new column with a simple constant number as value, e.g. 1, 2, 3, 4 AS [TableSource]
Make sure you select the PK of each table. If you don't have, you have to create one in the views, probably using ROW_NUMBER or NEWID, as [Pk] for example.
Finally, when selecting from the master view, you ORDER BY CreateDate, Pk, TableSource. This way you are 100% safe that within the same set of data any row will be placed exactly at the same position, resulting correct paging.
Example of safely isolating a page of 30 rows order by CreateDate:
SELECT * FROM (
SELECT src, id, ROW_NUMBER() OVER(ORDER BY dt DESC,src,id)rn FROM (
SELECT 1 src, id, dt FROM table1 /*WHERE x=y*/ UNION ALL
SELECT 2 src, id, dt FROM table2 /*WHERE x=y*/ UNION ALL
SELECT 3 src, id, dt FROM table3 /*WHERE x=y*/ UNION ALL
SELECT 4 src, id, dt FROM table4 /*WHERE x=y*/)alltables
)data WHERE data.rn BETWEEN 3001 AND 3030

Related

Implementing LIMIT against rows of single table while using LEFT JOIN

Given a fairly stereotypical scenario with an item table referenced by an images table holding multiple images for the same item, I'm trying to figure out how to retrieve a specific number of items, while collecting all of the image rows.
The setup is trivial, and looks like:
CREATE TABLE items (
id INTEGER PRIMARY KEY,
...
...
)
CREATE TABLE images (
item_id INTEGER PRIMARY KEY,
url STRING,
...
...
)
LIMIT 20 clamps the maximum number of rows in the entire result set, and incidentally truncates mid-way through an item record.
I'm currently doing two queries, but besides being architecturally not at all ideal, it's practically proving quite awkward to coordinate (which makes a lot of sense since I'm definitely doing it wrong). I'm not finding any info on how to coordinate LIMIT with LEFT JOINs, so I thought I'd ask. Thanks!
NB. Similar questions to this do indeed exist, but are asking how to do things like retrieving eg the first (say) 5 images for each item. I'm looking to retrieve (say) 10 items and get all the images for each.
A query like:
SELECT * FROM items ORDER BY ? LIMIT 10
will return 10 rows (at most) from items.
You need to provide the column(s) for the ORDER BY clause if you have some specific sorting condition in mind.
If not then remove the ORDER BY clause, but in this case nothing guarantees the resultset that you will get.
So all you have to do is LEFT join the above query to images:
SELECT it.*, im.*
FROM (SELECT * FROM items ORDER BY ? LIMIT 10) it
LEFT JOIN images im
ON im.item_id = it.id

Select latest and earliest times within a time group and a pivot statement

I have attandance data that contains a username, time, and status (IN or OUT). I want to show attendance data that contains a name, and the check in/out times. I expect a person to check in and out no more than twice a day. The data looks like this:
As you can see, my problem is that one person can have multiple data entries in different seconds for the same login attempt. This is because I get data from a fingerprint attendace scanner, and the machine in some cases makes multiple entries, sometimes just within 5-10 seconds. I want to select the data to be like this:
How can I identify the proper time for the login attempt, and then select the data with a pivot?
First, you need to normalize your data by removing the duplicate entries. In your situation, that's a challenge because the duplicated data isn't easily identified as a duplicate. You can make some assumptions though. Below, I assume that no one will make multiple login attempts in a two minute window. You can do this by first using a Common Table Expression (CTE, using the WITH clause).
Within the CTE, you can use the LAG function. Essentially what this code is saying is "for each partition of user and entry type, if the previous value was within 2 minutes of this value, then put a number, otherwise put null." I chose null as the flag that will keep the value because LAG of the first entry is going to be null. So, your CTE will just return a table of entry events (ID) that were distinct attempts.
Now, you prepare another CTE that a PIVOT will pull from that has everything from your table, but only for the entry IDs you cared about. The PIVOT is going to look over the MIN/MAX of your IN/OUT times.
WITH UNIQUE_LOGINS AS (
SELECT ID FROM LOGIN_TABLE
WHERE CASE WHEN LAG(TIME, 1, 0) OVER (PARTITION BY USERNAME, STATUS ORDER BY TIME)
+ (2/60/24) < TIME THEN NULL ELSE 1 END IS NULL ), -- Times within 2 minutes
TEMP_FOR_PIVOT AS (
SELECT USERNAME, TIME, STATUS FROM LOGIN_TABLE WHERE ID IN (SELECT ID FROM UNIQUE_LOGINS)
)
SELECT * FROM TEMP_FOR_PIVOT
PIVOT (
MIN(TIME), MAX(TIME) FOR STATUS IN ('IN', 'OUT')
)
From there, if you need to rearrange or rename your columns, then you can just put that last SELECT into yet another CTE and then select your values from it. There is some more about PIVOT here: Rotate/pivot table with aggregation in Oracle

Randomly choose which group values to include in GROUP BY

I have a fairly large table, so that I'd like to do a group by over the entire table across users, but only return data for 10% of the users, sampled randomly. I know how to sample rows uniformly, but is there a 1-step way to randomly decide which users to include in a group by on the user field?
Maybe could you order the results of the select clause randomly like this
SELECT TOP x *
FROM table_name
ORDER BY newid()

SQL Server - Speed up count on large table

I have a table with close to 30 million records. Just several columns. One of the column 'Born' have not more than 30 different values and there is an index defined on it. I need to be able to filter on that column and efficiently page through results.
For now I have (example if the year I'm searching for is '1970' - it is a parameter in my stored procedure):
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, (SELECT count(*) FROM PersonSubset) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
Every query of that sort (only Born parameter used) returns just over 1 million results.
I've noticed the biggest overhead is on the count used to return the total results. If I remove (SELECT count(*) FROM PersonSubset) AS TotalPeople from the select clause the whole thing speeds up a lot.
Is there a way to speed up the count in that query. What I care about is to have the paged results returned and the total count.
Updated following discussion in comments
The cause of the problem here is very low cardinality of the IX_Person_Born index.
SQL indexes are very good at quickly narrowing down values, but they have problems when you have lots of records with the same value.
You can think of it as like the index of a phone book - if you want to find "Smith, John" you first find that there are lots of names that begin with S, and then pages and pages of people called Smith, and then lots of Johns. You end up scanning the book.
This is compounded because the index in the phone book is clustered - the records are sorted by surname. If instead you want to find everyone called "John" you'll be doing a lot of looking up.
Here there are 30 million records but only 30 different values, which means that the best possible index is still returning around 1 million records - at that sort of scale it might as well be a table-scan. Each of those 1 million results is not the actual record - it's a lookup from the index to the table (the page number in the phone book analogy), which makes it even slower.
A high cardinality index (say for full date of birth), rather than year would be much quicker.
This is a general problem for all OLTP relational databases: low cardinality + huge datasets = slow queries because index-trees don't help much.
In short: there's no significantly quicker way to get the count using T-SQL and indexes.
You have a couple of options:
1. Data Aggregation
Either OLAP/Cube rollups or do it yourself:
select Born, count(*)
from Person
group by Born
The pro is that cube lookups or checking your cache is very fast. The problem is that the data will get out of date and you need some way to account for that.
2. Parallel Queries
Split into two queries:
SELECT count(*)
FROM Person
WHERE Born = '1970'
SELECT TOP 30 *
FROM Person
WHERE Born = '1970'
Then run these either in parallel server side, or add it to the user interface.
3. No-SQL
This problem is one of the big advantages no-SQL solutions have over traditional relational databases. In a no-SQL system the Person table is federated (or sharded) across lots of cheap servers. When a user searches every server is checked at the same time.
At this point a technology change is probably out, but it may be worth investigating so I've included it.
I have had similar problems in the past with databases of this kind of size, and (depending on context) I've used both options 1 and 2. If the total here is for paging then I'd probably go with option 2 and AJAX call to get the count.
DECLARE #TotalPeople int
--does this query run fast enough? If not, there is no hope for a combo query.
SET #TotalPeople = (SELECT count(*) FROM Person WHERE Born = '1970')
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, #TotalPeople as TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
You usually can't take a slow query, combine it with a fast query, and wind up with a fast query.
One of the column 'Born' have not more than 30 different values and there is an index defined on it.
Either SQL Server isn't using the index or statistics, or the index and statistics aren't helpful enough.
Here is a desperate measure that will force Sql's hand (at the potential cost of making writes very expensive - measure that, and blocking schema changes to the Person table while the view exists).
CREATE VIEW dbo.BornCounts WITH SCHEMABINDING
AS
SELECT Born, COUNT_BIG(*) as NumRows
FROM dbo.Person
GROUP BY Born
GO
CREATE UNIQUE CLUSTERED INDEX BornCountsIndex ON BornCounts(Born)
By putting a clustered index on a view, you make it a system maintained copy. The size of this copy is much smaller than 30 Million rows, and it has the exact information you're looking for. I did not have to change the query to get it to use the view, but you're free to use the view's name in the query if you like.
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, **max(Row) AS TotalPeople**
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
why not like that ?
edit , dont know why bold doesnt work :<
Here is a novel approach using system dmv's if you can get by with a "good enough" count, you don't mind creating an index for every distinct value for [Born], and you don't mind feeling a little bit dirty inside.
Create a filtered index for each year:
--pick a column to index, it doesn't matter which.
CREATE INDEX IX_Person_filt_1970 on Person ( id ) WHERE Born = '1970'
CREATE INDEX IX_Person_filt_1971 on Person ( id ) WHERE Born = '1971'
CREATE INDEX IX_Person_filt_1972 on Person ( id ) WHERE Born = '1972'
Then use the [rows] column from sys.partitions to to get a rowcount.
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *,
(
SELECT sum(rows)
FROM sys.partitions p
inner join sys.indexes i on p.object_id = i.object_id and p.index_id =i.index_id
inner join sys.tables t on t.object_id = i.object_id
WHERE t.name ='Person'
and i.name = 'IX_Person_filt_' + '1970' --or at #p1
) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
Sys.partitions isn't guaranteed to be accurate in 100% of cases (usually it is exact or really close) This approach won't work if you need to filter on anything but [Born]

Find row number in a sort based on row id, then find its neighbours

Say that I have some SELECT statement:
SELECT id, name FROM people
ORDER BY name ASC;
I have a few million rows in the people table and the ORDER BY clause can be much more complex than what I have shown here (possibly operating on a dozen columns).
I retrieve only a small subset of the rows (say rows 1..11) in order to display them in the UI. Now, I would like to solve following problems:
Find the number of a row with a given id.
Display the 5 items before and the 5 items after a row with a given id.
Problem 2 is easy to solve once I have solved problem 1, as I can then use something like this if I know that the item I was looking for has row number 1000 in the sorted result set (this is the Firebird SQL dialect):
SELECT id, name FROM people
ORDER BY name ASC
ROWS 995 TO 1005;
I also know that I can find the rank of a row by counting all of the rows which come before the one I am looking for, but this can lead to very long WHERE clauses with tons of OR and AND in the condition. And I have to do this repeatedly. With my test data, this takes hundreds of milliseconds, even when using properly indexed columns, which is way too slow.
Is there some means of achieving this by using some SQL:2003 features (such as row_number supported in Firebird 3.0)? I am by no way an SQL guru and I need some pointers here. Could I create a cached view where the result would include a rank/dense rank/row index?
Firebird appears to support window functions (called analytic functions in Oracle). So you can do the following:
To find the "row" number of a a row with a given id:
select id, row_number() over (partition by NULL order by name, id)
from t
where id = <id>
This assumes the id's are unique.
To solve the second problem:
select t.*
from (select id, row_number() over (partition by NULL order by name, id) as rownum
from t
) t join
(select id, row_number() over (partition by NULL order by name, id) as rownum
from t
where id = <id>
) tid
on t.rownum between tid.rownum - 5 and tid.rownum + 5
I might suggest something else, though, if you can modify the table structure. Most databases offer the ability to add an auto-increment column when a row is inserted. If your records are never deleted, this can server as your counter, simplifying your queries.