SQL Query to select certain number of records with WHERE clause - sql

Let's say I have a table that looks like the following:
ID | EntityType | Foo | Bar
----------------------------
1 | Business | test | test
2 | Family | welp | testing
3 | Individual | hmm | 100
4 | Family | test | test
5 | Business | welp | testing
6 | Individual | hmm | 100
This table is fairly large, and there are random (fairly infrequent) instances of "Business" in the EntityType column.
A query like
SELECT TOP 500 * FROM Records WHERE EntityType='Business' ORDER BY ID DESC
works perfectly for grabbing the first set of Businesses, now how would I page backwards and get the previous set of 500 records which meet my criteria?
I understand I could look at records between IDs, but there is no guarantee on what ID that would be, for example it wouldn't just be the last ID of the previous query minus 500 because the Business EntityType is so infrequent.
I've also looked at some paging models but I'm not sure how I can integrate them while keeping my WHERE clause just how it is (only accepting EntityType of Business) and guaranteeing 500 records (I've used one that "pages" back 500 records, and only shows about 18 businesses because they're within the 500 total records returned).
I appreciate any help on this matter!

select * from (
select top 500 * from (
select top 1000 * FROM Records WHERE EntityType='Business' ORDER BY ID DESC
) x
order by id
) y
order by id desc
Innermost query - take the top 1000, to get page 2 and page 1 results
2nd level query - take the page 2 records from the first query
outermost - reorder the results

I believe what you need is called paging. There is great article on paging on CodeGuru (I think it was mentioned here before):
http://www.codeguru.com/csharp/.net/net_data/article.php/c19611/Paging-in-SQL-Server-2005.htm
I think there you will find everything you need

So, I'd do this slightly differently from the other answers. My query to always pull the 500 last rows with a minimum row would look like this and require a rowcount.
Note that using a rowcount outside the query makes it exponentially easier to push through SQL syntax. I wish it wasn't necessary.
Declare #row_min as integer
Declare #row_count as integer
set #row_min = 500
SELECT #row_count = COUNT(*) FROM Records WHERE EntityType='Business' ;
WITH MyCTE AS
(
SELECT
ID, EntityType, Foo, Bar,
ROW_NUMBER() OVER (ORDER BY TagId) AS 'RowNum'
FROM Records
WHERE EntityType='Business'
)
Select TOP 500 *, (Select Max(RowNum) From MyCTE) As RowMax
FROM MyCTE
WHERE EntityType='Business'
AND
RowNum >
Case sign(#row_count - 500 - #row_min)
When -1 Then (#row_count - 500)
ELSE #row_min
end
AND
RowNum <
Case sign(#row_count - 500 - #row_min)
When -1 Then (#row_count)
ELSE #row_min + 500
end
--Note : Debugging purposes.
select sign(#row_count - 500 - #row_min), (#row_count - 500 - #row_min), #row_count, #row_min

Related

pagination and filtering on a very large table in postgresql (keyset pagination?)

I have a scientific database with currently 4,300,000 records. It's a scientific database, and an API is feeding it. In june 2020, I will probably have about 100,000,000 records.
This is de layout of the table 'output':
ID | sensor_ID | speed | velocity | direction
-----------------------------------------------------
1 | 1 | 10 | 1 | up
2 | 2 | 12 | 2 | up
3 | 2 | 11.5 | 1.5 | down
4 | 1 | 9.5 | 0.8 | down
5 | 3 | 11 | 0.75 | up
...
BTW, this is dummy data. But output is a table with 5 columns: ID, sensor_ID, speed, velocity and direction.
What I want to achieve is a decent pagination and filter method. I want to create a website (in nodejs) where this +4,000,000 records (for now) will be displayed, 10,000 records per page. I also want to be able to filter on sensor_ID, speed, velocity or direction.
For now, I have this query for selecting specific rows:
SELECT * FROM output ORDER BY ID DESC OFFSET 0 LIMIT 10000 // first 10,000 rows
SELECT * FROM output ORDER BY ID DESC OFFSET 10000 LIMIT 10000 // next 10,000 rows
...
I'm searching for some information/tips about creating a decent pagination method. For now, it's still quiet fast the way I do it, but I think it will be a lot slower when we hit +50,000,000 records.
First of all, I found this page: https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/. I'm interested in the keyset pagination. But to be honest, I have no clue how to start.
What I think I must do:
Create an index on the ID-field:
CREATE UNIQUE INDEX index_id ON output USING btree (ID)
I also found this page: https://leopard.in.ua/2014/10/11/postgresql-paginattion. When you scroll down to "Improvement #2: The Seek Method", you can see that they dropped the OFFSET-clause, and are using a WHERE-clause. I also see that they are using the last insert ID in their query:
SELECT * FROM output WHERE ID < <last_insert_id_here> ORDER BY ID DESC LIMIT 10000
I do not fully understand this. For the first page, I need the very last insert ID. Then I fetch the 10,000 newest records. But after that, to get the second page, I don't need the very last insert ID, I need the 10,000th last insert ID (I guess).
Can someone give me a good explanation about pagination and filtering in a fast way.
The stuff I'm using:
- postgresql
- pgadmin (for database management)
- node.js (latest version)
Thanks everyone! And have a nice 2020!
EDIT 1: I have no clue, but could massiveJS (https://massivejs.org/) be something good to use? And should I use it on ALL queries, or only on the pagination queries?
EDIT 2: I THINK I got it figured out a little bit (correct me if I'm wrong).
Let's say I have 100,000 records:
1) Get the last inserted ID
2) Use this last inserted ID to fetch the last 10,000 records
SELECT * FROM output WHERE ID < 100000 ORDER BY ID DESC LIMIT 10000 // last insert ID is here 100,000 because I have 100,000 records
3) Show the 10,000 records but also save the insert ID of the 10,000the record to use in the next query
4) Get the next 10,000 records with the new last insert id
SELECT * FROM output WHERE ID < 90000 ORDER BY ID DESC LIMIT 10000 // 90,000 is the very last insert id - 10,000
5) ...
Is this correct?
Here's how I handle this. For the first page I fetch, I use
SELECT id, col, col, col
FROM output
ORDER BY id DESC
LIMIT 10000
Then, in my client program (node.js) I capture the id value from the last row of the result set. When I need the next page, I do this.
SELECT id, col, col, col
FROM output
WHERE id < my_captured_id_value
ORDER BY id DESC
This exploits the index. And it works correctly even if you have deleted some rows from the table.
By the way, you probably want a descending index if your first pagination page has the largest ids. CREATE UNIQUE INDEX index_id ON output USING btree (ID DESC).
Pro tip SELECT * is harmful to performance on large databases. Always list the columns you actually need.
In keyset pagination you should set WHERE clause on what you want to set in ORDER BY clause , and for DESC you should use < and vice versa.
For the first page you can use something like this:
SELECT Col1, Col2, Col3
FROM db.tbl
WHERE Col3 LIKE '%search_term%'
ORDER BY Col1 DESC , Col2 ASC
LIMIT 10000
and for next page , you should send value of Col1 and Col2 from last row of the result to the query like this:
SELECT Col1, Col2, Col3
FROM db.tbl
WHERE Col3 LIKE '%search_term%'
AND ( Col1 < Col1_last_row_value AND Col2 > Col2_last_row_value)
ORDER BY Col1 DESC , Col2 ASC
LIMIT 10000
and in the server or client side you should check if your query bring back any result or not , if not that means you are done and the loading-icon of "infinite scroll" has to be hidden

SQL - Update top n records for each value in column a where n = count of column b

I have one table with the following columns and sample values:
[test]
ID | Sample | Org | EmployeeNumber
1 100 6513241
2 200 3216542
3 300 5649841
4 100 9879871
5 200 6546548
6 100 1116594
My example count query based on [test] returns these sample values grouped by Org:
Org | Count of EmployeeNumber
100 3
200 2
300 1
My question is can I use this count to update test.Sample to 'x' for the top 3 records of Org 100, the top 2 records of Org 200, and the top 1 record of Org 300? It does not matter which records are updated, as long as the number of records updated for the Org = the count of EmployeeNumber.
I realize that I could just update all records in this example but I have 175 Orgs and 900,000 records and my real count query includes an iif that only returns a partial count based on other columns.
The db that I am taking over uses a recordset and loop to update. I am trying to write this in one SQL update statement. I have tried several variations of nested select statements but can't quite figure it out. Any help would save my brain from exploding. Thanks!
Assuming, that id is the unique ID of the row, you could use a correlated subquery to select the count of row IDs of the rows sharing the current organization, that are less than or equal to the current row ID and check, that this count is less than or equal to the number of records from that organization you want to designate.
For example to mark 3 records of the organization 100 you could use:
UPDATE test
SET sample = 'x'
WHERE org = 100
AND (SELECT count(*)
FROM test t
WHERE t.org = test.org
AND t.id <= test.id) <= 3;
And analog for the other cases.
(Disclaimer: I don't have access to Access (ha, ha, pun), so I could not test it. But I guess it's basic enough, to work in almost every DBMS, also in Access.)

How to consolidate blocks of time?

I have a derived table with a list of relative seconds to a foreign key (ID):
CREATE TABLE Times (
ID INT
, TimeFrom INT
, TimeTo INT
);
The table contains mostly non-overlapping data, but there are occasions where I have a TimeTo < TimeFrom of another record:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 70 |
| 10 | 60 | 150 |
| 10 | 75 | 150 |
| .. | ... | ... |
+----+----------+--------+
The result set is meant to be a flattened linear idle report, but with too many of these overlaps, I end up with negative time in use. I.e. If the window above for ID = 10 was 150 seconds long, and I summed the differences of relative seconds to subtract from the window size, I'd wind up with 150-(20+20+90+75)=-55. This approach I've tried, and is what led me to realizing there were overlaps that needed to be flattened.
So, what I'm looking for is a solution to flatten the overlaps into one set of times:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 150 |
| .. | ... | ... |
+----+----------+--------+
Considerations: Performance is very important here, as this is part of a larger query that will perform well on it's own, and I'd rather not impact its performance much if I can help it.
On a comment regarding "Which seconds have an interval", this is something I have tried for the end result, and am looking for something with better performance. Adapted to my example:
SELECT SUM(C.N)
FROM (
SELECT A.N, ROW_NUMBER()OVER(ORDER BY A.N) RowID
FROM
(SELECT TOP 60 1 N FROM master..spt_values) A
, (SELECT TOP 720 1 N FROM master..spt_values) B
) C
WHERE EXISTS (
SELECT 1
FROM Times SE
WHERE SE.ID = 10
AND SE.TimeFrom <= C.RowID
AND SE.TimeTo >= C.RowID
AND EXISTS (
SELECT 1
FROM Times2 D
WHERE ID = SE.ID
AND D.TimeFrom <= C.RowID
AND D.TimeTo >= C.RowID
)
GROUP BY SE.ID
)
The problem I have with this solution is I have get a Row Count Spool out of the EXISTS query in the query plan with a number of executions equal to COUNT(C.*). I left the real numbers in that query to illustrate that getting around this approach is for the best. Because even with a Row Count Spool reducing the cost of the query by quite a bit, it's execution count increases the cost of the query as a whole by quite a bit as well.
Further Edit: The end goal is to put this in a procedure, so Table Variables and Temp Tables are also a possible tool to use.
OK. I'm still trying to do this with just one SELECT. But This totally works:
DECLARE #tmp TABLE (ID INT, GroupId INT, TimeFrom INT, TimeTo INT)
INSERT INTO #tmp
SELECT ID, 0, TimeFrom, TimeTo
FROM Times
ORDER BY Id, TimeFrom
DECLARE #timeTo int, #id int, #groupId int
SET #groupId = 0
UPDATE #tmp
SET
#groupId = CASE WHEN id != #id THEN 0
WHEN TimeFrom > #timeTo THEN #groupId + 1
ELSE #groupId END,
GroupId = #groupId,
#timeTo = TimeTo,
#id = id
SELECT Id, MIN(TimeFrom), Max(TimeTo) FROM #tmp
GROUP BY ID, GroupId ORDER BY ID
Left join each row to its successor overlapping row on the same ID value (where such exist).
Now for each row in the result-set of LHS left join RHS the contribution to the elapsed time for the ID is:
isnull(RHS.TimeFrom,LHS.TimeTo) - LHS.TimeFrom as TimeElapsed
Summing these by ID should give you the correct answer.
Note that:
- where there isn't an overlapping successor row the calculation is simply
LHS.TimeTo - LHS.TimeFrom
- where there is an overlapping successor row the calculation will net to
(RHS.TimeFrom - LHS.TimeFrom) + (RHS.TimeTo - RHS.TimeFrom)
which simplifies to
RHS.TimeTo - LHS.TimeFrom
What about something like below (assumes SQL 2008+ due to CTE):
WITH Overlaps
AS
(
SELECT t1.Id,
TimeFrom = MIN(t1.TimeFrom),
TimeTo = MAX(t2.TimeTo)
FROM dbo.Times t1
INNER JOIN dbo.Times t2 ON t2.Id = t1.Id
AND t2.TimeFrom > t1.TimeFrom
AND t2.TimeFrom < t1.TimeTo
GROUP BY t1.Id
)
SELECT o.Id,
o.TimeFrom,
o.TimeTo
FROM Overlaps o
UNION ALL
SELECT t.Id,
t.TimeFrom,
t.TimeTo
FROM dbo.Times t
INNER JOIN Overlaps o ON o.Id = t.Id
AND (o.TimeFrom > t.TimeFrom OR o.TimeTo < t.TimeTo);
I do not have a lot of data to test with but seems decent on the smaller data sets I have.
I also wrapped by head around this issue - and afterall I found, that the problem is your data.
You claim (if i get that right), that these entries should reflect the relative times, when a user goes idle / comes back.
So, you should consider to sanitize your data and refactor your inserts to produce valid data sets.
For instance, the two lines:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 50 | 70 |
| 10 | 60 | 150 |
how can it be possible that a user is idle until second 70, but goes idle on second 60? This already implies, that he has been back latest at around second 59.
I can only assume that this issue comes from different threads and/or browser windows (tabs) a user might be using your application with. (Each having it's own "idle detection")
So instead of working-around the symptoms - you should fix the cause! Why is this data entry inserted into the table? You could avoid this by simple checking, if the user is already idle before inserting a new row.
Create a unique key constraint on ID and TimeTo
Whenever an idle-event is detected, execute the following query:
INSERT IGNORE INTO Times (ID,TimeFrom,TimeTo)VALUES('10', currentTimeStamp, -1);
-- (If the user is already "idle" - nothing will happen)
Whenever an comeback-event is detected, execute the following query:
UPDATE Times SET TimeTo=currentTimeStamp WHERE ID='10' and TimeTo=-1
-- (If the user is already "back" - nothing will happen)
The fiddle linked here: http://sqlfiddle.com/#!2/dcb17/1 would reproduce the chain of events for your example, but resulting in a clean and logical set of idle-windows:
ID TIMEFROM TIMETO
10 10 30
10 50 70
10 75 150
Note: The Output is slightly different from the output you desired. But I feel that this is more accurate, cause of the reason outlined above: A user cannot go idle on second 70 without returning from it's current idle state before. He either STAYS idle (and a second thread/tab runs into the idle-event) Or he returned in between.
Especially for your need to maximize performance, you should fix the data and not invent a work-around-query. This is maybe 3 ms upon inserts, but could be worth 20 seconds upon select!
Edit: if Multi-Threading / Multiple-Sessions is the cause for the wrong insert, you would also need to implement a check, if most_recent_come_back_time < now() - idleTimeout - otherwhise a user might comeback on tab1, and is recorded idle on tab2 after a few seconds, cause tab2 did run into it's idle timeout, cause the user only refreshed tab1.
I had the 'same' problem once with 'days' (additionaly without counting WE and Holidays)
The word counting gave me the following idea:
create table Seconds ( sec INT);
insert into Seconds values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9), ...
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom and t.timeto-1
and id=10;
you can cut the start to 0 (I put the '10' here in braces)
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom- (10) and t.timeto- (10)-1
and id=10;
and finaly
select count(distinct sec) from times t, seconds s,
(select min(timefrom) m from times where id=10) as m
where s.sec between t.timefrom-m.m and t.timeto-m.m-1
and id=10;
additionaly you can "ignore" eg. 10 seconds by dividing you loose some prezition but earn speed
select count(distinct sec)*d from times t, seconds s,
(select min(timefrom) m from times where id=10) as m,
(select 10 d) as d
where s.sec between (t.timefrom-m)/d and (t.timeto-m)/d-1
and id=10;
Sure it depends on the range you have to look at, but a 'day' or two of seconds should work (although i did not test it)
fiddle ...

SQL Get n last unique entries by date

I have an access database that I'm well aware is quite poorly designed, unfortunately it is what I must use. It looks a little like the following:
(Row# is not a column in the database, it's just there to help me describe what I'm after)
Row# ID Date Misc
1 001 01/8/2013 A
2 001 01/8/2013 B
3 001 01/8/2013 C
4 002 02/8/2013 D
5 002 02/8/2013 A
6 003 04/8/2013 B
7 003 04/8/2013 D
8 003 04/8/2013 D
What I'm trying to do is obtain all information entered for the last n (by date) 'entries' where an 'entry' is all rows with a unique ID.
So if I want the last 1 entry I will get rows 6, 7 and 8. The last two entries will get me rows 4-8 etc.
I've tried to get the SN's needed in a subselect and then select all entries where those SN's appear, but I couldn't get it to work. Any help appreciated.
Thanks.
The proper Access syntax:
select *
from t
where ID in (select top 10 ID
from t
group by ID
order by max([date]) desc
)
I think this will work:
select *
from table
where Date in (
select distinct(Date) as unique_date from table order by unique_date DESC limit <num>
)
The idea is to use the subselect with a limit to only identify dates you care about.
EDIT: Some databases do not allow a limit in a subquery (I'm looking at you, mysql). In that case, you'll have to make a temporary table out of the subquery then select * from it.

Update a variable number of rows each with different values in a single SQL command?

I have a table of ordered data like this
ID ORDER
12 1
13 2
14 3
15 4
...
200 189
201 190
...
I would like to be able to update a few or all of their "order"s.
How should I do that?
For example, I might switch the ordering between ID=12 and ID=13 so it'd go like
ID ORDER
12 2
13 1
14 3
15 4
...
This would just be a simple UPDATE TABLE SET ORDER=1 WHERE ID=13 SET ORDER=2 WHERE ID=12
But if I wanted to move ID=200 all the way to the top,..
ID ORDER
12 2
13 3
14 4
15 5
...
200 190
201 1
...
then everything would have to be updated..? How do I do that? Is there a better way? Decimals?
edit: I'm using MSSQL btw
edit:clarification of use: I have a table with a long list of URL links, and the order of those links matter. I want to be able to rearrange their order. I have a web page that retrieves that list from the db, displays the names as an unordered-list, and I can rearrange the items on that list. I'm stuck on how to get the newly ordered list's order updated into the database.
If you want to move an item up to the top, and update the order of all the other in a single statment, you can do the following:
UPDATE MyTable
SET Order = (CASE Order WHEN 190 THEN 1
ELSE Order + 1
END)
WHERE Order BETWEEN 1 AND 190
To move Id 200 to the top, you have to do this:
1) Take everything which is ordered before Id 200 and increase the order by 1
update MyTable
set Order = Order + 1
where Order < (select Order from MyTable where Id = 200)
2) Put Id 200 to the top of the list (Order = 1)
update MyTable
set Order = 1
where Id = 200
I think, What you are looking is
Select * from TableName order by ID Desc/Asc
You want to Order the rows, Ascending or Descending
I don't think the above question quite answers it. I believe the answer to your question is what you already suspect - to change order = 200 to order = 1, you will have to rescore every other value, or use a number format with decimals.
However, I strongly suspect that if you elaborated more on why you need to do it this way, we could chime in with some better recommended methods.