Limit Records with SQL in Access - sql

We get an Access DB (.accdb) from an external source and have no control over the structure or data. We need to ingest the data into our DB using code. This means I have control over the SQL.
Our issue is that one table contains almost 13k records (currently 12,997) and takes a long time to process. I'd like to query the data from the source DB but only a predefined number of records at a time - let's say 1000 at a time.
I tried generating my query inside a loop where I update the number the records to return with each loop. So far, the only thing I've found that comes close to working is something like this:
SELECT *
FROM (
SELECT Top + pageSize + sub.*
FROM (
SELECT TOP + startPos + [Product Description Codes].*
FROM [Product Description Codes]
ORDER BY [Product Description Codes].PRODDESCRIPCODE
) sub
ORDER BY sub.PRODDESCRIPCODE DESC
) subOrdered
ORDER BY subOrdered.PRODDESCRIPCODE
Where I increment pageSize and startPos with each loop. The problem is that it always returns 1000 rows, even on what I think should be the last loop when it should return only 997 and then return zero after that.
Can anyone help me with this? I don't have another column to filter on. Is there a way to select a certain number of records in a loop and then increment that number until I've gotten all the records, and then stop?

If PRODDESCRIPCODE is primary key then you can simplify your select. ie:
SELECT TOP 1000 *
FROM [Product Description Codes]
where PRODDESCRIPCODE > #pcode;
and start with passing a #pcode parameter of 0 (if int, or '' if text etc). In next loop you would set the parameter to the max PRODDESCRIPCODE you have received.
(I am not sure if you meant MS SQL server saying SQL and how you are doing this).

Do you absolutely have to update records, or can you afford to insert the entire access table into your local table, slap on a timestamp field, and structure your local queries to grab the most recent entry? Based on some of your comments above, it doesn't sound like you have any cases where you are keeping a local record over an imported one.
SELECT PRODDESCRIPCODE, MAX(timestamp) FROM table GROUP BY PRODDESCRIPCODE

I ended up using a variation of the method from here:
http://www.jertix.org/en/blog/programming/implementation-of-sql-pagination-with-ms-access.html
Thank you all very much for your suggestions.

Related

Numbering rows in a view

I am connecting to an SQL database using a PLC, and need to return a list of values. Unfortunately, the PLC has limited memory, and can only retrieve approximately 5,000 values at any one time, however the database may contain up to 10,000 values.
As such I need a way of retrieving these values in 2 operations. Unfortunately the PLC is limited in the query it can perform, and is limited to only SELECT and WHERE commands, so I cannot use LIMIT or TOP or anything like that.
Is there a way in which I can create a view, and auto number every field in that view? I could then query all records < 5,000, followed by a second query of < 10,000 etc?
Unfortunately it seems that views do not support the identity column, so this would need to be done manually.
Anyone any suggestions? My only realistic option at the moment seems to be to create 2 views, one with the first 5,000 and 1 with the next 5,000...
I am using SQL Server 2000 if that makes a difference...
There are 2 solutions. The easiest is to modify your SQL table and add an IDENTITY column. If that is not a possibility, the you'll have to do something like the below query. For 10000 rows, it shouldn't be too slow. But as the table grows, it will become worse and worse-performing.
SELECT Col1, Col2, (SELECT COUNT(i.Col1)
FROM yourtable i
WHERE i.Col1 <= o.Col1) AS RowID
FROM yourtable o
While the code provided by Derek does what I asked - i.e numbers each row in the view, the performance for this is really poor - approximately 20 seconds to number 100 rows. As such it is not a workable solution. An alternative is to number the first 5,000 records with a 1, and the next 5,000 with a 2. This can be done with 3 simple queries, and is far quicker to execute.
The code to do so is as follows:
SELECT TOP(5000) BCode, SAPCode, 1 as GroupNo FROM dbo.DB
UNION
SELECT TOP (10000) BCode, SAPCode, 2 as GroupNo FROM dbo.DB p
WHERE ID NOT IN (SELECT TOP(5000) ID FROM dbo.DB)
Although, as pointed out by Andriy M, you should also specify an explicit sort, to ensure the you dont miss any records.
One possibility might be to use a function with a temporary table such as
CREATE FUNCTION dbo.OrderedBCodeData()
RETURNS #Data TABLE (RowNumber int IDENTITY(1,1),BCode int,SAPCode int)
AS
BEGIN
INSERT INTO #Data (BCode,SAPCode)
SELECT BCode,SAPCode FROM dbo.DB ORDER BY BCode
RETURN
END
And select from this function such as
SELECT FROM dbo.OrderedBCodeData() WHERE RowNumber BETWEEN 5000 AND 10000
I haven't used this in production ever, in fact was just a quick idea this morning but worth exploring as a neater alternative?

processing large table - how do i select the records page by page?

I need to do a process on all the records in a table. The table could be very big so I rather process the records page by page. I need to remember the records that have already been processed so there are not included in my second SELECT result.
Like this:
For first run,
[SELECT 100 records FROM MyTable]
For second run,
[SELECT another 100 records FROM MyTable]
and so on..
I hope you get the picture. My question is how do I write such select statement?
I'm using oracle btw, but would be nice if I can run on any other db too.
I also don't want to use store procedure.
Thank you very much!
Any solution you come up with to break the table into smaller chunks, will end up taking more time than just processing everything in one go. Unless the table is partitioned and you can process exactly one partition at a time.
If a full table scan takes 1 minute, it will take you 10 minutes to break up the table into 10 pieces. If the table rows are physically ordered by the values of an indexed column that you can use, this will change a bit due to clustering factor. But it will anyway take longer than just processing it in one go.
This all depends on how long it takes to process one row from the table of course. You could chose to reduce the load on the server by processing chunks of data, but from a performance perspective, you cannot beat a full table scan.
You are most likely going to want to take advantage of Oracle's stopkey optimization, so you don't end up with a full tablescan when you don't want one. There are a couple ways to do this. The first way is a little longer to write, but let's Oracle automatically figure out the number of rows involved:
select *
from
(
select rownum rn, v1.*
from (
select *
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rownum <= 200
)
where rn >= 101;
You could also achieve the same thing with the FIRST_ROWS hint:
select /*+ FIRST_ROWS(200) */ *
from (
select rownum rn, t.*
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rn between 101 and 200;
I much prefer the rownum method, so you don't have to keep changing the value in the hint (which would need to represent the end value and not the number of rows actually returned to the page to be accurate). You can set up the start and end values as bind variables that way, so you avoid hard parsing.
For more details, you can check out this post

How do I limit the rowcount in an SSIS data flow task?

I have an Oracle source, and I'm getting the entire table, and it is being copied to a SQL Server 2008 table that looks the same. Just for testing, I would like to only get a subset of the table.
In the old DTS packages, under Options on the data transform, I could set a first and last record number, and it would only get that many records.
If I were doing a query, I could change it to a select top 5000 or set rowcount 5000 at the top (maybe? This is an Oracle source). But I'm grabbing the entire table.
How do I limit the rowcount when selecting an Oracle table?
We can use the rowcount component in the dataflow and after the component make User::rowCount <= 500 in the precedence constraint condition while loding into the target. Whenever the count >500 the process stops to inserts the data into the target table.
thanks
prav
It's been a while since I've touched pl/sql, but I would think that you could simply put a where condition of "rownum <= n" where n = the number of rows that you want for your sample. ROWNUM is a pseudo-column that exists on each Oracle table . . . it's a handy feature for problems like this (it's equivalent to t-sql's row_number() function without the ability to partition and sort (I think). This would keep you from having to bring in the whole table into memory:
select col1, col2
from tableA
where rownum <= 10;
For future reference (and only because I've been working with it lately), DB2's equivalent for this is the clause "fetch first n only" at the end of the statement:
select col1, col2
from tableA
fetch first 10 only;
Hope I've not been too off base.
The row sampling component in the data flow restricts the number of rows. Just insert it between your source and destination and set the number of rows. Very useful for a large amount of data and when you can not modify the query. In this example, I execute an SP in the source.
See example below

Index of a record in a table

I need to locate the index position of a record in a large database table in order to preset a pager to that item's page. I need an efficient SQL query that can give me this number. Sadly SQL doesn't offer something like:
SELECT INDEX(*) FROM users WHERE userid='123'
Any bright ideas?
EDIT: Lets assume there is an ORDER BY clause appended to this. the point is I do not want to have to load all records to locate the position of a specific one. I am trying to open a pager to the page holding an existing item that had previously been chosen - because i want to provide information about that already chosen item within a context that allows a user to choose a different one.
You might use something like (pseudo-code):
counting query: $n = select count(uid) from {users} where ... (your paging condition including userid 123 as the limit)
$page = floor($n / $pager_size);
display query: select what,you,want from {users} where (your paging condition without the limit), passed to db_query_range(thequery, $page, $pager_size)
You should really look at pager_query, though, because that's what it's all about, and it basically works like this: a counting query and a display query, except it tries to build the counting query automatically.
Assuming you are really asking how to page records in SQL Server 2005 onwards, have a look at this code from David Hayden:
(you will need to change Date, Description to be your columns)
CREATE PROCEDURE dbo.ShowUsers
#PageIndex INT,
#PageSize INT
AS
BEGIN
WITH UserEntries AS (
SELECT ROW_NUMBER() OVER (ORDER BY Date DESC) AS Row, Date, Description
FROM users)
SELECT Date, Description
FROM UserEntries
WHERE Row BETWEEN (#PageIndex - 1) * #PageSize + 1 AND #PageIndex * #PageSize
END
SQL doesn't guarantee the order of objects in the table unless you use the OrderBy clause. In other words, the index of any particular row may change in subsequent queries. Can you describe what you are trying to accomplish with the pager?
You might be interested in something that simulates the rownum() of Oracle in MySQL... if you are using MySQL of course as it's not specified in the question.
Notes:
You'll have to look through all the records of your pages for that to work of course. You don't need to fetch them back to the PHP page from the database server but you'll have to include them in the query. There's no magic trick to determine the position of your row inside a result set other than querying the result set as it might change because of the where conditions, the orders and the groups. It needs to be in context.
Of course, if all your rows are sequential, with incremental ids, none are deleted, and you know the first and last ids; then you could use a count and with simple math get the position without querying everything.... but I doubt that's your case, it never is.

SQL trigger for deleting old results

We have a database that we are using to store test results for an embedded device. There's a table with columns for different types of failures (details not relevant), along with a primary key 'keynum' and a 'NUM_FAILURES' column that lists the number of failures. We store passes and failures, so a pass has a '0' in 'NUM_FAILURES'.
In order to keep the database from growing without bounds, we want to keep the last 1000 results, plus any of the last 50 failures that fall outside of the 1000. So, worst case, the table could have 1050 entries in it. I'm trying to find the most efficient SQL insert trigger to remove extra entries. I'll give what I have so far as an answer, but I'm looking to see if anyone can come up with something better, since SQL isn't something I do very often.
We are using SQLITE3 on a non-Windows platform, if it's relevant.
EDIT: To clarify, the part that I am having problems with is the DELETE, and specifically the part related to the last 50 failures.
The reason you want to remove these entries is to keep the database growing too big and not to keep it in some special state. For that i would really not use triggers and instead setup a job to run at some interval cleaning up the table.
So far, I have ended up using a View combined with a Trigger, but I'm not sure it's going to work for other reasons.
CREATE VIEW tablename_view AS SELECT keynum FROM tablename WHERE NUM_FAILURES!='0'
ORDER BY keynum DESC LIMIT 50;
CREATE TRIGGER tablename_trig
AFTER INSERT ON tablename WHEN (((SELECT COUNT(*) FROM tablename) >= 1000) or
((SELECT COUNT(NUM_FAILURES) FROM tablename WHERE NUM_FAILURES!='0') >= 50))
BEGIN
DELETE FROM tablename WHERE ((((SELECT MAX(keynum) FROM ibit) - keynum) >= 1000)
AND
((NUM_FAILURES=='0') OR ((SELECT MIN(keynum) FROM tablename_view) > keynum)));
END;
I think you may be using the wrong data structure. Instead I'd create two tables and pre-populate one with a 1000 rows (successes) and the other with 50 (failures). Put a primary ID on each. The when you record a result instead of inserting a new row find the ID+1 value for the last timestamped record entered (looping back to 0 if > max(id) in table) and update it with your new values.
This has the advantage of pre-allocating your storage, not requiring a trigger, and internally consistent logic. You can also adjust the size of the log very simply by just pre-populating more records rather than to have to change program logic.
There's several variations you can use on this, but the idea of using a closed loop structure rather than an open list would appear to match the problem domain more closely.
How about this:
DELETE
FROM table
WHERE ( id > ( SELECT max(id) - 1000 FROM table )
AND num_failures = 0
)
OR id > ( SELECT max(id) - 1050 FROM table )
If performance is a concern, it might be better to delete on a periodic basis, rather than on each insert.