I would like to know regarding performance of two queries. A page is showing 18 records per page. It is implemented in two ways in two different applications, but using same tables and join.
When compared their query in SQL query profiler,
The first query fetches all 1 to 18 records and display. It is reading 1.2 million records.
The second query fetches 6 records at a time and display, On browser its showing 18 records. It is reading is less than 250k records.
Using almost same join in both queries.
Is it possible to show 18 records per page pagination in batches of query? If yes, how it is implemented?
Is there another way to find how many records read in an SQL execution other than profiler?
You can use OFFSET and FETCH to paginate. See
https://www.sqlshack.com/pagination-in-sql-server/
If you have the queries, you can run it in SSMS to get the execution plans. There will be estimated and actual row count information.
https://www.sqlshack.com/execution-plans-in-sql-server/
I have an application using an AWS Aurora SQL postgres 10 DB that expects +5M records per day on a table. The application will be running on a kubernetes environment with ~5 pods.
One of the applications requirements is to export a method to build an object with all the possible values of 5 columns of the table. ie: all distinct values of the name column.
We expect ~100 different values per column. A distinct/group by takes more than 1s per column, making the process not meeting the non functional requirements (process time).
The solution I found was to create a table/view with the distinct of each column, that table/view will be refreshed with a cron like task.
Is this the more effective approach to meet the non functional/process time requirement using only postgres tools?
One possible solution is a materialized view that you regularly refresh. Between these refreshes, the data will become slightly stale.
Alternatively, you can maintain a separate table with just the distinct values and use triggers to keep the information up to date whenever rows are modified. This will require a combined index on all the affected columns to be fast.
DISTINCT is always a performance problem if it affects many rows.
While practicing DBMS and SQL using Oracle Database, when I tried to fire 2 select queries on a table the database always wait for the first query to finish executing and keeps the other one in pipeline apparently.
Consider a table MY_TABLE having 1 million records with a column 'id' that holds the serial number of records.
Now my queries are:-
Query #1 - select * from MY_TABLE where id<500001; --I am fetching first 500,000 records here
Query #2 - select * from MY_TABLE where id>500000; --I am fetching next 500,000 records here
Since these are select queries, these must be acquiring a read lock on the table which is a shared lock. Then why this phenomenon happens? Please note the sample space or domain for both queries are mutually exclusive to the best of my knowledge here because of the filters that I applied via where clause and this further aggravates my confusion.
Also, I am visualizing this in form of that, there must be some process which is evaluating my query and then doing a handshake with the memory(i.e. resource) for fetching the result. So, any resource in shared lock mode should be accessible to all process which hold that lock.
Secondly, is there any way to override this behavior or execute multiple select queries concurrently.
Note:- I want to chunk down a particular task(i.e. data of a table) and enhance the speed of my script.
The database doesn't keep queries in a pipeline, it's simply the fact that your client is only sending one query at a time. The database will quite happily run multiple queries against the same data at the same time, e.g. from separate sessions.
I am using Spark SQL to build a query UI on top of json logs stored on Amazon S3. In the UI, most queries use limit to bring back the top results. Usually just the first ten.
Is there a way with spark sql to show the total number of rows that matched the query without re-running the query as a count?
This is a design/algorithm question.
Here's the outline of my scenario:
I have a large table (say, 5 mil. rows) of data which I'll call Cars
Then I have an application, which performs a SELECT * on this Cars table, taking all the data and packaging it into a single data file (which is then uploaded somewhere.)
This data file generated by my application represents a snapshot, what the table looked like at an instant in time.
The table Cars, however, is updated sporadically by another process, regardless of whether the application is currently generating a package from the table or not. (There currently is no synchronization.)
My problem:
This table Cars is becoming too big to do a single SELECT * against. When my application retrieves all this data at once, it quickly overwhelms the memory capacity for my machine (let's say, 2GB.) Also, simply performing chained SELECTs with LIMIT or OFFSET fails the condition of synchronization: the table is frequently updated and I can't have the data change between SELECT calls.
What I'm looking for:
A way to pull the entirety of this table into an application whose memory capacity is smaller than the data, assuming the data size could approach infinity. Particularly, how do I achieve a pagination/segmented effect for my SQL selects? i.e. Make recurring calls with a page number to retrieve the next segment of data. The ideal solution allows for scalability in data size.
(For the sake of simplifying my scenario, we can assume that when given a segment of data, the application can process/write it then free up the memory used before requesting the next segment.)
Any suggestions you may be able to provide would be most helpful. Thanks!
EDIT: By request, my implementation uses C#.NET 4.0 & MSSQL 2008.
EDIT #2: This is not a SQL command question. This is design-pattern related question: what is the strategy to perform paginated SELECTs against a large table? (Especially when said table receives consistent updates.)
What database are you using? In MySQL for example the following would select 20 rows beginning from row 40 but this is mysql-only clause (edit: it seems Postgres also allows this)
select * from cars limit 20 offset 40
If you want a "snapshot" effect you have to copy the data into holding table where it will not get updated. You can accomplish some nice things with various types of change-tracking, but that's not what you stated you wanted. If you need a snapshot of the exact table state then take the snapshot and write it to a seperate table and use the limit and offset (or whatever) to create pages.
And at 5 million rows, I think it is likely the design requirement that might need to be modified...if you have 2000 clients all taking 5 million-row snapshots you are going to start having some size issues if you don't watch out.
You should provide details of the format of the resultant data file. Depending on the format this could be possible directly in your database, with no app code involved eg for mysql:
SELECT * INTO OUTFILE "c:/mydata.csv"
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY "\n"
FROM my_table;
For oracle there would be export, for sqlserver/sybase it would be BCP, etc.
Or alternatively achievable by streaming the data, without holding it all in memory, this would vary depending on the app language.
In terms of paging, the easy option is to just use the limit clause (if mysql) or the equivelent in whatever rdbms you are using, but this is a last resort:
select * from myTable order by ID LIMIT 0,1000
select * from myTable order by ID LIMIT 1000,1000
select * from myTable order by ID LIMIT 2000,1000
...
This selects the data in 1000 row chunks.
Look at this post on using limit and offset to create paginated results from your sql query.
http://www.petefreitag.com/item/451.cfm
You would have to first:
SELECT * from Cars Limit 10
and then
SELECT * from Cars limit 10 offset 10
And so on. You will have to figure out the best pagination for this.