My SQL table is too big: retrieving data via paging/segmenting the result? - sql

This is a design/algorithm question.
Here's the outline of my scenario:
I have a large table (say, 5 mil. rows) of data which I'll call Cars
Then I have an application, which performs a SELECT * on this Cars table, taking all the data and packaging it into a single data file (which is then uploaded somewhere.)
This data file generated by my application represents a snapshot, what the table looked like at an instant in time.
The table Cars, however, is updated sporadically by another process, regardless of whether the application is currently generating a package from the table or not. (There currently is no synchronization.)
My problem:
This table Cars is becoming too big to do a single SELECT * against. When my application retrieves all this data at once, it quickly overwhelms the memory capacity for my machine (let's say, 2GB.) Also, simply performing chained SELECTs with LIMIT or OFFSET fails the condition of synchronization: the table is frequently updated and I can't have the data change between SELECT calls.
What I'm looking for:
A way to pull the entirety of this table into an application whose memory capacity is smaller than the data, assuming the data size could approach infinity. Particularly, how do I achieve a pagination/segmented effect for my SQL selects? i.e. Make recurring calls with a page number to retrieve the next segment of data. The ideal solution allows for scalability in data size.
(For the sake of simplifying my scenario, we can assume that when given a segment of data, the application can process/write it then free up the memory used before requesting the next segment.)
Any suggestions you may be able to provide would be most helpful. Thanks!
EDIT: By request, my implementation uses C#.NET 4.0 & MSSQL 2008.
EDIT #2: This is not a SQL command question. This is design-pattern related question: what is the strategy to perform paginated SELECTs against a large table? (Especially when said table receives consistent updates.)

What database are you using? In MySQL for example the following would select 20 rows beginning from row 40 but this is mysql-only clause (edit: it seems Postgres also allows this)
select * from cars limit 20 offset 40

If you want a "snapshot" effect you have to copy the data into holding table where it will not get updated. You can accomplish some nice things with various types of change-tracking, but that's not what you stated you wanted. If you need a snapshot of the exact table state then take the snapshot and write it to a seperate table and use the limit and offset (or whatever) to create pages.
And at 5 million rows, I think it is likely the design requirement that might need to be modified...if you have 2000 clients all taking 5 million-row snapshots you are going to start having some size issues if you don't watch out.

You should provide details of the format of the resultant data file. Depending on the format this could be possible directly in your database, with no app code involved eg for mysql:
SELECT * INTO OUTFILE "c:/mydata.csv"
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY "\n"
FROM my_table;
For oracle there would be export, for sqlserver/sybase it would be BCP, etc.
Or alternatively achievable by streaming the data, without holding it all in memory, this would vary depending on the app language.
In terms of paging, the easy option is to just use the limit clause (if mysql) or the equivelent in whatever rdbms you are using, but this is a last resort:
select * from myTable order by ID LIMIT 0,1000
select * from myTable order by ID LIMIT 1000,1000
select * from myTable order by ID LIMIT 2000,1000
...
This selects the data in 1000 row chunks.

Look at this post on using limit and offset to create paginated results from your sql query.
http://www.petefreitag.com/item/451.cfm
You would have to first:
SELECT * from Cars Limit 10
and then
SELECT * from Cars limit 10 offset 10
And so on. You will have to figure out the best pagination for this.

Related

Understanding data scanned when querying ORC with Presto/Athena

I have a large amount of data in ORC files in AWS S3. The data in ORC files is sorted by uuid. I create an AWS Athena (Presto) table on top of them and run the following experiment.
First, I retrieve the first row to see how much data gets scanned:
select * from my_table limit 1
This query reports 18 MB of data being scanned.
I record the uuid from the row returned from the first query and run the following query:
select * from my_table where uuid=<FIRST_ROW_UUID> limit 1
This query reports 8.5 GB of data being scanned.
By design, both queries return the same result but the second query scans 500 times more data!
Any ideas why this is happening? Is this something inherent to ORC design or is it specific to how Presto interacts with S3?
[EDIT after ilya-kisil's response]
Let's change the last query to only select the uuid column:
select uuid from my_table where uuid=<FIRST_ROW_UUID> limit 1
For this query, the amount of data scanned drops to about 600 MB! This means that the bulk of the 8.5 GB scanned in the second query is attributed to gathering values from all columns for the record found and not to finding this record.
Given that all values in the record add up to no more than 1 MB, scanning almost 8 GB of data to put these values together seems extremely excessive. This seems like some idiosyncrasy of ORC or columnar formats in general and I am wondering if there are standard practices, e.g. ORC properties, that help reduce this overhead?
Well this is fairly simple. The very first time your query would pick a random record from your data. On top of that it is not guaranteed that you have read the very first record, since ORC files are splittable and can be processed in parallel. On the other hand, the second query looks for a specific record.
Here is an analogy. Let's assume you have 100 coins UUID and some other info imprinted at on their backs. All of them are face up on a table, so you can't see their UUID.
select * from my_table limit 1
This query is like you flipped some random coin, looked at what it is written on the back and put it back on a table face up. Next, someone came and shuffled all of the coins.
select * from my_table where uuid=<FIRST_ROW_UUID> limit 1
This query is like you wanting to look at the information written on the back of a specific coin. It is unlikely that you would flip the correct coin with your first try. So you would need to "scan" more coins (data).
One of the common ways to reduce size of scanned data is to partition your data, i.e. put it into separate "folders" (not files) in your S3 bucket. Then "folder" names can be use as a virtual columns within your table definition, i.e. additional metadata for your table. Have a look at this post, which goes into mor details on how to optimise queries in Athena.

How to show a sample of the data in BigQuery?

Let us suppose I have a 1TB dataset in BigQuery, and I want to be able to view the data in a columnar view, limiting to 1000 results. Here are a few of the queries I might use:
1. SELECT * FROM mytable LIMIT 1000
2. SELECT first_name, last_name FROM mytable LIMIT 1000
3. SELECT last_name, first_name FROM mytable LIMIT 1000
4. SELECT * FROM mytable ORDER BY first_name LIMIT 1000
If I ran these four queries I would be charged ~$20 ($5/tb, pretend * = first_name, last_name). This seems like a very high amount to pay to just sample the data -- is there another way to query this data to view a limited view of the data, like the above?
This seems like a very high amount to pay to just sample the data -- is there another way to
If your data dynamic, meaning is updated daily or whatever other way - you can use Table Decorators
For example
SELECT * FROM mytable#-3600000--1800000 LIMIT 1000
will query only data inserted within last hour, thus lowering cost a lot!!
Another option is to use Day partitioned tables so you can query only specific day worth of data
Is there a way to export a subset of the data instead of doing a query?
Yes. You can use Tabledata.list API to list page-by-page data in your original table and insert into new [sampled] table using whatever sampling logic you need. Note: this API is free as it actually doesn't use BigQuery query engine per se, but rather reading from underlying storage!!! so you can be reasonably wild :o)
Of course you need to implement this in client of your choice.
I assume you are accessing BQ through the online query interface (https://bigquery.cloud.google.com/table . . . ).
Click on the table in the data set. Go down to where it says "Table Details" in bold letters, beneath the "Run Query" icon.
In the second row below that is an option for "Preview". This will show you some data and it's free.
We have a sample table that's generated every day at work which I find extremely useful for many tasks. It's as simple as:
SELECT * FROM mytable WHERE RAND() < 0.01
The table is hierarchical, and this sampling is set to reproduce the whole structure; so queries can be tested/replicated in exactly the same form and then swapped over to the big table if needed. The 1% sample applies to the top level of the hierarchy (meaning you don't have to wonder whether you are getting valid results from branches).
For us, there is enough data that sums and ratios are generally very representative. The only kind of data that poses a significant problem is relatively rare events, which means counts of unique elements can't be relied on.
And of course, after the single daily charge for making this table, the billing goes from dollars to cents!

Select query too slow > 5min

I have a tableMyTable with 29,000 rows.
MyTable structure {
StudentId bigint,
....
}
Number of columns > 10 columns. The database in the hosting server.
From SSMS i execute the query:
SELECT *
FROM MyTable
Is it normal that the execution lasts more than 5 min?
First of all, retrieving all the data from a remote database is never a good idea. You are using an important share of bandwidth. Hopefully, the query you are using is only used for debugging purpose and should never hit production.
You did not mention if it took 5 minutes before you started receiving something or if you are receiving your data over the course of that 5 minutes, at a constant rate.
In the first situation, not receiving rows at all might indicating a that a lock is effective on your table, due to another operation.
In the latter situation, you are constantly receiving rows, but at a slow rate. Bandwidth and server load play a big part in that. To get you a rough idea of the amount of data that you are downloading, run this stored procedure:
EXEC sp_spaceused 'YourTableName';
Consider that the server has to upload that data and that you have to download the data.
Binary and xml fields (also called BLOB field) usually take a lot of data and you may not be able to control the amount of data stored by the user in those field.
Try checking the size of your variable length fields (varchar, xml and varbinary) by running a DATALENGTH on your column:
SELECT DATALENGTH(MyField) FROM MyTable
You can also get an average:
SELECT AVG(DATALENGTH(MyField)) FROM MyTable
A good idea concerning BLOB field is to retrieve them only when needer and not when you are loading a list of data.
For example, assume a XML field stored in a PurchaseOrder table. If you wish to display the list of PO to your user, you usually don't need to retrieve that field, unless the user open the PO.
Many recent ORM, like nHibernate, offers lazy loading for columns, along with paging so you can retrieve a small amount of row.
Ayende posted a rent about loading unbounded result set two weeks ago.
You're right - the select query shouldn't take that long. It's not the number of rows. Likely it's the type of data you've got on that table/view, and perhaps the storage configuration (slow disk, filegroups config, etc).
Some ideas to consider to remedy this performance problem:
be specific in the columns that you want to retrieve. For ad-hoc queries, SELECT * is fine, but recognize that the RDBMS will work slightly harder to determine which columns are on the table/view.
gathering the values any columns of datatype text, varbinary will take proportionally longer depending on the data within those fields.
consider the indexes (do you have any?) on the table/view?
is this a Prod database, where more/other activity might be hitting this table?
If you edit your question, perhaps include the full table definition so that we can get a real look at what's happening with the datatypes.
I would recommend that you consider OMG Ponies's recommendation - it could be due to the bandwidth between the box and your machine, so
try to remote the box and see how long the query takes on that machine.
If it takes almost same amount of time, then the problem lies either in the database design or underlying hardware, or other factors (table datatypes, wrong indexes, overall load on the machine, overall hits to this table, etc)
if it takes significantly less amount of time, then the problem is surely between your machine and the box - ideally this shouldn't be a big problem, because the web server will be closer to the db server, probably on same LAN (so it should be much faster in the real world). Also, I'm sure you wouldn't use a 'Select *' in the actual app to pick 29000 rows, so it shouldn't create a lot of problem.

Appropriate query and indexes for a logging table in SQL

Assume a table named 'log', there are huge records in it.
The application usually retrieves data by simple SQL:
SELECT *
FROM log
WHERE logLevel=2 AND (creationData BETWEEN ? AND ?)
logLevel and creationData have indexes, but the number of records makes it take longer to retrieve data.
How do we fix this?
Look at your execution plan / "EXPLAIN PLAN" result - if you are retrieving large amounts of data then there is very little that you can do to improve performance - you could try changing your SELECT statement to only include columns you are interested in, however it won't change the number of logical reads that you are doing and so I suspect it will only have a neglible effect on performance.
If you are only retrieving small numbers of records then an index of LogLevel and an index on CreationDate should do the trick.
UPDATE: SQL server is mostly geared around querying small subsets of massive databases (e.g. returning a single customer record out of a database of millions). Its not really geared up for returning truly large data sets. If the amount of data that you are returning is genuinely large then there is only a certain amount that you will be able to do and so I'd have to ask:
What is it that you are actually trying to achieve?
If you are displaying log messages to a user, then they are only going to be interested in a small subset at a time, and so you might also want to look into efficient methods of paging SQL data - if you are only returning even say 500 or so records at a time it should still be very fast.
If you are trying to do some sort of statistical analysis then you might want to replicate your data into a data store more suited to statistical analysis. (Not sure what however, that isn't my area of expertise)
1: Never use Select *
2: make sure your indexes are correct, and your statistics are up-to-date
3: (Optional) If you find you're not looking at log data past a certain time (in my experience, if it happened more than a week ago, I'm probably not going to need the log for it) set up a job to archive that to some back-up, and then remove unused records. That will keep the table size down reducing the amount of time it takes search the table.
Depending on what kinda of SQL database you're using, you might look into Horizaontal Partitioning. Oftentimes, this can be done entirely on the database side of things so you won't need to change your code.
Do you need all columns? First step should be to select only those you actually need to retrieve.
Another aspect is what you do with the data after it arrives to your application (populate a data set/read it sequentially/?).
There can be some potential for improvement on the side of the processing application.
You should answer yourself these questions:
Do you need to hold all the returned data in memory at once? How much memory do you allocate per row on the retrieving side? How much memory do you need at once? Can you reuse some memory?
A couple of things
do you need all the columns, people usually do SELECT * because they are too lazy to list 5 columns of the 15 that the table has.
Get more RAM, themore RAM you have the more data can live in cache which is 1000 times faster than reading from disk
For me there are two things that you can do,
Partition the table horizontally based on the date column
Use the concept of pre-aggregation.
Pre-aggregation:
In preagg you would have a "logs" table, "logs_temp" table, a "logs_summary" table and a "logs_archive" table. The structure of logs and logs_temp table is identical. The flow of application would be in this way, all logs are logged in the logs table, then every hour a cron job runs that does the following things:
a. Copy the data from the logs table to "logs_temp" table and empty the logs table. This can be done using the Shadow Table trick.
b. Aggregate the logs for that particular hour from the logs_temp table
c. Save the aggregated results in the summary table
d. Copy the records from the logs_temp table to the logs_archive table and then empty the logs_temp table.
This way results are pre-aggregated in the summary table.
Whenever you wish to select the result, you would select it from the summary table.
This way the selects are very fast, because the number of records are far less as the data has been pre-aggregated per hour. You could even increase the threshold from an hour to a day. It all depends on your needs.
Now the inserts would be fast too, because the amount of data is not much in the logs table as it holds the data only for the last hour, so index regeneration on inserts would take very less time as compared to very large data-set hence making the inserts fast.
You can read more about Shadow Table trick here
I employed the pre-aggregation method in a news website built on wordpress. I had to develop a plugin for the news website that would show recently popular (popular during the last 3 days) news items, and there are like 100K hits per day, and this pre-aggregation thing has really helped us a lot. The query time came down from more than 2 secs to under a second. I intend on making the plugin publically available soon.
As per other answers, do not use 'select *' unless you really need all the fields.
logLevel and creationData have indexes
You need a single index with both values, what order you put them in will affect performance, but assuming you have a small number of possible loglevel values (and the data is not skewed) you'll get better performance putting creationData first.
Note that optimally an index will reduce the cost of a query to log(N) i.e. it will still get slower as the number of records increases.
C.
I really hope that by creationData you mean creationDate.
First of all, it is not enough to have indexes on logLevel and creationData. If you have 2 separate indexes, Oracle will only be able to use 1.
What you need is a single index on both fields:
CREATE INDEX i_log_1 ON log (creationData, logLevel);
Note that I put creationData first. This way, if you only put that field in the WHERE clause, it will still be able to use the index. (Filtering on just date seems more likely scenario that on just log level).
Then, make sure the table is populated with data (as much data as you will use in production) and refresh the statistics on the table.
If the table is large (at least few hundred thousand rows), use the following code to refresh the statistics:
DECLARE
l_ownname VARCHAR2(255) := 'owner'; -- Owner (schema) of table to analyze
l_tabname VARCHAR2(255) := 'log'; -- Table to analyze
l_estimate_percent NUMBER(3) := 5; -- Percentage of rows to estimate (NULL means compute)
BEGIN
dbms_stats.gather_table_stats (
ownname => l_ownname ,
tabname => l_tabname,
estimate_percent => l_estimate_percent,
method_opt => 'FOR ALL INDEXED COLUMNS',
cascade => TRUE
);
END;
Otherwise, if the table is small, use
ANALYZE TABLE log COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;
Additionally, if the table grows large, you shoud consider to partition it by range on creationDate column. See these links for the details:
Oracle Documentation: Range Partitioning
OraFAQ: Range partitions
How to Create and Manage Partition Tables in Oracle

Check for changes to an SQL Server table?

How can I monitor an SQL Server database for changes to a table without using triggers or modifying the structure of the database in any way? My preferred programming environment is .NET and C#.
I'd like to be able to support any SQL Server 2000 SP4 or newer. My application is a bolt-on data visualization for another company's product. Our customer base is in the thousands, so I don't want to have to put in requirements that we modify the third-party vendor's table at every installation.
By "changes to a table" I mean changes to table data, not changes to table structure.
Ultimately, I would like the change to trigger an event in my application, instead of having to check for changes at an interval.
The best course of action given my requirements (no triggers or schema modification, SQL Server 2000 and 2005) seems to be to use the BINARY_CHECKSUM function in T-SQL. The way I plan to implement is this:
Every X seconds run the following query:
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*))
FROM sample_table
WITH (NOLOCK);
And compare that against the stored value. If the value has changed, go through the table row by row using the query:
SELECT row_id, BINARY_CHECKSUM(*)
FROM sample_table
WITH (NOLOCK);
And compare the returned checksums against stored values.
Take a look at the CHECKSUM command:
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM sample_table WITH (NOLOCK);
That will return the same number each time it's run as long as the table contents haven't changed. See my post on this for more information:
CHECKSUM
Here's how I used it to rebuild cache dependencies when tables changed:
ASP.NET 1.1 database cache dependency (without triggers)
Unfortunately CHECKSUM does not always work properly to detect changes.
It is only a primitive checksum and no cyclic redundancy check (CRC) calculation.
Therefore you can't use it to detect all changes, e. g. symmetrical changes result in the same CHECKSUM!
E. g. the solution with CHECKSUM_AGG(BINARY_CHECKSUM(*)) will always deliver 0 for all 3 tables with different content:
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM
(
SELECT 1 as numA, 1 as numB
UNION ALL
SELECT 1 as numA, 1 as numB
) q
-- delivers 0!
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM
(
SELECT 1 as numA, 2 as numB
UNION ALL
SELECT 1 as numA, 2 as numB
) q
-- delivers 0!
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM
(
SELECT 0 as numA, 0 as numB
UNION ALL
SELECT 0 as numA, 0 as numB
) q
-- delivers 0!
Why don't you want to use triggers? They are a good thing if you use them correctly. If you use them as a way to enforce referential integrity that is when they go from good to bad. But if you use them for monitoring, they are not really considered taboo.
How often do you need to check for changes and how large (in terms of row size) are the tables in the database? If you use the CHECKSUM_AGG(BINARY_CHECKSUM(*)) method suggested by John, it will scan every row of the specified table. The NOLOCK hint helps, but on a large database, you are still hitting every row. You will also need to store the checksum for every row so that you tell one has changed.
Have you considered going at this from a different angle? If you do not want to modify the schema to add triggers, (which makes a sense, it's not your database), have you considered working with the application vendor that does make the database?
They could implement an API that provides a mechanism for notifying accessory apps that data has changed. It could be as simple as writing to a notification table that lists what table and which row were modified. That could be implemented through triggers or application code. From your side, ti wouldn't matter, your only concern would be scanning the notification table on a periodic basis. The performance hit on the database would be far less than scanning every row for changes.
The hard part would be convincing the application vendor to implement this feature. Since this can be handles entirely through SQL via triggers, you could do the bulk of the work for them by writing and testing the triggers and then bringing the code to the application vendor. By having the vendor support the triggers, it prevent the situation where your adding a trigger inadvertently replaces a trigger supplied by the vendor.
Unfortunately, I do not think that there is a clean way to do this in SQL2000. If you narrow your requirements to SQL Server 2005 (and later), then you are in business. You can use the SQLDependency class in System.Data.SqlClient. See Query Notifications in SQL Server (ADO.NET).
Have a DTS job (or a job that is started by a windows service) that runs at a given interval. Each time it is run, it gets information about the given table by using the system INFORMATION_SCHEMA tables, and records this data in the data repository. Compare the data returned regarding the structure of the table with the data returned the previous time. If it is different, then you know that the structure has changed.
Example query to return information regarding all of the columns in table ABC (ideally listing out just the columns from the INFORMATION_SCHEMA table that you want, instead of using *select ** like I do here):
select * from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'ABC'
You would monitor different columns and INFORMATION_SCHEMA views depending on how exactly you define "changes to a table".
Wild guess here: If you don't want to modify the third party's tables, Can you create a view and then put a trigger on that view?
Check the last commit date. Every database has a history of when each commit is made. I believe its a standard of ACID compliance.