Redshift query output too big for RAM - sql

I have a redshift query the output of which is too big to fit into the RAM of my EC2 instance. I am using psycopg2 to execute the query. If I use the limit keyword will the rows repeat if I increment the limit ?
Say I enforce a limit of 0,1000 at first and get a block, then I enforce a limit of 1001,2000. Will there be a repetition of rows in those two blocks considering redshift fetches data parallelly ?
Is there a better alternative to this ?

You want to DECLARE a cursor to store the full results on Redshift and then FETCH rows from the cursor in batches as you need. This way the query only runs once when the cursor is filled. See https://docs.aws.amazon.com/redshift/latest/dg/declare.html (example is at the bottom of the page).
This is exactly how BI tools like Tableau get data from Redshift - in blocks of 10,000 rows. Using cursors prevents the tool/system/network from being overwhelmed by data as you possibly select very large result sets.

Related

the faster way to extract all records from oracle

I have oracle table contain 900 million records , this table partioned to 24 partion , and have indexes :
i try to using hint and i put fetch_buffer to 100000:
select /+ 8 parallel +/
* from table
it take 30 minutes to get 100 million records
my question is :
is there are any way more faster to get the 900 million (all data in the table ) ? should i use partions and did 24 sequential queries ? or should i use indexes and split my query to 10 queries for example
The network is almost certainly the bottleneck here. Oracle parallelism only impacts the way the database retrieves the data, but data is still sent to the client with a single thread.
Assuming a single thread doesn't already saturate your network, you'll probably want to build a concurrent retrieval solution. It helps that the table is already partitioned, then you can read large chunks of data without re-reading anything.
I'm not sure how to do this in Scala, but you want to run multiple queries like this at the same time, to use all the client and network resources possible:
select * from table partition (p1);
select * from table partition (p2);
...
Not really an answer but too long for a comment.
A few too many variables can impact this to give informed advice, so the following are just some general hints.
Is this over a network or local on the server? If the database is remote server then you are paying a heavy network price. I would suggest (if possible) running the extract on the server using the BEQUEATH protocol to avoid using the network. Once the file(s) complete, is will be quicker to compress and transfer to destination than transferring the data direct from database to local file via JDBC row processing.
With JDBC remember to set the cursor fetch size to reduce round tripping - setFetchSize. The default value is tiny (10 I think), try something like 1000 to see how that helps.
As for the query, you are writing to a file so even though Oracle might process the query in parallel, your write to file process probably doesn't so it's a bottleneck.
My approach would be to write the Java program to operate off a range of values as command line parameters, and experiment to find which range size and concurrent instances of the Java give optimal performance. The range will likely fall within discrete partitions so you will benefit from partition pruning (assuming the range value is an a indexed column ideally the partition key).
Roughly speaking I would start with range of 5m, and run concurrent instances that match the number of CPU cores - 2; this is not a scientifically derive number just one that I tend to use as my first stab and see what happens.

How to distinct fast single row in table?. If is there any hint?

My table having 500 rows .Iam tried to distinct the row values.but it will take time .if there any tuning this process and consume the time.
Although initally you will always have to invest time when doing a sql query on a large table, there are ways to reduce this when performing same operation repeatedly within your application. Such as chacheing (saving the returned rows in memory) or PDO prepared statements.

Random sampling complete rows

I know from this question that one can do random sampling RAND.
SELECT * FROM [table] WHERE RAND() < percentage
But this would require a full table scan and incur equivalent cost. I'm wondering if there are more efficient ways?
I'm experimenting with tabledata.list API but got java.net.SocketTimeoutException: Read timed out when index is very large (i.e. > 10000000). Is this operation not O(1)?
bigquery
.tabledata()
.list(tableRef.getProjectId, tableRef.getDatasetId, tableRef.getTableId)
.setStartIndex(index)
.setMaxResults(1L)
.execute()
I would recommend paging tabledata.list with pageToken and get collect sample rows from each page. This should scale much better.
Another (totally different) option I see is use of Table Decorators
You can in loop grammatically generate random time (for snapshot) or time-frame (for range) and query only that portions of data extracting needed data.
Note limitation: This will allow you to sample data that is less than 7 days old.
tabledata.list is not especially performant for arbitrary lookups in a table, especially as you are looking later and later into the table. It is not really designed for efficient data retrieval of an entire table, it's more for looking at the first few pages of data in a table.
If you want to run some operation over all the data in your table, but not run a query, you should probably use an extract job to GCS instead, and sample rows from the output files.

My SQL table is too big: retrieving data via paging/segmenting the result?

This is a design/algorithm question.
Here's the outline of my scenario:
I have a large table (say, 5 mil. rows) of data which I'll call Cars
Then I have an application, which performs a SELECT * on this Cars table, taking all the data and packaging it into a single data file (which is then uploaded somewhere.)
This data file generated by my application represents a snapshot, what the table looked like at an instant in time.
The table Cars, however, is updated sporadically by another process, regardless of whether the application is currently generating a package from the table or not. (There currently is no synchronization.)
My problem:
This table Cars is becoming too big to do a single SELECT * against. When my application retrieves all this data at once, it quickly overwhelms the memory capacity for my machine (let's say, 2GB.) Also, simply performing chained SELECTs with LIMIT or OFFSET fails the condition of synchronization: the table is frequently updated and I can't have the data change between SELECT calls.
What I'm looking for:
A way to pull the entirety of this table into an application whose memory capacity is smaller than the data, assuming the data size could approach infinity. Particularly, how do I achieve a pagination/segmented effect for my SQL selects? i.e. Make recurring calls with a page number to retrieve the next segment of data. The ideal solution allows for scalability in data size.
(For the sake of simplifying my scenario, we can assume that when given a segment of data, the application can process/write it then free up the memory used before requesting the next segment.)
Any suggestions you may be able to provide would be most helpful. Thanks!
EDIT: By request, my implementation uses C#.NET 4.0 & MSSQL 2008.
EDIT #2: This is not a SQL command question. This is design-pattern related question: what is the strategy to perform paginated SELECTs against a large table? (Especially when said table receives consistent updates.)
What database are you using? In MySQL for example the following would select 20 rows beginning from row 40 but this is mysql-only clause (edit: it seems Postgres also allows this)
select * from cars limit 20 offset 40
If you want a "snapshot" effect you have to copy the data into holding table where it will not get updated. You can accomplish some nice things with various types of change-tracking, but that's not what you stated you wanted. If you need a snapshot of the exact table state then take the snapshot and write it to a seperate table and use the limit and offset (or whatever) to create pages.
And at 5 million rows, I think it is likely the design requirement that might need to be modified...if you have 2000 clients all taking 5 million-row snapshots you are going to start having some size issues if you don't watch out.
You should provide details of the format of the resultant data file. Depending on the format this could be possible directly in your database, with no app code involved eg for mysql:
SELECT * INTO OUTFILE "c:/mydata.csv"
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY "\n"
FROM my_table;
For oracle there would be export, for sqlserver/sybase it would be BCP, etc.
Or alternatively achievable by streaming the data, without holding it all in memory, this would vary depending on the app language.
In terms of paging, the easy option is to just use the limit clause (if mysql) or the equivelent in whatever rdbms you are using, but this is a last resort:
select * from myTable order by ID LIMIT 0,1000
select * from myTable order by ID LIMIT 1000,1000
select * from myTable order by ID LIMIT 2000,1000
...
This selects the data in 1000 row chunks.
Look at this post on using limit and offset to create paginated results from your sql query.
http://www.petefreitag.com/item/451.cfm
You would have to first:
SELECT * from Cars Limit 10
and then
SELECT * from Cars limit 10 offset 10
And so on. You will have to figure out the best pagination for this.

SQL : Select records at a time

I have question that at a time in select query how many records are selected that means what is maximum limit of selecting recods in sql 2K,2k5,2k8.
Thanks in advaces.
There's no hard limit that I'm aware of on SQL server's side on how many rows you can SELECT. If you could INSERT them all you can read them all out at the same time.
However, if you select millions of rows at the time, you may experience issues like your client running out of memory or your connection timing out before being able to transmit all the data you SELECTed.
I don't believe there is a built in 'limit' for selecting rows, it'll be down to the architecture that SQL server is running out (i.e. 32bit/64bit, memory available etc) Certainly you can select hundreds of thousands of rows without issue.
But... if you ever find yourself asking for that many rows from a database you should optimise your code / stored procedures so that you only retrieve a subset at a time.
As #paolo says, there is no SQL-defined hard limit; you can specify a limit in your query with the LIMIT keyword (although that is database-dependent).
However, there is one important point: performing a SELECT query with actual database servers typically does not actually transfer the data from the server to the client, or load everything into memory. A query usually has a cursor into a resultset, and as the client iterates through the resultset more rows are fetch from the server, usually in chunks. So unless a client explicitly copies all data from the resultset to memory, at no point will this implicitly happen.
This is all completely database-dependent, and in many cases drivers allow tweaking of chunk size etc.