I was using postgres java jdbc driver. This error pop up when I was doing a large batch query SELECT * FROM mytable where (pk1, pk2, pk3) in ((?,?,?),(?,?,?).....) with ~20k composite ids (i.e., ~60k placeholder).
The callstack for the exception:
org.postgresql.util.PSQLException: ERROR: stack depth limit exceeded
Hint: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2552)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2284)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:322)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:481)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:401)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:164)
at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:114)
at io.agroal.pool.wrapper.PreparedStatementWrapper.executeQuery(PreparedStatementWrapper.java:78)
...
This looks like a server side error. It's tricky because:
it's hard for me to configure server side things...
even I can configure that, but it's hard to know "how large is the query that will blow up the server side stack"
Any ideas for this? Or what's the best practice to do such large id query.
I am not sure about the maximum entries for an IN clause (1000?), but it is way less than 20K. The common way to handle that many would be to create a staging table, in this case containing the variables. Call then v1, v2, v3. Now load the staging table from a file (CSV) then use:
select *
from mytable
where (pk1, pk2, pk3) in
(select v1,v2,v3
from staging_table
);
With this format there is no limit for items in the staging table.
Once the process is complete truncate the staging table in preparation for the next cycle.
I am not sure about the maximum entries for an IN clause (1000?), but it is way less than 20K. The common way to handle that many would be to create a stage table, in this case containing the variables. Call then v1, v2, v3. Now load the stage table from a file (CSV) then use:
select *
from mytable
where (pk1, pk2, pk3) in
(select v1,v2,v3
from staging_table
Related
So, i need to process parts of some huge dataset(100.000.000+ records) simultaneously by multiple instances of processing script, running on separate servers. each instance will be processing it's own chunk of data, no data will be given to more than one instance. I would give records in chunks of 50/100 to each instance. The question is - how to organise the pagination? I think that some sort of global pointers shout be stored on the DB side (PostgreSQL). CURSOR is not of use here because it exists within the transaction.
The dataset consists of multiple text files stored 1 line per row and will be queried though b-index. sample columns: uuid, file_name, line, line_nr, date.
The idea i have is to create a cursors table and to store the cursor current value(which will refer to line_nr) for each file_name after every request.
Is that an efficient way to do it, or there is some built-in functionality in PostgreSQL which would allow me to do it?
SELECT FOR UPDATE with SKIP LOCKED. You'll want to create a work queue table with claimed_at/completed_at timestamp columns that the workers update as they complete their work. The other column of work queue table will be a reference to your dataset's table's PK (you probably won't want to use a real foreign key for performance reasons). Then you can use a modified version of this query:
https://stackoverflow.com/a/49403339/16361
We'll use a larger limit in order to allocate chunks of tasks at once. And, instead of deleting, we'll set the claimed_at timestamp, and use a filter on claimed_at being null to avoid double-claiming. Your application code would be responsible for the 2nd update to set the completed_at timestamp. As a tertiary advantage, you can query your work queue table's completed_at - claimed_at timestamp to keep track of how long each task is taking, and you can query for completed_at IS NULL when everything is done to see any rows that caused workers to crash or that otherwise did not complete.
UPDATE work_queue_table set claimed_at=now()
WHERE dataset_row_uuid = (
SELECT dataset_row_uuid
FROM work_queue_table
WHERE claimed_at IS NULL
FOR UPDATE SKIP LOCKED
LIMIT 50
)
RETURNING dataset_row_uuid;
The setup of the work queue table could be as simple as this:
CREATE TABLE work_queue_table AS
SELECT uuid AS dataset_row_uuid,
NULL::timestamp AS claimed_at,
NULL::timestamp AS completed_at
FROM the_dataset_table
Though you many need to instead turn it into a regular CREATE & multiple INSERTs that you can run in parallel if this takes too long (I haven't created a huge table like this since PG gained multiprocessing features, it's possible that doesn't actually help anymore).
If it's good/helpful to have tasks from the same file handled by the same worker, you can change the ORDER BY to file_name, line_nr instead. There's a lot of tweaks you can do for various use cases, hopefully this can get you started.
If the number of clients is constant, each could fetch its next batch with
SELECT ...
FROM atable
WHERE id > previous_id
AND id % number_of_clients = client_no
ORDER BY id
LIMIT 50;
Here previous_id is the maximal id from the previous batch, number_of_clients is the number of clients and client_no is different for each client.
That avoids taking locks if you don't need them.
I have a table with 1.3 billion rows (MemSQL, columnstore pattern). I need to query a GROUP BY on 3 fields (id1, id2, text) and fetch the latest record for each of this 3-tuple. The table gets populated through a pipeline mounted onto a EFS folder. Currently, it has about 200k csv files of 2MB each.
I need help writing an optimized query for this case or if it can be done some other way.
Edit: I am not able to find any blog/help online, most of them talk about solutions involving the creation of an extra table which is not possible for me now (very heavy memory usage in that case).
Something like below is not going to work and takes my 5-node cluster down:
select max(eventTime) from table1 group by id1, id2, field1
there a couple of considerations here.
1) what is your shard key for the columnstore table?
2) are you using MemSQL 6.5, the most recent version?
3) have you reviewed this resource about Optimizing Table Data Structures? https://www.memsql.com/static/memsql_whitepaper_optimizing_table_data_structures.pdf
Ensure that common columns for all queries in the columnstore key to improve segment elimination.
If the data is inserted in an order, like a timestamp, it's best to put that column first in the columnstore key to minimize the work of the background merger process.
If there are lots of distinct values in one of the keys of the composite key, put that last. Put the key part with less distinctness first to increase the likelihood that segment elimination will be able to affect later columns.
Also, what would help is if run ran
EXPLAIN select max(eventTime) from table1 group by id1, id2, field1;
so that we could see the explain plan.
It takes long time because it needs a proper design for the database. So you have to choose the shard key to be those three columns (id1,id2,field1). I recommend to use column store for that query rather than row store.
In Oracle pl/sql, I have join few tables and insert into another table, which would result in Thousands/Lakhs or it could be in millions. Can insert as
insert into tableA
select * from tableB;
Will there be any chance of failure because of number of rows ?
Or is there a better way to insert values in case of more no of records.
Thanks in Advance
Well, everything is finite inside the machine, so if that select returns too many rows, it for sure won't work (although there must be maaany rows, the number is dependent on your storage and memory size, your OS, and maybe other things).
If you think your query can surpass the limit, then do the insertion in batches, and commit after each batch. Of course you need to be aware you must do something if at 50% of the inserts you decide you need to cancel the process (as a rollback will not be useful here).
My recommended steps are different because performance typically increases when you load more data in one SQL statement using SQL or PL/SQL:
I would recommend checking the size of your rollback segment (RBS segment) and possibly bring online a larger dedicated one for such transaction.
For inserts, you can say something like 'rollback consumed' = 'amount of data inserted'. You know the typical row width from the database statistics (see user_tables after analyze table tableB compute statistics for table for all columns for all indexes).
Determine how many rows you can insert per iteration.
Insert this amount of data in big insert and commit.
Repeat.
Locking normally is not an issue with insert, since what does not yet exist can't be locked :-)
When running on partitioned tables, you might want to consider different scenarios allowing the (sub)partitions to distribute the work together. When using SQL*Loader by loading from text files, you might use different approach too, such as direct path which adds preformatted data blocks to the database without the SQL engine instead of letting the RDBMS handle the SQL.
To create limited number of rows you can use ROW_NUM which is a pseudo column .
for example to create table with 10,000 rows from another table having 50,000 rows you can use.
insert into new_table_name select * from old_table_name where row_num<10000;
When i am running queries in VirtualBox Sandbox with hive. I feel Select count(*) is too much slower than the Select *.
Can anyone explain what is going on behind?
And why this delay is happening?
select * from table
It can be a Map only job But
Select Count(*) from table
It can be a Map and Reduce job
Hope this helps.
There are three types of operations that a hive query can perform.
In order of cheapest and fastest to more expensive and slower here they are.
A hive query can be a metadata only request.
Show tables, describe table are examples. In these queries the hive process performs a lookup in the metadata server. The metadata server is a SQL database, probably MySQL, but the actual DB is configurable.
A hive query can be an hdfs get request.
Select * from table, would be an example. In this case hive can return the results by performing an hdfs operation. hadoop fs -get, more or less.
A hive query can be a Map Reduce job.
Hive has to ship the jar to hdfs, the jobtracker queues the tasks, the tasktracker execute the tasks, the final data is put into hdfs or shipped to the client.
The Map Reduce job has different possibilities as well.
It can be a Map only job.
Select * from table where id > 100 , for example all of that logic can be applied on the mapper.
It can be a Map and Reduce job,
Select min(id) from table;
Select * from table order by id ;
It can also lead to multiple map Reduce passes, but I think the above summarizes some behaviors.
This is because the DB is using clustered primary keys so the query searches each row for the key individually, row by agonizing row, not from an index.
Run optimize table. This will ensure that the data pages are
physically stored in sorted order. This could conceivably speed up a
range scan on a clustered primary key.
create an additional non-primary index on just the change_event_id
column. This will store a copy of that column in index pages which be
much faster to scan. After creating it, check the explain plan to
make sure it's using the new index
This is a design/algorithm question.
Here's the outline of my scenario:
I have a large table (say, 5 mil. rows) of data which I'll call Cars
Then I have an application, which performs a SELECT * on this Cars table, taking all the data and packaging it into a single data file (which is then uploaded somewhere.)
This data file generated by my application represents a snapshot, what the table looked like at an instant in time.
The table Cars, however, is updated sporadically by another process, regardless of whether the application is currently generating a package from the table or not. (There currently is no synchronization.)
My problem:
This table Cars is becoming too big to do a single SELECT * against. When my application retrieves all this data at once, it quickly overwhelms the memory capacity for my machine (let's say, 2GB.) Also, simply performing chained SELECTs with LIMIT or OFFSET fails the condition of synchronization: the table is frequently updated and I can't have the data change between SELECT calls.
What I'm looking for:
A way to pull the entirety of this table into an application whose memory capacity is smaller than the data, assuming the data size could approach infinity. Particularly, how do I achieve a pagination/segmented effect for my SQL selects? i.e. Make recurring calls with a page number to retrieve the next segment of data. The ideal solution allows for scalability in data size.
(For the sake of simplifying my scenario, we can assume that when given a segment of data, the application can process/write it then free up the memory used before requesting the next segment.)
Any suggestions you may be able to provide would be most helpful. Thanks!
EDIT: By request, my implementation uses C#.NET 4.0 & MSSQL 2008.
EDIT #2: This is not a SQL command question. This is design-pattern related question: what is the strategy to perform paginated SELECTs against a large table? (Especially when said table receives consistent updates.)
What database are you using? In MySQL for example the following would select 20 rows beginning from row 40 but this is mysql-only clause (edit: it seems Postgres also allows this)
select * from cars limit 20 offset 40
If you want a "snapshot" effect you have to copy the data into holding table where it will not get updated. You can accomplish some nice things with various types of change-tracking, but that's not what you stated you wanted. If you need a snapshot of the exact table state then take the snapshot and write it to a seperate table and use the limit and offset (or whatever) to create pages.
And at 5 million rows, I think it is likely the design requirement that might need to be modified...if you have 2000 clients all taking 5 million-row snapshots you are going to start having some size issues if you don't watch out.
You should provide details of the format of the resultant data file. Depending on the format this could be possible directly in your database, with no app code involved eg for mysql:
SELECT * INTO OUTFILE "c:/mydata.csv"
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY "\n"
FROM my_table;
For oracle there would be export, for sqlserver/sybase it would be BCP, etc.
Or alternatively achievable by streaming the data, without holding it all in memory, this would vary depending on the app language.
In terms of paging, the easy option is to just use the limit clause (if mysql) or the equivelent in whatever rdbms you are using, but this is a last resort:
select * from myTable order by ID LIMIT 0,1000
select * from myTable order by ID LIMIT 1000,1000
select * from myTable order by ID LIMIT 2000,1000
...
This selects the data in 1000 row chunks.
Look at this post on using limit and offset to create paginated results from your sql query.
http://www.petefreitag.com/item/451.cfm
You would have to first:
SELECT * from Cars Limit 10
and then
SELECT * from Cars limit 10 offset 10
And so on. You will have to figure out the best pagination for this.