How to read data in batches using JDBI? - kotlin

I want to read all data from a table. The table has 4 million rows with 25 column in each row. I am setting fetch size to 1_000 to not overload JVM with lots of data. But the query itself is failing with timeout exception.
Does JDBI provide a "cursor" which can read data in batches thus avoid statement timeout? Is there any other way with JDBI to read this JDBI without statement timeout?
Code:
var handle = jdbi.open()
handle
.createQuery("SELECT * FROM TEST_TABLE")
.setFetchSize(1_000)
.mapToMap()
Exception:
query execution canceled due to statement timeout [statement:"SELECT * FROM TEST_TABLE", arguments:{positional:{}, named:{}, finder:[]}]
at org.jdbi.v3.core.statement.SqlStatement.internalExecute(SqlStatement.java:1796)
at org.jdbi.v3.core.result.ResultProducers.lambda$getResultSet$2(ResultProducers.java:64)
at org.jdbi.v3.core.result.ResultIterable.lambda$of$0(ResultIterable.java:57)
at org.jdbi.v3.core.result.ResultIterable.iterator(ResultIterable.java:43)

First, try using a smaller fetch size.
If that doesn't work, you can try doing pagination in your application code. Check out the docs on this here: https://www.cockroachlabs.com/docs/stable/pagination.html
Lastly, what is your statement_timeout set to? Can you raise it?
Have you tested on a smaller table? What is the largest table you can load without timing out?

Related

How to increase superset row limit and timeout cache for SQL Lab and Visualization

I have a dataset that has 1 billion rows. The data is stored in Hive. Also, I put Impala as a layer between Hive and Superset. The queries that are run in Superset have row limit max. 100.000. I need to change it with no row limit. Furthermore, I need to make a visualization from what the queries return from SQL lab, but it cannot be done because there is a timeout cache limit also. Therefore, if I change/increase the row limit in SQL lab and timeout cache in visualization, then I guess, there will be no problem.
i am trying my best to answer below. Pls backup all config files before changing.
For SQL row limit issue -
modify config.py file inside the 'anaconda3/lib/python3.7/site-packages' and set
DEFAULT_SQLLAB_LIMIT to 1000000000
QUERY_SEARCH_LIMIT to 1000000000
modify viz.py and set-
filter_row_limit to 1000000000
For timeout issue, increase below parameter values -
For synchronous queries - change superset_config.py
SUPERSET_WEBSERVER_TIMEOUT
SQLLAB_TIMEOUT
SUPERSET_TIMEOUT --This value should be >=SQLLAB_TIMEOUT
For async queries -
SQLLAB_ASYNC_TIME_LIMIT_SEC
There must be a config parameter to change the max row limit in site-packages/superset, DEFAULT_SQLLAB_LIMIT to set the default and SQL_MAX_ROW to set the max in SQL Lab.
I guess we have to run superset_init again to make the change appear on Superset.
I've been able to solve the problem as follows:
modify config.py in site-packages/superset:
increase SQL_MAX_ROW from 100'000.

Query error: Resources exceeded during query execution: The query could not be executed in the allotted memory

I'm getting an error when tries to execute the following query:
select r.*
from dataset.table1 r
where id NOT IN (select id from staging_data.table1);
It's basically a query to load incremental data on a table. The dataset.table1 has 360k rows and the incremental on staging_data has 40k. But when I try to run this on my script to load on another table, I got the error:
Resources exceeded during query execution: The query could not be executed in the allotted memory
This started to happen on the last week, before that it was working well.
I looked over for solutions on internet, but all the solutions doesn't work on my case.
Has anyone know how to solve it?
I changed the cronjob time and it worked. Thank you!
You can try using writing the results to another table, as Big Query has a limitation for the maximum response size that can be processed. You can do that either if you are using Legacy or Standard SQL, and you can follow the steps to do it in the documentation.

Simple queries take very long

When I execute a query for the first time in DBeaver it can take up to 10-15 seconds to display the result. In SQLDeveloper those queries only take a fraction of that time.
For example:
Simple "select column1 from table1" statement
DBeaver: 2006ms,
SQLDeveloper: 306ms
Example 2 (other way around; so theres no server-side caching):
Simple "select column1 from table2" statement
SQLDeveloper: 252ms,
DBeaver: 1933ms
DBeavers status box says:
Fetch resultset
Discover attribute column1
Find attribute column1
Late bind attribute colummn1
2, 3 and 4 use most of the query execution time.
I'm using oracle 11g, SQLDeveloper 4.1.1.19 and DBeaver 3.5.8.
See http://dbeaver.jkiss.org/forum/viewtopic.php?f=2&t=1870
What could be the cause?
DBeaver looks up some metadata related to objects in your query.
On an Oracle DB, it queries catalog tables such as
SYS.ALL_ALL_TABLES / SYS.ALL_OBJECTS - only once after connection, for the first query you execute
SYS.ALL_TAB_COLS / SYS.ALL_INDEXES / SYS.ALL_CONSTRAINTS / ... - I believe each time you query a table not used before.
Version 3.6.10 introduced an option to enable/disable a hint used in those queries. Disabling the hint made a huge difference for me. The option is in the Oracle Properties tab of the connection edit dialog. Have a look at issue 360 on dbeaver's github for more info.
The best way to get insight is to perfom the database trace
Perform few time the query to eliminate the caching effect.
Than repeat in both IDEs following steps
activate the trace
ALTER SESSION SET tracefile_identifier = test_IDE_xxxx;
alter session set events '10046 trace name context forever, level 12'; /* binds + waits */
Provide the xxxx to identify the test. You will see this string as a part of the trace file name.
Use level 12 to see the wait events and bind variables.
run the query
close the conenction
This is important to not trace other things.
Examine the two trace files to see:
what statements were performed
what number of rows was fetched
what time was elapsed in DB
for the rest of the time the client (IDE) is responsible
This should provide you enough evidence to claim if one IDE behaves different than other or if simple the DB statements issued are different.

What problems may occur while querying SQL databases with big amount of data over internet

I am having this big database on one MSSQL server that contains data indexed by a web crawler.
Every day I want to update SOLR SearchEngine Index using DataImportHandler which is situated in another server and another network.
Solr DataImportHandler uses query to get data from SQL. For example this query
SELECT * FROM DB.Table WHERE DateModified > Config.LastUpdateDate
The ImportHandler does 8 selects of this types. Each select will get arround 1000 rows from database.
To connect to SQL SERVER i am using com.microsoft.sqlserver.jdbc.SQLServerDriver
The parameters I can add for connection are:
responseBuffering="adaptive/all"
batchSize="integer"
So my question is:
What can go wrong while doing this queries every day ? ( except network errors )
I want to know how is SQL Server working in this context ?
Further more I have to take a decicion regarding the way I will implement this importing and how to handle errors, but first I need to know what errors can arise.
Thanks!
Later edit
My problem is that I don't know how can this SQL Queries fail. When i am calling this importer every day it does 10 queries to the database. If 5th query fails I have to options:
rollback the entire transaction and do it again, or commit the data I got from the first 4 queries and redo somehow the queries 5 to 10. But if this queries always fails, because of some other problems, I need to think another way to import this data.
Can this sql queries over internet fail because of timeout operations or something like this?
The only problem i identified after working with this type of import is:
Network problem - If the network connection fails: in this case SOLR is rolling back any changes and the commit doesn't take place. In my program I identify this as an error and don't log the changes in the database.
Thanks #GuidEmpty for providing his comment and clarifying out this for me.
There could be issues with permissions (not sure if you control these).
Might be a good idea to catch exceptions you can think of and include a catch all (Exception exp).
Then take the overall one as a worst case and roll-back (where you can) and log the exception to include later on.
You don't say what types you are selecting either, keep in mind text/blob can take a lot more space and could cause issues internally if you buffer any data etc.
Though just a quick re-read and you don't need to roll-back if you are only selecting.
I think you would be better having a think about what you are hoping to achieve and whether knowing all possible problems will help?
HTH

is there a maximum number of inserts that can be run in a batch sql script?

I have a series of simple "Insert INTO" type statements but after running about 3 or 4 of them the script stops and i get empty sets when i try selecting from the appropriate tables....aside from my specific code...i wonder whether there is an ideal way of running multiple insert type queries.
Right now i just have a txt file saved as a.sql with normal sql commands separated by ";"
No, there is not. however, if it stops after 3 or 4 inserts, it's a good bet there's an error in the 3rd or 4th insert. Depending on which SQL engine you use, there are different ways of making it report errors during and after operations.
Additionally, if you have lots of inserts, it's a good idea to wrap them inside a transaction - this basically buffers all the insert commands until it sees the end command for the transaction, and then commit everything to your table. That way, if something goes wrong, your database doesn't get polluted with data that needs to first be deleted again. More importantly, every insert without a transaction counts as a single transaction, which makes them really slow - Doing 100 inserts inside a transaction can be as fast as doing two or three normal inserts.
Maximum Capacity Specifications for SQL Server
Max Batch size = 65,536 * Network Packet Size
However I doubt that Max Batch size is your problem.