Camel Sql Consumer Performance for Large DataSets - sql

I am trying to cache some static data in Ignite cache in order to query faster so I need to read the data from DataBase in order to insert them into cache cluster.
But number of rows is like 3 million and it causes OutOfMemory error normally because SqlComponent is trying to process all the data as one and it tries to collect them once and for all.
Is there any way to split them when reading result set (for ex 1000 items per Exchange)?

You can add a limit in the SQL query depending on what SQL database you use.
Or you can try using jdbcTemplate.maxRows=1000 to use that option. But it depends on the JDBC driver if it supports limiting using that option or not.
And also mind you need some way to mark/delete the rows after processing, so they are not selected in the next query, such as using the onConsume option.
You can look in the unit tests to find some examples with onConsume etc: https://github.com/apache/camel/tree/master/components/camel-sql/src/test

Related

Databricks - automatic parallelism and Spark SQL

I have a general question about Databrick cells and auto-parallelism with Spark SQL. I have a summary table that has a number of fields of which most have a complex logic behind them.
If I put blocks (%SQL) of individual field logic in individual cells, will the scheduler automatically attempt to allocate the cells to different nodes on the cluster to improve performance ( depending on how many nodes my cluster has) ? Alternatively are their PySpark functions I can use to organise the parallel running myself ? I cant find much about this elsewhere...
I am using LTS 10.4 (Spark 3.2.1 Scala 2.12)
Many thanks
Richard
If you write python "pyspark" code over multiple cells there is something called "lazy execution" meaning the actual work only happens at the last possible moment (for example when data is written or displayed). So before you run for example a display(df) no actual work is done on the cluster. So technically here the code of multiple code cells is parallelized efficiently.
However, in Databricks Spark SQL a single cell is executed to completion before the next one is started. If you want to run those concurrently you can take a look at running multiple notebooks at the same time (or multiple parameterized instances of the same notebook) with dbutils.notebook.run(). Then the cluster will automatically split the resources evenly between those queries running at the same time.
You can try run the sql statements using spark.sql() and assign the outputs to different dataframes. In the last step, you could execute an operation (for ex: join) that brings all into one dataframe. The lazy evaluation should then evaluate all dataframes (i.e. your sql queries) in parallel.

Select random bin from a set in Aerospike Query Language?

I want to select a sample of random 'n' bins from a set in the namespace. Is there a way to achieve this in Aerospike Query Language?
In Oracle, we achieve something similar with the following query:
SELECT * FROM <table-name> sample block(10) where rownum < 101
The above query fetches blocks of size of 10 rows from a sample size of 100.
Can we do something similar to this in Aerospike also?
Rows are like records in Aerospike, and columns are like bins. You don’t have a way to sample random columns from a table, do you?
You can sample random records from a set using ScanPolicy.maxRecords added to a scan of that set. Note the new (optional) set indexes in Aerospike version 5.6 may accelerate that operation.
Each namespace has its data partitioned into 4096 logical partitions, and the records in the namespace evenly distributed to each of those using the characteristics of the 20-byte RIPEMD-160 digest. Therefore, Aerospike doesn't have a rownum, but you can leverage the data distribution to sample data.
Each partition is roughly 0.0244% of the namespace. That's a sample space you can use, similar to the SQL query above. Next, if you are using the ScanParition method of the client, you can give it the ScanPolicy.maxRecords to pick a specific number of records out of that partition. Further you can start after an arbitrary digest (see PartitionFilter.after) if you'd like.
Ok, now let's talk data browsing. Instead of using the aql tool, you could be using the Aerospike JDBC driver, which works with any JDBC compatible data browser like DBeaver, SQuirreL, and Tableau. When you use LIMIT on a SELECT statement it will basically do what I described above - use partition scanning and a max-records sample on that scan. I suggest you try this as an alternative.
AQL is a tool written using Aerospike C Client. Aerospike does not have a SQL like query language per se that the server understands. What ever functionality that AQL provides is documented - type HELP on the aql> prompt.
You can write an application in C or Java to achieve this. For example, in Java, you can do a scanAll() API call with maxRecords defined in the ScanPolicy. I don't see AQL tool offering that option for scans. (It just allows you to specify a scan rate, one of the other ScanPolicy options.)

SQL - When data are transfered

i need to get a large amount of data from a remote database. the idea is do a sort of pagination, like this
1 Select a first block of datas
SELECT * FROM TABLE LIMIT 1,10000
2 Process that block
while(mysql_fetch_array()...){
//do something
}
3 Get next block
and so on.
Assuming 10000 is an allowable dimension for my system, let us suppose i have 30000 records to get: i perform 3 call to remote system.
But my question is: when executing a select, the resultset is transmitted and than stored in some local part with the result that fetch is local, or result set is stored in remote system and records coming one by one at any fetch? Because if the real scenario is the second i don't perform 3 call, but 30000 call, and is not what i want.
I hope I explained, thanks for help
bye
First, it's highly recommended to utilize MySQLi or PDO instead of the deprecated mysql_* functions
http://php.net/manual/en/mysqlinfo.api.choosing.php
By default with the mysql and mysqli extensions, the entire result set is loaded into PHP's memory when executing the query, but this can be changed to load results on demand as rows are retrieved if needed or desired.
mysql
mysql_query() buffers the entire result set in PHP's memory
mysql_unbuffered_query() only retrieves data from the database as rows are requested
mysqli
mysqli::query()
The $resultmode parameter determines behaviour.
The default value of MYSQLI_STORE_RESULT causes the entire result set to be transfered to PHP's memory, but using MYSQLI_USE_RESULT will cause the rows to be retrieved as requested.
PDO by default will load data as needed when using PDO::query() or PDO::prepare() to execute the query and retrieving results with PDO::fetch().
To retrieve all data from the result set into a PHP array, you can use PDO::fetchAll()
Prepared statements can also use the PDO::MYSQL_ATTR_USE_BUFFERED_QUERY constant, though PDO::fetchALL() is recommended.
It's probably best to stick with the default behaviour and benchmark any changes to determine if they actually have any positive results; the overhead of transferring results individually may be minor, and other factors may be more important in determining the optimal method.
You would be performing 3 calls, not 30.000. That's for sure.
Each 10.000 results batch is rendered on the server (by performing each of the 3 queries). Your while iterates through a set of data that has already been returned by MySQL (that's why you don't have 30.000 queries).
That is assuming you would have something like this:
$res = mysql_query(...);
while ($row = mysql_fetch_array($res)) {
//do something with $row
}
Anything you do inside the while loop by making use of $row has to do with already-fetched data from your initial query.
Hope this answers your question.
according to the documentation here all the data is fetched to the server, then you go through it.
from the page:
Returns an array of strings that corresponds to the fetched row, or FALSE if there are no more rows.
In addition it seams this is deprecated so you might want to use something else that is suggested there.

Solr sort different criteria for each subset

We are using Apache SOLR for full text search. We have specific requirement for sorting the search results - basically when querying for data, we need 2 sets of data - A and B, but each set should have its own sorting criteria and we cannot make 2 different calls. We can get 2 sets by using an OR condition, but how do we sort each set differently ? To illustrate, if :
Set A = {3,1,2}
Set B = {8,5,9}
So, the expected response can have set A returned in ascending order {1,2,3} but the set B can be returned in descending order {9,8,5}
I believe the default sort in SOLR will sort the entire results sets. Any suggestions or if the question is not clear,let me know.
You can possibly achieve this using FieldCollapsing
You might need to do a little more work - i.e. have a display order field(could be an integer) so that Solr knows one field that it needs to sort by.
Next you could use a query like this -
&q=*&group=true&group.field=set&group.sort=display_order
I would recommend keeping the logic such as this out of Solr, it isn't meant to be a substitute for Relational Databases, and getting it to do complex SQL like operation (while some are possible) is going to be tricky.
By the way there is an open issue in Solr's JIRA that addresses batch processing of multiple queries. Which means, when it is merged into a release, you could fire n different queries to fetch these sets in one call to Solr.
If you are keen to have SOLR perform this task for you, the patch is available in the JIRA card, you could create a build for yourself and let us all know how it goes :)

Sending huge vector to a Database in R

Good afternoon,
After computing a rather large vector (a bit shorter than 2^20 elements), I have to store the result in a database.
The script takes about 4 hours to execute with a simple code such as :
#Do the processing
myVector<-processData(myData)
#Sends every thing to the database
lapply(myVector,sendToDB)
What do you think is the most efficient way to do this?
I thought about using the same query to insert multiple records (multiple inserts) but it simply comes back to "chucking" the data.
Is there any vectorized function do send that into a database?
Interestingly, the code takes a huge amount of time before starting to process the first element of the vector. That is, if I place a browser() call inside sendToDB, it takes 20 minutes before it is reached for the first time (and I mean 20 minutes without taking into account the previous line processing the data). So I was wondering what R was doing during this time?
Is there another way to do such operation in R that I might have missed (parallel processing maybe?)
Thanks!
PS: here is a skelleton of the sendToDB function:
sendToDB<-function(id,data) {
channel<-odbcChannel(...)
query<-paste("INSERT INTO history VALUE(",id,",\"",data,"\")",sep="")
sqlQuery(channel,query)
odbcClose(channel)
}
That's the idea.
UPDATE
I am at the moment trying out the LOAD DATA INFILE command.
I still have no idea why it takes so long to reach the internal function of the lapply for the first time.
SOLUTION
LOAD DATA INFILE is indeed much quicker. Writing into a file line by line using write is affordable and write.table is even quicker.
The overhead I was experiencing for lapply was coming from the fact that I was looping over POSIXct objects. It is much quicker to use seq(along.with=myVector) and then process the data from within the loop.
What about writing it to some file and call LOAD DATA INFILE? This should at least give a benchmark. BTW: What kind of DBMS do you use?
Instead of your sendToDB-function, you could use sqlSave. Internally it uses a prepared insert-statement, which should be faster than individual inserts.
However, on a windows-platform using MS SQL, I use a separate function which first writes my dataframe to a csv-file and next calls the bcp bulk loader. In my case this is a lot faster than sqlSave.
There's a HUGE, relatively speaking, overhead in your sendToDB() function. That function has to negotiate an ODBC connection, send a single row of data, and then close the connection for each and every item in your list. If you are using rodbc it's more efficient to use sqlSave() to copy an entire data frame over as a table. In my experience I've found some databases (SQL Server, for example) to still be pretty slow with sqlSave() over latent networks. In those cases I export from R into a CSV and use a bulk loader to load the files into the DB. I have an external script set up that I call with a system() call to run the bulk loader. That way the load is happening outside of R but my R script is running the show.