Solr DataImportHandler - batchSize="-1" does not work - apache

I am using Apache Solr 6.4.1.
Because I am using a really big database (over 3mio rows), I would like to add batchSize="-1" in the db-data-config.xml.
But if I do this, it did work. Without batchSize I can get the first 2k rows than I get a "java.lang.RuntimeException: java.lang.StackOverflowError" Error.
In Solrconfig.xml
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst>
In db-data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://***:1433;integratedSecurity=true;
Initial Catalog=***;"
batchSize="-1"/>
...
Why is batchSize="-1" dont working? (batchSize="200" or other is working)
UPDATE
if I set Debug in Dataimporthandler to false, then it works!

I don't think that set batchSize to '-1' would help in your situation. This is written inside the source code of Solr DataImportHandler:
if (batchSize == -1)
batchSize = Integer.MIN_VALUE;
[... omissis ...]
Statement statement = c.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
statement.setFetchSize(batchSize);
So double check what kind of parameters accepts MS JDBC driver for the setFetchSize method.
setFetchSize - Gives the JDBC driver a hint as to the number of rows
that should be fetched from the database when more rows are needed for
ResultSet objects generated by this
Statement. If the value specified is zero, then the hint
is ignored. The default value is zero.
So the driver is free to ignore this hint, may be it is just reading in the whole table. And you could also try to change the version of your JDBC driver...
I think you first should adapt the value depending to network latency and the amount of record you want to retrive at each round trip.
Indexing performance and mssql server load depends on the batchsize. Try starting with a small size and then gradually increase it.
If this not works try to radically change your JDBC driver.
Returning to batchSize parameter, there are only few cases where you don't need it. Generally this is the behaviour the method should have:
if you have configured your JVM with enough memory to read the entire table
if your JDBC driver would rise an exception invoking setFetchSize() method
if you're dealing with MySql JDBC driver which has a known bug

Related

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

Ignore large or corrupt records when loading files with pig using PigStorage

I am seeing the following error when loading in a large file using pig.
java.io.IOException: Too many bytes before newline: 2147483971
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:251)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:123)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:181)
at org.apache.tez.mapreduce.lib.MRReaderMapReduce.setupNewRecordReader(MRReaderMapReduce.java:157)
at org.apache.tez.mapreduce.lib.MRReaderMapReduce.setSplit(MRReaderMapReduce.java:88)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703)
at org.apache.tez.mapreduce.input.MRInput.processSplitEvent(MRInput.java:631)
at org.apache.tez.mapreduce.input.MRInput.handleEvents(MRInput.java:590)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.handleEvent(LogicalIOProcessorRuntimeTask.java:732)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.access$600(LogicalIOProcessorRuntimeTask.java:106)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.runInternal(LogicalIOProcessorRuntimeTask.java:809)
at org.apache.tez.common.RunnableWithNdc.run(RunnableWithNdc.java:35)
at java.lang.Thread.run(Thread.java:748)
The command I am using is as follows:
LOAD 'file1.data' using PigStorage('\u0001') as (
id:long,
source:chararray,
)
Is there any option that can be passed here to drop the record that is causing the issue and continue?
You can skip a certain number of records by using the following setting at the top of your pig script
set mapred.skip.map.max.skip.records = 1000;
Link: The number of acceptable skip records surrounding the bad record PER bad record in mapper. The number includes the bad record as well. To turn the feature of detection/skipping of bad records off, set the value to 0. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value to Long.MAX_VALUE to indicate that framework need not try to narrow down. Whatever records(depends on application) get skipped are acceptable.

CrateIO 2.0.2 - Problems with Insert and Update

I've done an upgrade from Crate 1.1.4 to 2.0.2. After this I've also optimized all tables.
Crate runs at one server with one instance. I have not changed any default settings, except the node name and the cluster name.
But now I can't write anything at the database. Selects are good, but every write operation ends up with:
SQLActionException: INTERNAL_SERVER_ERROR 5000 UnavailableShardsException: [mytable][3] Not enough active copies to meet shard count of [ALL] (have 1, needed 2). Timeout: [1m], request: [ShardUpsertRequest{items=[Item{id='10'}], shardId=[my_table][3]}]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:107)
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:319)
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:254)
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:839)
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:836)
at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:142)
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1656)
at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:848)
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:271)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:250)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:242)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
at org.elasticsearch.transport.TransportService$6.doRun(TransportService.java:550)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It does not matter if I do the INSERT/UPDATE query with JDBC or directly at the Crate console.
Did anyone has a good idea how I can solve this problem?
Thanks!
CrateDB changed the default write consistency checks and also the default number_of_shards (to allow usage on a 1-node cluster with default table settings).
See https://crate.io/docs/reference/release_notes/2.0.1.html#breaking-changes
In your case, I guess your number_of_replicas is set to 1.
Setting it to 0 or 0-1 (auto-expand) will solve this.

hsqldb properties

I am using hsqldb which is having the following settings in the properties file (not set by me)
hsqldb.cache_size_scale=8
readonly=false
hsqldb.nio_data_file=true
hsqldb.cache_scale=14
version=1.8.0
hsqldb.default_table_type=memory
hsqldb.cache_file_scale=1
modified=yes
hsqldb.cache_version=1.7.0
hsqldb.original_version=1.8.0
hsqldb.compatible_version=1.8.0
The db started giving errors in logs
java.sql.SQLException: S1000 General error java.util. NoSuchElementException
Some searching on google pointed me that this is because the limit of the .data file has been reached. The size of the .data file is around 0.7gb.
If i increase the cache_file_size , will the above error disappear
hsqldb.default_table_type=memory
hsqldb.cache_file_scale=1
If hsqldb.cache_file_scale=3.
Does this mean that database is in memory and will require 3GB. If memory is an issue how can be reduced ?
The current setting allows up to 2GB in the data file.
I suggest you perform a SHUTDOWN SCRIPT to clear up any problems. If you have further problems, contact the HSQLDB project.

NHIbernate SysCache2 and SQLDependency problems

I've set enable_broker on my SQL Server 2008 to use SQLDepndency
I've configured my .Net app to use Syscache2 with a cache region as follows:
<syscache2>
<cacheRegion name="BlogEntriesCacheRegion" priority="High">
<dependencies>
<commands>
<add name="BlogEntries"
command="Select EntryId from dbo.Blog_Entries where ENABLED=1"
/>
</commands>
</dependencies>
</cacheRegion>
</syscache2>
My Hbm file looks like this:
<?xml version="1.0" encoding="utf-8"?>
<hibernate-mapping xmlns="urn:nhibernate-mapping-2.2">
<class name="BlogEntry" table="Blog_Entries">
<cache usage="nonstrict-read-write" region="BlogEntriesCacheRegion"/>
....
</class>
</hibernate-mapping>
I also have query caching enabled for queries against BlogEntry
When I first query, the results are cached in the 2nd level cache, as expected.
If I now go and change a row in blog_entries, everything works as expected, the cache is expired, it get's this message:
2010-03-03 12:56:50,583 [7] DEBUG NHibernate.Caches.SysCache2.SysCacheRegion - Cache items for region 'BlogEntriesCacheRegion' have been removed from the cache for the following reason : DependencyChanged
I expect that. On the next page request, the query and it's results are stored back in the cache. However, the cache is immediately invalidated again, even though nothing has further changed.
DEBUG NHibernate.Caches.SysCache2.SysCacheRegion - Cache items for region 'BlogEntriesCacheRegion' have been removed from the cache for the following reason : DependencyChanged
My cache is constantly invalidated every subsequent time with no changes to the underlying data. Only a restart of the application allows the cache to operate again - but only the first time the data is cached (again, the first dirtying of the cache, causes it to never work again)
Has anyone seen this problem or got any ideas what this could be? I was thinking that syscache2 needs to handle the SQLDependency onChange event, which it probably is doing - so I don't understand why SQL Server keeps sending SQLDependency depedencyChanged.
thanks
We are getting the same problem on one database instance, but not on the other. It definitely seems to be some kind of permission problem on the database end, because the exact same NHibernate configuration is used in both cases.
In the working case the cache behaves as expected, in the other (which is a database engine which has much stricter permissions) we get the exact same behaviour you mentioned.