CrateIO 2.0.2 - Problems with Insert and Update - cratedb

I've done an upgrade from Crate 1.1.4 to 2.0.2. After this I've also optimized all tables.
Crate runs at one server with one instance. I have not changed any default settings, except the node name and the cluster name.
But now I can't write anything at the database. Selects are good, but every write operation ends up with:
SQLActionException: INTERNAL_SERVER_ERROR 5000 UnavailableShardsException: [mytable][3] Not enough active copies to meet shard count of [ALL] (have 1, needed 2). Timeout: [1m], request: [ShardUpsertRequest{items=[Item{id='10'}], shardId=[my_table][3]}]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:107)
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:319)
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:254)
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:839)
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:836)
at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:142)
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1656)
at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:848)
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:271)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:250)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:242)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
at org.elasticsearch.transport.TransportService$6.doRun(TransportService.java:550)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It does not matter if I do the INSERT/UPDATE query with JDBC or directly at the Crate console.
Did anyone has a good idea how I can solve this problem?
Thanks!

CrateDB changed the default write consistency checks and also the default number_of_shards (to allow usage on a 1-node cluster with default table settings).
See https://crate.io/docs/reference/release_notes/2.0.1.html#breaking-changes
In your case, I guess your number_of_replicas is set to 1.
Setting it to 0 or 0-1 (auto-expand) will solve this.

Related

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

Ignore large or corrupt records when loading files with pig using PigStorage

I am seeing the following error when loading in a large file using pig.
java.io.IOException: Too many bytes before newline: 2147483971
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:251)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:123)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:181)
at org.apache.tez.mapreduce.lib.MRReaderMapReduce.setupNewRecordReader(MRReaderMapReduce.java:157)
at org.apache.tez.mapreduce.lib.MRReaderMapReduce.setSplit(MRReaderMapReduce.java:88)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703)
at org.apache.tez.mapreduce.input.MRInput.processSplitEvent(MRInput.java:631)
at org.apache.tez.mapreduce.input.MRInput.handleEvents(MRInput.java:590)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.handleEvent(LogicalIOProcessorRuntimeTask.java:732)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.access$600(LogicalIOProcessorRuntimeTask.java:106)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.runInternal(LogicalIOProcessorRuntimeTask.java:809)
at org.apache.tez.common.RunnableWithNdc.run(RunnableWithNdc.java:35)
at java.lang.Thread.run(Thread.java:748)
The command I am using is as follows:
LOAD 'file1.data' using PigStorage('\u0001') as (
id:long,
source:chararray,
)
Is there any option that can be passed here to drop the record that is causing the issue and continue?
You can skip a certain number of records by using the following setting at the top of your pig script
set mapred.skip.map.max.skip.records = 1000;
Link: The number of acceptable skip records surrounding the bad record PER bad record in mapper. The number includes the bad record as well. To turn the feature of detection/skipping of bad records off, set the value to 0. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value to Long.MAX_VALUE to indicate that framework need not try to narrow down. Whatever records(depends on application) get skipped are acceptable.

How to handle a failed write behind from the GridCacheWriteBehindStore?

GridCacheWriteBehindStore does not log the object type or its fields when a write to the underlying store fails. We accept that the cache and the underlying store may be out of sync when using write behind, but we NEED to know what failed.
A simple, and likely example is when there is a NOT NULL constraint on a field in a database table and no such check exists in the java layer.
Here is all you see ... Note also that it's only log level warn which also seems wrong.
[WARN ] 2016-11-30 14:23:17.178 [flusher-0-#57%null%]
GridCacheWriteBehindStore - Unable to update underlying store:
o.a.i.cache.store.jdbc.CacheJdbcPojoStore#3a60c416
You're right, exception is ignored there. This will be fixed in the upcoming 1.8: https://issues.apache.org/jira/browse/IGNITE-3770

Configuring SLURM so it requires the user to specify --account

I'm trying to figure out how to configure SLURM so that a user is required to specify --account when using the SLURM commands (salloc, sbatch, srun). Effectively I want to disable the default account behavior.
Has anyone out there found a simple way to do this?
I had the same requirement to force users to specify accounts and, after finding several ways to fulfill it with slurm, I decided to revive this post with the shortest/easiest solution.
The slurm lua submit plugin sees the job description before the default account is applied. Hence, you can install the slurm-lua package, add "JobSubmitPlugins=lua" to the slurm.conf, restart the slurmctld, and directly test against whether the account was defined via the job_submit.lua script (create the script wherever you keep your slurm.conf; typically in /etc/slurm/):
-- /etc/slurm/job_submit.lua to reject jobs with no account specified
function slurm_job_submit(job_desc, part_list, submit_uid)
if job_desc.account == nil then
slurm.log_error("User %s did not specify an account.", job_desc.user_id)
slurm.log_user("You must specify an account!")
return slurm.ERROR
end
return slurm.SUCCESS
end
function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
return slurm.SUCCESS
end
return slurm.SUCCESS
Errors resulting from not specifying an account appear as follows:
# srun --pty bash
srun: error: You must specify an account!
srun: error: Unable to allocate resources: Unspecified error
# sbatch submit.slurm
sbatch: error: You must specify an account!
sbatch: error: Batch job submission failed: Unspecified error
These errors are also printed out to the slurmctld log so that you know what the resource allocation issue was for the particular job:
[2017-09-12T08:32:00.697] error: job_submit.lua: User 0 did not specify an account.
[2017-09-12T08:32:00.697] _slurm_rpc_submit_batch_job: Unspecified error
As an addendum, the Slurm Submit Plugins Guide is only moderately useful and you will probably be much better off simply examining the Lua job_submit plugin implementation for guidance.
One option is to set the AccountingStorageEnforce parameter to associations in slurm.conf.
AccountingStorageEnforce
This controls what level of association-based enforcement to impose on job submissions. Valid options are any combination of
associations, limits, nojobs, nosteps, qos, safe, and wckeys, or all
for all things (expect nojobs and nosteps, they must be requested as
well).
By enforcing Associations no new job is allowed to run unless a corresponding association exists in the system. If limits are enforced
users can be limited by association to whatever job size or run time
limits are defined.
Then, with the sacctmgr command, make sure the default account has no access to the defined partitions. Effectively, the users will be denied submission if they do not specify a valid account.
Another option is to write a custom submission plugin, which you can write in Lua. In that script, you can check whether the --account parameter was set and deny submission with a custom message if it was not.

Lucene Search Error Stack

I am seeing the following error when trying to search using Lucene. (version 1.4.3). Any ideas as to why I could be seeing this and how to fix it?
Caused by: java.io.IOException: read past EOF
at org.apache.lucene.store.InputStream.refill(InputStream.java:154)
at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:195)
at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:55)
at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:89)
at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:118)
at org.apache.lucene.store.Lock$With.run(Lock.java:109)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:106)
at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:43)
In this same environment I also see the following error:
Caused by: java.io.IOException: Lock obtain timed out:
Lock#/tmp/lucene-3ec31395c8e06a56e2939f1fdda16c67-write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:58)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:223)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:213)
The same code works in a test environment, however not in production. Cannot identify any obvious differences between the two environments.
File permissions are wrong (it needs write permission) or your are not able to access a locked file that the current process needs.