aws neptune bulk load parallelization - amazon-neptune

I am trying to insert 624,118,983 records divided into 1000 files, it takes 35 hours to get loaded all into neptune which is very slow.
I have configured db.r5.large instance with 2 instatnce.
I have 1000 files stored in S3 bucket.
I have one loading request pointing to S3 bucket folder which has 1000 files.
when i get the load status I get below response.
{
"status" : "200 OK",
"payload" : {
"feedCount" : [
{
"LOAD_NOT_STARTED" : 640
},
{
"LOAD_IN_PROGRESS" : 1
},
{
"LOAD_COMPLETED" : 358
},
{
"LOAD_FAILED" : 1
}
],
"overallStatus" : {
"fullUri" : "s3://myntriplesfiles/ntriple-folder/",
"runNumber" : 1,
"retryNumber" : 0,
"status" : "LOAD_IN_PROGRESS",
"totalTimeSpent" : 26870,
"startTime" : 1639289761,
"totalRecords" : 224444549,
"totalDuplicates" : 17295821,
"parsingErrors" : 1,
"datatypeMismatchErrors" : 0,
"insertErrors" : 0
}
}
I see here is that LOAD_IN_PROGRESS is always 1. that means neptune is not trying to load mutiple files in parallelization.
How do i tell neptune to load 1000 file in some parallelization for example parallelization factor of 10.
Am i missing any configuration?
This is how I use bulk load api.
curl -X POST -H 'Content-Type: application/json' https://neptune-hostname:8182/loader -d '
{
"source" : "s3://myntriplesfiles/ntriple-folder/",
"format" : "nquads",
"iamRoleArn" : "my aws arn values goes here",
"region" : "us-east-2",
"failOnError" : "FALSE",
"parallelism" : "HIGH",
"updateSingleCardinalityProperties" : "FALSE",
"queueRequest" : "FALSE"
}'
Please advice.

The Amazon Neptune bulk loader does not load multiple files in parallel, but does divide up the contents of each file among the number of available worker threads on the writer instance (limited by how you have the parallelism property set on the load command). If you have no other writes pending during the load period you can set that field to OVERSUBSCRIBE which will use all available worker threads. Secondly, larger files are better than smaller files as that gives the worker threads more that they can do in parallel. Thirdly, using a larger writer instance just for the duration of the load will provide a lot more worker threads that can take on load tasks. The number of worker threads available in an instance is approximately twice the number of vCPU the instance has. Quite often, people will use something like an db-r5-12xl just for the bulk load (for large loads) and then scale that back to something a lot smaller for regular query workloads.

In Addition to the above, Gzip compressing the files would help faster network reads. Neptune, by default understands gzip compressed files.
Also queueRequest: TRUE can be set to achieve better results. Neptune can queue up to 64 requests. Instead of sending only one request you can trigger multiple files in parallel. You can even configure dependencies among the files if you have to. Ref: https://docs.aws.amazon.com/neptune/latest/userguide/load-api-reference-load.html
You need to move to a bigger writer instance only in cases where CPU usage is consistently higher than 60%.

Related

Nextflow and Preemptible Machines in Google Life Sciences

Consider a nextflow workflow that uses the Google Life Sciences API and uses preemptible machines. The config might look like this:
google {
project = "cool-name"
region = "cool-region"
lifeSciences {
bootDiskSize = "200 GB"
preemptible = True
}
}
Let's say you have only a single process and this process has the directive maxRetries = 5. After five retries (=after the 6th time), the process will be considered failed.
Is it somehow possible to specify in nextflow that after a certain number of retries were unsuccessful, that nextflow should request a non-preemptible machine instead and continue retrying a couple more times?

Azure Function Apps - maintain max batch size with maxDequeueCount

I have following host file:
{
"version": "2.0",
"extensions": {
"queues": {
"maxPollingInterval": "00:00:02",
"visibilityTimeout": "00:00:30",
"batchSize": 16,
"maxDequeueCount": 3,
"newBatchThreshold": 8
}
}
}
I would expect with setup there could never be more than batchSize+newBatchThreshold number of instances running. But I realized when messages are dequed they are run instantly and not just added to the back of the queue. This means you can end up with a very high amount of instances causing a lot of 429 (to many requests). Is there anyway to configure the function app to just add the dequeded messages to the back of the queue?
It was not related to dequeueCount. The problem was because it was a consumption plan, and then you cant control the amount of instances. After chaning to a Standard plan it worked as expected.

Node AWS.S3 SDK upload timeout

Using the Node AWS SDK S3.upload method is not completing multi part uploads for some reason.
A readable stream that receives uploads from a browser is set as the Body (the readable stream is able to be be piped to file writableStream without any problems).
S3.upload is given the following options object:
{
partSize: 1024*1024*5,
queueSize: 1
}
When trying to upload a ~8.5mb file, the file is completely sent from the browser, but the request returned from S3.upload continually fires 'httpUploadProgress' events that indicate that all bytes have been uploaded. The following is received continually until the error occurs:
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
RequestTimeout: Your socket connection to the server was not read from
or written to within the timeout period. Idle connections will be
closed.
The progress loaded field shows that it has loaded the total bytes, but the upload is never completed. Even the end event on the readable stream fires.
Console logging in the SDK itself shows that S3.upload consumes all the available data from the readable stream even if the part size is set to 5mb and the queue size is set to 1.
Does the part size and queue size have an impact on proper usage of S3.upload? How can this problem be investigated further?
I had to use createMultipartUpload and uploadPart for my larger (8Mb) file upload.
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#uploadPart-property

Google Dataflow stalled after BigQuery outage

I have a Google Dataflow Job running. The dataflow job is reading messages from Pub/Sub, enrich it and write the enriched data into BigQuery.
Dataflow was processing approximately 5000 messages per second. I am using 20 workers to run the dataflow job.
Yesterday it seems there was a BigQuery outage. So writing the data in BigQuery part failed. After some time, my dataflow stopped working.
I see 1000 errors like below
(7dd47a65ad656a43): Exception: java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "The project xx-xxxxxx-xxxxxx has not enabled BigQuery.",
"reason" : "invalid"
} ],
"message" : "The project xx-xxxxxx-xxxxxx has not enabled BigQuery.",
"status" : "INVALID_ARGUMENT"
}
com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:285)
com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:175)
com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2728)
com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2685)
com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:159)
com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:194)
com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:65)
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:719)
Stack trace truncated. Please see Cloud Logging for the entire trace.
Please note that dataflow is not working even the BigQuery started working. I had to restart the dataflow job to make it work.
This causes data loss. Not only at the time of outage, but also until I notice the error and restart dataflow job. Is there a way to configure the retry option so that dataflow job does not go into stale on these cases?

Graphdb's loadrdf tool loads ontology and data very slow

I am using GraphDB loadrdf tool to load an ontology and a fairly big data. I set pool.buffer.size=800000 and jvm -Xmx to 24g. I tried both parallel and serial modes. They both slow down once the repo total statements go over about 10k. It eventually slows down to 1 or 2 statements/second. Does anyone know if this is a normal behavior of loadrdf or there's a way to optimize the performance?
Edit I have increased tuple-index-memory. See part of my repo ttl configuration:
owlim:entity-index-size "45333" ;
owlim:cache-memory "24g" ;
owlim:tuple-index-memory "20g" ;
owlim:enable-context-index "false" ;
owlim:enablePredicateList "false" ;
owlim:predicate-memory "0" ;
owlim:fts-memory "0" ;
owlim:ftsIndexPolicy "never" ;
owlim:ftsLiteralsOnly "true" ;
owlim:in-memory-literal-properties "false" ;
owlim:transaction-mode "safe" ;
owlim:transaction-isolation "true" ;
owlim:disable-sameAs "true";
But somehow the process still slows down. It starts with "Global average rate: 1,402 st/s". But slows down to "Global average rate: 20 st/s" after "Statements in repo: 61,831". I give my jvm: -Xms24g -Xmx36g
can you please post your repository configuration? Inside it, there is a parameter tuple-index-memory - this will determine the amount of changes(disc pages) that we are allowed to keep in memory. The bigger this value is the smaller amount of flushes we are going to do.
Check if this is set to a value like 20G in your setup and retry the process again.
I've looked at you repository configuration ttl. There is this parameter: entity-index-size=45333 whose value needs to be increased, e.g. set it to 100 million (entity-index-size=100000000). Default value for that parameter in GraphDB 7 is 10M, but since you've set it explicitly it gets overriden.
You can read more about that parameter here