Converting Cypher queries to gremlin - cypher

I am converting the cypher queries to gremlin with the help of Cypher for gremlin project.
Followed all the steps to configure it but facing the below issue when running the cypher queries.
301195 [gremlin-server-worker-1] INFO org.opencypher.gremlin.server.op.cypher.CypherOpProcessor - Cypher: MATCH (n) RETURN n
301209 [gremlin-server-worker-1] INFO org.opencypher.gremlin.server.op.cypher.CypherOpProcessor - Gremlin: g.V().project('n').by(__.valueMap().with('~tinkerpop.valueMap.tokens'))
301209 [gremlin-server-worker-1] WARN io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.NoSuchFieldError: scriptEvaluationTimeout
at org.opencypher.gremlin.server.op.cypher.CypherOpProcessor.handleIterator(CypherOpProcessor.java:197)
at org.opencypher.gremlin.server.op.cypher.CypherOpProcessor.lambda$evalCypher$0(CypherOpProcessor.java:132)
at org.opencypher.gremlin.server.op.cypher.CypherOpProcessor.inTransaction(CypherOpProcessor.java:146)
at org.opencypher.gremlin.server.op.cypher.CypherOpProcessor.evalCypher(CypherOpProcessor.java:132)
at org.apache.tinkerpop.gremlin.server.handler.OpExecutorHandler.channelRead0(OpExecutorHandler.java:67)
at org.apache.tinkerpop.gremlin.server.handler.OpExecutorHandler.channelRead0(OpExecutorHandler.java:43)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
Any help is appreciated if anyone knows or have worked on project

Looking at the queries (both the Cypher and Gremlin) as well as the error, this appears to be an issue where the query timed out trying to return an answer and the Cypher for Gremlin library does not handle it gracefully.
More importantly, the query you are trying to run is not a good query to experiment on, as it is the equivalent of asking an RDBMS to return all rows from all tables. Even with a small graph and a fast database this query will take a while to return, I suggest you add some filtering criteria or a limit such as these:
MATCH (n) RETURN n LIMIT 10
//substitute appropriate labels and property names
MATCH (n:foo) WHERE n.name='bar' RETURN n

Related

Ignore large or corrupt records when loading files with pig using PigStorage

I am seeing the following error when loading in a large file using pig.
java.io.IOException: Too many bytes before newline: 2147483971
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:251)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:123)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:181)
at org.apache.tez.mapreduce.lib.MRReaderMapReduce.setupNewRecordReader(MRReaderMapReduce.java:157)
at org.apache.tez.mapreduce.lib.MRReaderMapReduce.setSplit(MRReaderMapReduce.java:88)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703)
at org.apache.tez.mapreduce.input.MRInput.processSplitEvent(MRInput.java:631)
at org.apache.tez.mapreduce.input.MRInput.handleEvents(MRInput.java:590)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.handleEvent(LogicalIOProcessorRuntimeTask.java:732)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.access$600(LogicalIOProcessorRuntimeTask.java:106)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.runInternal(LogicalIOProcessorRuntimeTask.java:809)
at org.apache.tez.common.RunnableWithNdc.run(RunnableWithNdc.java:35)
at java.lang.Thread.run(Thread.java:748)
The command I am using is as follows:
LOAD 'file1.data' using PigStorage('\u0001') as (
id:long,
source:chararray,
)
Is there any option that can be passed here to drop the record that is causing the issue and continue?
You can skip a certain number of records by using the following setting at the top of your pig script
set mapred.skip.map.max.skip.records = 1000;
Link: The number of acceptable skip records surrounding the bad record PER bad record in mapper. The number includes the bad record as well. To turn the feature of detection/skipping of bad records off, set the value to 0. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value to Long.MAX_VALUE to indicate that framework need not try to narrow down. Whatever records(depends on application) get skipped are acceptable.

Solr DataImportHandler - batchSize="-1" does not work

I am using Apache Solr 6.4.1.
Because I am using a really big database (over 3mio rows), I would like to add batchSize="-1" in the db-data-config.xml.
But if I do this, it did work. Without batchSize I can get the first 2k rows than I get a "java.lang.RuntimeException: java.lang.StackOverflowError" Error.
In Solrconfig.xml
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst>
In db-data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://***:1433;integratedSecurity=true;
Initial Catalog=***;"
batchSize="-1"/>
...
Why is batchSize="-1" dont working? (batchSize="200" or other is working)
UPDATE
if I set Debug in Dataimporthandler to false, then it works!
I don't think that set batchSize to '-1' would help in your situation. This is written inside the source code of Solr DataImportHandler:
if (batchSize == -1)
batchSize = Integer.MIN_VALUE;
[... omissis ...]
Statement statement = c.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
statement.setFetchSize(batchSize);
So double check what kind of parameters accepts MS JDBC driver for the setFetchSize method.
setFetchSize - Gives the JDBC driver a hint as to the number of rows
that should be fetched from the database when more rows are needed for
ResultSet objects generated by this
Statement. If the value specified is zero, then the hint
is ignored. The default value is zero.
So the driver is free to ignore this hint, may be it is just reading in the whole table. And you could also try to change the version of your JDBC driver...
I think you first should adapt the value depending to network latency and the amount of record you want to retrive at each round trip.
Indexing performance and mssql server load depends on the batchsize. Try starting with a small size and then gradually increase it.
If this not works try to radically change your JDBC driver.
Returning to batchSize parameter, there are only few cases where you don't need it. Generally this is the behaviour the method should have:
if you have configured your JVM with enough memory to read the entire table
if your JDBC driver would rise an exception invoking setFetchSize() method
if you're dealing with MySql JDBC driver which has a known bug

I am getting 'Local Search phase started with an uninitialized Solution' when I run on a larger dataset

I am developing a solver using Optaplanner 6.1.0, similar to the Vehicle Routing Problem. When I run my solver on 700 installers and 200 bookings, it will successfully solve the planning problem. But, when I used against a larger dataset (700 installers and 1220 bookings), I get
Caused by: java.lang.IllegalStateException: Local Search phase started with an uninitialized Solution. First initialize the Solution. For example, run a Construction Heuristic phase first.
but right before the exception,
16:10:40,378 INFO [DefaultConstructionHeuristicPhase] [http-listener-1(4)] Construction Heuristic phase (0) ended: step total (194), time spent (30693), best score (-1hard/-688803soft).
I am using <constructionHeuristicType>FIRST_FIT_DECREASING</constructionHeuristicType>
in my config.
Am I using it wrong?
Maybe the value range for a planning variable is empty. Especially with value range provider from entity, this is more likely. Feel free to file a jira that the error message should improve in such a case.
Diagnostic todo: Comment out the local solver phase, run the solver (so it only does the construction heuristic) and then iterate through the planning entities and print out the value for each planning value. Check if there are any nulls in there.
The fact that you have 194 steps, instead 200 steps in your CH indicates this. (If those other 6 planning entities are immovable, this won't trigger this exception (more info), so that's not the problem.)

Read Timed Out : sychronous query via Bigquery java API

We are using the big query JAVA API to retrieve results for our analytics reporting frontend. We are trying to retrieve the results synchronously. A lot of times we get Read timed out error, even before the query timeout as specified in the parameters. Here's the stack trace for a sample fail:
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at com.sun.net.ssl.internal.ssl.InputRecord.readFully(InputRecord.java:293)
at com.sun.net.ssl.internal.ssl.InputRecord.read(InputRecord.java:331)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:830)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:787)
at com.sun.net.ssl.internal.ssl.AppInputStream.read(AppInputStream.java:75)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:410)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:460)
I am not able to retrieve the job id of the resulting job as the error occurs before I can retrieve a JobReference object. The timeout specified in this case was 300 sec. The query failed well before it. The query contains three JOIN's and several GROUP EACH BY clauses. Can you suggest us a possible way to debug this ?
Adding the code snippet:
QueryRequest queryInfo = new QueryRequest().setQuery(sql)
.setTimeoutMs(timeOutInSec * 1000);
// get project id
BQGameConnectionDetails details = Config
.getBQConnectionDetails(gameId);
String projectId = details.getProjectId();
Bigquery.Jobs.Query queryRequest = getInstance(gameId).jobs()
.query(projectId, queryInfo);
QueryResponse response = queryRequest.execute();
There are two timeouts involved. The first timeout is in the HTTP request you've sent to bigquery. The second is in the bigquery request timeout. It sounds like you've set the latter to a large value, but the former is likely the timeout that you're hitting. If the HTTP request times out before the BigQuery timeout, the connection will be closed and BigQuery won't have a chance to respond.
There are two options: First is to increase the HTTP request timeout (which depends on the libraries you're using, but this page here may be helpful). The second is to decrease the bigquery timeout. This means you'll have to use jobs.getQueryResults() to read the actual results, but this is a more robust method because it doesn't matter how long the query takes, you can just call getQueryResults() in a loop. I would post a link to a good java sample that does this, but I don't know that one exists, unfortunately.

Determine actual errors from a load job

Using the Java SDK I am creating a load job for just a single record with a fairly complicated schema. When monitoring the status of the load job, it takes a surprisingly long time (but perhaps this is due to working out the schema), but then says:
11:21:06.975 [main] INFO xxx.GoogleBigQuery - Job status (21694ms) create_scans_1384744805079_172221126: DONE
11:24:50.618 [main] ERROR xxx.GoogleBigQuery - Job create_scans_1384744805079_172221126 caused error (invalid) with message
Too many errors encountered. Limit is: 0.
11:24:50.810 [main] ERROR xxx.GoogleBigQuery - {
"message" : "Too many errors encountered. Limit is: 0.",
"reason" : "invalid"
?}
BTW - how do I tell the job that it can have more than zero errors using Java?
This load job does not appear in the list of recent jobs in the console, and as far as I can see, none of the Java objects contains any more details about the actual errors encountered. So how can I pro-grammatically find out what is going wrong? All I can find is:
if (err != null) {
log.error("Job {} caused error ({}) with message\n{}", jobID, err.getReason(), err.getMessage());
try {
log.error(err.toPrettyString());
}
...
In general I am having a difficult time finding good documentation for some of these things and am working it out by trial and error and short snippets of code found on here and older groups. If there is a better source of information than the getting started guides, then I would appreciate any pointers to that information. The Javadoc does not really help and I cannot find any complete examples of loading, querying, testing for errors, cataloging errors and so on.
This job is submitted via a NEWLINE_DELIMITIED_JSON record, supplied to the job via:
InputStream dummy = getClass().getResourceAsStream("/googlebigquery/xxx.record");
final InputStreamContent jsonIn = new InputStreamContent("application/octet-stream", dummy);
createTableJob = bigQuery.jobs().insert(projectId, loadJob, jsonIn).execute();
My authentication and so on seems to work correctly as separate Java code to list the projects, and the datasets in the project all works correctly. So I just need help in working what the actual error is - does it not like the schema (I have records nested within records for instance), or does it think that there is an error in the data I am submitting.
Thanks in advance for any help. The job number cited above is an actual failed load job if that helps any Google staffers who might read this.
It sounds like you have a couple of questions, so I'll try to address them all.
First, the way to get the status of the job that failed is to call jobs().get(jobId), which returns a job object that has an errorResult object that has the error that caused the job to fail (e.g. "too many errors"). The errorStream list is a lost of all of the errors on the job, which should tell you which lines hit errors.
Note if you have the job id, it may be easier to use bq to lookup the job -- you can run bq show <job_id> to get the job error information. If you add the --format=prettyjson it will print out all of the information in the job.
A hint you also might want to consider is to supply your own job id when you create the job -- then even if there is an error starting the job (i.e. the insert() call fails, perhaps due to a network error) you can look up the job to see what actually happened.
To tell BigQuery that some errors are allowed during import, you can use the maxBadResults setting in the load job. See https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#getMaxBadRecords().