when enabling errors to bigquery I do not receive the bad record number - google-bigquery

I'm using bigquery command line tool to upload these records:
{name: "a"}
{name1: "b"}
{name: "c"}
.
➜ ~ bq load --source_format=NEWLINE_DELIMITED_JSON my_dataset.my_table ./names.json
this is the result I get:
Upload complete.
Waiting on bqjob_r7fc5650eb01d5fd4_000001560878b74e_1 ... (2s) Current status: DONE
BigQuery error in load operation: Error processing job 'my_dataset:bqjob...4e_1': JSON table encountered too many errors, giving up.
Rows: 2; errors: 1.
Failure details:
- JSON parsing error in row starting at position 5819 at file:
file-00000000. No such field: name1.
when I use bq --format=prettyjson show -j <jobId> I get:
{
"status": {
"errorResult": {
"location": "file-00000000",
"message": "JSON table encountered too many errors, giving up. Rows: 2; errors: 1.",
"reason": "invalid"
},
"errors": [
{
"location": "file-00000000",
"message": "JSON table encountered too many errors, giving up. Rows: 2; errors: 1.",
"reason": "invalid"
},
{
"message": "JSON parsing error in row starting at position 5819 at file:
file-00000000. No such field: name1.",
"reason": "invalid"
}
],
"state": "DONE"
}
}
As you can see I receive an error which tells me in what line I had an error. : Rows: 2; errors: 1
Now I'm trying to enable errors by using max_bad_errors
➜ ~ bq load --source_format=NEWLINE_DELIMITED_JSON --max_bad_records=3 my_dataset.my_table ./names.json
here is what I receive:
Upload complete.
Waiting on bqjob_...ce1_1 ... (4s) Current status: DONE
Warning encountered during job execution:
JSON parsing error in row starting at position 5819 at file: file-00000000. No such field: name1.
when I use bq --format=prettyjson show -j <jobId> I get:
{
.
.
.
"status": {
"errors": [
{
"message": "JSON parsing error in row starting at position 5819 at file: file-00000000. No such field: name1.",
"reason": "invalid"
}
],
"state": "DONE"
},
}
when I check - it actually uploads the good records to the table and ignores the bad record,
but now I do not know in what record the error was.
Is this a big query bug?
can it be fixed so I receive record number also when enabling bad records?

Yes this is what max_bad_records does. If the number of errors is below max_bad_records the load will succeed. The error message tells you the start position of the failed line, 5819, and the file name, file-00000000. The file name is changed since you're doing an upload and load.
The previous "Rows: 2; errors: 1" means 2 rows are parsed and there is 1 error. It's not always the 2nd row in the file. A big file can be processed by many workers in parallel. Worker n starts processing at position xxxx, parsed two rows, and found an error. It'll also report the same error message and apparently 2 doesn't mean the 2nd row in the file. And it doesn't make sense for worker n to scan the file from beginning to find out which line it starts with. Instead, it'll just report the start position of the line.

Related

Databricks spark_jar_task failed when submitted via API

I am using to submit a sample spark_jar_task
My sample spark_jar_task request to calculate Pi :
"libraries": [
{
"jar": "dbfs:/mnt/test-prd-foundational-projects1/spark-examples_2.11-2.4.5.jar"
}
],
"spark_jar_task": {
"main_class_name": "org.apache.spark.examples.SparkPi"
}
Databricks sysout logs where it prints the Pi value as expected
....
(This session will block until Rserve is shut down) Spark package found in SPARK_HOME: /databricks/spark DATABRICKS_STDOUT_END-19fc0fbc-b643-4801-b87c-9d22b9e01cd2-1589148096455
Executing command, time = 1589148103046.
Executing command, time = 1589148115170.
Pi is roughly 3.1370956854784273
Heap
.....
Spark_jar_task though prints the PI value in log, the job got terminated with failed status without stating the error. Below are response of api /api/2.0/jobs/runs/list/?job_id=23.
{
"runs": [
{
"job_id": 23,
"run_id": 23,
"number_in_job": 1,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "FAILED",
"state_message": ""
},
"task": {
"spark_jar_task": {
"jar_uri": "",
"main_class_name": "org.apache.spark.examples.SparkPi",
"run_as_repl": true
}
},
"cluster_spec": {
"new_cluster": {
"spark_version": "6.4.x-scala2.11",
......
.......
Why the job failed here? Any suggestions will be appreciated!
EDIT :
The errorlog says
20/05/11 18:24:15 INFO ProgressReporter$: Removed result fetcher for 740457789401555410_9000204515761834296_job-34-run-1-action-34
20/05/11 18:24:15 WARN ScalaDriverWrapper: Spark is detected to be down after running a command
20/05/11 18:24:15 WARN ScalaDriverWrapper: Fatal exception (spark down) in ReplId-a46a2-6fb47-361d2
com.databricks.backend.common.rpc.SparkStoppedException: Spark down:
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:493)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748)
20/05/11 18:24:17 INFO ShutdownHookManager: Shutdown hook called
I found answer from this post https://github.com/dotnet/spark/issues/126
Looks like, we shouldnt deliberately call
spark.stop()
when running as a jar in databricks

GCP Bigquery: Can't query stackdriver access logs exported in cloudstorage because invalid json field "#type"

I store the access log of a pixel image in a cloudstorage bucket dev-access-log-bucket using the standard "sink"
so the files looks like this requests/2019/05/08/15:00:00_15:59:59_S1.json
and one line looks like this (I formatted the json, but it's on one line normmaly) :
{
"httpRequest": {
"cacheLookup": true,
"remoteIp": "93.24.25.190",
"requestMethod": "GET",
"requestSize": "224",
"requestUrl": "https://dev-snowplow.legalstart.fr/one_pixel_image.png?user_id=0&action=purchase&product_id=0&money=10",
"responseSize": "779",
"status": 200,
"userAgent": "python-requests/2.21.0"
},
"insertId": "w6wyz1g2jckjn6",
"jsonPayload": {
"#type": "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry",
"statusDetails": "response_sent_by_backend"
},
"logName": "projects/tracking-pixel-239909/logs/requests",
"receiveTimestamp": "2019-05-08T15:34:24.126095758Z",
"resource": {
"labels": {
"backend_service_name": "",
"forwarding_rule_name": "dev-yolaw-pixel-forwarding-rule",
"project_id": "tracking-pixel-239909",
"target_proxy_name": "dev-yolaw-pixel-proxy",
"url_map_name": "dev-urlmap",
"zone": "global"
},
"type": "http_load_balancer"
},
"severity": "INFO",
"spanId": "7d8823509c2dc94f",
"timestamp": "2019-05-08T15:34:23.140747307Z",
"trace": "projects/tracking-pixel-239909/traces/bb55577eedd5797db2867931f8de9162"
}
all of these once again are standard GCP things, I did not customize anything here.
So now I want to do some requests on it from Bigquery, I create a dataset and an external table configured like this :
External Data Configuration
Source URI(s) gs://dev-access-log-bucket/requests/*
Auto-detect schema true (note: I don't know why it puts true though i've manually defined it)
Ignore unknown values true
Source format NEWLINE_DELIMITED_JSON
Max bad records 0
and the following manual schema:
timestamp DATETIME REQUIRED
httpRequest RECORD REQUIRED
httpRequest. requestUrl STRING REQUIRED
and when I run a request
SELECT
timestamp
FROM
`path.to.my.table`
LIMIT
1000
I got
Invalid field name "#type". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.
How can I work around this without needing to pre-process the log to not have the "#type" field in it ?

Windows image fails to pull with error "failed to register layer"

I have image stored in AzureCI and I it fails to launch with following errors. I would assume since second entry is backing off of pulling image that image actually successfully downloaded but there is no additional information available about second error. Image is based of microsoft/dotnet-framework which is in turn based off windowsservercore image
{
"count": 58,
"firstTimestamp": "2018-01-23T04:06:59+00:00",
"lastTimestamp": "2018-01-23T12:37:39+00:00",
"message": "pulling image \"cr.azurecr.io/id-poc:latest\"",
"name": "Pulling",
"type": "Normal"
},
{
"count": 331,
"firstTimestamp": "2018-01-23T04:07:15+00:00",
"lastTimestamp": "2018-01-23T12:29:53+00:00",
"message": "Back-off pulling image \"cr.azurecr.io/id-poc:latest\"",
"name": "BackOff",
"type": "Normal"
},
{
"count": 1,
"firstTimestamp": "2018-01-23T11:52:33+00:00",
"lastTimestamp": "2018-01-23T11:52:33+00:00",
"message": "Failed to pull image \"cr.azurecr.io/id-poc:latest\": failed to register layer: re-exec error: exit status 1: output: failed in Win32: The system cannot find the file specified. (0x2) \nhcsshim::ImportLayer failed in Win32: The system cannot find the file specified. (0x2)",
"name": "Failed",
"type": "Warning"
}, this continues 10 more times
Edit 2: Turns out it was due to a base of unsupported Windows 1709.
Edit: Thanks for the clarification, made an assumption it was 1709 there. Let me look into it!
Unfortunately Windows 1709 images are not supported right now on Azure Container Instances. This will be fixed in the April time frame. In the mean time please use Windows images based on other versions as a work around - apologies for that! We're working on it!

How to get most recent elasticsearch results using search API

I have an ElasticSearch cluster and am trying to query it using the RESTful Search API. My query would return the oldest results but I wanted the newest so I added a range filter
curl -XGET 'https://cluster.com/_search' -d '{
"from": 0, "size": 10000,
"range" : {
"#timestamp" : {
"gt": "now-1h"
}
}
}'
But I get the following error
"error":"SearchPhaseExecutionException[Failed to execute phase [query],.....Parse Failure [Failed to parse source.........Parse Failure [No parser for element [range]]]
I've tried using #timestamp, timestamp, and _timestamp as well for variable names but that didn't work. I've also confirmed that it is the range option that is causing the request to fail.
Any help would be appreciated.
Your query is not formatted correctly, you miss a "query" level:
curl -XGET 'https://cluster.com/_search' -d '{
"from": 0, "size": 10000,
"query": {
"range" : {
"#timestamp" : {
"gt": "now-1h"
}
}
}
}'

AWS xray put trace segment command return error

I am trying to send segment doc manually using the CLI with example on this page: https://docs.aws.amazon.com/xray/latest/devguide/xray-api-sendingdata.html#xray-api-segments
I created my own Trace ID and also start and end time.
The command i used are:
> DOC='{"trace_id": "'$TRACE_ID'", "id": "6226467e3f841234", "start_time": 1581596193, "end_time": 1581596198, "name": "test.com"}'
>echo $DOC
{"trace_id": "1-5e453c54-3dc3e03a3c86f97231d06c88", "id": "6226467e3f845502", "start_time": 1581596193, "end_time": 1581596198, "name": "test.com"}
> aws xray put-trace-segments --trace-segment-documents $DOC
{
"UnprocessedTraceSegments": [
{
"ErrorCode": "ParseError",
"Message": "Invalid segment. ErrorCode: ParseError"
},
{
"ErrorCode": "MissingId",
"Message": "Invalid segment. ErrorCode: MissingId"
},
{
"ErrorCode": "MissingId",
"Message": "Invalid segment. ErrorCode: MissingId"
},
.................
The put-trace-segment keep giving me error. The segment doc comply with the JSON schema too. Am i missing something else?
Thanks.
I need to enclose the JSON with "..". The command that works for me was: aws xray put-trace-segments --trace-segment-documents "$DOC"
This is probably due an error in the documentation or that the xray team was using another kind of shell.