Json loading error using Apache beam on Vertex AI - google-bigquery

I am trying to load data from datastore to bigquery using Apache beam in Vertex AI notebook. This is the code part where the loading happens-
from apache_beam.io.gcp.datastore.v1new.datastoreio import ReadFromDatastore
import apache_beam as beam
from apache_beam.io.gcp.bigquery_file_loads import BigQueryBatchFileLoads
table_row = (p
| 'DatastoreGetData' >> ReadFromDatastore(query=myquery)
| 'EntityConversion' >> beam.Map(ent_to_json_func, table_schema)
| 'Final' >> BigQueryBatchFileLoads(
destination=lambda row: f"myproject:dataset.mytable",
custom_gcs_temp_location=f'gs://myproject/beam',
write_disposition='WRITE_TRUNCATE',
schema=table_schema
)
)
table_schema is the json version of BigQuery table schema (attached Column mapping pic below).
ent_to_json_func converts fields coming from datastore to corresponding BigQuery field in the correct format.
I am trying to load just one row from datastore, it is giving error. The data looks like this-
{ "key": { "namespace": null, "app": null, "path":
"Table/12345678", "kind": "Mykind", "name": null, "id":
12345678 }, "col1": false, "col2": { "namespace": null,
"app": null, "path": "abc/12345", "kind": "abc", "name":
null, "id": 12345 }, "col3": "6835218432", "col4": {
"namespace": null, "app": null, "path": null, "kind":
null, "name": null, "id": null }, "col5": false,
"col6": null, "col7": "https://www.somewebsite.com/poi/",
"col8": "0.00", "col9": "2022-03-12 03:44:17.732193+00:00",
"col10":
"{"someid":"NAME","col7":"https://www.somewebsite.com/poi/", "provided":"Yes","someid2":"SDFTYI1090"}",
"col11": ""0.00"", "col12": "{}", "col13": [] }
The column mapping is here
The error is as follows-
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/runners/common.py in process(self, windowed_value)
1416 try:
-> 1417 return self.do_fn_invoker.invoke_process(windowed_value)
1418 except BaseException as exn:
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/runners/common.py in invoke_process(self, windowed_value, restriction, watermark_estimator_state, additional_args, additional_kwargs)
837 self._invoke_process_per_window(
--> 838 windowed_value, additional_args, additional_kwargs)
839 return residuals
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/runners/common.py in _invoke_process_per_window(self, windowed_value, additional_args, additional_kwargs)
982 windowed_value,
--> 983 self.process_method(*args_for_process, **kwargs_for_process),
984 self.threadsafe_watermark_estimator)
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py in process(self, element, dest_ids_list)
753 # max_retries to 0.
--> 754 self.bq_wrapper.wait_for_bq_job(ref, sleep_duration_sec=10, max_retries=0)
755
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/io/gcp/bigquery_tools.py in wait_for_bq_job(self, job_reference, sleep_duration_sec, max_retries)
637 'BigQuery job {} failed. Error Result: {}'.format(
--> 638 job_reference.jobId, job.status.errorResult))
639 elif job.status.state == 'DONE':
RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_187_4cab298bbd73af86496c64ca35602a05_a5309204fb004ae0ba8007ac2169e079 failed.
Error Result: <ErrorProto
location: 'gs://myproject/beam/bq_load/63e94c1a210742aabab09f96/myproject.dataset.mytable/aed960fb-0ae6-489a-9cb8-e012eee0d9c8'
message: 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
File: gs://myproject/beam/bq_load/63e94c1a210742aabab09f96/myproject.dataset.mytable/aed960fb-0ae6-489a-9cb8-e012eee0d9c8'
reason: 'invalid'>
Please let me know what to do. How to determine the exact cause of the error? I have also validated this json through a json validator, there is no issue.
UPDATE:
I found out that the issue is due to BYTES column. From datastore, bytes type is coming, which I am converting to string using decode and saving in json. When I upload that json into BigQuery, it gives error.
How to proceed in this case?

To solve your issue you have to set STRING type for COL10 COL11 COL12 and COL13 in the BigQuery table.
Your final Dict in the PCollection need to match exactly the schema of the BigQuery table.
In your json I saw these columns as String, your schema needs also to have STRING type for these columns.

The error message is accurate. The file attempted to load to bigquery is not valid json string, see "{"someid":"NAME","col7":"https://www.somewebsite.com/poi/" there are quotes without escape.
Load to bytes type should actually be fine.

Related

Extract information from a column

I have the below information from a temp table where one of the columns (i think formatted as json) is having the below information I'm trying to extract the information from that column such as path, brand, color, or activity_group on different columns. split this information.
{
"fragment": null,
"host": null,
"parameters": null,
"path": "\"Fashion\",\"activity_group\":\"Fleece - Hoods / Pants\",\"brand\":\"MICHAEL Michael Kors\",\"budget_curve\":\"Crew Sweaters\",\"category\":\"Clothing\",\"category_id\":\"Clothing\",\"color\":\"Pearl Hethr 036\",\"image_url\":\"https://images.sportsdirect.com/images/products/66003602_l.jpg\",\"is_full_price\":true,\"name\":\"Logo Tape Sweatshirt\",\"price\":58,\"price_excl\":48.33,\"price_excl_gbp\":48.33,\"price_gbp\":58,\"product_discount\":0,\"product_discount_gbp\":0,\"product_id\":\"660036\",\"quantity\":1,\"sku\":\"66003602390\",\"sub_category\":\"Crew Sweaters\",\"subtotal\":48.33,\"subtotal_gbp\":48.33,\"tax\":9.67,\"tax_gbp\":9.67,\"total\":58,\"total_gbp\":58,\"variant\":\"12 (M)\"},{\"activity\":\"Fashion\",\"activity_group\":\"Leggings\",\"brand\":\"MICHAEL Michael Kors\",\"budget_curve\":\"Leggings\",\"category\":\"Clothing\",\"category_id\":\"Clothing\",\"color\":\"Pearl Hthr 036\",\"image_url\":\"https://images.sportsdirect.com/images/products/67601302_l.jpg\",\"is_full_price\":false,\"name\":\"Logo Tape Leggings\",\"price\":50,\"price_excl\":41.67,\"price_excl_gbp\":41.67,\"price_gbp\":50,\"product_discount\":35,\"product_discount_gbp\":35,\"product_id\":\"676013\",\"quantity\":1,\"sku\":\"67601302390\",\"sub_category\":\"Leggings\",\"subtotal\":41.67,\"subtotal_gbp\":41.67,\"tax\":8.33,\"tax_gbp\":8.33,\"total\":50,\"total_gbp\":50,\"variant\":\"12 (M)\"}]",
"port": null,
"query": null,
"scheme": "[{\"activity\""
}
I tried to use parse_url and parse_json however I am not sure I am using this correctly, can someone advise what code instead parse I can use?
table name: order_dupe_check_cleaned
column name: PRODUCTS_NEED_CHECK
This looks like a JSON string. Use json python library to extract the information from the column.
import json
data = '{"fragment": null, "host": null, "parameters": null, "path": "", "Fashion","activity_group":"Fleece - Hoods / Pants","brand":"The North Face","budget_curve":"Full Zip Fleece Tops","category":"Clothing","category_id":"Clothing","color":"JK3 Black","image_url":"https://images.sportsdirect.com/images/products/55435003_l.jpg","is_full_price":true,"name":"Men’s 100 Glacier Full-Zip "scheme": "[{"activity"}'
try:
data_dict = json.loads(data)
brand = data_dict['brand']
color = data_dict['color']
activity_group = data_dict['activity_group']
print(brand)
print(color)
print(activity_group)
except json.decoder.JSONDecodeError as e:
print("There is an error in the JSON string: ", e)
This will print the brand, color, and activity_group values. You can then store the data in separate columns.

How to load a jsonl file into BigQuery when the file has mix data fields as columns

During my work flow, after extracting the data from API, the JSON has the following structure:
[
{
"fields":
[
{
"meta": {
"app_type": "ios"
},
"name": "app_id",
"value": 100
},
{
"meta": {},
"name": "country",
"value": "AE"
},
{
"meta": {
"name": "Top"
},
"name": "position",
"value": 1
}
],
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
}
]
Then it is store as .jsonl and put on GCS. However, when I load it onto BigQuery for further extraction, the automatic schema inference return the following error:
Error while reading data, error message: JSON parsing error in row starting at position 0: Could not convert value to string. Field: value; Value: 100
I want to convert it in to the following structure:
app_type
app_id
country
position
click
price
count
ios
100
AE
Top
1
1
1
Is there a way to define manual schema on BigQuery to achieve this result? Or do I have to preprocess the jsonl file before put it to BigQuery?
One of the limitations in loading JSON data from GCS to BigQuery is that it does not support maps or dictionaries in JSON.
A invalid example would be:
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
Your jsonl file should be something like this:
{"app_type":"ios","app_id":"100","country":"AE","position":"Top","click":"1","price":"1","count":"1"}
I already tested it and it works fine.
So wherever you process the conversion of the json files to jsonl files and storage to GCS, you will have to do some preprocessing.
Probably you have to options:
precreate target table with an app_id field as an INTEGER
preprocess jsonfile and enclose 100 into quotes like "100"

Problems matching a long value in Rest Assured json body

I have the following response:
[
{
"id": 53,
"fileUri": "abc",
"filename": "abc.jpg",
"fileSizeBytes": 578466,
"createdDate": "2018-10-15",
"updatedDate": "2018-10-15"
},
{
"id": 54,
"fileUri": "xyz",
"filename": "xyz.pdf",
"fileSizeBytes": 88170994,
"createdDate": "2018-10-15",
"updatedDate": "2018-10-15"
}
]
and I am trying to match the id value to the object in JUnit like so:
RestAssured.given() //
.expect() //
.statusCode(HttpStatus.SC_OK) //
.when() //
.get(String.format("%s/%s/file", URL_BASE, id)) //
.then() //
.log().all() //
.body("", hasSize(2)) //
.body("id", hasItems(file1.getId(), file2.getId()));
But when the match occurs it tries to match an int to a long. Instead I get this output:
java.lang.AssertionError: 1 expectation failed.
JSON path id doesn't match.
Expected: (a collection containing <53L> and a collection containing <54L>)
Actual: [53, 54]
How does one tell Rest Assured that the value is indeed a long even though it might be short enough to fit in an int? I can cast the file's id to an int and it works, but that seems sloppy.
The problem is that when converting from json to java type, int type selected,
one solution is to compare int values.
instead of
.body("id", hasItems(file1.getId(), file2.getId()));
use
.body("id", hasItems(new Long(file1.getId()).intValue(), new Long(file2.getId()).intValue()));

Nested pymongo queries (mlab)

I have some documents in mlab mongodb; the format is:
{
"_id": {
"$oid": "58aeb1d074fece33edf2b356"
},
"sensordata": {
"operation": "chgstatus",
"user": {
"status": "0",
"uniqueid": "191b117fcf5c"
}
},
"created_date": {
"$date": "2017-02-23T15:26:29.840Z"
}
}
database name : mparking_sensor
collection name : sensor
I want to query in python to extract status key value pair and created_date key value pair only.
my python code is :
import sys
import pymongo
uri = 'mongodb://thorburn:tekush1!#ds157529.mlab.com:57529/mparking_sensor'
client = pymongo.MongoClient(uri)
db = client.get_default_database().sensor
print db
results = db.find()
for record in results:
print(record["sensordata"] , record['created_date'])
print()
client.close()
which gives me everything under sensordata as expected, dot notations giving me an error, can somebody help?
PyMongo represents BSON documents as Python dictionaries, and subdocuments as dictionaries within dictionaries. To access a value in a nested dictionary:
record["sensordata"]["user"]["status"]
So a complete print statement might be:
print("%s %s" % (record["sensordata"]["user"]["status"], record['created_date']))
That prints:
0 {'$date': '2017-02-23T15:26:29.840Z'}

bigquery bug: do not receive all bad records when uploading data

I'm trying to upload data to bigquery table
here's the table schema:
[{
"name": "temp",
"type": "STRING"
}]
here is my file I'm uploading:
{"temp" : "0"}
{"temp1" : "1"}
{"temp2" : "2"}
{"temp3" : "3"}
{"temp4" : "4"}
{"temp5" : "5"}
{"temp6" : "6"}
{"temp7" : "7"}
{"temp" : "8"}
{"temp" : "9"}
here is the bq command for uploading enabling errors:
bq load --source_format=NEWLINE_DELIMITED_JSON --max_bad_records=100 mydataset.mytable ./tmp.json
I receive:
Upload complete.
Waiting on bqjob_123.._1 ... (2s) Current status: DONE
Warnings encountered during job execution:
JSON parsing error in row starting at position 15 at file: file-00000000. No such field: temp1.
JSON parsing error in row starting at position 31 at file: file-00000000. No such field: temp2.
JSON parsing error in row starting at position 47 at file: file-00000000. No such field: temp3.
JSON parsing error in row starting at position 63 at file: file-00000000. No such field: temp4.
JSON parsing error in row starting at position 79 at file: file-00000000. No such field: temp5.
now I'm using:
bq --format=prettyjson show -j <jobId>
and this is what I get (I copied here only relevant fields):
{
"configuration": {
...
"maxBadRecords": 100
}
,
"statistics": {
"load": {
"inputFileBytes": "157",
"inputFiles": "1",
"outputBytes": "9",
"outputRows": "3"
}
},
"status": {
"errors": [
{
"message": "JSON parsing error in row starting at position 15 at file: file-00000000. No such field: temp1.",
"reason": "invalid"
},
{
"message": "JSON parsing error in row starting at position 31 at file: file-00000000. No such field: temp2.",
"reason": "invalid"
},
{
"message": "JSON parsing error in row starting at position 47 at file: file-00000000. No such field: temp3.",
"reason": "invalid"
},
{
"message": "JSON parsing error in row starting at position 63 at file: file-00000000. No such field: temp4.",
"reason": "invalid"
},
{
"message": "JSON parsing error in row starting at position 79 at file: file-00000000. No such field: temp5.",
"reason": "invalid"
}
],
"state": "DONE"
}
}
now when I go to my table I actually have 3 new records (which actually matches the outputRows : 3 field) :
{"temp" : "0"}
{"temp" : "8"}
{"temp" : "9"}
now these are my qustions:
as you see I had 6 bad records I receive only 5 of them. - didn't receive temp6. Now I tried uploading files with more bad records and always receive only 5. Is this a bigquery bug?
assuming my records are larger and I upload many records enabling errors, after uploading how can I know which records were the bad ones? - I need to know which records weren't loaded to bigquery.
all I get is JSON parsing error in row starting at position 15 at file.. Position does't tell me much. Why can't I receive the number of the record? Or is there a way to calculate the record number by the position?
We only return the first 5 errors, as we don't want to make the reply too big.
As I explained in another thread, BigQuery is designed to process large files fast by processing them in parallel. If the file is 1GB, we might create hundreds of workers and each worker processes a chunk of the file. If a worker is processing the last 10MB of the file and found a bad record, to know the number of this record it needs to read all the previous 990MB. Thus every worker just report the start position of the bad record. Some editors support seeking to a offset in a file. In vim, 1000go will move to position 1000. In less, it's 1000P.