I have the below information from a temp table where one of the columns (i think formatted as json) is having the below information I'm trying to extract the information from that column such as path, brand, color, or activity_group on different columns. split this information.
{
"fragment": null,
"host": null,
"parameters": null,
"path": "\"Fashion\",\"activity_group\":\"Fleece - Hoods / Pants\",\"brand\":\"MICHAEL Michael Kors\",\"budget_curve\":\"Crew Sweaters\",\"category\":\"Clothing\",\"category_id\":\"Clothing\",\"color\":\"Pearl Hethr 036\",\"image_url\":\"https://images.sportsdirect.com/images/products/66003602_l.jpg\",\"is_full_price\":true,\"name\":\"Logo Tape Sweatshirt\",\"price\":58,\"price_excl\":48.33,\"price_excl_gbp\":48.33,\"price_gbp\":58,\"product_discount\":0,\"product_discount_gbp\":0,\"product_id\":\"660036\",\"quantity\":1,\"sku\":\"66003602390\",\"sub_category\":\"Crew Sweaters\",\"subtotal\":48.33,\"subtotal_gbp\":48.33,\"tax\":9.67,\"tax_gbp\":9.67,\"total\":58,\"total_gbp\":58,\"variant\":\"12 (M)\"},{\"activity\":\"Fashion\",\"activity_group\":\"Leggings\",\"brand\":\"MICHAEL Michael Kors\",\"budget_curve\":\"Leggings\",\"category\":\"Clothing\",\"category_id\":\"Clothing\",\"color\":\"Pearl Hthr 036\",\"image_url\":\"https://images.sportsdirect.com/images/products/67601302_l.jpg\",\"is_full_price\":false,\"name\":\"Logo Tape Leggings\",\"price\":50,\"price_excl\":41.67,\"price_excl_gbp\":41.67,\"price_gbp\":50,\"product_discount\":35,\"product_discount_gbp\":35,\"product_id\":\"676013\",\"quantity\":1,\"sku\":\"67601302390\",\"sub_category\":\"Leggings\",\"subtotal\":41.67,\"subtotal_gbp\":41.67,\"tax\":8.33,\"tax_gbp\":8.33,\"total\":50,\"total_gbp\":50,\"variant\":\"12 (M)\"}]",
"port": null,
"query": null,
"scheme": "[{\"activity\""
}
I tried to use parse_url and parse_json however I am not sure I am using this correctly, can someone advise what code instead parse I can use?
table name: order_dupe_check_cleaned
column name: PRODUCTS_NEED_CHECK
This looks like a JSON string. Use json python library to extract the information from the column.
import json
data = '{"fragment": null, "host": null, "parameters": null, "path": "", "Fashion","activity_group":"Fleece - Hoods / Pants","brand":"The North Face","budget_curve":"Full Zip Fleece Tops","category":"Clothing","category_id":"Clothing","color":"JK3 Black","image_url":"https://images.sportsdirect.com/images/products/55435003_l.jpg","is_full_price":true,"name":"Men’s 100 Glacier Full-Zip "scheme": "[{"activity"}'
try:
data_dict = json.loads(data)
brand = data_dict['brand']
color = data_dict['color']
activity_group = data_dict['activity_group']
print(brand)
print(color)
print(activity_group)
except json.decoder.JSONDecodeError as e:
print("There is an error in the JSON string: ", e)
This will print the brand, color, and activity_group values. You can then store the data in separate columns.
Related
I am trying to load data from datastore to bigquery using Apache beam in Vertex AI notebook. This is the code part where the loading happens-
from apache_beam.io.gcp.datastore.v1new.datastoreio import ReadFromDatastore
import apache_beam as beam
from apache_beam.io.gcp.bigquery_file_loads import BigQueryBatchFileLoads
table_row = (p
| 'DatastoreGetData' >> ReadFromDatastore(query=myquery)
| 'EntityConversion' >> beam.Map(ent_to_json_func, table_schema)
| 'Final' >> BigQueryBatchFileLoads(
destination=lambda row: f"myproject:dataset.mytable",
custom_gcs_temp_location=f'gs://myproject/beam',
write_disposition='WRITE_TRUNCATE',
schema=table_schema
)
)
table_schema is the json version of BigQuery table schema (attached Column mapping pic below).
ent_to_json_func converts fields coming from datastore to corresponding BigQuery field in the correct format.
I am trying to load just one row from datastore, it is giving error. The data looks like this-
{ "key": { "namespace": null, "app": null, "path":
"Table/12345678", "kind": "Mykind", "name": null, "id":
12345678 }, "col1": false, "col2": { "namespace": null,
"app": null, "path": "abc/12345", "kind": "abc", "name":
null, "id": 12345 }, "col3": "6835218432", "col4": {
"namespace": null, "app": null, "path": null, "kind":
null, "name": null, "id": null }, "col5": false,
"col6": null, "col7": "https://www.somewebsite.com/poi/",
"col8": "0.00", "col9": "2022-03-12 03:44:17.732193+00:00",
"col10":
"{"someid":"NAME","col7":"https://www.somewebsite.com/poi/", "provided":"Yes","someid2":"SDFTYI1090"}",
"col11": ""0.00"", "col12": "{}", "col13": [] }
The column mapping is here
The error is as follows-
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/runners/common.py in process(self, windowed_value)
1416 try:
-> 1417 return self.do_fn_invoker.invoke_process(windowed_value)
1418 except BaseException as exn:
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/runners/common.py in invoke_process(self, windowed_value, restriction, watermark_estimator_state, additional_args, additional_kwargs)
837 self._invoke_process_per_window(
--> 838 windowed_value, additional_args, additional_kwargs)
839 return residuals
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/runners/common.py in _invoke_process_per_window(self, windowed_value, additional_args, additional_kwargs)
982 windowed_value,
--> 983 self.process_method(*args_for_process, **kwargs_for_process),
984 self.threadsafe_watermark_estimator)
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py in process(self, element, dest_ids_list)
753 # max_retries to 0.
--> 754 self.bq_wrapper.wait_for_bq_job(ref, sleep_duration_sec=10, max_retries=0)
755
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/io/gcp/bigquery_tools.py in wait_for_bq_job(self, job_reference, sleep_duration_sec, max_retries)
637 'BigQuery job {} failed. Error Result: {}'.format(
--> 638 job_reference.jobId, job.status.errorResult))
639 elif job.status.state == 'DONE':
RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_187_4cab298bbd73af86496c64ca35602a05_a5309204fb004ae0ba8007ac2169e079 failed.
Error Result: <ErrorProto
location: 'gs://myproject/beam/bq_load/63e94c1a210742aabab09f96/myproject.dataset.mytable/aed960fb-0ae6-489a-9cb8-e012eee0d9c8'
message: 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
File: gs://myproject/beam/bq_load/63e94c1a210742aabab09f96/myproject.dataset.mytable/aed960fb-0ae6-489a-9cb8-e012eee0d9c8'
reason: 'invalid'>
Please let me know what to do. How to determine the exact cause of the error? I have also validated this json through a json validator, there is no issue.
UPDATE:
I found out that the issue is due to BYTES column. From datastore, bytes type is coming, which I am converting to string using decode and saving in json. When I upload that json into BigQuery, it gives error.
How to proceed in this case?
To solve your issue you have to set STRING type for COL10 COL11 COL12 and COL13 in the BigQuery table.
Your final Dict in the PCollection need to match exactly the schema of the BigQuery table.
In your json I saw these columns as String, your schema needs also to have STRING type for these columns.
The error message is accurate. The file attempted to load to bigquery is not valid json string, see "{"someid":"NAME","col7":"https://www.somewebsite.com/poi/" there are quotes without escape.
Load to bytes type should actually be fine.
Hello I have the below json field (addressbooklist) in which im trying to extract addr1 however my code results are NULL
Can someone help?
Query-
select
addressbooklist:addressbook.addressbookAddress.addr1 as test,
from customer;
JSON Field
{
"addressbook": [
{
"addressbookAddress": {
"addr1": "701 N. Brand Boulevard",
"addr2": null,
"addr3": null,
"addrPhone": null,
"addrText": "Executive Software International<br>701 N. Brand Boulevard<br>Glendale CA 912031242<br>United States",
"addressee": "Executive Software International",
"attention": null,
"city": "Glendale",
"country": {
"value": "_unitedStates"
},
"customFieldList": null,
"internalId": null,
"nullFieldList": null,
"override": false,
"state": "CA",
"zip": "912031242"
},
"defaultBilling": false,
"defaultShipping": true,
"internalId": "1587207",
"isResidential": false,
"label": "701 N. Brand Boulevard"
},
{
"addressbookAddress": {
"addr1": "701 N. Brand Boulevard",
"addr2": null,
"addr3": null,
"addrPhone": null,
"addrText": "Executive Software International<br>701 N. Brand Boulevard<br>Glendale CA 912031242<br>United States",
"addressee": "Executive Software International",
"attention": null,
"city": "Glendale",
"country": {
"value": "_unitedStates"
},
"customFieldList": null,
"internalId": null,
"nullFieldList": null,
"override": false,
"state": "CA",
"zip": "912031242"
},
"defaultBilling": true,
"defaultShipping": false,
"internalId": "1587208",
"isResidential": false,
"label": "701 N. Brand Boulevard"
}
],
"replaceAll": false
}
Your json document addressbook element is an array, not a plain property. You can reference a specific element in an array with [0] notation (0=first element, 1=second,...). Your query would look like this:
select addressbooklist:addressbook[0].addressbookAddress.addr1 as test from customer;
If you want to have all of the addressbook items in the result, you can flatten the json structure with this kind of query:
select ab.value:addressbookAddress.addr1 as test
from customer
, lateral flatten(input => addressbooklist:addressbook) ab;
Result:
TEST
"701 N. Brand Boulevard"
"701 N. Brand Boulevard"
More info on Snowflake json syntax for arrays can be found here:
Get single array items: https://docs.snowflake.com/en/user-guide/querying-semistructured.html#retrieving-a-single-instance-of-a-repeating-element
Flatten function to get table from array:
https://docs.snowflake.com/en/user-guide/querying-semistructured.html#using-the-flatten-function-to-parse-arrays
The JSON object you refered to has the following structure:
Array[Object1{key1: value, key2: value}, Object2{key1: value, key2: value}]
Object1 is the first item of your array with the index 0, Object2 has index 1 and so on.
You access an item of an array like Array[0] and you can reach a value inside an object like Object1.key1. In this example you would reach out to key1 like Array[0]Object1.key1.
In your case you can access addr1 in you first object like this:
addressbook[0].addressbookAddress.addr1
It would help to know which language you want to use. I just have to do this with mySQL. Something like this may help:
select
yourField->>'$."addressbookAddress"[0].addr1' as address
from yourTable
where 1
Good Morning guys,
Im stucking on extract data from the PostgresQL queries.
I made query like this
pool.query(' SELECT * FROM images_info where id < 50', genericQueryHandler(res));
The data sent back format is like this
{
"command": "SELECT",
"rowCount": 49,
"oid": null,
"rows": [...],
"fields": [],
"_parsers": [],
"_types": {},
"RowCtor": null,
"rowAsArray": false
}
I only need the data in "rows". How do I extract the "rows"? I tried limit, group by, it doesn't work. Could you guys help me ? Very appreciate your time and help
I have a table where i am saving data in a column of type bytea, the data is actually a JSON object.
I need to implement a filter on the JSON data.
SELECT cast(job_data::TEXT as jsonb) FROM job_details where job_data ->> "organization" = "ABC";
This query does not work.
The JSON Object looks like
{
"uid": "FdUR4SB0h7",
"Type": "Reference Data Service",
"user": "hk#ss.com",
"SubType": "Reference Data Task",
"_version": 1,
"Frequency": "Once",
"Parameters": "sdfsdfsdfds",
"organization": "ABC",
"StartDateTime": "2020-01-20T10:30:00Z"
}
You need to predicate on the converted column, also, that conversion may not necessarily work depending on encoding. Try something like this:
SELECT
*
FROM
job_details
WHERE
convert_from(job_data, 'UTF-8')::json ->> 'organization' = 'ABC';
I have a CosmosDb database with valid data documents within it.
I have creatd an Azure Search and correctly hooked up the CosmosDB endpoint, and run the indexer which has indexed over 500 documents, with 149KB of storage used up. However, when I run a simple '*' search using Search Explorer, all my result sets are NULL except the primary key and another field from CosmosDb that gets internally generated whenever I add a new Document. What am I doing wrong? These fields are NOT null or empty in the database.
See below screenshots:
Showing the JSON from search explorer as well:
"value": [
{
"#search.score": 1,
"id": "9ce19abc-a26a-5102-1919-8dcf42100067",
"StoreName": "",
"StoreProfile": "",
"StoreType": "",
"StoreStatus": "",
"RBM": "",
"StoreStateManager": "",
"StoreFranchiseGroup": "",
"Street": "",
"City": "",
"State": "",
"Location": "",
"Precinct": "",
"AreaPopulation": "",
"MedianAge": "",
"MedianHHoldIncome": "",
"Grading": "",
"BayCount": "",
"TenancySqm": "",
"RetailAreaSqm": "",
"rid": "BFA1AM5ZblIBAAAAAAAAAA=="
},
{
"#search.score": 1,
"id": "5fe1ec72-1cc9-593a-fac0-891f8b84df27",
"StoreName": null,
"StoreProfile": null,
"StoreType": null,
"StoreStatus": null,
"RBM": null,
"StoreStateManager": null,
"StoreFranchiseGroup": null,
"Street": null,
"City": null,
"State": null,
"Location": null,
"Precinct": null,
"AreaPopulation": null,
"MedianAge": null,
"MedianHHoldIncome": null,
"Grading": null,
"BayCount": null,
"TenancySqm": null,
"RetailAreaSqm": null,
"rid": "BFA1AM5ZblICAAAAAAAAAA=="
},
The search being used is just *, e.g.
.../docs?api-version=2017-11-11&search=*
This issue turned out to be the use of column names with spaces in them. These originally came from the Spreadsheet I was using to obtain the original data for CosmosDB. While CosmosDB will allow columns with spaces in the names, it will negatively affect (in my experience)
Azure Search
CosmosDb queries (found this out via testing the same data set with a Web App Bot).
Removing spaces from the column name and re-testing showed the issue to be resolved.