I have issues using the copy command in redshift to load in JSON objects, I am receiving a file in the below JSON format which fails when attempting to use the copy command, however when I adjust the json file to the bottom it works. This is not an ideal solution as I am not permiited to modify the JSON file
this works fine :
{
"id": 1,
"name": "Major League Baseball"
}
{
"id": 2,
"name": "National Hockey League"
}
This does not work (notice the extra square brackets)
[
{"id":1,"name":"Major League Baseball"},
{"id":2,"name":"National Hockey League"}
]
this is my json path
{
"jsonpaths": [
"$['id']",
"$['name']"
]
}
The problem with the COPY command is it does not really accept a valid JSON file. Instead, it expects a JSON-per-line which is shown in the documentation, but not obviously mentioned.
Hence, every line is supposed to be a valid JSON but the full file is not. That's why when you modify your file, it works.
Related
I am trying to analyze monthly wikimedia pageview statistics. Their daily dumps are OK but monthly reports like the one from June 2021 (https://dumps.wikimedia.org/other/pageview_complete/monthly/2021/2021-06/pageviews-202106-user.bz2) seem broken:
[radim#sandbox2 pageviews]$ bzip2 -t pageviews-202106-user.bz2
bzip2: pageviews-202106-user.bz2: bad magic number (file not created by bzip2)
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
[radim#sandbox2 pageviews]$ file pageviews-202106-user.bz2
pageviews-202106-user.bz2: Par archive data
Any idea how to extract the data? What encoding is used here? Can it be Parquet file from their Hive analytics cluster?
These files are not bzip2 archives. They are Parquet files. Parquet-tools can be used to inspect them.
$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main schema /tmp/pageviews-202106-user.bz2 2>/dev/null
{
"type" : "record",
"name" : "hive_schema",
"fields" : [ {
"name" : "line",
"type" : [ "null", "string" ],
"default" : null
} ]
}
I want to write data-driven tests passing dynamic values reading from external file (csv).
Able to pass dynamic values from csv for simple strings (account number & affiliate id below). But, using embedded expressions, how can I pass dynamic values from csv file for "DealerReportFormats" json array below?
Any help is highly-appreciated!!
Scenario Outline: Dealer dynamic requests
Given path '/dealer-reports/retrieval'
And request read('../DealerTemplate.json')
When method POST
Then status 200
Examples:
| read('../DealerData.csv') |
DealerTemplate.json is below
{
"DealerId": "FIXED",
"DealerName": "FIXED",
"DealerType": "FIXED",
"DealerCredentials": {
"accountNumber": "#(DealerCredentials_AccountNumber)",
"affiliateId": "#(DealerCredentials_AffiliateId)"
},
"DealerReportFormats": [
{
"name": "SalesReport",
"format": "xml"
},
{
"name": "CustomerReport",
"format": "txt"
}
]
}
DealerData.csv:
DealerCredentials_AccountNumber,DealerCredentials_AffiliateId
testaccount1,123
testaccount2,12345
testaccount3,123456
CSV is only for "flat" structures, so trying to mix that with JSON is too ambitious in my honest opinion. Please look for another framework if needed :)
That said I see 2 options:
a) use proper quoting and escaping in the CSV
b) refer to JSON files
Here is an example:
Scenario Outline:
* json foo = foo
* print foo
Examples:
| read('test.csv') |
And test.csv is:
foo,bar
"{ a: 'a1', b: 'b1' }",test1
"{ a: 'a2', b: 'b2' }",test2
I leave it as an exercise to you if you want to escape double-quotes. It is possible.
Option (b) is you can refer to stand-alone JSON files and read them:
foo,bar
j1.json,test1
j2.json,test2
And you can do * def foo = read(foo) in your feature.
Find below the input data:
[{"acc_id": 166211981, "archived": true, "access_key": "ALLLJNXXXXXXXPU4C7GA", "secret_key": "X12J6SixMaFHoXXXXZW707XXX24OXXX", "created": "2018-10-03T05:56:01.208069Z", "description": "Data Testing", "id": 11722990697, "key_field": "Ae_Appl_Number", "last_modified": "2018-10-03T08:44:20.324237Z", "list_type": "js_variables", "name": "TEST_AE_LI_KEYS_003", "project_id": 1045199007354, "s3_path": "opti-port/dcp/ue.1045199007354/11722990697"}, {"acc_id": 166211981, "archived": false, "access_key": "ALLLJNXXXXXXXPU4C7GA", "secret_key": "X12J6SixMaFHoXXXXZW707XXX24OXXX", "created": "2018-10-03T08:46:32.535653Z", "description": "Data Testing", "id": 11724290732, "key_field": "Ae_Appl_Number", "last_modified": "2018-10-03T10:11:13.167798Z", "list_type": "js_variables", "name": "TEST_AE_LI_KEYS_001", "project_id": 1045199007354, "s3_path": "opti-port/dcp/ue.1045199007354/11724290732"}]
I want the output file to contain below data:
11722990697,TEST_AE_LI_KEYS_003,opti-port/dcp/ue.1045199007354/11722990697
11724290732,EST_AE_LI_KEYS_001,opti-port/dcp/ue.1045199007354/11724290732
I am able to achieve the same by taking one record at a time and processing it using awk.but i am getting the field names also.
find below my trial:
R=cat in.txt | awk -F '},' '{print $1}'
echo $R | awk -F , '{print $7 " " $11 " " $13}'
I want it to be done for entire file without field names.
AWK/SED is not the right tool for parsing JSON files. Use jq
[root#localhost]# jq -r '.[] | "\(.acc_id),\(.name),\(.s3_path)"' abc.json
166211981,TEST_AE_LI_KEYS_003,opti-port/dcp/ue.1045199007354/11722990697
166211981,TEST_AE_LI_KEYS_001,opti-port/dcp/ue.1045199007354/11724290732
If you don't want to install any other software then you can use python as well which is found on most of the linux machine
[root#localhost]# cat parse_json.py
#!/usr/bin/env python
# Import the json module
import json
# Open the json file in read only mode and load the json data. It will load the data in python dictionary
with open('abc.json') as fh:
data = json.load(fh)
# To print the dictionary
# print(data)
# To print the name key from first and second record
# print(data[0]["name"])
# print(data[1]["name"])
# Now to get both the records use a for loop
for i in range(0,2):
print("%s,%s,%s") % (data[i]["access_key"],data[i]["name"],data[i]["s3_path"])
[root#localhost]# ./parse_json.py
ALLLJNXXXXXXXPU4C7GA,TEST_AE_LI_KEYS_003,opti-port/dcp/ue.1045199007354/11722990697
ALLLJNXXXXXXXPU4C7GA,TEST_AE_LI_KEYS_001,opti-port/dcp/ue.1045199007354/11724290732
Assuming the input data is in a file called input.json, you can use a Python script to fetch the attributes. Put the following content in a file called fetch_attributes.py:
import json
with open("input.json") as fh:
data = json.load(fh)
with open("output.json", "w") as of:
for record in data:
of.write("%s,%s,%s\n" % (record["id"],record["name"],record["s3_path"]))
Then, run the script as:
python fetch_attributes.py
Code Explanation
import json - Importing Python's json library to parse the JSON.
with open("input.json") as fh: - Opening the input file and getting the file handler in if.
data = json.load(fh) - Loading the JSON input file using load() method from the json library which will populate the data variable with a Python dictionary.
with open("output.json", "w") as of: - Opening the output file in write mode and getting the file handler in of.
for record in data: - Loop over the list of records in the JSON.
of.write("%s,%s,%s\n" % (record["id"],record["name"],record["s3_path"])) - Fetching the required attributes from each record and writing them in the file.
What steps will reproduce the problem?
1.I am running command:
./bq load --source_format=NEWLINE_DELIMITED_JSON --schema=lifeSchema.json dataset_test1.table_test_3 lifeData.json
2.I have attached data source file and scema files.
3. It throws an error - JSON parsing error in row starting at position 0 at file:
file-00000000. Could not convert value to double. Field:
computed_results_A; Value:
What is the expected output? What do you see instead?
I want empty string converted as NULL or 0
What version of the product are you using? On what operating system?
I am using MAC OSX YOSEMITE
Source JSON lifeData.json
{"schema":{"vendor":"com.bd.snowplow","name":"in_life","format":"jsonschema","version":"1-0-2"},"data":{"step":0,"info_userId":"53493764","info_campaignCity":"","info_self_currentAge":45,"info_self_gender":"male","info_self_retirementAge":60,"info_self_married":false,"info_self_lifeExpectancy":0,"info_dependantChildren":0,"info_dependantAdults":0,"info_spouse_working":true,"info_spouse_currentAge":33,"info_spouse_retirementAge":60,"info_spouse_monthlyIncome":0,"info_spouse_incomeInflation":5,"info_spouse_lifeExpectancy":0,"info_finances_sumInsured":0,"info_finances_expectedReturns":6,"info_finances_loanAmount":0,"info_finances_liquidateSavings":true,"info_finances_savingsAmount":0,"info_finances_monthlyExpense":0,"info_finances_expenseInflation":6,"info_finances_expenseReduction":10,"info_finances_monthlyIncome":0,"info_finances_incomeInflation":5,"computed_results_A":"","computed_results_B":null,"computed_results_C":null,"computed_results_D":null,"uid_epoch":"53493764_1466504541604","state":"init","campaign_id":"","campaign_link":"","tool_version":"20150701-lfi-v1"},"hierarchy":{"rootId":"94583157-af34-4ecb-8024-b9af7c9e54fa","rootTstamp":"2016-06-21 10:22:24.000","refRoot":"events","refTree":["events","in_life"],"refParent":"events"}}
Schema JSON lifeSchema.json
{
"name": "computed_results_A",
"type": "float",
"mode": "nullable"
}
Try loading the JSON file as a one column CSV file.
bq load --field_delimiter='|' proj:set.table file.json json:string
Once the file is loaded into BigQuery, you can use JSON_EXTRACT_SCALAR or a JavaScript UDF to parse the JSON with total freedom.
I have a HUGE file with a lot of HL7 segments. It must be split into 1000 (or so ) smaller files.
Since it has HL7 data, there is a pattern (logic) to go by. Each data chunk starts with "MSH|" and ends when next segment starts with "MSH|".
The script must be windows (cmd) based or VBS as I cannot install any software on that machine.
File structure:
MSH|abc|123|....
s2|sdsd|2323|
...
..
MSH|ns|43|...
...
..
..
MSH|sdfns|4343|...
...
..
asds|sds
MSH|sfns|3|...
...
..
as|ss
File in above example, must be split into 2 or 3 files. Also, the files comes from UNIX, so newlines must remain as they are in the source file.
Any help?
This is a sample script that I used to parse large hl7 files into separate files with the new file names based on the data file. Uses REBOL which does not require installation ie. the core version does not make any registry entries.
I have a more generalised version that scans an incoming directory and splits them into single files and then waits for the next file to arrive.
Rebol [
file: %split-hl7.r
author: "Graham Chiu"
date: 17-Feb-2010
purpose: {split HL7 messages into single messages}
]
fn: %05112010_0730.dat
outdir: %05112010_0730/
if not exists? outdir [
make-dir outdir
]
data: read fn
cnt: 0
filename: join copy/part form fn -4 + length? form fn "-"
separator: rejoin [ newline "MSH"]
parse/all data [
some [
[ copy result to separator | copy result to end ]
(
write to-file rejoin [ outdir filename cnt ".txt" ] result
print "Got result"
?? result
cnt: cnt + 1
)
1 skip
]
]
HL7 has a lot of segments - I assume that you know that your file has only MSH segments. So, have you tried parsing the file for the string "(newline)MSH|"? Just keep a running buffer and dump that into an output file when it gets too big.