Text processing to fetch the attributes - awk

Find below the input data:
[{"acc_id": 166211981, "archived": true, "access_key": "ALLLJNXXXXXXXPU4C7GA", "secret_key": "X12J6SixMaFHoXXXXZW707XXX24OXXX", "created": "2018-10-03T05:56:01.208069Z", "description": "Data Testing", "id": 11722990697, "key_field": "Ae_Appl_Number", "last_modified": "2018-10-03T08:44:20.324237Z", "list_type": "js_variables", "name": "TEST_AE_LI_KEYS_003", "project_id": 1045199007354, "s3_path": "opti-port/dcp/ue.1045199007354/11722990697"}, {"acc_id": 166211981, "archived": false, "access_key": "ALLLJNXXXXXXXPU4C7GA", "secret_key": "X12J6SixMaFHoXXXXZW707XXX24OXXX", "created": "2018-10-03T08:46:32.535653Z", "description": "Data Testing", "id": 11724290732, "key_field": "Ae_Appl_Number", "last_modified": "2018-10-03T10:11:13.167798Z", "list_type": "js_variables", "name": "TEST_AE_LI_KEYS_001", "project_id": 1045199007354, "s3_path": "opti-port/dcp/ue.1045199007354/11724290732"}]
I want the output file to contain below data:
11722990697,TEST_AE_LI_KEYS_003,opti-port/dcp/ue.1045199007354/11722990697
11724290732,EST_AE_LI_KEYS_001,opti-port/dcp/ue.1045199007354/11724290732
I am able to achieve the same by taking one record at a time and processing it using awk.but i am getting the field names also.
find below my trial:
R=cat in.txt | awk -F '},' '{print $1}'
echo $R | awk -F , '{print $7 " " $11 " " $13}'
I want it to be done for entire file without field names.

AWK/SED is not the right tool for parsing JSON files. Use jq
[root#localhost]# jq -r '.[] | "\(.acc_id),\(.name),\(.s3_path)"' abc.json
166211981,TEST_AE_LI_KEYS_003,opti-port/dcp/ue.1045199007354/11722990697
166211981,TEST_AE_LI_KEYS_001,opti-port/dcp/ue.1045199007354/11724290732
If you don't want to install any other software then you can use python as well which is found on most of the linux machine
[root#localhost]# cat parse_json.py
#!/usr/bin/env python
# Import the json module
import json
# Open the json file in read only mode and load the json data. It will load the data in python dictionary
with open('abc.json') as fh:
data = json.load(fh)
# To print the dictionary
# print(data)
# To print the name key from first and second record
# print(data[0]["name"])
# print(data[1]["name"])
# Now to get both the records use a for loop
for i in range(0,2):
print("%s,%s,%s") % (data[i]["access_key"],data[i]["name"],data[i]["s3_path"])
[root#localhost]# ./parse_json.py
ALLLJNXXXXXXXPU4C7GA,TEST_AE_LI_KEYS_003,opti-port/dcp/ue.1045199007354/11722990697
ALLLJNXXXXXXXPU4C7GA,TEST_AE_LI_KEYS_001,opti-port/dcp/ue.1045199007354/11724290732

Assuming the input data is in a file called input.json, you can use a Python script to fetch the attributes. Put the following content in a file called fetch_attributes.py:
import json
with open("input.json") as fh:
data = json.load(fh)
with open("output.json", "w") as of:
for record in data:
of.write("%s,%s,%s\n" % (record["id"],record["name"],record["s3_path"]))
Then, run the script as:
python fetch_attributes.py
Code Explanation
import json - Importing Python's json library to parse the JSON.
with open("input.json") as fh: - Opening the input file and getting the file handler in if.
data = json.load(fh) - Loading the JSON input file using load() method from the json library which will populate the data variable with a Python dictionary.
with open("output.json", "w") as of: - Opening the output file in write mode and getting the file handler in of.
for record in data: - Loop over the list of records in the JSON.
of.write("%s,%s,%s\n" % (record["id"],record["name"],record["s3_path"])) - Fetching the required attributes from each record and writing them in the file.

Related

How to stream and read from a .tar.gz boto3 in S3?

On S3 there is a JSON file with the following format:
{"field1": "...", "field2": "...", ...}
{"field1": "...", "field2": "...", ...}
{"field1": "...", "field2": "...", ...}
It is compressed, in .tar.gz format, and its unzipped size is ~30GB, therefore I would like to read it in a streaming fashion.
Using the aws cli, I managed to locally do so with the following command:
aws s3 cp s3://${BUCKET_NAME}/${FILE_NAME}.tar.gz - | gunzip -c -
However, I would like to do it natively in python 3.8.
Merging various solutions online, I tried the following strategies:
1. Uncompressing in-memory file [not working]
import boto3, gzip, json
from io import BytesIO
s3 = boto3.resource('s3')
key = 'FILE_NAME.tar.gz'
streaming_iterator = s3.Object('BUCKET_NAME', key).get()['Body'].iter_lines()
first_line = next(streaming_iterator)
gzipline = BytesIO(first_line)
gzipline = gzip.GzipFile(fileobj=gzipline)
print(gzipline.read())
Which raises
EOFError: Compressed file ended before the end-of-stream marker was reached
2. Using the external library smart_open [partially working]
import boto3
for line in open(
f's3://${BUCKET_NAME}/${FILE_NAME}.tar.gz',
mode="rb",
transport_params={"client": boto3.client('s3')},
encoding="raw_unicode_escape",
compression=".gz"
):
print(line)
This second solution works discretely well for ASCII characters, but for some reason it also turns non ASCII characters into garbage; e.g.,
input: \xe5\x9b\xbe\xe6\xa0\x87\xe3\x80\x82
output: å\x9b¾æ\xa0\x87ã\x80\x82
expected output: 图标。
This leads me to think that the encoding I put is wrong, but I literally tried every encoding present in this page and the only ones that don't lead to an Exception are raw_unicode_escape, unicode_escape and palmos (?), but they all produce garbage.
Any suggestion is welcomed, thanks in advance.
The return from a call to get_object() is a StreamingBody object, which as the name implies will allow you to read from the object in a streaming fashion. However, boto3 does not support seeking on this file object.
While you can pass this object to a tarfile.open call, you need to be careful. There are two caveats. First, you'll need to tell tarfile that you're passing it a non-seekable streaming object using the | character in the open string, and you can't do anything that would trigger a seek, such as attempt to get a list of files first, then operate on these files.
Putting it all together is fairly straight forward, you just need to open a object using boto3, then process each file in the tar file in turn:
# Use boto3 to read the object from S3
s3 = boto3.client('s3')
resp = s3.get_object(Bucket='example-bucket', Key='path/to/example.tar.gz')
obj = resp['Body']
# Open the tar file, the "|" is important, as it instructs
# tarfile that the fileobj is non-seekable
with tarfile.open(fileobj=obj, mode='r|gz') as tar:
# Enumerate the tar file objects as we extract data
for member in tar:
with tar.extractfile(member) as f:
# Read each row in turn and decode it
for row in f:
row = json.loads(row)
# Just print out the filename and results in this demo
print(member.name, row)

Add file name and timestamp into each record in BigQuery using Dataflow

I have a few .txt files with data in JSON to be loaded to google BigQuery table. Along with the columns in the text files I will need to insert filename and current timestamp for each rows. It is in GCP Dataflow with Python 3.7
I accessed the Filemetadata containing the filepath and size using GCSFileSystem.match and metadata_list.
I believe I need to get the pipeline code to run in a loop, pass the filepath to ReadFromText, and call a FileNameReadFunction ParDo.
(p
| "read from file" >> ReadFromText(known_args.input)
| "parse" >> beam.Map(json.loads)
| "Add FileName" >> beam.ParDo(AddFilenamesFn(), GCSFilePath)
| "WriteToBigQuery" >> beam.io.WriteToBigQuery(known_args.output,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
I followed the steps in Dataflow/apache beam - how to access current filename when passing in pattern? but I can't make it quite work.
Any help is appreciated.
You can use textio.ReadFromTextWithFilename instead of ReadFromText. That will produce a PCollection of (filename,line) tuples.
To include the file and timestamp in your output json record, you could change your "parse" line to
| "parse" >> beam.map(lambda (file, line): {
**json.loads(line),
"filename": file,
"timestamp": datetime.now()})

How to import csv file to postgresql table without using copy command

I am planning to use insert command or anything like bulk insert but not copy command . please help!
You tagged your question with ruby, so a ruby'ish way could be:
Install the smarter_csv gem (https://github.com/tilo/smarter_csv) which lets you parse each line into a hash where the column title is used as the key.
inserts = SmarterCSV.process('/path/to/file.csv')
# [
# { col_name: "value from row 1", ... },
# { col_name: "value from row 2", ... }
# ]
Then you might use whatever ORM or database connector you like, e.g. ActiveRecord:
MyModel.insert(inserts)

Convert Empty string ("") to Double data type while importing data from JSON file using command line BQ command

What steps will reproduce the problem?
1.I am running command:
./bq load --source_format=NEWLINE_DELIMITED_JSON --schema=lifeSchema.json dataset_test1.table_test_3 lifeData.json
2.I have attached data source file and scema files.
3. It throws an error - JSON parsing error in row starting at position 0 at file:
file-00000000. Could not convert value to double. Field:
computed_results_A; Value:
What is the expected output? What do you see instead?
I want empty string converted as NULL or 0
What version of the product are you using? On what operating system?
I am using MAC OSX YOSEMITE
Source JSON lifeData.json
{"schema":{"vendor":"com.bd.snowplow","name":"in_life","format":"jsonschema","version":"1-0-2"},"data":{"step":0,"info_userId":"53493764","info_campaignCity":"","info_self_currentAge":45,"info_self_gender":"male","info_self_retirementAge":60,"info_self_married":false,"info_self_lifeExpectancy":0,"info_dependantChildren":0,"info_dependantAdults":0,"info_spouse_working":true,"info_spouse_currentAge":33,"info_spouse_retirementAge":60,"info_spouse_monthlyIncome":0,"info_spouse_incomeInflation":5,"info_spouse_lifeExpectancy":0,"info_finances_sumInsured":0,"info_finances_expectedReturns":6,"info_finances_loanAmount":0,"info_finances_liquidateSavings":true,"info_finances_savingsAmount":0,"info_finances_monthlyExpense":0,"info_finances_expenseInflation":6,"info_finances_expenseReduction":10,"info_finances_monthlyIncome":0,"info_finances_incomeInflation":5,"computed_results_A":"","computed_results_B":null,"computed_results_C":null,"computed_results_D":null,"uid_epoch":"53493764_1466504541604","state":"init","campaign_id":"","campaign_link":"","tool_version":"20150701-lfi-v1"},"hierarchy":{"rootId":"94583157-af34-4ecb-8024-b9af7c9e54fa","rootTstamp":"2016-06-21 10:22:24.000","refRoot":"events","refTree":["events","in_life"],"refParent":"events"}}
Schema JSON lifeSchema.json
{
"name": "computed_results_A",
"type": "float",
"mode": "nullable"
}
Try loading the JSON file as a one column CSV file.
bq load --field_delimiter='|' proj:set.table file.json json:string
Once the file is loaded into BigQuery, you can use JSON_EXTRACT_SCALAR or a JavaScript UDF to parse the JSON with total freedom.

JSON file not loading into redshift

I have issues using the copy command in redshift to load in JSON objects, I am receiving a file in the below JSON format which fails when attempting to use the copy command, however when I adjust the json file to the bottom it works. This is not an ideal solution as I am not permiited to modify the JSON file
this works fine :
{
"id": 1,
"name": "Major League Baseball"
}
{
"id": 2,
"name": "National Hockey League"
}
This does not work (notice the extra square brackets)
[
{"id":1,"name":"Major League Baseball"},
{"id":2,"name":"National Hockey League"}
]
this is my json path
{
"jsonpaths": [
"$['id']",
"$['name']"
]
}
The problem with the COPY command is it does not really accept a valid JSON file. Instead, it expects a JSON-per-line which is shown in the documentation, but not obviously mentioned.
Hence, every line is supposed to be a valid JSON but the full file is not. That's why when you modify your file, it works.