Text processing to fetch the attributes

Text processing to fetch the attributes - awk

Find below the input data:
[{"acc_id": 166211981, "archived": true, "access_key": "ALLLJNXXXXXXXPU4C7GA", "secret_key": "X12J6SixMaFHoXXXXZW707XXX24OXXX", "created": "2018-10-03T05:56:01.208069Z", "description": "Data Testing", "id": 11722990697, "key_field": "Ae_Appl_Number", "last_modified": "2018-10-03T08:44:20.324237Z", "list_type": "js_variables", "name": "TEST_AE_LI_KEYS_003", "project_id": 1045199007354, "s3_path": "opti-port/dcp/ue.1045199007354/11722990697"}, {"acc_id": 166211981, "archived": false, "access_key": "ALLLJNXXXXXXXPU4C7GA", "secret_key": "X12J6SixMaFHoXXXXZW707XXX24OXXX", "created": "2018-10-03T08:46:32.535653Z", "description": "Data Testing", "id": 11724290732, "key_field": "Ae_Appl_Number", "last_modified": "2018-10-03T10:11:13.167798Z", "list_type": "js_variables", "name": "TEST_AE_LI_KEYS_001", "project_id": 1045199007354, "s3_path": "opti-port/dcp/ue.1045199007354/11724290732"}]
I want the output file to contain below data:
11722990697,TEST_AE_LI_KEYS_003,opti-port/dcp/ue.1045199007354/11722990697
11724290732,EST_AE_LI_KEYS_001,opti-port/dcp/ue.1045199007354/11724290732
I am able to achieve the same by taking one record at a time and processing it using awk.but i am getting the field names also.
find below my trial:
R=cat in.txt | awk -F '},' '{print $1}'
echo $R | awk -F , '{print $7 " " $11 " " $13}'
I want it to be done for entire file without field names.

AWK/SED is not the right tool for parsing JSON files. Use jq
[root#localhost]# jq -r '.[] | "\(.acc_id),\(.name),\(.s3_path)"' abc.json
166211981,TEST_AE_LI_KEYS_003,opti-port/dcp/ue.1045199007354/11722990697
166211981,TEST_AE_LI_KEYS_001,opti-port/dcp/ue.1045199007354/11724290732
If you don't want to install any other software then you can use python as well which is found on most of the linux machine
[root#localhost]# cat parse_json.py
#!/usr/bin/env python
# Import the json module
import json
# Open the json file in read only mode and load the json data. It will load the data in python dictionary
with open('abc.json') as fh:
data = json.load(fh)
# To print the dictionary
# print(data)
# To print the name key from first and second record
# print(data[0]["name"])
# print(data[1]["name"])
# Now to get both the records use a for loop
for i in range(0,2):
print("%s,%s,%s") % (data[i]["access_key"],data[i]["name"],data[i]["s3_path"])
[root#localhost]# ./parse_json.py
ALLLJNXXXXXXXPU4C7GA,TEST_AE_LI_KEYS_003,opti-port/dcp/ue.1045199007354/11722990697
ALLLJNXXXXXXXPU4C7GA,TEST_AE_LI_KEYS_001,opti-port/dcp/ue.1045199007354/11724290732

Assuming the input data is in a file called input.json, you can use a Python script to fetch the attributes. Put the following content in a file called fetch_attributes.py:
import json
with open("input.json") as fh:
data = json.load(fh)
with open("output.json", "w") as of:
for record in data:
of.write("%s,%s,%s\n" % (record["id"],record["name"],record["s3_path"]))
Then, run the script as:
python fetch_attributes.py
Code Explanation
import json - Importing Python's json library to parse the JSON.
with open("input.json") as fh: - Opening the input file and getting the file handler in if.
data = json.load(fh) - Loading the JSON input file using load() method from the json library which will populate the data variable with a Python dictionary.
with open("output.json", "w") as of: - Opening the output file in write mode and getting the file handler in of.
for record in data: - Loop over the list of records in the JSON.
of.write("%s,%s,%s\n" % (record["id"],record["name"],record["s3_path"])) - Fetching the required attributes from each record and writing them in the file.

Related

How to stream and read from a .tar.gz boto3 in S3?

On S3 there is a JSON file with the following format:
{"field1": "...", "field2": "...", ...}
{"field1": "...", "field2": "...", ...}
{"field1": "...", "field2": "...", ...}
It is compressed, in .tar.gz format, and its unzipped size is ~30GB, therefore I would like to read it in a streaming fashion.
Using the aws cli, I managed to locally do so with the following command:
aws s3 cp s3://${BUCKET_NAME}/${FILE_NAME}.tar.gz - | gunzip -c -
However, I would like to do it natively in python 3.8.
Merging various solutions online, I tried the following strategies:
1. Uncompressing in-memory file [not working]
import boto3, gzip, json
from io import BytesIO
s3 = boto3.resource('s3')
key = 'FILE_NAME.tar.gz'
streaming_iterator = s3.Object('BUCKET_NAME', key).get()['Body'].iter_lines()
first_line = next(streaming_iterator)
gzipline = BytesIO(first_line)
gzipline = gzip.GzipFile(fileobj=gzipline)
print(gzipline.read())
Which raises
EOFError: Compressed file ended before the end-of-stream marker was reached
2. Using the external library smart_open [partially working]
import boto3
for line in open(
f's3://${BUCKET_NAME}/${FILE_NAME}.tar.gz',
mode="rb",
transport_params={"client": boto3.client('s3')},
encoding="raw_unicode_escape",
compression=".gz"
):
print(line)
This second solution works discretely well for ASCII characters, but for some reason it also turns non ASCII characters into garbage; e.g.,
input: \xe5\x9b\xbe\xe6\xa0\x87\xe3\x80\x82
output: å\x9b¾æ\xa0\x87ã\x80\x82
expected output: 图标。
This leads me to think that the encoding I put is wrong, but I literally tried every encoding present in this page and the only ones that don't lead to an Exception are raw_unicode_escape, unicode_escape and palmos (?), but they all produce garbage.
Any suggestion is welcomed, thanks in advance.

The return from a call to get_object() is a StreamingBody object, which as the name implies will allow you to read from the object in a streaming fashion. However, boto3 does not support seeking on this file object.
While you can pass this object to a tarfile.open call, you need to be careful. There are two caveats. First, you'll need to tell tarfile that you're passing it a non-seekable streaming object using the | character in the open string, and you can't do anything that would trigger a seek, such as attempt to get a list of files first, then operate on these files.
Putting it all together is fairly straight forward, you just need to open a object using boto3, then process each file in the tar file in turn:
# Use boto3 to read the object from S3
s3 = boto3.client('s3')
resp = s3.get_object(Bucket='example-bucket', Key='path/to/example.tar.gz')
obj = resp['Body']
# Open the tar file, the "|" is important, as it instructs
# tarfile that the fileobj is non-seekable
with tarfile.open(fileobj=obj, mode='r|gz') as tar:
# Enumerate the tar file objects as we extract data
for member in tar:
with tar.extractfile(member) as f:
# Read each row in turn and decode it
for row in f:
row = json.loads(row)
# Just print out the filename and results in this demo
print(member.name, row)

Add file name and timestamp into each record in BigQuery using Dataflow

I have a few .txt files with data in JSON to be loaded to google BigQuery table. Along with the columns in the text files I will need to insert filename and current timestamp for each rows. It is in GCP Dataflow with Python 3.7
I accessed the Filemetadata containing the filepath and size using GCSFileSystem.match and metadata_list.
I believe I need to get the pipeline code to run in a loop, pass the filepath to ReadFromText, and call a FileNameReadFunction ParDo.
(p
| "read from file" >> ReadFromText(known_args.input)
| "parse" >> beam.Map(json.loads)
| "Add FileName" >> beam.ParDo(AddFilenamesFn(), GCSFilePath)
| "WriteToBigQuery" >> beam.io.WriteToBigQuery(known_args.output,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
I followed the steps in Dataflow/apache beam - how to access current filename when passing in pattern? but I can't make it quite work.
Any help is appreciated.

You can use textio.ReadFromTextWithFilename instead of ReadFromText. That will produce a PCollection of (filename,line) tuples.
To include the file and timestamp in your output json record, you could change your "parse" line to
| "parse" >> beam.map(lambda (file, line): {
**json.loads(line),
"filename": file,
"timestamp": datetime.now()})

How to import csv file to postgresql table without using copy command

I am planning to use insert command or anything like bulk insert but not copy command . please help!

You tagged your question with ruby, so a ruby'ish way could be:
Install the smarter_csv gem (https://github.com/tilo/smarter_csv) which lets you parse each line into a hash where the column title is used as the key.
inserts = SmarterCSV.process('/path/to/file.csv')
# [
# { col_name: "value from row 1", ... },
# { col_name: "value from row 2", ... }
# ]
Then you might use whatever ORM or database connector you like, e.g. ActiveRecord:
MyModel.insert(inserts)

Convert Empty string ("") to Double data type while importing data from JSON file using command line BQ command

What steps will reproduce the problem?
1.I am running command:
./bq load --source_format=NEWLINE_DELIMITED_JSON --schema=lifeSchema.json dataset_test1.table_test_3 lifeData.json
2.I have attached data source file and scema files.
3. It throws an error - JSON parsing error in row starting at position 0 at file:
file-00000000. Could not convert value to double. Field:
computed_results_A; Value:
What is the expected output? What do you see instead?
I want empty string converted as NULL or 0
What version of the product are you using? On what operating system?
I am using MAC OSX YOSEMITE
Source JSON lifeData.json
{"schema":{"vendor":"com.bd.snowplow","name":"in_life","format":"jsonschema","version":"1-0-2"},"data":{"step":0,"info_userId":"53493764","info_campaignCity":"","info_self_currentAge":45,"info_self_gender":"male","info_self_retirementAge":60,"info_self_married":false,"info_self_lifeExpectancy":0,"info_dependantChildren":0,"info_dependantAdults":0,"info_spouse_working":true,"info_spouse_currentAge":33,"info_spouse_retirementAge":60,"info_spouse_monthlyIncome":0,"info_spouse_incomeInflation":5,"info_spouse_lifeExpectancy":0,"info_finances_sumInsured":0,"info_finances_expectedReturns":6,"info_finances_loanAmount":0,"info_finances_liquidateSavings":true,"info_finances_savingsAmount":0,"info_finances_monthlyExpense":0,"info_finances_expenseInflation":6,"info_finances_expenseReduction":10,"info_finances_monthlyIncome":0,"info_finances_incomeInflation":5,"computed_results_A":"","computed_results_B":null,"computed_results_C":null,"computed_results_D":null,"uid_epoch":"53493764_1466504541604","state":"init","campaign_id":"","campaign_link":"","tool_version":"20150701-lfi-v1"},"hierarchy":{"rootId":"94583157-af34-4ecb-8024-b9af7c9e54fa","rootTstamp":"2016-06-21 10:22:24.000","refRoot":"events","refTree":["events","in_life"],"refParent":"events"}}
Schema JSON lifeSchema.json
{
"name": "computed_results_A",
"type": "float",
"mode": "nullable"
}

Try loading the JSON file as a one column CSV file.
bq load --field_delimiter='|' proj:set.table file.json json:string
Once the file is loaded into BigQuery, you can use JSON_EXTRACT_SCALAR or a JavaScript UDF to parse the JSON with total freedom.

JSON file not loading into redshift

I have issues using the copy command in redshift to load in JSON objects, I am receiving a file in the below JSON format which fails when attempting to use the copy command, however when I adjust the json file to the bottom it works. This is not an ideal solution as I am not permiited to modify the JSON file
this works fine :
{
"id": 1,
"name": "Major League Baseball"
}
{
"id": 2,
"name": "National Hockey League"
}
This does not work (notice the extra square brackets)
[
{"id":1,"name":"Major League Baseball"},
{"id":2,"name":"National Hockey League"}
]
this is my json path
{
"jsonpaths": [
"$['id']",
"$['name']"
]
}

The problem with the COPY command is it does not really accept a valid JSON file. Instead, it expects a JSON-per-line which is shown in the documentation, but not obviously mentioned.
Hence, every line is supposed to be a valid JSON but the full file is not. That's why when you modify your file, it works.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Text processing to fetch the attributes - awk

Related

How to stream and read from a .tar.gz boto3 in S3?

Add file name and timestamp into each record in BigQuery using Dataflow

How to import csv file to postgresql table without using copy command

Convert Empty string ("") to Double data type while importing data from JSON file using command line BQ command

JSON file not loading into redshift

Categories

Resources