Google Cloud Function -- Convert BigQuery Data to Gzip (Compressed) Json then Load to Cloud Storage

Google Cloud Function -- Convert BigQuery Data to Gzip (Compressed) Json then Load to Cloud Storage - google-bigquery

*For context, this script is largely based on the one found in this guide from Google: https://cloud.google.com/bigquery/docs/samples/bigquery-extract-table-json#bigquery_extract_table_json-nodejs
I have the below script which is functioning. However, it writes a normal JSON file to cloud storage. To be a bit more optimized for file transfer and storage,I wanted to use const {pako} = require('pako'); to compress the files before loading.
I haven't been able to figure out how to accomplish this, unfortunately, after numerous attempts.
Anyone have any ideas?
**I'm assuming it has something to do with the options in .extract(storage.bucket(bucketName).file(filename), options);, but again, pretty lost in how to figure this out unfortunately...
Any help would be appreciated! :)
**The intent of this function is:
It is a Google Cloud function
It gets data from BigQuery
It writes that data in JSON format to Cloud Storage
My goal is to integrate Pako (or another means of compression) to compress the JSON files to gzip format prior to moving into storage.
const {BigQuery} = require('#google-cloud/bigquery');
const {Storage} = require('#google-cloud/storage');
const functions = require('#google-cloud/functions-framework');
const bigquery = new BigQuery();
const storage = new Storage();
functions.http('extractTableJSON', async (req, res) => {
// Exports my_dataset:my_table to gcs://my-bucket/my-file as JSON.
// https://cloud.google.com/bigquery/docs/samples/bigquery-extract-table-json#bigquery_extract_table_json-nodejs
const DateYYYYMMDD = new Date().toISOString().slice(0,10).replace(/-/g,"");
const datasetId = "dataset-1";
const tableId = "example";
const bucketName = "domain.appspot.com";
const filename = `/cache/${DateYYYYMMDD}/example.json`;
// Location must match that of the source table.
const options = {
format: 'json',
location: 'US',
};
// Export data from the table into a Google Cloud Storage file
const [job] = await bigquery
.dataset(datasetId)
.table(tableId)
.extract(storage.bucket(bucketName).file(filename), options);
console.log(`Job ${job.id} created.`);
res.send(`Job ${job.id} created.`);
// Check the job's status for errors
const errors = job.status.errors;
if (errors && errors.length > 0) {
res.send(errors);
}
});

If you want to gzip compress the result, simply use that option
// Location must match that of the source table.
const options = {
format: 'json',
location: 'US',
gzip: true,
};
Job done ;)

From guillaume blaquiere Ah, you are looking for an array of rows!!! Ok, you can't have it out of the box. BigQuery export JSONL file (JSON Line, with 1 valid JSON per line, representing a row in BQ) – guillaume blaquiere
Turns out that I had a misunderstanding of the expected output. I was expecting a JSON Array, whereas the output is individual JSON lines, as Guillaume mentioned above.
So, if you're looking for a JSON Array output, you can still use the helper found below to convert the output, but turns out, that was in fact the expected output and I was mistakenly thinking it was inaccurate (sorry - I'm new ...)
// step #1: Use the below options to export to compressed JSON (as per guillaume blaquiere's note)
const options = {
format: 'json',
location: 'US',
gzip: true,
};
// step #2 (if you're looking for a JSON Array): you can use the below helper function to convert the response.
function convertToJsonArray(text: string): any {
// wrap in array and add comma at end of each line and remove last comma
const wrappedText = `[${text.replace(/\r?\n|\r/g, ",").slice(0, -1)}]`;
const jsonArray = JSON.parse(wrappedText);
return jsonArray;
}
For reference / in case it's helpful, i created this function that'll handle both compressed and uncompressed JSON that's returned.
The application of this is that i'm writing the BigQuery table to JSON in cloud storage to act as a "cache" then requesting that file from a React app and using the below to parse the file in the React app for use on frontend.
import pako from 'pako';
function convertToJsonArray(text: string): any {
const wrappedText = `[${text.replace(/\r?\n|\r/g, ",").slice(0, -1)}]`;
const jsonArray = JSON.parse(wrappedText);
return jsonArray;
}
async function getJsonData(arrayBuffer: ArrayBuffer): Promise<any> {
try {
const Uint8Arr = pako.inflate(arrayBuffer);
const arrayBuf = new TextDecoder().decode(Uint8Arr);
const jsonArray = convertToJsonArray(arrayBuf);
return jsonArray;
} catch (error) {
console.log("Error unzipping file, trying to parse as is.", error)
const parsedBuffer = new TextDecoder().decode(arrayBuffer);
const jsonArray = convertToJsonArray(parsedBuffer);
return jsonArray;
}
}

Related

Google Buckets / Read by line

I know that is currently possible to download objects by byte range in Google Cloud Storage buckets.
const options = {
destination: destFileName,
start: startByte,
end: endByte,
};
await storage.bucket(bucketName).file(fileName).download(options);
However, I would need to read by line as the files I deal with are *.csv:
await storage
.bucket(bucketName)
.file(fileName)
.download({ destination: '', lineStart: number, lineEnd: number });
I couldn't find any API for it, could anyone advise on how to achieve the desired behaviour?

You could not read a file line by line directly from Cloud Storage, as it stores them as objects , as shown on this answer:
The string you read from Google Storage is a string representation of a multipart form. It contains not only the uploaded file contents but also some metadata.
To read the file line by line as desired, I suggest loading it onto a variable and then parse the variable as needed. You could use the sample code provided on this answer:
const { Storage } = require("#google-cloud/storage");
const storage = new Storage();
//Read file from Storage
var downloadedFile = storage
.bucket(bucketName)
.file(fileName)
.createReadStream();
// Concat Data
let fileBuffer = "";
downloadedFile
.on("data", function (data) {
fileBuffer += data;
})
.on("end", function () {
// CSV file data
//console.log(fileBuffer);
//Parse data using new line character as delimiter
var rows;
Papa.parse(fileBuffer, {
header: false,
delimiter: "\n",
complete: function (results) {
// Shows the parsed data on console
console.log("Finished:", results.data);
rows = results.data;
},
});
To parse the data, you could use a library like PapaParse as shown on this tutorial.

asyncStorage with ReactNative: JSON.parse doesn't when getting object back

Hy everyone !
I've stored a simple object in Async Storage in a ReactNative app.
But when I get it back, it isn't correctly parsed : all keys still got the quotes marks (added by JSON.stringify() when storing it) ...
I store data like that:
const storeData = () => {
let myData = {
title: 'Hummus',
price: '6.90',
id: '1'
}
AsyncStorage.setItem('#storage_Key', JSON.stringify(myData));
}
and then access data like that:
const getData= async () => {
const jsonValue = await AsyncStorage.getItem('#storage_Key')
console.log(jsonValue);
return JSON.parse(jsonValue);
}
and my object after parsing looks like that:
{"title":"Hummus","price":"6.90","id":"1"}
Any idea why quotes aren't removed from keys ??

That's because JSON specification says the keys should be string. What you are using is the modern representation of JSON called JSON5 (https://json5.org/). JSON5 is a superset of JSON specification and it does not require keys to be surrounded by quotes in some cases. When you stringify, it returns the result in JSON format.
Both JSON and JSON5 are equally valid in modern browsers. So, you should not be worries about breaking anything programmatically just because they look different.
You can use JSON5 as shown below and it will give you your desired Stringified result.
let myData = {
title: 'Hummus',
price: '6.90',
id: '1'
}
console.log(JSON5.stringify(myData));
console.log(JSON.stringify(myData));
<script src="https://unpkg.com/json5#^2.0.0/dist/index.min.js"></script>
Like this:
// JSON5.stringify
{title:'Hummus',price:'6.90',id:'1'}
// JSON.stringify
{"title":"Hummus","price":"6.90","id":"1"}

How can I use the same value as written in the Json during the same test execution in the testcafe

I have been trying to use the value from the JSON that I have got added successfully using fs.write() function,
There are two test cases in the same fixture, one to create an ID and 2nd to use that id. I can wrote the id successfully in the json file using fs.write() function and trying to use that id using importing json file like var myid=require('../../resources/id.json')
The json file storing correct id of the current execution but I get the id of first test execution in 2nd execution.
For example, id:1234 is stored during first test execution and id:4567 is stored in 2nd test execution. During 2nd test execution I need the id:4567 but I get 1234 this is weird, isn't it?
I use it like
t.typeText(ele, myid.orid)
my json file contains only id like {"orid":"4567"}
I am new to Javascript and Testcafe any help would really be appreciated
Write File class
const fs = require('fs')
const baseClass =require('../component/base')
class WriteIntoFile{
constructor(orderID){
const OID = {
orderid: orderID
}
const jsonString = JSON.stringify(OID)
fs.writeFile(`resources\id.json`, jsonString, err => {
if (err) {
console.log('Error writing file', err)
} else {
console.log('Successfully wrote file')
}
})
}
}
export default WriteIntoFile
I created 2 different classes in order to separate create & update operations and call the functions of create & update order in single fixture in test file
Create Order class
class CreateOrder{
----
----
----
async createNewOrder(){
//get text of created ordder and saved order id in to the json file
-----
-----
-----
const orId= await baseclass.getOrderId();
new WriteIntoFile(orId)
console.log(orId)
-----
-----
-----
}
}export default CreateOrder
Update Order class
var id=require('../../resources/id.json')
class UpdateOrder{
async searchOrderToUpdate(){
await t
***//Here, I get old order id that was saved during previous execution***
.typeText(baseClass.searchBox, id.orderid)
.wait(2500)
.click(baseClass.searchIcon)
.doubleClick(baseClass.orderAGgrid)
console.log(id.ordderid)
----
----
async updateOrder(){
this.searchOrderToUpdate()
.typeText(baseClass.phNo, '1234567890')
.click(baseClass.saveBtn)
}
}export default UpdateOrder
Test file
const newOrder = new CreateOrder();
const update = new UpdateOrder();
const role = Role(`siteurl`, async t => {
await t
login('id')
await t
.wait(1500)
},{preserveUrl:true})
test('Should be able to create an Order', async t=>{
await newOrder.createNewOrder();
});
test('Should be able to update an order', async t=>{
await update.updateOrder();
});

I'll reply to this, but you probably won't be happy with my answer, because I wouldn't go down this same path as you proposed in your code.
I can see a couple of problems. Some of them might not be problems right now, but in a month, you could struggle with this.
1/ You are creating separate test cases that are dependent on each other.
This is a problem because of these reasons:
what if Should be able to create an Order doesn't run? or what if it fails? then Should be able to update an order fails as well, and this information is useless, because it wasn't the update operation that failed, but the fact that you didn't meet all preconditions for the test case
how do you make sure Should be able to create an Order always runs before hould be able to update an order? There's no way! You can do it like this when one comes before the other and I think it will work, but in some time you decide to move one test somewhere else and you are in trouble and you'll spend hours debugging it. You have prepared a trap for yourself. I wrote this answer on this very topic, you can read it.
you can't run the tests in parallel
when I read your test file, there's no visible hint that the tests are dependent on each other. Therefore as a stranger to your code, I could easily mess things up because I have no way of knowing about it without going deeper in the code. This is a big trap for anyone who might come to your code after you. Don't do this to your colleagues.
2/ Working with files when all you need to do is pass a value around is too cumbersome.
I really don't see a reason why you need to same the id into a file. A slightly better approach (still violating 1/) could be:
const newOrder = new CreateOrder();
const update = new UpdateOrder();
// use a variable to pass the orderId around
// it's also visible that the tests are dependent on each other
let orderId = undefined;
const role = Role(`siteurl`, async t => {
// some steps, I omit this for better readability
}, {preserveUrl: true})
test('Should be able to create an Order', async t=>{
orderId = await newOrder.createNewOrder();
});
test('Should be able to update an order', async t=>{
await update.updateOrder(orderId);
});
Doing it like this also slightly remedies what I wrote in 1/, that is that it's not visible at first sight that the tests are dependent on each other. Now, this is a bit improved.
Some other approaches how you can pass data around are mentioned here and here.
Perhaps even a better approach is to use t.fixtureCtx object:
const newOrder = new CreateOrder();
const update = new UpdateOrder();
const role = Role(`siteurl`, async t => {
// some steps, I omit this for better readability
}, {preserveUrl:true})
test('Should be able to create an Order', async t=>{
t.fixtureCtx.orderId = await newOrder.createNewOrder();
});
test('Should be able to update an order', async t=>{
await update.updateOrder(t.fixtureCtx.orderId);
});
Again, I can at least see the tests are dependent on each other. That's already a big victory.
Now back to your question:
During 2nd test execution I need the id:4567 but I get 1234 this is weird, isn't it?
No, it's not weird. You required the file:
var id = require('../../resources/id.json')
and so it's loaded once and if you write into the file later, you won't read the new content unless you read the file again. require() is a function in Node to load modules, and it makes sense to load them once.
This demonstrates the problem:
const idFile = require('./id.json');
const fs = require('fs');
console.log(idFile); // { id: 5 }
const newId = {
'id': 7
};
fs.writeFileSync('id.json', JSON.stringify(newId));
// it's been loaded once, you won't get any other value here
console.log(idFile); // { id: 5 }
What you can do to solve the problem?
You can use fs.readFileSync():
const idFile = require('./id.json');
const fs = require('fs');
console.log(idFile); // { id: 5 }
const newId = {
'id': 7
};
fs.writeFileSync('id.json', JSON.stringify(newId));
// you need to read the file again and parse its content
const newContent = JSON.parse(fs.readFileSync('id.json'));
console.log(newContent); // { id: 7 }
And this is what I warned you against in the comment section. That this is too cumbersome, inefficient, because you write to a file and then read from the file just to get one value.
What you created is not very readable either:
const fs = require('fs')
const baseClass =require('../component/base')
class WriteIntoFile{
constructor(orderID){
const OID = {
orderid: orderID
}
const jsonString = JSON.stringify(OID)
fs.writeFile(`resources\id.json`, jsonString, err => {
if (err) {
console.log('Error writing file', err)
} else {
console.log('Successfully wrote file')
}
})
}
}
export default WriteIntoFile
All these operations for writing into a file are in a constructor, but a constructor is not the best place for all this. Ideally you have only variable assignments in it. I also don't see much reason for why you need to create a new class when you are doing only two operations that can easily fit on one line of code:
fs.writeFileSync('orderId.json', JSON.stringify({ orderid: orderId }));
Keep it as simple as possible. it's more readable like so than having to go to a separate file with the class and decypher what it does there.

HapiJS reply with readable stream

For one call, I am replying with a huge JSON object which sometimes causes the Node event loop to become blocked. As such, I'm using Big Friendly JSON package to stream JSON instead. My issue is I cannot figure out how to actually reply with the stream
My original code was simply
let searchResults = s3Access.getSavedSearch(guid)).Body;
searchResults = JSON.parse(searchResults.toString());
return reply(searchResults);
Works great but bogs down on huge payloads
I've tried things like, using the Big Friendly JSON package https://gitlab.com/philbooth/bfj
const stream = bfj.streamify(searchResults);
return reply(stream); // according to docs it's a readable stream
But then my browser complained about an empty response. I then tried to add the below to the reply, same result.
.header('content-encoding', 'json')
.header('Content-Length', stream.length);
I also tried return reply(null, stream); but that produced a ton of node errors
Is there some other way I need to organize this? My understanding was I could just reply a readable stream and Hapi would take care of it, but the response keeps showing up as empty.

Did you try to use h.response, here h is reply.
Example:
handler: async (request, h) => {
const { limit, sortBy, order } = request.query;
const queryString = {
where: { status: 1 },
limit,
order: [[sortBy, order]],
};
let userList = {};
try {
userList = await _getList(User, queryString);
} catch (e) {
// throw new Boom(e);
Boom.badRequest(i18n.__('controllers.user.fetchUser'), e);
}
return h.response(userList);
}

Search and move files in S3 bucket according to metadata

I currently have a setup where audio files are uploaded to a bucket with user defined metadata. My next goal is to filter through the metadata and move the files to a different folder. Currently I have lambda a lambda function that converts the audio to mp3. So I need help to adjust the code so that the metadata persists through the encoding and is also stored in a database. And to create another function that searches for a particular metadata value and moves the corresponding files to another bucket.
'use strict';
console.log('Loading function');
const aws = require('aws-sdk');
const s3 = new aws.S3({ apiVersion: '2006-03-01' });
const elastictranscoder = new aws.ElasticTranscoder();
// return basename without extension
function basename(path) {
return path.split('/').reverse()[0].split('.')[0];
}
// return output file name with timestamp and extension
function outputKey(name, ext) {
return name + '-' + Date.now().toString() + '.' + ext;
}
exports.handler = (event, context, callback) => {
const bucket = event.Records[0].s3.bucket.name;
const key = event.Records[0].s3.object.key;
var params = {
Input: {
Key: key
},
PipelineId: '1477166492757-jq7i0s',
Outputs: [
{
Key: basename(key)+'.mp3',
PresetId: '1351620000001-300040', // mp3-128
}
]
};
elastictranscoder.createJob(params, function(err, data) {
if (err){
console.log(err, err.stack); // an error occurred
context.fail();
return;
}
context.succeed();
});
};
I also have done some research and know that metadata should be able to be pulled out by
s3.head_object(Bucket=bucket, Key=key)

S3 does not provide a mechanism for searching metadata.
The only way to do what you're contemplating using only native S3 capabilities is to iterate through the list of objects and send a HEAD request for each object but of course this does not scale well for large buckets and each of those requests comes with a charge, although it is a small one.

Currently there exists the S3 inventory tool that allow to extract information out of the S3 objects including metadata and the information can be retrieved for instance using Athena queries.
Details can be found here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Google Cloud Function -- Convert BigQuery Data to Gzip (Compressed) Json then Load to Cloud Storage - google-bigquery

If you want to gzip compress the result, simply use that option // Location must match that of the source table. const options = { format: 'json', location: 'US', gzip: true, }; Job done ;)

Related

Google Buckets / Read by line

asyncStorage with ReactNative: JSON.parse doesn't when getting object back

How can I use the same value as written in the Json during the same test execution in the testcafe

HapiJS reply with readable stream

Search and move files in S3 bucket according to metadata

Categories

Resources