Google Buckets / Read by line - gsutil

I know that is currently possible to download objects by byte range in Google Cloud Storage buckets.
const options = {
destination: destFileName,
start: startByte,
end: endByte,
};
await storage.bucket(bucketName).file(fileName).download(options);
However, I would need to read by line as the files I deal with are *.csv:
await storage
.bucket(bucketName)
.file(fileName)
.download({ destination: '', lineStart: number, lineEnd: number });
I couldn't find any API for it, could anyone advise on how to achieve the desired behaviour?

You could not read a file line by line directly from Cloud Storage, as it stores them as objects , as shown on this answer:
The string you read from Google Storage is a string representation of a multipart form. It contains not only the uploaded file contents but also some metadata.
To read the file line by line as desired, I suggest loading it onto a variable and then parse the variable as needed. You could use the sample code provided on this answer:
const { Storage } = require("#google-cloud/storage");
const storage = new Storage();
//Read file from Storage
var downloadedFile = storage
.bucket(bucketName)
.file(fileName)
.createReadStream();
// Concat Data
let fileBuffer = "";
downloadedFile
.on("data", function (data) {
fileBuffer += data;
})
.on("end", function () {
// CSV file data
//console.log(fileBuffer);
//Parse data using new line character as delimiter
var rows;
Papa.parse(fileBuffer, {
header: false,
delimiter: "\n",
complete: function (results) {
// Shows the parsed data on console
console.log("Finished:", results.data);
rows = results.data;
},
});
To parse the data, you could use a library like PapaParse as shown on this tutorial.

Related

Google Cloud Function -- Convert BigQuery Data to Gzip (Compressed) Json then Load to Cloud Storage

*For context, this script is largely based on the one found in this guide from Google: https://cloud.google.com/bigquery/docs/samples/bigquery-extract-table-json#bigquery_extract_table_json-nodejs
I have the below script which is functioning. However, it writes a normal JSON file to cloud storage. To be a bit more optimized for file transfer and storage,I wanted to use const {pako} = require('pako'); to compress the files before loading.
I haven't been able to figure out how to accomplish this, unfortunately, after numerous attempts.
Anyone have any ideas?
**I'm assuming it has something to do with the options in .extract(storage.bucket(bucketName).file(filename), options);, but again, pretty lost in how to figure this out unfortunately...
Any help would be appreciated! :)
**The intent of this function is:
It is a Google Cloud function
It gets data from BigQuery
It writes that data in JSON format to Cloud Storage
My goal is to integrate Pako (or another means of compression) to compress the JSON files to gzip format prior to moving into storage.
const {BigQuery} = require('#google-cloud/bigquery');
const {Storage} = require('#google-cloud/storage');
const functions = require('#google-cloud/functions-framework');
const bigquery = new BigQuery();
const storage = new Storage();
functions.http('extractTableJSON', async (req, res) => {
// Exports my_dataset:my_table to gcs://my-bucket/my-file as JSON.
// https://cloud.google.com/bigquery/docs/samples/bigquery-extract-table-json#bigquery_extract_table_json-nodejs
const DateYYYYMMDD = new Date().toISOString().slice(0,10).replace(/-/g,"");
const datasetId = "dataset-1";
const tableId = "example";
const bucketName = "domain.appspot.com";
const filename = `/cache/${DateYYYYMMDD}/example.json`;
// Location must match that of the source table.
const options = {
format: 'json',
location: 'US',
};
// Export data from the table into a Google Cloud Storage file
const [job] = await bigquery
.dataset(datasetId)
.table(tableId)
.extract(storage.bucket(bucketName).file(filename), options);
console.log(`Job ${job.id} created.`);
res.send(`Job ${job.id} created.`);
// Check the job's status for errors
const errors = job.status.errors;
if (errors && errors.length > 0) {
res.send(errors);
}
});
If you want to gzip compress the result, simply use that option
// Location must match that of the source table.
const options = {
format: 'json',
location: 'US',
gzip: true,
};
Job done ;)
From guillaume blaquiere Ah, you are looking for an array of rows!!! Ok, you can't have it out of the box. BigQuery export JSONL file (JSON Line, with 1 valid JSON per line, representing a row in BQ) – guillaume blaquiere
Turns out that I had a misunderstanding of the expected output. I was expecting a JSON Array, whereas the output is individual JSON lines, as Guillaume mentioned above.
So, if you're looking for a JSON Array output, you can still use the helper found below to convert the output, but turns out, that was in fact the expected output and I was mistakenly thinking it was inaccurate (sorry - I'm new ...)
// step #1: Use the below options to export to compressed JSON (as per guillaume blaquiere's note)
const options = {
format: 'json',
location: 'US',
gzip: true,
};
// step #2 (if you're looking for a JSON Array): you can use the below helper function to convert the response.
function convertToJsonArray(text: string): any {
// wrap in array and add comma at end of each line and remove last comma
const wrappedText = `[${text.replace(/\r?\n|\r/g, ",").slice(0, -1)}]`;
const jsonArray = JSON.parse(wrappedText);
return jsonArray;
}
For reference / in case it's helpful, i created this function that'll handle both compressed and uncompressed JSON that's returned.
The application of this is that i'm writing the BigQuery table to JSON in cloud storage to act as a "cache" then requesting that file from a React app and using the below to parse the file in the React app for use on frontend.
import pako from 'pako';
function convertToJsonArray(text: string): any {
const wrappedText = `[${text.replace(/\r?\n|\r/g, ",").slice(0, -1)}]`;
const jsonArray = JSON.parse(wrappedText);
return jsonArray;
}
async function getJsonData(arrayBuffer: ArrayBuffer): Promise<any> {
try {
const Uint8Arr = pako.inflate(arrayBuffer);
const arrayBuf = new TextDecoder().decode(Uint8Arr);
const jsonArray = convertToJsonArray(arrayBuf);
return jsonArray;
} catch (error) {
console.log("Error unzipping file, trying to parse as is.", error)
const parsedBuffer = new TextDecoder().decode(arrayBuffer);
const jsonArray = convertToJsonArray(parsedBuffer);
return jsonArray;
}
}

Search and move files in S3 bucket according to metadata

I currently have a setup where audio files are uploaded to a bucket with user defined metadata. My next goal is to filter through the metadata and move the files to a different folder. Currently I have lambda a lambda function that converts the audio to mp3. So I need help to adjust the code so that the metadata persists through the encoding and is also stored in a database. And to create another function that searches for a particular metadata value and moves the corresponding files to another bucket.
'use strict';
console.log('Loading function');
const aws = require('aws-sdk');
const s3 = new aws.S3({ apiVersion: '2006-03-01' });
const elastictranscoder = new aws.ElasticTranscoder();
// return basename without extension
function basename(path) {
return path.split('/').reverse()[0].split('.')[0];
}
// return output file name with timestamp and extension
function outputKey(name, ext) {
return name + '-' + Date.now().toString() + '.' + ext;
}
exports.handler = (event, context, callback) => {
const bucket = event.Records[0].s3.bucket.name;
const key = event.Records[0].s3.object.key;
var params = {
Input: {
Key: key
},
PipelineId: '1477166492757-jq7i0s',
Outputs: [
{
Key: basename(key)+'.mp3',
PresetId: '1351620000001-300040', // mp3-128
}
]
};
elastictranscoder.createJob(params, function(err, data) {
if (err){
console.log(err, err.stack); // an error occurred
context.fail();
return;
}
context.succeed();
});
};
I also have done some research and know that metadata should be able to be pulled out by
s3.head_object(Bucket=bucket, Key=key)
S3 does not provide a mechanism for searching metadata.
The only way to do what you're contemplating using only native S3 capabilities is to iterate through the list of objects and send a HEAD request for each object but of course this does not scale well for large buckets and each of those requests comes with a charge, although it is a small one.
Currently there exists the S3 inventory tool that allow to extract information out of the S3 objects including metadata and the information can be retrieved for instance using Athena queries.
Details can be found here.

Windows Azure Storage Blobs to zip file with Express

I am trying to use this pluggin (express-zip). At the Azure Storage size we have getBlobToStream which give us the file into a specific Stream. What i do now is getting image from blob and saving it inside the server, and then res.zip it. Is somehow possible to create writeStream which will write inside readStream?
Edit: The question has been edited to ask about doing this in express from Node.js. I'm leaving the original answer below in case anyone was interested in a C# solution.
For Node, You could use a strategy similar to what express-zip uses, but instead of passing a file read stream in this line, pass in a blob read stream obtained using createReadStream.
Solution using C#:
If you don't mind caching everything locally while you build the zip, the way you are doing it is fine. You can use a tool such as AzCopy to rapidly download an entire container from storage.
To avoid caching locally, you could use the ZipArchive class, such as the following C# code:
internal static void ArchiveBlobs(CloudBlockBlob destinationBlob, IEnumerable<CloudBlob> sourceBlobs)
{
using (Stream blobWriteStream = destinationBlob.OpenWrite())
{
using (ZipArchive archive = new ZipArchive(blobWriteStream, ZipArchiveMode.Create))
{
foreach (CloudBlob sourceBlob in sourceBlobs)
{
ZipArchiveEntry archiveEntry = archive.CreateEntry(sourceBlob.Name);
using (Stream archiveWriteStream = archiveEntry.Open())
{
sourceBlob.DownloadToStream(archiveWriteStream);
}
}
}
}
}
This creates a zip archive in Azure storage that contains multiple blobs without writing anything to disk locally.
I'm the author of express-zip. What you are trying to do should be possible. If you look under the covers, you'll see I am in fact adding streams into the zip:
https://github.com/thrackle/express-zip/blob/master/lib/express-zip.js#L55
So something like this should work for you (prior to me adding support for this in the interface of the package itself):
var zip = zipstream(exports.options);
zip.pipe(express.response || http.ServerResponse.prototype); // res is a writable stream
var addFile = function(file, cb) {
zip.entry(getBlobToStream(), { name: file.name }, cb);
};
async.forEachSeries(files, addFile, function(err) {
if (err) return cb(err);
zip.finalize(function(bytesZipped) {
cb(null, bytesZipped);
});
});
Apologize if I've made horrible errors above; I haven't been on this for a bit.

Closing stream after using BitmapEncoder with WinJS Metro app

In a Windows 8 Metro application written in JS I open a file, get the stream, write some image data to it using the 'promise - .then' pattern. It works fine - the file is successfully saved to the file system, except after using the BitmapEncoder to flush the stream to the file, the stream is still open. ie; I can't access the file until I kill the application, but the 'stream' variable is out of scope for me to reference, so I can't close() it. Is there something comparable to the C# using statement that could be used?
...then(function (file) {
return file.openAsync(Windows.Storage.FileAccessMode.readWrite);
})
.then(function (stream) {
//Create imageencoder object
return Imaging.BitmapEncoder.createAsync(Imaging.BitmapEncoder.pngEncoderId, stream);
})
.then(function (encoder) {
//Set the pixel data in the encoder ('canvasImage.data' is an existing image stream)
encoder.setPixelData(Imaging.BitmapPixelFormat.rgba8, Imaging.BitmapAlphaMode.straight, canvasImage.width, canvasImage.height, 96, 96, canvasImage.data);
//Go do the encoding
return encoder.flushAsync();
//file saved successfully,
//but stream is still open and the stream variable is out of scope.
};
This simple imaging sample from Microsoft might help. Copied below.
It looks like, in your case, you need to declare the stream before the chain of then calls, make sure you don't name-collide with your parameter to your function accepting the stream (note the part where they do _stream = stream), and add a then call to close the stream.
function scenario2GetImageRotationAsync(file) {
var accessMode = Windows.Storage.FileAccessMode.read;
// Keep data in-scope across multiple asynchronous methods
var stream;
var exifRotation;
return file.openAsync(accessMode).then(function (_stream) {
stream = _stream;
return Imaging.BitmapDecoder.createAsync(stream);
}).then(function (decoder) {
// irrelevant stuff to this question
}).then(function () {
if (stream) {
stream.close();
}
return exifRotation;
});
}

when to check for file size/mimetype in node.js upload script?

I created an upload script in node.js using express/formidable. It basically works, but I am wondering where and when to check the uploaded file e. g. for the maximum file size or if the file´s mimetype is actually allowed.
My program looks like this:
app.post('/', function(req, res, next) {
req.form.on('progress', function(bytesReceived, bytesExpected) {
// ... do stuff
});
req.form.complete(function(err, fields, files) {
console.log('\nuploaded %s to %s', files.image.filename, files.image.path);
// ... do stuff
});
});
It seems to me that the only viable place for checking the mimetype/file size is the complete event where I can reliably use the filesystem functions to get the size of the uploaded file in /tmp/ – but that seems like a not so good idea because:
the possibly malicious/too large file is already uploaded on my server
the user experience is poor – you watch the upload progress just to be told that it didnt work afterwards
Whats the best practice for implementing this? I found quite a few examples for file uploads in node.js but none seemed to do the security checks I would need.
With help from some guys at the node IRC and the node mailing list, here is what I do:
I am using formidable to handle the file upload. Using the progress event I can check the maximum filesize like this:
form.on('progress', function(bytesReceived, bytesExpected) {
if (bytesReceived > MAX_UPLOAD_SIZE) {
console.log('### ERROR: FILE TOO LARGE');
}
});
Reliably checking the mimetype is much more difficult. The basic Idea is to use the progress event, then if enough of the file is uploaded use a file --mime-type call and check the output of that external command. Simplified it looks like this:
// contains the path of the uploaded file,
// is grabbed in the fileBegin event below
var tmpPath;
form.on('progress', function validateMimetype(bytesReceived, bytesExpected) {
var percent = (bytesReceived / bytesExpected * 100) | 0;
// pretty basic check if enough bytes of the file are written to disk,
// might be too naive if the file is small!
if (tmpPath && percent > 25) {
var child = exec('file --mime-type ' + tmpPath, function (err, stdout, stderr) {
var mimetype = stdout.substring(stdout.lastIndexOf(':') + 2, stdout.lastIndexOf('\n'));
console.log('### file CALL OUTPUT', err, stdout, stderr);
if (err || stderr) {
console.log('### ERROR: MIMETYPE COULD NOT BE DETECTED');
} else if (!ALLOWED_MIME_TYPES[mimetype]) {
console.log('### ERROR: INVALID MIMETYPE', mimetype);
} else {
console.log('### MIMETYPE VALIDATION COMPLETE');
}
});
form.removeListener('progress', validateMimetype);
}
});
form.on('fileBegin', function grabTmpPath(_, fileInfo) {
if (fileInfo.path) {
tmpPath = fileInfo.path;
form.removeListener('fileBegin', grabTmpPath);
}
});
The new version of Connect (2.x.) has this already baked into the bodyParser using the limit middleware: https://github.com/senchalabs/connect/blob/master/lib/middleware/multipart.js#L44-61
I think it's much better this way as you just kill the request when it exceeds the maximum limit instead of just stopping the formidable parser (and letting the request "go on").
More about the limit middleware: http://www.senchalabs.org/connect/limit.html