I have a huge ORC object ( > 50GB) in S3. I would like to deserialize it in chunks (in a streaming manner). This allows me to retry from last offset in case of S3 file download failures.
I understand ORC stores metadata as a footer. So, I'm looking for some solution that reads the footer first, followed by chunked deserializing.
s3 supports querying for specific file ranges over their http api. Assuming you know your stripe size ahead of time, you can use the api to get filesize. You can calculate the postscript offset, and download only it as a chunk. With that metadata, you can then start pulling in the remainder of the file. It'd probably be best to do several requests, one for each stripe, and decode them concurrently.
Related
I have a 10gb CSV file. I can put the file in S3 in 2 ways.
1) Upload the entire file into single csv object.
2) Divide the file into multiple chunks(say 200mb) and upload.
Now I need to get all the data in the object into a pandas data frame which is running on a EC2 instance.
1) One way is to make a single request and get the file, if it is to be a one big file and put the data in dataframe.
2) Other way is to make multiple requests for each object and keep appending data to dataframe.
Which is the better way of doing it?
With multiple files, you will have possibility to download them simultaneously in parallel threads. But this has 2 drawbacks:
These operations are IO heavy (network mostly), so depending on your instance type you might have worse performance overall
Multithreaded apps include some overhead in handling errors, aggregating results and such.
Depending on what you do, you might also want to look at AWS Athena, which can query data in S3 for you and produce results in seconds, so you don't have to download it at all.
I stuck with the following problem: I need to upload objects in small parts (512KB), so I can not use multipart upload (since the minimum 5MB restriction). On the grounds of that, I have to put my parts in a "partitions" bucket and run a Cron task to download partitions and upload a single concatenated object into a "completed" bucket.
I would like to clarify, however, that there is no more elegant way to do this except direct download and concatenation. AWS CLI suggests one can copy objects as a whole, but I see no way to copy and concatenate several objects into one. Is there a way to do this via AWS S3 means?
UPD: I am not guaranteed 512KB chunk size (in fact, it is 512KB to 16MB), but it is usually 512KB and this limit takes origin from vendor of my IP cameras so I can not really change that. And I know the result size beforehead, the camera tells me "I am going to upload 33MB" with a separate call to my backend, but I have no control over number of chunks or their size except the guaranteed boundaries above.
I have what I think should be a relatively simple use case for AWS Glue, yet I'm having a lot of trouble figuring out how to implement it.
I have a Kinesis Firehose job dumping streaming data into a S3 bucket. These files consist of a series of discrete web browsing events represented as JSON documents with varying structures (so, say, one document might have field 'date' but not field 'name', whereas another might have 'name' but not 'date').
I wish to run hourly ETL jobs on these files, the specifics of which are not relevant to the matter at hand.
I'm trying to run a S3 data catalog crawler and the problem I'm running into is that the Kinesis output format is not, itself, valid JSON, which is just baffling to me. Instead it's a bunch of JSON documents separated by a line break. The crawler can automatically identify and parse JSON files, but it cannot parse this.
I thought of writing a lambda function to 'fix' the Firehose file, triggered by its creation on the bucket, but it sounds like a cheap workaround for two pieces that should fit neatly together.
Another option would be just bypassing the data catalog altogether and doing the necessary transformations in the Glue script itself, but I have no idea how to get started on this.
Am I missing anything? Is there an easier way to parse Firehouse output files or, failing that, bypassing the need for a crawler?
cheers and thanks in advance
It sounds like you're describing the behaviour of Kinesis Firehose, which is to concatenate multiple incoming records according to some buffering (time and size) settings, and then write the records to S3 as a single object. Firehose Data Delivery
The batching of multiple records into a single file is important if the workload will contain a large number of records, as performance (and S3 costs) for processing many small files from S3 can be less than optimal.
AWS Glue Crawlers and ETL jobs do support processing of 'JSON line' (newline delimited JSON) format.
If the crawler is failing to run please include the logs or error details (and if possible the crawler run duration, and number of tables created and updated.
I have seen a crawler fail in an instance where differences in the files being crawled forced it into a table-per-file mode, and it hit a limit on the number of tables. AWS Glue Limits
I managed to fix this; basically the problem was that not every JSON document had the same underlying structure.
I wrote a lambda script as part of the Kinesis process that forced every document into the same structure, by adding NULL fields where necessary. The crawlers were then able to correctly parse the resulting files and map them to a single table.
Can you please paste few lines from the json file the firehose creating? I ran the crawler on json file generated by Kinesis Streams and it was able to parse it successfully.
Did you also try "convert record format" when you create the Firehose job? There you can specify the JSONSerDe or Glue catalog to parse your data.
What solved this for me was to add a newline field '/n' to end of each payload sent to firehose.
msg_pkg = (str(json_response) + '\n').encode('utf-8')
record = {'Data': msg_pkg}
put_firehose('agg2-na-firehose', record
Because apparently the Hive JSON SerDe is the default used to proces json data. After doing this I was able to crawl the json data and read it in Athena as well.
I need to load around 1 million rows into bigquery table. My approach will be to write data into cloud storage, and then use load api to load multiple files at once.
What's the most efficient way to do this? I can parallelize the writing into gcs part. When I call load api, I pass in all the uris so I only need to call it once. I'm not sure how this loading is conducted in the backend. If I pass in multiple file names, will this loading run in multiple processes? How do I decide the size of each file to get the best performance?
Thanks
Put all the million rows in one file. If the file is not compressed, BigQuery can read it in parallel with many workers.
From https://cloud.google.com/bigquery/quota-policy
BigQuery can read compressed files (.gz) of up to 4GB.
BigQuery can read uncompressed files (.csv, .json, ...) of up to 5000GB. BigQuery figures out how to read it in parallel - you don't need to worry.
Actually title was a question :)
Do AWS S3 support file streaming in case if file is not 100% uploaded? Client #1 split files into small chunks and start uploading them using Multipart Upload. Client #2 start downloading them from S3. So, as result client #2 don't need to wait until client #1 has uploaded the whole file.
Is it possible to do it without additional streaming server?
This is not natively supported by S3.
S3 allows the individual parts of a multipart upload to be uploaded sequentially, or in parallel, or even out of their logical order, over an essentially unlimited period of time.
It is not until you send the CompleteMultipartUpload request that the parts are verified by S3 as all being present, and having the correct checksums, that the final object is assembled from the parts, and is created in the bucket (or overwrites the former object with the same key, if there was one) if the parts are all present and their integrity is intact. Until then, the object -- as an object at the designated key -- does not technically exist, so it can't be downloaded.