TimescaleDB: how to ingest files from s3? - amazon-s3

In Postgres, a way to ingest files from s3 directly is through the aws_s3 extension, using table_import_from_s3 function for example.
However this is not directly supported by TimescaleDB as of now.
=> CREATE EXTENSION IF NOT EXISTS aws_s3 CASCADE;
ERROR: could not open extension control file "/usr/share/postgresql/14/extension/aws_s3.control": No such file or directory
Is there an alternative to ingest data from s3 files directly into TimescaleDB?

Related

Inspect Parquet in S3 from Command Line

I can download a single snappy.parquet partition file with:
aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet
And then use:
parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet
But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file. Instead the prefix: "s3://bucket/my-data.parquet"
Also what if the schema is different in different row groups across different partition files?
Following instructions here I downloaded a jar file and ran
hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/
But this resulted in error: No FileSystem for schema "s3".
This answer seems promising, but only for reading from HDFS. Any solution for S3?
I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.
You should be able to do:
pip install "clidb[extras]"
clidb s3://bucket/
and then click to load parquet files as views to inspect and run SQL against.

How to add the files of s3bucket folder to a zipfile and download the zip file

I have a folder in s3bucket. I want to zip the files inside it and then download the zip file. Whatever i found was related to lambda. Is there a way i can do it without using lambda? if not then what is the proper way to do it.
Thank you in advance.
S3 can't zip it on the fly for you since it's only a file storage service. You could use lambda of course, but the simplest way to download a "folder" on S3 is to use the AWS CLI.
aws s3 sync s3://<bucket_name>/<folder_key> <local_dest_path>
You can then zip it on your local machine if needed.

get zip files from one s3 bucket unzip them to another s3 bucket

I have zip files in one s3 bucket
I need to unzip them and copy the unzipped folder to another s3 bucket and keep the source path
for example - if in source bucket the zip file in under
"s3://bucketname/foo/bar/file.zip"
then in destination bucket it should be "s3://destbucketname/foo/bar/zipname/files.."
how can it be done ?
i know that it is possible somehow to do it with lambda so i wont have to download it locally but i have no idea how
thanks !
If your desire is to trigger the above process as soon as the Zip file is uploaded into the bucket, then you could write an AWS Lambda function
When the Lambda function is triggered, it will be passed the name of the bucket and object that was uploaded. The function should then:
Download the Zip file to /tmp
Unzip the file (Beware: maximum storage available: 500MB)
Loop through the unzipped files and upload them to the destination bucket
Delete all local files created (to free-up space for any future executions of the function)
For a general example, see: Tutorial: Using AWS Lambda with Amazon S3 - AWS Lambda
You can use AWS Lambda for this. You can also set an event notification in your S3 bucket so that a lambda function is triggered everytime a new file arrives. You can write a Python code that uses boto3 to connect to S3. Then you can read files into a buffer, and unzip them using these libraries, gzip them and then reupload to S3 in your desired folder/path:
import gzip
import zipfile
import io
with zipped.open(file, "r") as f_in:
gzipped_content = gzip.compress(f_in.read())
destinationbucket.upload_fileobj(io.BytesIO(gzipped_content),
final_file_path,
ExtraArgs={"ContentType": "text/plain"}
)
There is also a tutorial here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9
Arguably Python is simpler to use for your Lambda, but if you are considering Java, I've made a library that manages unzipping of data in AWS S3 utilising stream download and multipart upload.
Unzipping is achieved without keeping data in memory or writing to disk. That makes it suitable for large data files - it has been used to unzip files of size 100GB+.
It is available in Maven Central, here is the GitHub link: nejckorasa/s3-stream-unzip

Can make figure out file dependencies in AWS S3 buckets?

The source directory contains numerous large image and video files.
These files need to be uploaded to an AWS S3 bucket with the aws s3 cp command. For example, as part of this build process, I copy my image file my_image.jpg to the S3 bucket like this: aws s3 cp my_image.jpg s3://mybucket.mydomain.com/
I have no problem doing this copy to AWS manually. And I can script it too. But I want to use the makefile to upload my image file my_image.jpg iff the same-named file in my S3 bucket is older than the one in my source directory.
Generally make is very good at this kind of dependency checking based on file dates. However, is there a way I can tell make to get the file dates from files in S3 buckets and use that to determine if dependencies need to be rebuilt or not?
The AWS CLI has an s3 sync command that can take care of a fair amount of this for you. From the documentation:
A s3 object will require copying if:
the sizes of the two s3 objects differ,
the last modified time of the source is newer than the last modified time of the destination,
or the s3 object does not exist under the specified bucket and prefix destination.
I think you'll need to make S3 look like a file system to make this work. On Linux it is common to use FUSE to build adapters like that. Here are some projects to present S3 as a local filesystem. I haven't tried any of those, but it seems like the way to go.

Scrapy crawl appends locally, replaces on S3?

I implemented a Scrapy project that is now working fine locally. Using the crawl command, each spider appended it's jsonlines to the same file if the file existed. When I changed the feed exporter to S3 using boto it now overwrites the entire file with the data from the last run spider instead of appending to the file.
Is there any way to enable the Scrapy/boto/S3 to append the jsonlines in to the file like it does locally?
Thanks
There is no way to append to a file in S3. You could enable versioning on the S3 bucket and then each time the file was written to S3, it would create a new version of the file. Then you could retrieve all versions of the file using the list_versions method of the boto Bucket object.
From reading the feed exporter code (https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/feedexport.py), the file exporter opens the specified file in append mode whilst the S3 exporter calls set_contents_from_file which presumably overwrites the original file.
The boto S3 documentation (http://boto.readthedocs.org/en/latest/getting_started.html) doesn't mention being able to modify stored files, so the only solution would be to create a custom exporter that stores a local copy of results that can be appended to first before copying that file to S3.