When using redshift spectrum, it seems you can only import data providing location until a folder, and it imports all the files inside the folder.
Is there a way to import import only one file from inside a folder with many files. When providing full path with filename , I think it treats the file as a manifest file and gives errors: manifest is too large or JSON not supported.
Is there any other way?
You inadvertently answered your own question: Use a manifest file
From CREATE EXTERNAL TABLE - Amazon Redshift:
LOCATION { 's3://bucket/folder/' | 's3://bucket/manifest_file' }
The path to the Amazon S3 bucket or folder that contains the data files or a manifest file that contains a list of Amazon S3 object paths. The buckets must be in the same AWS Region as the Amazon Redshift cluster.
If the path specifies a manifest file, the s3://bucket/manifest_file argument must explicitly reference a single fileāfor example,'s3://mybucket/manifest.txt'. It can't reference a key prefix.
The manifest is a text file in JSON format that lists the URL of each file that is to be loaded from Amazon S3 and the size of the file, in bytes. The URL includes the bucket name and full object path for the file. The files that are specified in the manifest can be in different buckets, but all the buckets must be in the same AWS Region as the Amazon Redshift cluster.
I'm not sure why it requires the length of each file. It might be used to distribute the workload amongst multiple nodes.
Related
So we uploaded about 2000 files into a bucket, using boto3, where there's no versioning enabled.
However, some of those files were already in the bucket by the same name. Is there a name to check which of those files were already in the bucket before being uploaded? The bucket has no versioning enabled which is the issue here.
We did not initially create a list of the file contents of the bucket.
I am trying to access the file that has been deleted from an s3 Bucket using aws lambdas.
I have set up a trigger for s3:ObjectRemoved*, however after extracting the bucket and file name of the deleted file, the file is deleted from s3 so I do not have access to the contents of the file.
What approach should be taken with AWS lambda to get the contents of the file after a file is deleted from an s3 bucket.
Comment proposed by #keithRozario was useful however with versioning, applying a GET request will result in a not found error as per the s3 documentation.
#Ersoy suggestion of creating a 'bin' bucket or directory/prefix with the same file name and working with that as per your requirements.
In my case copying the initial object created to a bin directory and then accessing that folder when the file is deleted from the main upload directory.
I have a working config to push files from a directory on my server to an S3 bucket. NiFi is running on a different server so I have a getSFTP. The source files have subfolders my putS3Object current config does not support and jams all of the files at the root level of the S3 bucket. I know there's a way to get putS3Object to create directories using defined folders. The ObjectKey by default is set to ${filename}. If set to say, my/directory/${filename}, it creates two folders, my and the subfolder directory, and puts the files inside. However, I do NOT know what to set for the object key to replicate the file(s) source directories.
Try ${path}/${filename} based on this in the documentation:
Keeping with the example of a file that is picked up from a local file system, the FlowFile would have an attribute called filename that reflected the name of the file on the file system. Additionally, the FlowFile will have a path attribute that reflects the directory on the file system that this file lived in.
I am seeking some suggestion about loading csv files from s3 bucket to neo4j graphdb. In S3 bucket the files are in csv.gz format. I need to import them into my neo4j graph db which is in ec2 instance.
1. Is there any direct way to load csv.gz into neo4j db without unzipping it ?
2. can I set/add s3 bucket path into neo4j.conf file at neo4j.dbms.directory which is by default neo4j/import ?
kindly help me to suggest some idea to load files from S3
Thank you
You can achieve both of these goals with APOC. The docs give you two approaches:
Load from the GZ file directly, assuming the file in the bucket has a public URL
Load the file from S3 directly, with an access token
Here's an example of the first approach - the section after the ! is the filename within the zip file to load, and this should work with .zip, .gz, .tar files etc.
CALL apoc.load.csv("https://pablissimo-so-test.s3-us-west-2.amazonaws.com/mycsv.zip!mycsv.csv")
I am using the ListS3 processor to get files from S3 and piping it into the RouteOnAttribute processor. From there I am using the Route to Property name as the Routing Strategy and assigning properties bases on which files I am listening.
I am able to see all the files I want but can't do anything with them because my another processor down the line needs the full path of those files. I am using a python script, that takes in file path as cmd line arguments.
How do I extract the full absolute path of the files from S3?
You can list, download, and save S3 files locally using a sequence of NiFi processors like the following:
ListS3 - to get references to S3 objects you can filter. Output from ListS3 contains only references to objects, not the content itself, in attributes:
s3.bucket - name of the bucket, like my-bucket
filename - key of the object, like path/to/file.txt
FetchS3Object - to download object content from S3 using the bucket and key from ListS3 above.
PutFile - to store the file locally. Specify the Directory property where you want the files to be placed /path/to/directory. The filename attributes from S3 will contain relative paths from S3 keys, so these would be added to the Directory by default.
You can then assemble local paths for your Python script using NiFi expression language:
/path/to/directory/${filename}