I am seeking some suggestion about loading csv files from s3 bucket to neo4j graphdb. In S3 bucket the files are in csv.gz format. I need to import them into my neo4j graph db which is in ec2 instance.
1. Is there any direct way to load csv.gz into neo4j db without unzipping it ?
2. can I set/add s3 bucket path into neo4j.conf file at neo4j.dbms.directory which is by default neo4j/import ?
kindly help me to suggest some idea to load files from S3
Thank you
You can achieve both of these goals with APOC. The docs give you two approaches:
Load from the GZ file directly, assuming the file in the bucket has a public URL
Load the file from S3 directly, with an access token
Here's an example of the first approach - the section after the ! is the filename within the zip file to load, and this should work with .zip, .gz, .tar files etc.
CALL apoc.load.csv("https://pablissimo-so-test.s3-us-west-2.amazonaws.com/mycsv.zip!mycsv.csv")
Related
I have built object storage plugin to store orthanc data in s3 bucket in legacy mode. I am now trying to eliminate local storage of files of orthanc and move it to s3 completely. I also have OHIF viewer integrated which is serving orthanc data, How do I make it fetch from s3 bucket? I have read that json file of dicom file can be used to do this, but I dont know how to do that because the json file has url of each instance in s3 bucket. How do i generate this json file if this is the way to do it?
I am saving some html content to Amazon s3, from my flask api using boto3 module with the code
s3.Object(BUCKET_NAME, PREFIX + file_name+'.html').put(Body=html_content)
The file is being stored in s3 but when I am going to view it it is just getting downloaded instead of being viewed. I would rather try to view the file instead of downloading it. How to fix it from boto3 commands? Kindly help me.
Go to the S3 bucket and browse to the file > properties > metadata, there is a key called Content-Type that tells AWS what kind of content it is, it's probably set to binary so it will only be downloaded at the moment, like in this screenshot:
If you change this value to "text/plain" for example it will attempt to view it.
Can csv files from the AWS S3 bucket be configured to go straight into ML or do the files need to land somewhere and then the CSV files have to get ingested using MCLP?
Assuming you have CSV files in the S3 Bucket and that one row in the CSV file is to be inserted as a single XML record...that wasn't clear in your question, but is the most common use case. If your plan is to just pull the files in and persist them as CSV files, there are undocumented XQuery functions that could be used to access the S3 bucket and pull the files in off that. Anyway, the MLCP documents are very helpful in understanding this very versatile and powerful tool.
According to the documentation (https://developer.marklogic.com/products/mlcp) the supported data sources are:
Local filesystem
HDFS
MarkLogic Archive
Another MarkLogic Database
You could potentially mount the S3 Bucket to a local filesystem on EC2 to bypass the need to make the files accessible to MLCP. Google's your friend if that's important. I personally haven't seen a production-stable method for that, but it's been a long time since I've tried.
Regardless, you need to make those files available on a supported source, most likely a filesystem location in this case, where MLCP can be run and can reach the files. I suppose that's what you meant by having the files land somewhere. MLCP can process delimited files in import mode. The documentation is very good for understanding all the options.
I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.
I am trying to load one of my S3 buckets.
File i am trying to load is huge tarball on the web, I don't want to download file on my disk and then again start uploading it to S3 bucket.
is there any way that I can directly specify this URL and it get added to S3 ?
You have to "put" to S3, and it does not "get".