Mounting S3 to DBFS(azure databricks), integer is adding with mount path - amazon-s3

I have been mounted s3 bucket to DBFS.After unmounting i tried to list the files in the directory
eg : %fs ls /mnt/TmpS3SampleDB/
Output : java.io.FileNotFoundException: File/743456612344/mnt/TmpS3SampleDB/ does not exist.
In the above output , i don't understand where the interger-743456612344 is coming from.
Can anyone please explain . I am using azure databricks.

Note: Azure Databricks interact with object storage using directory and file semantics instead of storage URLs.
"743456612344" this is directory id associated with the Databricks.
When you try listing files in WASB using dbutils.fs.ls or the Hadoop API, you get the following exception:
java.io.FileNotFoundException: File/ does not exist.
For more details, refer "Databricks File System".
Hope this helps. Do let us know if you any further queries.

It's very likely generated by the Local API.
You should type
%fs ls /dbfs/mnt/TmpS3SampleDB/
Instead.

Related

Kafka Connect SpooldirCsv connector

I am trying to use the Confluent's SpoolDirCSVSource connector to read files from a directory and send to MSK. Works fine on my local. But now I need to source it from S3 bucket . Is there no way I can use this connector to do this ? Or is there some other connector which does this ? The input.path parameter works only with local directories I think. Any pointer in right direction to the correct connector or modifying the SpoolDirCSV connector would be appreciated.
I know this question is old , something similar. But I am curious to know if this functionality is still absent (I can be wrong).
How to use Kafka Connect to source .csv files from S3 bucket?
This is the exact error when the connector is deployed to cloud
There is an issue with the connector
Code: InvalidInput.InvalidConnectorConfiguration
Message: The connector configuration is invalid. Message: Connector configuration is invalid and contains the following 2 error(s): Invalid value File 's3:/mytestbucketak/input' is not an absolute path. for configuration input.path Invalid value File 's3:/mytestbucketak/error' is not an absolute path. for configuration error.path
That connector can only read from local filesystem, not S3.
Confluent has a specific S3 Source Connector, or as linked, FilePulse Connector also exists.

loading csv file from S3 in neo4j graphdb

I am seeking some suggestion about loading csv files from s3 bucket to neo4j graphdb. In S3 bucket the files are in csv.gz format. I need to import them into my neo4j graph db which is in ec2 instance.
1. Is there any direct way to load csv.gz into neo4j db without unzipping it ?
2. can I set/add s3 bucket path into neo4j.conf file at neo4j.dbms.directory which is by default neo4j/import ?
kindly help me to suggest some idea to load files from S3
Thank you
You can achieve both of these goals with APOC. The docs give you two approaches:
Load from the GZ file directly, assuming the file in the bucket has a public URL
Load the file from S3 directly, with an access token
Here's an example of the first approach - the section after the ! is the filename within the zip file to load, and this should work with .zip, .gz, .tar files etc.
CALL apoc.load.csv("https://pablissimo-so-test.s3-us-west-2.amazonaws.com/mycsv.zip!mycsv.csv")

Redshift COPY command failing to Load Data from S3

We are facing error while we are trying to load a huge zip file from S3 bucket to redshift from EC2 instance and even aginity. Waht is the real issue here?
As far as we have checked this can be because of the VPC NACL rules but not sure.
Error :
ERROR: Connection timed out after 50000 milliseconds
I also got this error and the Enhanced VPC Routing is enabled , check the routing from your Redshift cluster to S3.
There are several ways to let the Redshift cluster reach S3 , you can see the link below:
https://docs.aws.amazon.com/redshift/latest/mgmt/enhanced-vpc-routing.html
I solved this error by setting NAT for my private subnet which is used by my Redshift cluster.
I think you are correct, it might be because bucket access rules or secret/access keys.
Here are some pointers to debug it further if above doesn't work.
Create a small zip file, then try again if its something because of Size(but I don't think it is possible case.)
Split your zip file into multiple zip files and create Manifest file for loading rather then single file.
I hope your will find this useful.
You should create an IAM role which authorizes Amazon Redshift to access other AWS services like S3 on your behalf, you must associate that role with an Amazon Redshift cluster before you can use the role to load or unload data.
Check below link for setting up IAM role:
https://docs.aws.amazon.com/redshift/latest/mgmt/copy-unload-iam-role.html
I got this error when the Redshift cluster had Enhanced VPC Routing enabled, but no route in the route table for S3. Adding the S3 endpoint fixed the issue. Link to docs.

Using data present in S3 inside EMR mappers

I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.

Access files in s3n://elasticmapreduce/samples/wordcount/input

How I can I access the file sitting in the following folder of S3 which is own by someone else
s3n://elasticmapreduce/samples/wordcount/input
The files in s3n://elasticmapreduce/samples/wordcount/input are public, and made available as input by Amazon to the sample word count Hadoop program. The best way to fetch them is to
Start a new Amazon Elastic MapReduce Job Flow (it doesn't matter which one) from the Amazon Web Services console, and make sure that you keep the the job alive with the Keep Alive option
Once the EC2 machines have started, find the instances on EC2 from the Amazon Web Services console
ssh into one of the running EC2 instances, using the hadoop user, for example
ssh -i keypair.pem hadoop#ec2-IPADDRESS.compute-1.amazonaws.com
Obtain the files you need, using hadoop dfs -copyToLocal s3://elasticmapreduce/samples/wordcount/input/0002 .
sftp the files to your local system
You can access wordSplitter.py here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/wordSplitter.py
You can access the input files here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0012
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0011
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0010
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0009
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0008
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0007
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0006
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0005
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0004
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0003
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0002
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0001
The owner of the folder (most likely a file in the folder) must have made it accessible to anonymous reader.
If that is the case, s3n://x/y... is translated to
http://s3.amazonaws.com/x/y...
or
http://x.s3.amazonaws.com/y...
x is the name of the bucket.
y... is the path wihtin the bucket.
If you want to make sure the file exists, e.g. if you suspect the name was misspelled, you can in your browser to open
http://s3.amazonaws.com/x
and you'll see XML describing "files" that is S3 objects, available.
Try this:
http://s3.amazonaws.com/elasticmapreduce
I tried this, and seems that the path you want is not public.
AWS EBS documentation quotes s3://elasticmapreduce/samples/wordcount/input in one of the "getting started" examples. But s3 is different from s3n, so input might be available to EMR, but not to HTTP access.
In Amazon S3, there is no concept of folders, a bucket it just a flat collection of objects. But you can list all the files you are interested in a browser with the following URL:
s3.amazonaws.com/elasticmapreduce?prefix=samples/wordcount/input/
Then you can download them by specifying the whole name, e.g.
s3.amazonaws.com/elasticmapreduce/samples/wordcount/input/0001