I need to copy the directory (/tmp/xxx_files/xxx/Output) head containing sub folders and files from HDFS (Hadoop distributed file system). I'm using HDFS connector but it seems it does not support this.
It always getting an error like:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /tmp/xxx_files/xxx/Output/
I don't see any option is HDFS connector for copying the files/directories inside the path specified. It is always expecting file names to be copied.
Is it possible to copy a directory head containing sub-folders and files using the HDFS connector from MuleSoft?
As the technical documentation of the HSFS connector on the official MuleSoft website states, the code is hosted at the GitHub site of the connector:
The Anypoint Connector for the Hadoop Distributed File System (HDFS)
is used as a bi-directional gateway between applications. Its source
is stored at the HDFS Connector GitHub site.
What it does not state, that there is also a more detailed technical documentation available on the GitHub site.
Here you can also find different examples how to use the connector for basic file-system operations.
The links seem to be broken in the official MuleSoft documentation.
You can find the repository here:
https://github.com/mulesoft/mule-hadoop-connector
The operations are implemented in the HdfsOperations java class. (See also the FileSystemApiService class)
As you can see, the functionality you expect is not implemented. It is not supported out-of-the-box.
You can't copy a directory head containing sub folders and files from HDFS without any further effort using the HDFS connector.
Related
I have spring cloud config server reading properties from multiple sources (Git and Vault). For a given path, even it finds the resource in Git, it still queries vault and report failure as the resources are not available at both sources. My requirement is to look for a resource and if its found, no need to query the other source. Please suggest if its possible. Thanks
Can csv files from the AWS S3 bucket be configured to go straight into ML or do the files need to land somewhere and then the CSV files have to get ingested using MCLP?
Assuming you have CSV files in the S3 Bucket and that one row in the CSV file is to be inserted as a single XML record...that wasn't clear in your question, but is the most common use case. If your plan is to just pull the files in and persist them as CSV files, there are undocumented XQuery functions that could be used to access the S3 bucket and pull the files in off that. Anyway, the MLCP documents are very helpful in understanding this very versatile and powerful tool.
According to the documentation (https://developer.marklogic.com/products/mlcp) the supported data sources are:
Local filesystem
HDFS
MarkLogic Archive
Another MarkLogic Database
You could potentially mount the S3 Bucket to a local filesystem on EC2 to bypass the need to make the files accessible to MLCP. Google's your friend if that's important. I personally haven't seen a production-stable method for that, but it's been a long time since I've tried.
Regardless, you need to make those files available on a supported source, most likely a filesystem location in this case, where MLCP can be run and can reach the files. I suppose that's what you meant by having the files land somewhere. MLCP can process delimited files in import mode. The documentation is very good for understanding all the options.
Is there a way to configure a File connector for use in cloudhub, specifically related to reading in a file over FTPS and putting it into a file before beginning the actual processing of the contents?
Clarification:
I'm in cloudhub, which does not provide a filesystem in the same sense that a local/on-prem Mule setup has. One standard practice when dealing with streams (FTPS or similar) in order to avoid processing over the open stream is to take the incoming stream and use the File connector (outbound in this case) to put the inbound stream into a file, and then use that file for your flow process. How is this managed in CloudHub?
File Connector is to read files from paths specified on the server. They cannot be used to read from remote servers.
I case you want to have a File to start your flow with try the following.
<flow name="ftp_reader_flow">
<ftp: inbound> Read from the remote directory
...
<file:outbound> to a local directory
</flow>
<flow name="actual_processing_flow">
<file:inbound> read from the local directory.
... Continue with the processing
.....
</flow>
Hope this helps.
You can use the connector for temporary data with the tmp directory.
From the MuleSoft Documentation:
Disk Persistence
CloudHub does not guarantee that writing to disk survives hardware
failures. Instead, you must use an external storage mechanism to store
information. For small amounts of data, you can use the Object Store.
For applications that have large data storage requirements, we
recommend use of a cloud service such as Amazon S3. For temporary
storage, the File connector is still available and can be used with
the /tmp directory.
You can use File Connector in CloudHub as well, But Make sure your are reading or writing the file from classpath -src/main/resource or any folder from project classpath only.
I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.
How I can I access the file sitting in the following folder of S3 which is own by someone else
s3n://elasticmapreduce/samples/wordcount/input
The files in s3n://elasticmapreduce/samples/wordcount/input are public, and made available as input by Amazon to the sample word count Hadoop program. The best way to fetch them is to
Start a new Amazon Elastic MapReduce Job Flow (it doesn't matter which one) from the Amazon Web Services console, and make sure that you keep the the job alive with the Keep Alive option
Once the EC2 machines have started, find the instances on EC2 from the Amazon Web Services console
ssh into one of the running EC2 instances, using the hadoop user, for example
ssh -i keypair.pem hadoop#ec2-IPADDRESS.compute-1.amazonaws.com
Obtain the files you need, using hadoop dfs -copyToLocal s3://elasticmapreduce/samples/wordcount/input/0002 .
sftp the files to your local system
You can access wordSplitter.py here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/wordSplitter.py
You can access the input files here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0012
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0011
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0010
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0009
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0008
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0007
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0006
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0005
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0004
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0003
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0002
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0001
The owner of the folder (most likely a file in the folder) must have made it accessible to anonymous reader.
If that is the case, s3n://x/y... is translated to
http://s3.amazonaws.com/x/y...
or
http://x.s3.amazonaws.com/y...
x is the name of the bucket.
y... is the path wihtin the bucket.
If you want to make sure the file exists, e.g. if you suspect the name was misspelled, you can in your browser to open
http://s3.amazonaws.com/x
and you'll see XML describing "files" that is S3 objects, available.
Try this:
http://s3.amazonaws.com/elasticmapreduce
I tried this, and seems that the path you want is not public.
AWS EBS documentation quotes s3://elasticmapreduce/samples/wordcount/input in one of the "getting started" examples. But s3 is different from s3n, so input might be available to EMR, but not to HTTP access.
In Amazon S3, there is no concept of folders, a bucket it just a flat collection of objects. But you can list all the files you are interested in a browser with the following URL:
s3.amazonaws.com/elasticmapreduce?prefix=samples/wordcount/input/
Then you can download them by specifying the whole name, e.g.
s3.amazonaws.com/elasticmapreduce/samples/wordcount/input/0001