Kafka Connect SpooldirCsv connector - amazon-s3

I am trying to use the Confluent's SpoolDirCSVSource connector to read files from a directory and send to MSK. Works fine on my local. But now I need to source it from S3 bucket . Is there no way I can use this connector to do this ? Or is there some other connector which does this ? The input.path parameter works only with local directories I think. Any pointer in right direction to the correct connector or modifying the SpoolDirCSV connector would be appreciated.
I know this question is old , something similar. But I am curious to know if this functionality is still absent (I can be wrong).
How to use Kafka Connect to source .csv files from S3 bucket?
This is the exact error when the connector is deployed to cloud
There is an issue with the connector
Code: InvalidInput.InvalidConnectorConfiguration
Message: The connector configuration is invalid. Message: Connector configuration is invalid and contains the following 2 error(s): Invalid value File 's3:/mytestbucketak/input' is not an absolute path. for configuration input.path Invalid value File 's3:/mytestbucketak/error' is not an absolute path. for configuration error.path

That connector can only read from local filesystem, not S3.
Confluent has a specific S3 Source Connector, or as linked, FilePulse Connector also exists.

Related

Camel Kafka S3 Source Connector with multiple connectors for same bucket

I am trying to define Camel S3 Source connectors. I have searched quite a bit without success to find answers to below questions.
How can I set up my connector such that the file in the S3 bucket doesn't get deleted but can be moved to another folder specified by me, after it is processed
Is it possible to define separate connectors with different value converters for the same bucket, maybe by folders under the bucket? The connectors will use different kafka topics based on the file type. How can I define the bucket with folder when defining the connector properties
Thank you

Redshift COPY command failing to Load Data from S3

We are facing error while we are trying to load a huge zip file from S3 bucket to redshift from EC2 instance and even aginity. Waht is the real issue here?
As far as we have checked this can be because of the VPC NACL rules but not sure.
Error :
ERROR: Connection timed out after 50000 milliseconds
I also got this error and the Enhanced VPC Routing is enabled , check the routing from your Redshift cluster to S3.
There are several ways to let the Redshift cluster reach S3 , you can see the link below:
https://docs.aws.amazon.com/redshift/latest/mgmt/enhanced-vpc-routing.html
I solved this error by setting NAT for my private subnet which is used by my Redshift cluster.
I think you are correct, it might be because bucket access rules or secret/access keys.
Here are some pointers to debug it further if above doesn't work.
Create a small zip file, then try again if its something because of Size(but I don't think it is possible case.)
Split your zip file into multiple zip files and create Manifest file for loading rather then single file.
I hope your will find this useful.
You should create an IAM role which authorizes Amazon Redshift to access other AWS services like S3 on your behalf, you must associate that role with an Amazon Redshift cluster before you can use the role to load or unload data.
Check below link for setting up IAM role:
https://docs.aws.amazon.com/redshift/latest/mgmt/copy-unload-iam-role.html
I got this error when the Redshift cluster had Enhanced VPC Routing enabled, but no route in the route table for S3. Adding the S3 endpoint fixed the issue. Link to docs.

Mule - Copy the directory from HDFS

I need to copy the directory (/tmp/xxx_files/xxx/Output) head containing sub folders and files from HDFS (Hadoop distributed file system). I'm using HDFS connector but it seems it does not support this.
It always getting an error like:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /tmp/xxx_files/xxx/Output/
I don't see any option is HDFS connector for copying the files/directories inside the path specified. It is always expecting file names to be copied.
Is it possible to copy a directory head containing sub-folders and files using the HDFS connector from MuleSoft?
As the technical documentation of the HSFS connector on the official MuleSoft website states, the code is hosted at the GitHub site of the connector:
The Anypoint Connector for the Hadoop Distributed File System (HDFS)
is used as a bi-directional gateway between applications. Its source
is stored at the HDFS Connector GitHub site.
What it does not state, that there is also a more detailed technical documentation available on the GitHub site.
Here you can also find different examples how to use the connector for basic file-system operations.
The links seem to be broken in the official MuleSoft documentation.
You can find the repository here:
https://github.com/mulesoft/mule-hadoop-connector
The operations are implemented in the HdfsOperations java class. (See also the FileSystemApiService class)
As you can see, the functionality you expect is not implemented. It is not supported out-of-the-box.
You can't copy a directory head containing sub folders and files from HDFS without any further effort using the HDFS connector.

Can I use the File connector in Mule Cloudhub for FTPS

Is there a way to configure a File connector for use in cloudhub, specifically related to reading in a file over FTPS and putting it into a file before beginning the actual processing of the contents?
Clarification:
I'm in cloudhub, which does not provide a filesystem in the same sense that a local/on-prem Mule setup has. One standard practice when dealing with streams (FTPS or similar) in order to avoid processing over the open stream is to take the incoming stream and use the File connector (outbound in this case) to put the inbound stream into a file, and then use that file for your flow process. How is this managed in CloudHub?
File Connector is to read files from paths specified on the server. They cannot be used to read from remote servers.
I case you want to have a File to start your flow with try the following.
<flow name="ftp_reader_flow">
<ftp: inbound> Read from the remote directory
...
<file:outbound> to a local directory
</flow>
<flow name="actual_processing_flow">
<file:inbound> read from the local directory.
... Continue with the processing
.....
</flow>
Hope this helps.
You can use the connector for temporary data with the tmp directory.
From the MuleSoft Documentation:
Disk Persistence
CloudHub does not guarantee that writing to disk survives hardware
failures. Instead, you must use an external storage mechanism to store
information. For small amounts of data, you can use the Object Store.
For applications that have large data storage requirements, we
recommend use of a cloud service such as Amazon S3. For temporary
storage, the File connector is still available and can be used with
the /tmp directory.
You can use File Connector in CloudHub as well, But Make sure your are reading or writing the file from classpath -src/main/resource or any folder from project classpath only.

Access files in s3n://elasticmapreduce/samples/wordcount/input

How I can I access the file sitting in the following folder of S3 which is own by someone else
s3n://elasticmapreduce/samples/wordcount/input
The files in s3n://elasticmapreduce/samples/wordcount/input are public, and made available as input by Amazon to the sample word count Hadoop program. The best way to fetch them is to
Start a new Amazon Elastic MapReduce Job Flow (it doesn't matter which one) from the Amazon Web Services console, and make sure that you keep the the job alive with the Keep Alive option
Once the EC2 machines have started, find the instances on EC2 from the Amazon Web Services console
ssh into one of the running EC2 instances, using the hadoop user, for example
ssh -i keypair.pem hadoop#ec2-IPADDRESS.compute-1.amazonaws.com
Obtain the files you need, using hadoop dfs -copyToLocal s3://elasticmapreduce/samples/wordcount/input/0002 .
sftp the files to your local system
You can access wordSplitter.py here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/wordSplitter.py
You can access the input files here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0012
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0011
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0010
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0009
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0008
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0007
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0006
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0005
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0004
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0003
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0002
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0001
The owner of the folder (most likely a file in the folder) must have made it accessible to anonymous reader.
If that is the case, s3n://x/y... is translated to
http://s3.amazonaws.com/x/y...
or
http://x.s3.amazonaws.com/y...
x is the name of the bucket.
y... is the path wihtin the bucket.
If you want to make sure the file exists, e.g. if you suspect the name was misspelled, you can in your browser to open
http://s3.amazonaws.com/x
and you'll see XML describing "files" that is S3 objects, available.
Try this:
http://s3.amazonaws.com/elasticmapreduce
I tried this, and seems that the path you want is not public.
AWS EBS documentation quotes s3://elasticmapreduce/samples/wordcount/input in one of the "getting started" examples. But s3 is different from s3n, so input might be available to EMR, but not to HTTP access.
In Amazon S3, there is no concept of folders, a bucket it just a flat collection of objects. But you can list all the files you are interested in a browser with the following URL:
s3.amazonaws.com/elasticmapreduce?prefix=samples/wordcount/input/
Then you can download them by specifying the whole name, e.g.
s3.amazonaws.com/elasticmapreduce/samples/wordcount/input/0001