Map Reduce job on Amazon: argument for custom jar

Map Reduce job on Amazon: argument for custom jar - amazon-s3

This is one of my first try with Map Reduce on AWS in its Management Console.
Hi have uploaded on AWS S3 my runnable jar developed on Hadoop 0.18, and it works on my local machine.
As described on documentation, I have passed the S3 paths for input and output as argument of the jar: all right, but the problem is the third argument that is another path (as string) to a file that I need to load while the job is in execution. That file resides on S3 bucket too, but it seems that my jar doesn't recognize the path and I got a FileNotFound Exception while it tries to load it. That is strange because this is a path exactly like the other two...
Anyone have any idea?
Thank you
Luca

This is a problem with AWS, please check Lesson 2 at http://meghsoft.com/blog/. See if you can use FileSystem.get(uri, conf) to obtain a file system supporting your path.
Hope this helps.
Sonal

Sonal,
thank you for your suggestion.
In the end the solution was using the DistributedCache.
Loading the file before to run the job I can access inside the Map Class everithing I need by overriding the confiure method and taking the file from the distributed cache (already loaded with the file).
Thank you,
Luca

Related

How to perform Mongodb CDC using wso2 streaming integrator?

I'm so confused I don't know how to perform mongo cdc with wso2 streaming integrator. I set up a mongo replicaset follow this doc. I config cdc source like below,
but it doesn't work, I got these error logs . Can any one help me to fix this? Thanks in advance.

It seems like an issue with the extension installer script of the WSO2 SI. The mongo_java_driver is actually a bundled jar and due to that it should not be converted again into a bundle.
So to fix your problem, Follow the below steps,
Step 1- Uninstall the installed MongoDB jar.
Step 2- Go to WSO2SI_HOME/wso2/server/resources/extensionsInstaller folder and open the extensionDependencies.json file.
Step 3- Search for "name": "mongo-java-driver" and under the configurations usage type from "JAR" to "BUNDLE".
Step 4- reinstall the MongoDB extension via extension installer
This will solve your problem.

Have you copy the mongo-java-driver to <PRODUCT_HOME>/lib directory? it seems like the cdc extension couldn't locate the mongodb drivers

Checkpoint s3p flink on EMR

I have problem with checkpoint by s3p in the flink of EMR.
When creating the EMR cluster, I have a tick in Presto and added jar file as instructed at https://ci.apache.org/projects/flink/flink-docs-stable/ops/plugins.html.
But when checking point by s3p in flink, it still reports
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 's3p'. The scheme is directly supported by Flink through the following plugin: flink-s3-fs-presto. Please ensure that each plugin resides within its own subfolder within the plugins directory. See https://ci.apache.org/projects/flink/flink-docs-stable/ops/plugins.html for more information. If you want to use a Hadoop file system for that scheme, please add the scheme to the configuration fs.allowed-fallback-filesystems. For a full list of supported file systems, please see https://ci.apache.org/projects/flink/flink-docs-stable/ops/filesystems/.
Can you help me checkpoint s3p on the flink of EMR?
Thanks.

Presto in EMR has nothing to do with the flink-s3-fs-presto plugin in Flink. You can leave it unticked in the future (doesn't hurt either except blowing things up).
The most likely reason is that you forgot to create a subfolder in the plugins folder. Could you give me an ls of your Flink distribution?

Apache Flink to use S3 for backend state and checkpoints

Background
I was planning to use S3 to store the Flink's checkpoints using the FsStateBackend. But somehow I was getting the following error.
Error
org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 's3'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.
Flink version: I am using Flink 1.10.0 version.

I have found the solution for the above issue, so here I am listing it in steps that are required.
Steps
We need to add some configs in the flink-conf.yaml file which I have listed below.
state.backend: filesystem
state.checkpoints.dir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
state.backend.fs.checkpointdir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
s3.access-key: XXXXXXXXXXXXXXXXXXX #your-access-key
s3.secret-key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx #your-secret-key
s3.endpoint: http://127.0.0.1:9000 #your-endpoint-hostname (I have used Minio)
After completing the first step we need to copy the respective(flink-s3-fs-hadoop-1.10.0.jar and flink-s3-fs-presto-1.10.0.jar) JAR files from the opt directory to the plugins directory of your Flink.
E.g:--> 1. Copy /flink-1.10.0/opt/flink-s3-fs-hadoop-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-hadoop/flink-s3-fs-hadoop-1.10.0.jar // Recommended for StreamingFileSink
2. Copy /flink-1.10.0/opt/flink-s3-fs-presto-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-presto/flink-s3-fs-presto-1.10.0.jar //Recommended for checkpointing
Add this in checkpointing code
env.setStateBackend(new FsStateBackend("s3://s3-bucket/checkpoints/"))
After completing all the above steps re-start the Flink if it is already running.
Note:
If you are using both(flink-s3-fs-hadoop and flink-s3-fs-presto) in Flink then please use s3p:// specificly for flink-s3-fs-presto and s3a:// for flink-s3-fs-hadoop instead of s3://.
For more details click here.

Streaming Sink to S3

I am trying create an s3 sink for my streaming output. I figured a BucketingSink would be fine, since it is used for HDFS. But it seems that an S3 url is not recognized as hdfs. I get the following error:
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: java.lang.RuntimeException: Error while creating FileSystem when initializing the state of the BucketingSink.
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Cannot support file system for 's3' via Hadoop, because Hadoop is not in the classpath, or some classes are missing from the classpath.
Is there a way to make S3 work for BucketingSink, or is there another option other than BucketingSink that I can use? I am currently running 1.5.2. Would be happy to provide any additional information.
Thank you!
Edit:
My sink creation/using looks like the following:
val s3Sink = new BucketingSink[String]("s3://s3bucket/sessions")
s3Sink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd--HHmm"))
s3Sink.setWriter(new StringWriter[String]())
s3Sink.setBatchSize(200)
s3Sink.setPendingPrefix("sessions-")
s3Sink.setPendingSuffix(".csv")
// Create stream and do stuff here
stream.addSink(s3Sink)

Probably you have to include hadoop-aws jar to your Flink job. Refer to this link will help: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/deployment/aws.html#provide-s3-filesystem-dependency

Pentaho PDI Failed to load ESAPI.properties as a classloader resource

I am running a fresh install of Pentaho Data Integration 5.0.1.A Stable from:
http://community.pentaho.com/projects/data-integration/
on my macbook pro, java 1.7.0_25, and I keep seeing this error in the console:
Attempting to load ESAPI.properties via file I/O.
Attempting to load ESAPI.properties as resource file via file I/O.
Not found in 'org.owasp.esapi.resources' directory or file not readable:
/Applications/pdi-ce-5.0.1.A/data-integration/ESAPI.properties
Not found in SystemResource Directory/resourceDirectory: .esapi/ESAPI.properties
What are the ESAPI.properties used for? What should they be set to by default?
thanks, -John

This is a known bug (PDI-10568) that should be fixed in an upcoming release. As a work around, try putting the default ESAPI and validation properties in your $HOME/.esapi/ folder. Create one if it doesn't already exist.
Background: ESAPI is an Enterprise Level Security library used by Pentaho webservices to properly encode URLs and HTML content, read more at https://www.owasp.org/index.php/ESAPI

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Map Reduce job on Amazon: argument for custom jar - amazon-s3

This is a problem with AWS, please check Lesson 2 at http://meghsoft.com/blog/. See if you can use FileSystem.get(uri, conf) to obtain a file system supporting your path. Hope this helps. Sonal

Related

How to perform Mongodb CDC using wso2 streaming integrator?

Checkpoint s3p flink on EMR

Apache Flink to use S3 for backend state and checkpoints

Streaming Sink to S3

Pentaho PDI Failed to load ESAPI.properties as a classloader resource

Categories

Resources