Apache Hudi throwing Dataset not found exception when storing to S3 - apache-spark-sql

I am trying to load a simple dataframe as Hudi dataset into S3 and I am having trouble in doing that. I am new to Apache Hudi and I am trying to load the data from by running the code locally on my Windows machine. All the Maven dependencies I am using to achieve this and the code along with exceptions are mentioned below
inputDF.write.format("com.uber.hoodie")
.option(HoodieWriteConfig.TABLE_NAME, tablename)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "GameId")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,"operatorShortName")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "HandledTimestamp")
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
.mode(SaveMode.Append)
.save("s3a://s3_buket/Games2" )
<!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk -->
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.11.623</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>com.uber.hoodie</groupId>
<artifactId>hoodie</artifactId>
<version>0.4.7</version>
<type>pom</type>
</dependency>
<!-- https://mvnrepository.com/artifact/com.uber.hoodie/hoodie-spark -->
<dependency>
<groupId>com.uber.hoodie</groupId>
<artifactId>hoodie-spark</artifactId>
<version>0.4.7</version>
</dependency>
Exception in thread "main" com.uber.hoodie.exception.DatasetNotFoundException: Hoodie dataset not found in path s3a://gat-datalake-raw-dev/Games2\.hoodie
at com.uber.hoodie.exception.DatasetNotFoundException.checkValidDataset(DatasetNotFoundException.java:45)
at com.uber.hoodie.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:91)
at com.uber.hoodie.HoodieWriteClient.rollbackInflightCommits(HoodieWriteClient.java:1172)
at com.uber.hoodie.HoodieWriteClient.startCommitWithTime(HoodieWriteClient.java:1044)
at com.uber.hoodie.HoodieWriteClient.startCommit(HoodieWriteClient.java:1037)
at com.uber.hoodie.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144)
at com.uber.hoodie.DefaultSource.createRelation(DefaultSource.scala:91)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
at com.playngoplatform.scala.dao.DataAccessS3.writeDataToRefinedS3(DataAccessS3.scala:26)
at com.playngoplatform.scala.controller.GameAndProviderDataTransform.processData(GameAndProviderDataTransform.scala:29)
at com.playngoplatform.scala.action.GameAndProviderData$.main(GameAndProviderData.scala:10)
at com.playngoplatform.scala.action.GameAndProviderData.main(GameAndProviderData.scala)
I am not doing anything else apart from this. I am just creating a Hudi dataset directly from my Spark data source code. I am seeing the folder getting created the S3 path but not any further
.hoodie.properties file is mentioned below
hoodie.compaction.payload.class=com.uber.hoodie.common.model.HoodieAvroPayload
hoodie.table.name=hoodie.games
hoodie.archivelog.folder=archived
hoodie.table.type=MERGE_ON_READ

Hudi is not completely mature to support your windows OS.
The issue is fixed by changing file seperation character in terms of running this on windows machine.

Related

Azure Keyvault library to Atlassian Confluence plugin pom.xml

I am trying to combine these 2 tutorials - Confluence Hello World Macro & Azure keyvault quick start:
https://developer.atlassian.com/server/framework/atlassian-sdk/create-a-confluence-hello-world-macro/
https://learn.microsoft.com/en-us/azure/key-vault/secrets/quick-create-java?tabs=azure-cli
After having added the 2 Azure dependencies to the pom.xml of the maven project and running atlas-mvn clean package I receive an error message about 3 banned dependencies.
I looked for the newest Azure packages at the maven portal. Then it was reduced to one.
Found Banned Dependency: org.slf4j:slf4j-api:jar:1.7.25
Then I added added exclusions to the dependency section:
This resulted that the build ran successfully, however, the Confluence plugin produces a runtime error:
java.lang.NoClassDefFoundError
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
at com.azure.security.keyvault.secrets.SecretClientBuilder.(SecretClientBuilder.java:110)
Can you please help, how can I achieve this?
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-security-keyvault-secrets</artifactId>
<version>4.3.0</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-identity</artifactId>
<version>1.4.0</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
error: java.lang.NoClassDefFoundError Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger at com.azure.security.keyvault.secrets.SecretClientBuilder.(SecretClientBuilder.java:110)
The above error indicates that JVM is not able to found org/slf4j/Logger class in your application's path.The simplest reason for this error is the missing Slf4j.jar file.
If the problem is caused due to the missing slf4j.jar file then you can fix it by adding a relevant version of slf4j.jar into your path.
Use the latest version of the jar in which version of the JAR file you should add will depend upon the application.
In Maven , you can also add the following dependency in your pom.xml file to download sl4j.jar
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.36</version>
</dependency>
Reference:
java.lang.NoClassDefFoundError: org.slf4j.LoggerFactory - Stack Overflow

Spring doesn't see h2 database hence complain about database not available

I'm building a simple reactive web application ( Following Josh long's tech talk ) Simply put I have reactive web, r2dbc and h2 as dependencies.
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-webflux</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-r2dbc</artifactId>
</dependency>
<dependency>
<groupId>com.h2database</groupId>
<artifactId>h2</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>io.projectreactor</groupId>
<artifactId>reactor-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
So I expect spring would configure everything for me( It does for Josh ). But I get error saying not being able to connect to a database and there is a suggestion asking to include h2(which I already have). What am I doing wrong here?
Description:
Failed to configure a ConnectionFactory: 'url' attribute is not specified and no embedded database could be configured.
Reason: Failed to determine a suitable R2DBC Connection URL
Action:
Consider the following:
If you want an embedded database (H2), please put it on the classpath.
If you have database settings to be loaded from a particular profile you may need to activate it (no profiles are currently active).
Ok it was missing r2dbc-h2 dependency. This happened because I didn't add r2dbc when I created the project with start.spring.io then added it and inspect the pom but only copied spring-boot-starter-data-r2dbc.
<dependency>
<groupId>io.r2dbc</groupId>
<artifactId>r2dbc-h2</artifactId>
<scope>runtime</scope>
</dependency>
Still bit confusing though. Spring boot says it looks in to the class path and auto configure dependencies but seems like sometimes it need given combination of dependencies.

Amazon EMR while Submitting Job for Apache-Flink getting error with Hadoop recoverable

Added Depedency Pom Details :
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>1.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.7.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-core</artifactId>
<version>1.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-runtime_2.11</artifactId>
<version>1.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table_2.11</artifactId>
<version>1.7.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.10_2.11</artifactId>
<version>1.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-filesystem_2.11</artifactId>
<version>1.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-hadoop-compatibility_2.11</artifactId>
<version>1.7.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-s3-fs-hadoop</artifactId>
<version>1.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-shaded-hadoop</artifactId>
<version>1.7.1</version>
<type>pom</type>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>2.8.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.8.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.8.5</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>1.11.529</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-connectors</artifactId>
<version>1.1.5</version>
<type>pom</type>
</dependency>
</dependencies>
java.lang.UnsupportedOperationException: Recoverable writers on Hadoop
are only supported for HDFS and for Hadoop version 2.7 or newer at
org.apache.flink.runtime.fs.hdfs.HadoopRecoverableWriter.(HadoopRecoverableWriter.java:57)
at
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.createRecoverableWriter(HadoopFileSystem.java:202)
at
org.apache.flink.core.fs.SafetyNetWrapperFileSystem.createRecoverableWriter(SafetyNetWrapperFileSystem.java:69)
at
org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.(Buckets.java:112)
at
org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink$RowFormatBuilder.createBuckets(StreamingFileSink.java:242)
at
org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.initializeState(StreamingFileSink.java:327)
at
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:278)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704) at
java.lang.Thread.run(Thread.java:748)
Flink uses something called a ServiceLoader to load components needed to interface with pluggable File Systems. If you care to see where Flink does this in code, head over to org.apache.flink.core.fs.FileSystem. Take note of the initialize function, which makes use of the RAW_FACTORIES variable. RAW_FACTORIES is created by the function loadFileSystems, which you can see makes use of Java's ServiceLoader.
The file system components need to be setup prior to your application starting on Flink. This implies that your Flink application does not need to bundle these components, they should be provided for your application.
EMR does not provide the S3 file system components that Flink needs to use S3 as a streaming file sink out of the box. This exception is being thrown not because the version isn't high enough, but because Flink loaded the HadoopFileSystem in the absence of a FileSystem that matched the s3 scheme (see code here).
You can see if your file systems are loading by enabling DEBUG logging level for my Flink application which EMR lets you do in configurations:
{
"Classification": "flink-log4j",
"Properties": {
"log4j.rootLogger": "DEBUG,file"
}
},{
"Classification": "flink-log4j-yarn-session",
"Properties": {
"log4j.rootLogger": "DEBUG,stdout"
}
}
The relevant logs are available in the YARN Resource Manager, looking at the logs for an individual node. Searching for the string "Added file system" should help you locate all successfully loaded file systems.
Also handy in this investigation was to SSH to the master node and use the flink-scala REPL, where I could see what FileSystem Flink decided to load given a file URI.
The solution is to drop the JAR for the S3 file system implementation into /usr/lib/flink/lib/ prior to starting your Flink application. This can be done with a bootstrap action that grabs the flink-s3-fs-hadoop or flink-s3-fs-presto (depending on which implementation you are using). My bootstrap action script looks something like this:
sudo mkdir -p /usr/lib/flink/lib
cd /usr/lib/flink/lib
sudo curl -O https://search.maven.org/remotecontent?filepath=org/apache/flink/flink-s3-fs-hadoop/1.8.1/flink-s3-fs-hadoop-1.8.1.jar
In order to use Flink's StreamingFileSink with exactly once guarantees, you need to use Hadoop >= 2.7. Versions below 2.7 are not supported. Hence, please make sure that you are running an up to date Hadoop version on EMR.

Spark structured streaming Elasticsearch integration issue. Data source es does not support streamed writing

I am writing a Spark structured streaming application in which data processed with Spark needs be sink'ed to elastic search.
This is my development environment, hence I have a standalone Elastic search.
I have tried following two ways to sink the data in the DataSet to ES.
1.ds.writeStream().format("org.elasticsearch.spark.sql").start("spark/orders");
2.ds.writeStream().format("es").start("spark/orders");
In both cases I am getting the following error:
Caused by:
java.lang.UnsupportedOperationException: Data source es does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:287) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:272) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:213) ~[spark-sql_2.11-2.1.1.jar:2.1.1]
pom.xml:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.mongodb.spark</groupId>
<artifactId>mongo-spark-connector_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>5.6.1</version>
</dependency>
Appreciate any help in resolving this issue.
you can try
ds.write.format("org.elasticsearch.spark.sql").option("es.resource",ES_INDEX+"/"+ES_TYPE).option("es.mapping.id",ES_ID).mode("overwrite").save()
Elasticsearch sink does not support streamed writing which means you can't stream output to Elasticsearch.
You could write streaming output to kafka and using logstash to read from kafka to elasticsearch.
Update:
Streamed Writing is now supported in version Elasticsearch 6.x when using Spark 2.2.0.
Dependency:
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>6.2.4</version>
</dependency>
writeStream code:
ds
.writeStream
.outputMode(OutputMode.Append()) // only append mode is currently supported
.format("es")
.option("checkpointLocation", "/my/checkpointLocation")
.option("es.mapping.id", "MY_OPTIONAL_ID_ATTRIBUTE")
.trigger(Trigger.ProcessingTime(5, TimeUnit.SECONDS))
.start("index/type")

Do I need to install glassfish server to use it as embedded server in application?

I am trying to use glassfish as a embedded server in my ejb3.1 project.
below are my maven dependencies..
But when I run my tests it fails to deploy ejb modules.
do I need to set javaee.home or some more variable ?
<dependency>
<groupId>org.glassfish.extras</groupId>
<artifactId>glassfish-embedded-all</artifactId>
<version>3.1-SNAPSHOT</version>
<scope>test</scope>
<type>jar</type>
</dependency>
<dependency>
<groupId>org.glassfish.extras</groupId>
<artifactId>glassfish-embedded-static-shell</artifactId>
<version>3.1-SNAPSHOT</version>
<scope>test</scope>
<type>jar</type>
</dependency>
<dependency>
<groupId>javax</groupId>
<artifactId>javaee-api</artifactId>
<version>6.0</version>
<scope>provided</scope>
</dependency>
The exception is..
Caused by: org.omg.CORBA.DATA_CONVERSION: vmcid: SUN minor code: 214 completed: No
.
.
.
Caused by: java.lang.IllegalStateException: java.lang.RuntimeException: java.util.MissingResourceException: Can't find resource for bundle java.util.PropertyResourceBundle, key iiop.cannot_find_keyalias
No. even you dont need glassfish-embedded-static-shell.jar.
If you want to use EJB3.1 only glassfish-embedded-all jar is enough.
If you want to access jpa data sources from ejb3 then you need a domain.xml file in classpath.
You will need to pass property "org.glassfish.ejb.embedded.glassfish.installation.root" while creating a EJB container in client code.(like EJBContainer.createEJBContainer(prop)). value of this property should be a folder name (ex. glassfish).
The folder should have domains\domain1\config\domain.xml file.
You can download and install glassfish v3 and from installation you can copy this file.