Spark structured stream to kudu context - apache-spark-sql

I want to read kafka topic then write it to kudu table by spark streaming.
My first approach
// sessions and contexts
val conf = new SparkConf().setMaster("local[2]").setAppName("TestMain")
val sparkSession = SparkSession.builder().config(conf).getOrCreate()
val sparkContext = sparkSession.sparkContext
val kuduContext = new KuduContext("...", sparkContext);
// structure
val schema: StructType = StructType(
StructField("userNo", IntegerType, true) ::
StructField("bandNo", IntegerType, false) ::
StructField("ipv4", StringType, false) :: Nil);
// kudu - prepare table
kuduContext.deleteTable("test_table");
kuduContext.createTable("test_table", schema, Seq("userNo"), new CreateTableOptions()
.setNumReplicas(1)
.addHashPartitions(List("userNo").asJava, 3))
// get stream from kafka
val parsed = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("startingOffsets", "latest")
.option("subscribe", "feed_api_band_get_popular_post_list")
.load()
.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
// write it to kudu
kuduContext.insertRows(parsed.toDF(), "test_table");
Now it complains
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
My second approach
It seems I change my code to use traditional KafkaUtils.createDirectStream
KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
).foreachRDD(rdd => {
rdd.foreach(record => {
// write to kudu.............
println(record.value());
})
});
ssc.start();
ssc.awaitTermination();
So, which one is right approach? or is there any way to make it run from first approach?
Spark version is 2.2.0.

Both the approaches seem right. First one uses the Spark Structured streaming way of doing things wherein the data is appended on a tabular basis. Second method does it via traditional DStream way of doing things

I believe at the present time there is no Kudu support for using the KuduContext with Spark structured streaming. I had a similar issue and had to fall back on using traditional Kudu Client and implementing a ForeachWriter[Row] class. I used the examples here and was able to achieve a solution.

The first approach is incorrect, as you already can see from the error, which is very clear: Queries with streaming sources must be executed with writeStream.start(). That will only work on batch.
The second one uses DStream, so not structured streaming.
There is a 3rd and 4th approach.
Starting with Kudu 1.9.0, structured streaming is supported with this issue fixed, and used as expected:
parsed
.writeStream
.format("kudu")
.option("kudu.master", kuduMaster)
.option("kudu.table", tableName)
.option("kudu.operation", operation)
.start()
Note that if you are using Cloudera, this method will only work with cdh6.2.0 and above:
<!-- https://mvnrepository.com/artifact/org.apache.kudu/kudu-spark2 -->
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-spark2_2.11</artifactId>
<version>1.9.0-cdh6.2.0</version>
<scope>test</scope>
</dependency>
My solution was to look at the code from SparkContext and see what kuduContext.insertRows(df, table) and the other methods do, and create a ForeachWriter[Row]:
val kuduContext = new KuduContext(master, sparkContext)
parsed
.toDF()
.writeStream
.foreach(new ForeachWriter[Row] {
override def open(partitionId: Long, version: Long): Boolean =
kuduContext.tableExists(table)
override def process(value: Row): Unit = {
val kuduClient = kuduContext.syncClient
val kuduSession = kuduClient.newSession()
kuduSession.setFlushMode(SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND)
kuduSession.setIgnoreAllDuplicateRows(ignoreDuplicates)
val kuduTable = kuduClient.openTable(kuduSinkConfiguration.table)
val operation = getOperationFunction(kuduTable) //get the kuduTable.newInsert(), newUpsert(), etc.
kuduSession.setIgnoreAllDuplicateRows(ignoreDuplicates)
val row = operation.getRow
row.add("userNo", value.getAs[Int]("userNo"))
row.add("bandNo", value.getAs[Int]("bandNo"))
row.add("ipv4", value.getAs[String]("ipv4"))
kuduSession.apply(operation)
kuduSession.flush()
kuduSession.close()
}
override def close(errorOrNull: Throwable): Unit = Unit
})
.start()

We can also load structured streaming data into Kudu table using Spark version 2.2.0 and cloudera version CDH 5.14. You just need to download spark-kudu-2.2.11 jar corresponding to CDH6.2 and pass it as jar in your spark-submit command. This will identify kudu format in below statement and load dataframe easily.
parsed
.writeStream
.format("kudu")
.option("kudu.master", kuduMaster)
.option("kudu.table", tableName)
.option("kudu.operation", operation)
.start()
JAR can be downloaded from : https://mvnrepository.com/artifact/org.apache.kudu/kudu-spark2_2.11/1.10.0-cdh6.3.2
Spark-submit statement:
spark2-submit --master local[*] --deploy-mode client --jars spark-sql-kafka-0-10_2.11-2.2.0.jar,kafka-clients-0.10.0.0.jar,spark-streaming-kafka-0-10_2.11-2.2.0.jar,kudu-spark2_2.11-1.10.0-cdh6.3.2.jar,kudu-client-1.10.0-cdh6.3.2.jar /path_of_python_code/rdd-stream-read.py
Note- Kudu-client is optional. Might have to used with cluster deploy mode.
writestream statement used:
query=dfCols.writeStream.format("kudu").option("kudu.master", "host:7051,host:7051,host:7051").option("kudu.table","impala::db.kudu_table_name").option("kudu.operation","upsert").option("checkpointLocation","file:///path_of_dir/checkpoint/").start()

Related

How do I get the data of one row of a Structured Streaming Dataframe in pyspark?

I have a Kafka broker with a topic connected to Spark Structured Streaming. My topic sends data to my streaming dataframe, and I'd like to get information on each row for this topic (because I need to compare each row with another database).
If I could transform my batches into an RDD I could get each row easily.
I also saw something about DStreams but I don't know if with the last version f spark it still works.
Is DStream the answer to my problem or if there is any other solution to get my data row by row?
Read the data in spark streaming from kafka and write your custom row comparison in foreach writer of spark streaming
.
eg.
streamingDatasetOfString.writeStream.foreach(
new ForeachWriter[String] {
def open(partitionId: Long, version: Long): Boolean = {
// Open connection
}
def process(record: String): Unit = {
// Write string to connection
}
def close(errorOrNull: Throwable): Unit = {
// Close the connection
}}).start()
`
This is supported in python,scala,java since spark 2.4

How to avoid duplicates in BigQuery by streaming with Apache Beam IO?

We are using a pretty simple flow where messages are retrieved from PubSub, their JSON content is being flatten into two types (for BigQuery and Postgres) and then inserted into both sinks.
But, we are seeing duplicates in both sinks (Postgres was kinda fixed with a unique constraint and a "ON CONFLICT... DO NOTHING").
At first we trusted in the supposedly "insertId" UUId that the Apache Beam/BigQuery creates.
Then we add a "unique_label" attribute to each message before queueing them into PubSub, using data from the JSON itself, which gives them uniqueness (a device_id + a reading's timestamp). And subscribed to the topic using that attribute with "withIdAttribute" method.
Finally we paid for GCP Support, and their "solutions" do not work. They have told us to even use Reshuffle transform, which is deprecated by the way, and some windowing (that we do not won't since we want near-real time data).
This the main flow, pretty basic:
[UPDATED WITH LAST CODE]
Pipeline
val options = PipelineOptionsFactory.fromArgs(*args).withValidation().`as`(OptionArgs::class.java)
val pipeline = Pipeline.create(options)
var mappings = ""
// Value only available at runtime
if (options.schemaFile.isAccessible){
mappings = readCloudFile(options.schemaFile.get())
}
val tableRowMapper = ReadingToTableRowMapper(mappings)
val postgresMapper = ReadingToPostgresMapper(mappings)
val pubsubMessages =
pipeline
.apply("ReadPubSubMessages",
PubsubIO
.readMessagesWithAttributes()
.withIdAttribute("id_label")
.fromTopic(options.pubSubInput))
pubsubMessages
.apply("AckPubSubMessages", ParDo.of(object: DoFn<PubsubMessage, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info("Processing readings: " + context.element().attributeMap["id_label"])
context.output("")
}
}))
val disarmedMessages =
pubsubMessages
.apply("DisarmedPubSubMessages",
DisarmPubsubMessage(tableRowMapper, postgresMapper)
)
disarmedMessages
.get(TupleTags.readingErrorTag)
.apply("LogDisarmedErrors", ParDo.of(object: DoFn<String, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info(context.element())
context.output("")
}
}))
disarmedMessages
.get(TupleTags.tableRowTag)
.apply("WriteToBigQuery",
BigQueryIO
.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry())
.to(options.bigQueryOutput)
)
pipeline.run()
DissarmPubsubMessage is a PTransforms that uses FlatMapElements transform to get TableRow and ReadingsInputFlatten (own class for Postgres)
We expect zero duplicates or the "best effort" (and we append some cleaning cron job), we paid for these products to run statistics and bigdata analysis...
[UPDATE 1]
I even append a new simple transform that logs our unique attribute through a ParDo which supposedly should ack the PubsubMessage, but this is not the case:
new flow with AckPubSubMessages step
Thanks!!
Looks like you are using the global window. One technique would be to window this into an N minute window. Then process the keys in the window and drop an items with dup keys.
The supported programming languages are Python and Java, your code seems to be Scala and as far as I know it is not supported. I strongly recommend using Java to avoid any unsupported feature for the programming language you use.
In addition, I would recommend the following approaches to work on duplicates, the option 2 could meet your need of near-real-time:
message_id. Probably you already read the FAQ - duplicates which points to deprecated doc. However, if you check the PubsubMessage object you will notice that messageId is still available and it will be populated if not set by the publisher:
"ID of this message, assigned by the server when the message is
published ... It must not be populated by the publisher in a
topics.publish call"
BigQuery Streaming. To validate duplicate during loading the data, right before inserting in BQ you can create UUID.Please refer the section Example sink: Google BigQuery.
Try the Dataflow template PubSubToBigQuery and validate there are not duplicates in BQ.

Writing different values to different BigQuery tables in Apache Beam

Suppose I have a PCollection<Foo> and I want to write it to multiple BigQuery tables, choosing a potentially different table for each Foo.
How can I do this using the Apache Beam BigQueryIO API?
This is possible using a feature recently added to BigQueryIO in Apache Beam.
PCollection<Foo> foos = ...;
foos.apply(BigQueryIO.write().to(new SerializableFunction<ValueInSingleWindow<Foo>, TableDestination>() {
#Override
public TableDestination apply(ValueInSingleWindow<Foo> value) {
Foo foo = value.getValue();
// Also available: value.getWindow(), getTimestamp(), getPane()
String tableSpec = ...;
String tableDescription = ...;
return new TableDestination(tableSpec, tableDescription);
}
}).withFormatFunction(new SerializableFunction<Foo, TableRow>() {
#Override
public TableRow apply(Foo foo) {
return ...;
}
}).withSchema(...));
Depending on whether the input PCollection<Foo> is bounded or unbounded, under the hood this will either create multiple BigQuery import jobs (one or more per table depending on amount of data), or it will use the BigQuery streaming inserts API.
The most flexible version of the API uses DynamicDestinations, which allows you to write different values to different tables with different schemas, and even allows you to use side inputs from the rest of the pipeline in all of these computations.
Additionally, BigQueryIO has been refactored into a number of reusable transforms that you can yourself combine to implement more complex use cases - see files in the source directory.
This feature will be included in the first stable release of Apache Beam and into the next release of Dataflow SDK (which will be based on the first stable release of Apache Beam). Right now you can use this by running your pipeline against a snapshot of Beam at HEAD from github.
As of Beam 2.12.0, this feature is available in the Python SDK as well. It is marked as experimental, so you will have to pass --experiments use_beam_bq_sink to enable it. You'd do something like so:
def get_table_name(element):
if meets_some_condition(element):
return 'mytablename1'
else:
return 'mytablename2'
p = beam.Pipeline(...)
my_input_pcoll = p | ReadInMyPCollection()
my_input_pcoll | beam.io.gcp.bigquery.WriteToBigQuery(table=get_table_name)
The new sink supports a number of other options, which you can review in the pydoc

Dataproc + BigQuery examples - any available?

According to the Dataproc docos, it has "native and automatic integrations with BigQuery".
I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then write the results of this analysis back to BigQuery. You may be asking "why not just do the analysis in BigQuery directly!?" - the reason is because we are creating complex statistical models, and SQL is too high level for developing them. We need something like Python or R, ergo Dataproc.
Are they any Dataproc + BigQuery examples available? I can't find any.
To begin, as noted in this question the BigQuery connector is preinstalled on Cloud Dataproc clusters.
Here is an example on how to read data from BigQuery into Spark. In this example, we will read data from BigQuery to perform a word count.
You read data from BigQuery in Spark using SparkContext.newAPIHadoopRDD. The Spark documentation has more information about using SparkContext.newAPIHadoopRDD. '
import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
import com.google.gson.JsonObject
import org.apache.hadoop.io.LongWritable
val projectId = "<your-project-id>"
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"
val outputTableSchema =
"[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"
val jobName = "wordcount"
val conf = sc.hadoopConfiguration
// Set the job-level projectId.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
// Use the systemBucket for temporary BigQuery export data used by the InputFormat.
val systemBucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)
// Configure input and output for BigQuery access.
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
BigQueryConfiguration.configureBigQueryOutput(conf,
fullyQualifiedOutputTableId, outputTableSchema)
val fieldName = "word"
val tableData = sc.newAPIHadoopRDD(conf,
classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])
tableData.cache()
tableData.count()
tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)
You will need to customize this example with your settings, including your Cloud Platform project ID in <your-project-id> and your output table ID in <your-fully-qualified-table-id>.
Finally, if you end up using the BigQuery connector with MapReduce, this page has examples for how to write MapReduce jobs with the BigQuery connector.
The above example doesn't show how to write data to an output table. You need to do this:
.saveAsNewAPIHadoopFile(
hadoopConf.get(BigQueryConfiguration.TEMP_GCS_PATH_KEY),
classOf[String],
classOf[JsonObject],
classOf[BigQueryOutputFormat[String, JsonObject]], hadoopConf)
where the key: String is actually ignored

Why is "error: not found: value StructType" at sql schema create?

I have the CDH5 version 1.0.0 of Spark installed on CentOS 6.2 and running without error.
When trying to run some Spark SQL I encounter an error. I start my Spark shell fine ...
spark-shell --master spark://mysparkserver:7077
then I run one of the example Scala scripts from the programming guide at Spark SQL Programming Guide.
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val vehicle = sc.textFile("/tmp/scala.csv")
scala> val schemaString = "year manufacturer model class engine cylinders fuel consumption clkm hlkm cmpg hmpg co2lyr co2gkm"
scala> import org.apache.spark.sql._
scala > val schema =
StructType
(
schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true))
)
But the import statement doesn't seem to have worked? Because the last line gives an error that
scala> StructType
<console>:14: error: not found: value StructType
StructType
^
I do know that StructType is org.apache.spark.sql.api.java.StructType. And if I replace StructType in the schema line with the full name the error changes.
Has anyone else encountered this error ? Is there an extra step required that I am missing?
Your problem is that you are reading the programming guide for the latest version of Spark, and trying it out on Spark 1.0.0. Alas, org.apache.spark.sql.api.java.StructType was introduced in Spark 1.1.0, as was the section on "Programmatically Specifying the Schema".
So, without upgrading, you're not going to be able to do this -- unless you're able to make use of the techniques in the 1.0.0 guide section "Running SQL on RDDs", which in 1.1.0 is called "Inferring the Schema Using Reflection". (Basically, if you can tolerate a fixed Schema.)
If you look at the various documentation URLs, you want to replace the latest with 1.0.0. When in doubt, I like to bring up multiple versions of the API doc and search. I notice that, like javadoc, scaladoc has a #since annotation for making this information clearer in API docs, but it isn't being used in the Spark API docs.
I have encountered this issue even in spark 3.0.0
Please use the below import
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val schema = StructType( Array(StructField("language",
StringType,true),StructField("language", StringType,true)))
schema: org.apache.spark.sql.types.StructType =
StructType(StructField(language,StringType,true),
StructField(language,StringType,true))