Spark Streaming + Spark SQL - apache-spark-sql

I am trying to process logs via Spark Streaming and Spark SQL. The main idea is to have a "compacted" dataset with Parquet format for "old" data converted to DataFrame as needed for queries, the compacted dataset loading is done with:
SQLContext sqlContext = JavaSQLContextSingleton.getInstance(sc.sc());
DataFrame compact = null;
compact = sqlContext.parquetFile("hdfs://auto-ha/tmp/data/logs");
As the uncompacted dataset (I compact the dataset daily) is composed of many files, I would like to have the data in the current day within a DStream in order to get those queries fast.
I have tried the DataFrame approach without results....
DataFrame df = JavaSQLContextSingleton.getInstance(sc.sc()).createDataFrame(lastData, schema);
df.registerTempTable("lastData");
JavaDStream SumStream = inputStream.transform(new Function<JavaRDD<Row>, JavaRDD<Object>>() {
#Override
public JavaRDD<Object> call(JavaRDD<Row> v1) throws Exception {
DataFrame df = JavaSQLContextSingleton.getInstance(v1.context()).createDataFrame(v1, schema);
......drop old data from lastData table
df.insertInto("lastData");
}
});
Using this approach I do not get any results if I query the temp table in a different thread for example.
I have also tried to use the RDD transform method, more specifically I tried to follow the Spark Example where I create a empty RDD and then I union the DSStream RDD contents with the empty RDD:
JavaRDD<Row> lastData = sc.emptyRDD();
JavaDStream SumStream = inputStream.transform(new Function<JavaRDD<Row>, JavaRDD<Object>>() {
#Override
public JavaRDD<Object> call(JavaRDD<Row> v1) throws Exception {
lastData.union(v1).filter(let only recent data....);
}
});
This approach does not work too as I do not get any contents in the lastData
Could I use for this purpose Windowed computations or updateStateBy key?
Any suggestions?
Thanks for your help!

Well I finally got it.
I use updateState function and return 0 if the timestamp is older than 24 hour like this.
final static Function2<List<Long>, Optional<Long>, Optional<Long>> RETAIN_RECENT_DATA
= (List<Long> values, Optional<Long> state) -> {
Long newSum = state.or(0L);
for (Long value : values) {
newSum += value;
}
//current milis uses UTC
if (System.currentTimeMillis() - newSum > 86400000L) {
return Optional.absent();
} else {
return Optional.of(newSum);
}
};
Then on each batch I register the DataFrame as temp table:
finalsum.foreachRDD((JavaRDD<Row> rdd, Time time) -> {
if (!rdd.isEmpty()) {
HiveContext sqlContext1 = JavaSQLContextSingleton.getInstance(rdd.context());
if (sqlContext1.cacheManager().isCached("alarm_recent")) {
sqlContext1.uncacheTable("alarm_recent");
}
DataFrame wordsDataFrame = sqlContext1.createDataFrame(rdd, schema);
wordsDataFrame.registerTempTable("alarm_recent");
wordsDataFrame.cache();//
wordsDataFrame.first();
}
return null;
});

You could use mapwithState with Spark1.6.
The mapwithState function is much more efficient and easy to implement.
Take a look at this link.
mapwithState supports cool functionality like State time out and initialRDD which comes handy while maintaining a Stateful Dstream.
Thanks
Manas

Related

How to add a column to query results in Google Dataflow

I am trying to read a query from BigQuery and then with Apache Beam / Dataflow in Kotlin I want to add a column with the current date as a timestamp. I don't want to do it inside the query itself because I want to reuse this code for a big amount of queries and it looks like a better design.
This is the pipeline code I wrote:
val pipeline = Pipeline.create(options)
.apply("Retrieve query", BigQueryIO.readTableRows().fromQuery(query).usingStandardSql())
.apply("Add date", ParDo.of(AddDate()))
.apply("Store data", BigQueryIO.writeTableRows().withSchema(tableSchema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.to(TableReference().setProjectId(gcpProject).setDatasetId(datasetId).setTableId(tableId))
For some reason it does not advance from the Add date transformation.
This is the code which is most likely to have the bug / error:
class AddDate : DoFn<TableRow, TableRow>() {
#ProcessElement
fun processElement(context: ProcessContext) {
val tableRow = context.element() as TableRow
tableRow.set("process_date", Instant.now())
context.output(tableRow)
}
}
I also tried with this code instead inside processElement, but still does not work.
context.outputWithTimestamp(context.element(), Instant.now())
The error is the following:
Input values must not be mutated in any way.
The problem is solved by using a new object and being careful with the type used for the date (for DATE or TIMESTAMP types)
#ProcessElement
fun processElement(context: ProcessContext) {
val tableRow = TableRow()
tableRow.set("process_date", Instant.now().toString())
val input = context.element() as TableRow
input.keys.forEach { tableRow.set(it, input[it]) }
context.output(tableRow)
}

BigQuery: Split table based on a column

Short question: I would like to split a BQ table into multiple small tables, based on the distinct values of a column. So, if column country has 10 distinct values, it should split the table into 10 individual tables, with each having respective country data. Best, if done from within a BQ query (using INSERT, MERGE, etc.).
What I am doing right now is importing data to gstorage -> local storage -> doing splits locally and then pushing into tables (which is kind of a very time consuming process).
Thanks.
If the data has the same schema, just leave it in one table and use the clustering feature: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#creating_a_clustered_table
#standardSQL
CREATE TABLE mydataset.myclusteredtable
PARTITION BY dateCol
CLUSTER BY country
OPTIONS (
description="a table clustered by country"
) AS (
SELECT ....
)
https://cloud.google.com/bigquery/docs/clustered-tables
The feature is in beta though.
You can use Dataflow for this. This answer gives an example of a pipeline that queries a BigQuery table, splits the rows based on a column and then outputs them to different PubSub topics (which could be different BigQuery tables instead):
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<TableRow> weatherData = p.apply(
BigQueryIO.Read.named("ReadWeatherStations").from("clouddataflow-readonly:samples.weather_stations"));
final TupleTag<String> readings2010 = new TupleTag<String>() {
};
final TupleTag<String> readings2000plus = new TupleTag<String>() {
};
final TupleTag<String> readingsOld = new TupleTag<String>() {
};
PCollectionTuple collectionTuple = weatherData.apply(ParDo.named("tablerow2string")
.withOutputTags(readings2010, TupleTagList.of(readings2000plus).and(readingsOld))
.of(new DoFn<TableRow, String>() {
#Override
public void processElement(DoFn<TableRow, String>.ProcessContext c) throws Exception {
if (c.element().getF().get(2).getV().equals("2010")) {
c.output(c.element().toString());
} else if (Integer.parseInt(c.element().getF().get(2).getV().toString()) > 2000) {
c.sideOutput(readings2000plus, c.element().toString());
} else {
c.sideOutput(readingsOld, c.element().toString());
}
}
}));
collectionTuple.get(readings2010)
.apply(PubsubIO.Write.named("WriteToPubsub1").topic("projects/fh-dataflow/topics/bq2pubsub-topic1"));
collectionTuple.get(readings2000plus)
.apply(PubsubIO.Write.named("WriteToPubsub2").topic("projects/fh-dataflow/topics/bq2pubsub-topic2"));
collectionTuple.get(readingsOld)
.apply(PubsubIO.Write.named("WriteToPubsub3").topic("projects/fh-dataflow/topics/bq2pubsub-topic3"));
p.run();

Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK.
Looking at the BigQuery JSON API, to create a day partitioned table one needs to pass in a
"timePartitioning": { "type": "DAY" }
option, but the com.google.cloud.dataflow.sdk.io.BigQueryIO interface only allows specifying a TableReference.
I thought that maybe I could pre-create the table, and sneak in a partition decorator via a BigQueryIO.Write.toTableReference lambda..? Is anyone else having success with creating/writing partitioned tables via Dataflow?
This seems like a similar issue to setting the table expiration time which isn't currently available either.
As Pavan says, it is definitely possible to write to partition tables with Dataflow. Are you using the DataflowPipelineRunner operating in streaming mode or batch mode?
The solution you proposed should work. Specifically, if you pre-create a table with date partitioning set up, then you can use a BigQueryIO.Write.toTableReference lambda to write to a date partition. For example:
/**
* A Joda-time formatter that prints a date in format like {#code "20160101"}.
* Threadsafe.
*/
private static final DateTimeFormatter FORMATTER =
DateTimeFormat.forPattern("yyyyMMdd").withZone(DateTimeZone.UTC);
// This code generates a valid BigQuery partition name:
Instant instant = Instant.now(); // any Joda instant in a reasonable time range
String baseTableName = "project:dataset.table"; // a valid BigQuery table name
String partitionName =
String.format("%s$%s", baseTableName, FORMATTER.print(instant));
The approach I took (works in the streaming mode, too):
Define a custom window for the incoming record
Convert the window into the table/partition name
p.apply(PubsubIO.Read
.subscription(subscription)
.withCoder(TableRowJsonCoder.of())
)
.apply(Window.into(new TablePartitionWindowFn()) )
.apply(BigQueryIO.Write
.to(new DayPartitionFunc(dataset, table))
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
Setting the window based on the incoming data, the End Instant can be ignored, as the start value is used for setting the partition:
public class TablePartitionWindowFn extends NonMergingWindowFn<Object, IntervalWindow> {
private IntervalWindow assignWindow(AssignContext context) {
TableRow source = (TableRow) context.element();
String dttm_str = (String) source.get("DTTM");
DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd").withZoneUTC();
Instant start_point = Instant.parse(dttm_str,formatter);
Instant end_point = start_point.withDurationAdded(1000, 1);
return new IntervalWindow(start_point, end_point);
};
#Override
public Coder<IntervalWindow> windowCoder() {
return IntervalWindow.getCoder();
}
#Override
public Collection<IntervalWindow> assignWindows(AssignContext c) throws Exception {
return Arrays.asList(assignWindow(c));
}
#Override
public boolean isCompatible(WindowFn<?, ?> other) {
return false;
}
#Override
public IntervalWindow getSideInputWindow(BoundedWindow window) {
if (window instanceof GlobalWindow) {
throw new IllegalArgumentException(
"Attempted to get side input window for GlobalWindow from non-global WindowFn");
}
return null;
}
Setting the table partition dynamically:
public class DayPartitionFunc implements SerializableFunction<BoundedWindow, String> {
String destination = "";
public DayPartitionFunc(String dataset, String table) {
this.destination = dataset + "." + table+ "$";
}
#Override
public String apply(BoundedWindow boundedWindow) {
// The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
String dayString = DateTimeFormat.forPattern("yyyyMMdd")
.withZone(DateTimeZone.UTC)
.print(((IntervalWindow) boundedWindow).start());
return destination + dayString;
}}
Is there a better way of achieving the same outcome?
I believe it should be possible to use the partition decorator when you are not using streaming. We are actively working on supporting partition decorators through streaming. Please let us know if you are seeing any errors today with non-streaming mode.
Apache Beam version 2.0 supports sharding BigQuery output tables out of the box.
I have written data into bigquery partitioned tables through dataflow. These writings are dynamic as-in if the data in that partition already exists then I can either append to it or overwrite it.
I have written the code in Python. It is a batch mode write operation into bigquery.
client = bigquery.Client(project=projectName)
dataset_ref = client.dataset(datasetName)
table_ref = dataset_ref.table(bqTableName)
job_config = bigquery.LoadJobConfig()
job_config.skip_leading_rows = skipLeadingRows
job_config.source_format = bigquery.SourceFormat.CSV
if tableExists(client, table_ref):
job_config.autodetect = autoDetect
previous_rows = client.get_table(table_ref).num_rows
#assert previous_rows > 0
if allowJaggedRows is True:
job_config.allowJaggedRows = True
if allowFieldAddition is True:
job_config._properties['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']
if isPartitioned is True:
job_config._properties['load']['timePartitioning'] = {"type": "DAY"}
if schemaList is not None:
job_config.schema = schemaList
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
else:
job_config.autodetect = autoDetect
job_config._properties['createDisposition'] = 'CREATE_IF_NEEDED'
job_config.schema = schemaList
if isPartitioned is True:
job_config._properties['load']['timePartitioning'] = {"type": "DAY"}
if schemaList is not None:
table = bigquery.Table(table_ref, schema=schemaList)
load_job = client.load_table_from_uri(gcsFileName, table_ref, job_config=job_config)
assert load_job.job_type == 'load'
load_job.result()
assert load_job.state == 'DONE'
It works fine.
If you pass the table name in table_name_YYYYMMDD format, then BigQuery will treat it as a sharded table, which can simulate partition table features.
Refer the documentation: https://cloud.google.com/bigquery/docs/partitioned-tables

dataflow bigquery partition [duplicate]

I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK.
Looking at the BigQuery JSON API, to create a day partitioned table one needs to pass in a
"timePartitioning": { "type": "DAY" }
option, but the com.google.cloud.dataflow.sdk.io.BigQueryIO interface only allows specifying a TableReference.
I thought that maybe I could pre-create the table, and sneak in a partition decorator via a BigQueryIO.Write.toTableReference lambda..? Is anyone else having success with creating/writing partitioned tables via Dataflow?
This seems like a similar issue to setting the table expiration time which isn't currently available either.
As Pavan says, it is definitely possible to write to partition tables with Dataflow. Are you using the DataflowPipelineRunner operating in streaming mode or batch mode?
The solution you proposed should work. Specifically, if you pre-create a table with date partitioning set up, then you can use a BigQueryIO.Write.toTableReference lambda to write to a date partition. For example:
/**
* A Joda-time formatter that prints a date in format like {#code "20160101"}.
* Threadsafe.
*/
private static final DateTimeFormatter FORMATTER =
DateTimeFormat.forPattern("yyyyMMdd").withZone(DateTimeZone.UTC);
// This code generates a valid BigQuery partition name:
Instant instant = Instant.now(); // any Joda instant in a reasonable time range
String baseTableName = "project:dataset.table"; // a valid BigQuery table name
String partitionName =
String.format("%s$%s", baseTableName, FORMATTER.print(instant));
The approach I took (works in the streaming mode, too):
Define a custom window for the incoming record
Convert the window into the table/partition name
p.apply(PubsubIO.Read
.subscription(subscription)
.withCoder(TableRowJsonCoder.of())
)
.apply(Window.into(new TablePartitionWindowFn()) )
.apply(BigQueryIO.Write
.to(new DayPartitionFunc(dataset, table))
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
Setting the window based on the incoming data, the End Instant can be ignored, as the start value is used for setting the partition:
public class TablePartitionWindowFn extends NonMergingWindowFn<Object, IntervalWindow> {
private IntervalWindow assignWindow(AssignContext context) {
TableRow source = (TableRow) context.element();
String dttm_str = (String) source.get("DTTM");
DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd").withZoneUTC();
Instant start_point = Instant.parse(dttm_str,formatter);
Instant end_point = start_point.withDurationAdded(1000, 1);
return new IntervalWindow(start_point, end_point);
};
#Override
public Coder<IntervalWindow> windowCoder() {
return IntervalWindow.getCoder();
}
#Override
public Collection<IntervalWindow> assignWindows(AssignContext c) throws Exception {
return Arrays.asList(assignWindow(c));
}
#Override
public boolean isCompatible(WindowFn<?, ?> other) {
return false;
}
#Override
public IntervalWindow getSideInputWindow(BoundedWindow window) {
if (window instanceof GlobalWindow) {
throw new IllegalArgumentException(
"Attempted to get side input window for GlobalWindow from non-global WindowFn");
}
return null;
}
Setting the table partition dynamically:
public class DayPartitionFunc implements SerializableFunction<BoundedWindow, String> {
String destination = "";
public DayPartitionFunc(String dataset, String table) {
this.destination = dataset + "." + table+ "$";
}
#Override
public String apply(BoundedWindow boundedWindow) {
// The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
String dayString = DateTimeFormat.forPattern("yyyyMMdd")
.withZone(DateTimeZone.UTC)
.print(((IntervalWindow) boundedWindow).start());
return destination + dayString;
}}
Is there a better way of achieving the same outcome?
I believe it should be possible to use the partition decorator when you are not using streaming. We are actively working on supporting partition decorators through streaming. Please let us know if you are seeing any errors today with non-streaming mode.
Apache Beam version 2.0 supports sharding BigQuery output tables out of the box.
I have written data into bigquery partitioned tables through dataflow. These writings are dynamic as-in if the data in that partition already exists then I can either append to it or overwrite it.
I have written the code in Python. It is a batch mode write operation into bigquery.
client = bigquery.Client(project=projectName)
dataset_ref = client.dataset(datasetName)
table_ref = dataset_ref.table(bqTableName)
job_config = bigquery.LoadJobConfig()
job_config.skip_leading_rows = skipLeadingRows
job_config.source_format = bigquery.SourceFormat.CSV
if tableExists(client, table_ref):
job_config.autodetect = autoDetect
previous_rows = client.get_table(table_ref).num_rows
#assert previous_rows > 0
if allowJaggedRows is True:
job_config.allowJaggedRows = True
if allowFieldAddition is True:
job_config._properties['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']
if isPartitioned is True:
job_config._properties['load']['timePartitioning'] = {"type": "DAY"}
if schemaList is not None:
job_config.schema = schemaList
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
else:
job_config.autodetect = autoDetect
job_config._properties['createDisposition'] = 'CREATE_IF_NEEDED'
job_config.schema = schemaList
if isPartitioned is True:
job_config._properties['load']['timePartitioning'] = {"type": "DAY"}
if schemaList is not None:
table = bigquery.Table(table_ref, schema=schemaList)
load_job = client.load_table_from_uri(gcsFileName, table_ref, job_config=job_config)
assert load_job.job_type == 'load'
load_job.result()
assert load_job.state == 'DONE'
It works fine.
If you pass the table name in table_name_YYYYMMDD format, then BigQuery will treat it as a sharded table, which can simulate partition table features.
Refer the documentation: https://cloud.google.com/bigquery/docs/partitioned-tables

generating DataFrames in for loop in Scala Spark cause out of memory

I'm generating small dataFrames in for loop. At each round of for loop, I pass the generated dataFrame to a function which returns double. This simple process (which I thought could be easily taken care of by garbage collector) blow up my memory. When I look at Spark UI at each round of for loop it adds a new "SQL{1-500}" (my loop runs 500 times). My question is how to drop this sql object before generating a new one?
my code is something like this:
Seq.fill(500){
val data = (1 to 1000).map(_=>Random.nextInt(1000))
val dataframe = createDataFrame(data)
myFunction(dataframe)
dataframe.unpersist()
}
def myFunction(df: DataFrame)={
df.count()
}
I tried to solve this problem by dataframe.unpersist() and sqlContext.clearCache() but neither of them worked.
You have two places where I suspect something fishy is happening:
in the definition of myFunction : you really need to put the = before the body of the definition. I had typos like that compile, but produce really weird errors (note I changed your myFunction for debugging purposes)
it is better to fill your Seq with something you know and then apply foreach or some such
(You also need to replace random.nexInt with Random.nextInt, and also, you can only create a DataFrame from a Seq of a type that is a subtype of Product, such as tuple, and need to use sqlContext to use createDataFrame)
This code works with no memory issues:
Seq.fill(500)(0).foreach{ i =>
val data = {1 to 1000}.map(_.toDouble).toList.zipWithIndex
val dataframe = sqlContext.createDataFrame(data)
myFunction(dataframe)
}
def myFunction(df: DataFrame) = {
println(df.count())
}
Edit: parallelizing the computation (across 10 cores) and returning the RDD of counts:
sc.parallelize(Seq.fill(500)(0), 10).map{ i =>
val data = {1 to 1000}.map(_.toDouble).toList.zipWithIndex
val dataframe = sqlContext.createDataFrame(data)
myFunction(dataframe)
}
def myFunction(df: DataFrame) = {
df.count()
}
Edit 2: the difference between declaring function myFunction with = and without = is that the first is (a usual) function definition, while the other is procedure definition and is only used for methods that return Unit. See explanation. Here is this point illustrated in Spark-shell:
scala> def myf(df:DataFrame) = df.count()
myf: (df: org.apache.spark.sql.DataFrame)Long
scala> def myf2(df:DataFrame) { df.count() }
myf2: (df: org.apache.spark.sql.DataFrame)Unit