I am trying to read a query from BigQuery and then with Apache Beam / Dataflow in Kotlin I want to add a column with the current date as a timestamp. I don't want to do it inside the query itself because I want to reuse this code for a big amount of queries and it looks like a better design.
This is the pipeline code I wrote:
val pipeline = Pipeline.create(options)
.apply("Retrieve query", BigQueryIO.readTableRows().fromQuery(query).usingStandardSql())
.apply("Add date", ParDo.of(AddDate()))
.apply("Store data", BigQueryIO.writeTableRows().withSchema(tableSchema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.to(TableReference().setProjectId(gcpProject).setDatasetId(datasetId).setTableId(tableId))
For some reason it does not advance from the Add date transformation.
This is the code which is most likely to have the bug / error:
class AddDate : DoFn<TableRow, TableRow>() {
#ProcessElement
fun processElement(context: ProcessContext) {
val tableRow = context.element() as TableRow
tableRow.set("process_date", Instant.now())
context.output(tableRow)
}
}
I also tried with this code instead inside processElement, but still does not work.
context.outputWithTimestamp(context.element(), Instant.now())
The error is the following:
Input values must not be mutated in any way.
The problem is solved by using a new object and being careful with the type used for the date (for DATE or TIMESTAMP types)
#ProcessElement
fun processElement(context: ProcessContext) {
val tableRow = TableRow()
tableRow.set("process_date", Instant.now().toString())
val input = context.element() as TableRow
input.keys.forEach { tableRow.set(it, input[it]) }
context.output(tableRow)
}
Related
I've refactored some legacy code within Spring Boot (2.1.2) system and migrated from java.util.Date to java.time based classes (jsr310). The system expects the dates in a ISO8601 formated string, whereas some are complete timestamps with time information (e.g. "2019-01-29T15:29:34+01:00") while others are only dates with offset (e.g. "2019-01-29+01:00"). Here is the DTO (as Kotlin data class):
data class Dto(
// ...
#JsonFormat(shape = JsonFormat.Shape.STRING, pattern = "yyyy-MM-dd'T'HH:mm:ssXXX")
#JsonProperty("processingTimestamp")
val processingTimestamp: OffsetDateTime,
// ...
#JsonFormat(shape = JsonFormat.Shape.STRING, pattern = "yyyy-MM-ddXXX")
#JsonProperty("orderDate")
val orderDate: OffsetDateTime,
// ...
)
While Jackson perfectly deserializes processingTimestamp, it fails with orderDate:
Caused by: java.time.DateTimeException: Unable to obtain OffsetDateTime from TemporalAccessor: {OffsetSeconds=32400},ISO resolved to 2018-10-23 of type java.time.format.Parsed
at java.time.OffsetDateTime.from(OffsetDateTime.java:370) ~[na:1.8.0_152]
at com.fasterxml.jackson.datatype.jsr310.deser.InstantDeserializer.deserialize(InstantDeserializer.java:207) ~[jackson-datatype-jsr310-2.9.8.jar:2.9.8]
This makes sense to me, since OffsetDateTime cannot find any time information necessary to construct the instant. If I change to val orderDate: LocalDate Jackson can successfully deserialize, but then the offset information is gone (which I need to convert to Instant later).
Question
My current workaround is to use OffsetDateTime, in combination with a custom deserializer (see below). But I'm wondering, if there is a better solution for this?
Also, I'd wish for a more appropriate data type like OffsetDate, but I cannot find it in java.time.
PS
I was asking myself if "2019-01-29+01:00" is a valid for ISO8601. However, since I found that java.time.DateTimeFormatter.ISO_DATE is can correctly parse it and I cannot change the format how the clients send data, I put aside this question.
Workaround
data class Dto(
// ...
#JsonFormat(shape = JsonFormat.Shape.STRING, pattern = "yyyy-MM-ddXXX")
#JsonProperty("catchDate")
#JsonDeserialize(using = OffsetDateDeserializer::class)
val orderDate: OffsetDateTime,
// ...
)
class OffsetDateDeserializer(
private val formatter: DateTimeFormatter = DateTimeFormatter.ISO_DATE
) : JSR310DateTimeDeserializerBase<OffsetDateTime>(OffsetDateTime::class.java, formatter) {
override fun deserialize(parser: JsonParser, context: DeserializationContext): OffsetDateTime? {
if (parser.hasToken(JsonToken.VALUE_STRING)) {
val string = parser.text.trim()
if (string.isEmpty()) {
return null
}
val parsed: TemporalAccessor = formatter.parse(string)
val offset = if(parsed.isSupported(ChronoField.OFFSET_SECONDS)) ZoneOffset.from(parsed) else ZoneOffset.UTC
val localDate = LocalDate.from(parsed)
return OffsetDateTime.of(localDate.atStartOfDay(), offset)
}
throw context.wrongTokenException(parser, _valueClass, parser.currentToken, "date with offset must be contained in string")
}
override fun withDateFormat(otherFormatter: DateTimeFormatter?): JsonDeserializer<OffsetDateTime> = OffsetDateDeserializer(formatter)
}
As #JodaStephen explained in the comments, OffsetDate was not included in java.time to have a minimal set of classes. So, OffsetDateTime is the best option.
He also suggested to use DateTimeFormatterBuilder and parseDefaulting to create a DateTimeFormatter instance, to directly create OffsetDateTime from the formatters parsing result (TemporalAccessor). AFAIK, I still need to create a custom deserializer to use the formatter. Here is code, which solved my problem:
class OffsetDateDeserializer: JsonDeserializer<OffsetDateTime>() {
private val formatter = DateTimeFormatterBuilder()
.append(DateTimeFormatter.ISO_DATE)
.parseDefaulting(ChronoField.HOUR_OF_DAY, 0)
.parseDefaulting(ChronoField.MINUTE_OF_HOUR, 0)
.parseDefaulting(ChronoField.SECOND_OF_MINUTE, 0)
.parseDefaulting(ChronoField.MILLI_OF_SECOND, 0)
.parseDefaulting(ChronoField.OFFSET_SECONDS, 0)
.toFormatter()
override fun deserialize(parser: JsonParser, context: DeserializationContext): OffsetDateTime? {
if (parser.hasToken(JsonToken.VALUE_STRING)) {
val string = parser.text.trim()
if (string.isEmpty()) {
return null
}
try {
return OffsetDateTime.from(formatter.parse(string))
} catch (e: DateTimeException){
throw context.wrongTokenException(parser, OffsetDateTime::class.java, parser.currentToken, "error while parsing date: ${e.message}")
}
}
throw context.wrongTokenException(parser, OffsetDateTime::class.java, parser.currentToken, "date with offset must be contained in string")
}
}
I have list of table column names and it's values which will be determined # run time. Right now I am using following way to achieve the feet which requires casting Filed to TableField for every single column name. Is there any better way ?
override fun updateFields(job: Job, jsonObject: JsonObject, handler: Handler<AsyncResult<Job?>>): JobQService {
val updateFieldsDsl = dslContext.update(JOB)
var feildSetDsl: UpdateSetMoreStep<*>? = null
jsonObject.map.keys.forEach { column ->
feildSetDsl = if (feildSetDsl == null) {
updateFieldsDsl.set(JOB.field(column) as TableField<Record, Any>, jsonObject.getValue(column))
} else {
feildSetDsl!!.set(JOB.field(column) as TableField<Record, Any>, jsonObject.getValue(column))
}
}
val queryDsl = feildSetDsl!!.where(JOB.ID.eq(job.id))
jdbcClient.rxUpdateWithParams(queryDsl.sql, JsonArray(queryDsl.bindValues)).subscribeBy(
onSuccess = { handler.handle(Future.succeededFuture(job)) },
onError = { handler.handle(Future.failedFuture(it)) }
)
return this;
}
I'm not sure what you mean by "better" but there is a method UpdateSetStep.set(Map), which seems to be helpful for what you're trying to do. See the javadoc:
UpdateSetMoreStep set(Map<?,?> map)
Set a value for a field in the UPDATE statement.
Keys can either be of type String, Name, or Field.
Values can either be of type <T> or Field<T>. jOOQ will attempt to convert values to their corresponding field's type.
I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK.
Looking at the BigQuery JSON API, to create a day partitioned table one needs to pass in a
"timePartitioning": { "type": "DAY" }
option, but the com.google.cloud.dataflow.sdk.io.BigQueryIO interface only allows specifying a TableReference.
I thought that maybe I could pre-create the table, and sneak in a partition decorator via a BigQueryIO.Write.toTableReference lambda..? Is anyone else having success with creating/writing partitioned tables via Dataflow?
This seems like a similar issue to setting the table expiration time which isn't currently available either.
As Pavan says, it is definitely possible to write to partition tables with Dataflow. Are you using the DataflowPipelineRunner operating in streaming mode or batch mode?
The solution you proposed should work. Specifically, if you pre-create a table with date partitioning set up, then you can use a BigQueryIO.Write.toTableReference lambda to write to a date partition. For example:
/**
* A Joda-time formatter that prints a date in format like {#code "20160101"}.
* Threadsafe.
*/
private static final DateTimeFormatter FORMATTER =
DateTimeFormat.forPattern("yyyyMMdd").withZone(DateTimeZone.UTC);
// This code generates a valid BigQuery partition name:
Instant instant = Instant.now(); // any Joda instant in a reasonable time range
String baseTableName = "project:dataset.table"; // a valid BigQuery table name
String partitionName =
String.format("%s$%s", baseTableName, FORMATTER.print(instant));
The approach I took (works in the streaming mode, too):
Define a custom window for the incoming record
Convert the window into the table/partition name
p.apply(PubsubIO.Read
.subscription(subscription)
.withCoder(TableRowJsonCoder.of())
)
.apply(Window.into(new TablePartitionWindowFn()) )
.apply(BigQueryIO.Write
.to(new DayPartitionFunc(dataset, table))
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
Setting the window based on the incoming data, the End Instant can be ignored, as the start value is used for setting the partition:
public class TablePartitionWindowFn extends NonMergingWindowFn<Object, IntervalWindow> {
private IntervalWindow assignWindow(AssignContext context) {
TableRow source = (TableRow) context.element();
String dttm_str = (String) source.get("DTTM");
DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd").withZoneUTC();
Instant start_point = Instant.parse(dttm_str,formatter);
Instant end_point = start_point.withDurationAdded(1000, 1);
return new IntervalWindow(start_point, end_point);
};
#Override
public Coder<IntervalWindow> windowCoder() {
return IntervalWindow.getCoder();
}
#Override
public Collection<IntervalWindow> assignWindows(AssignContext c) throws Exception {
return Arrays.asList(assignWindow(c));
}
#Override
public boolean isCompatible(WindowFn<?, ?> other) {
return false;
}
#Override
public IntervalWindow getSideInputWindow(BoundedWindow window) {
if (window instanceof GlobalWindow) {
throw new IllegalArgumentException(
"Attempted to get side input window for GlobalWindow from non-global WindowFn");
}
return null;
}
Setting the table partition dynamically:
public class DayPartitionFunc implements SerializableFunction<BoundedWindow, String> {
String destination = "";
public DayPartitionFunc(String dataset, String table) {
this.destination = dataset + "." + table+ "$";
}
#Override
public String apply(BoundedWindow boundedWindow) {
// The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
String dayString = DateTimeFormat.forPattern("yyyyMMdd")
.withZone(DateTimeZone.UTC)
.print(((IntervalWindow) boundedWindow).start());
return destination + dayString;
}}
Is there a better way of achieving the same outcome?
I believe it should be possible to use the partition decorator when you are not using streaming. We are actively working on supporting partition decorators through streaming. Please let us know if you are seeing any errors today with non-streaming mode.
Apache Beam version 2.0 supports sharding BigQuery output tables out of the box.
I have written data into bigquery partitioned tables through dataflow. These writings are dynamic as-in if the data in that partition already exists then I can either append to it or overwrite it.
I have written the code in Python. It is a batch mode write operation into bigquery.
client = bigquery.Client(project=projectName)
dataset_ref = client.dataset(datasetName)
table_ref = dataset_ref.table(bqTableName)
job_config = bigquery.LoadJobConfig()
job_config.skip_leading_rows = skipLeadingRows
job_config.source_format = bigquery.SourceFormat.CSV
if tableExists(client, table_ref):
job_config.autodetect = autoDetect
previous_rows = client.get_table(table_ref).num_rows
#assert previous_rows > 0
if allowJaggedRows is True:
job_config.allowJaggedRows = True
if allowFieldAddition is True:
job_config._properties['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']
if isPartitioned is True:
job_config._properties['load']['timePartitioning'] = {"type": "DAY"}
if schemaList is not None:
job_config.schema = schemaList
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
else:
job_config.autodetect = autoDetect
job_config._properties['createDisposition'] = 'CREATE_IF_NEEDED'
job_config.schema = schemaList
if isPartitioned is True:
job_config._properties['load']['timePartitioning'] = {"type": "DAY"}
if schemaList is not None:
table = bigquery.Table(table_ref, schema=schemaList)
load_job = client.load_table_from_uri(gcsFileName, table_ref, job_config=job_config)
assert load_job.job_type == 'load'
load_job.result()
assert load_job.state == 'DONE'
It works fine.
If you pass the table name in table_name_YYYYMMDD format, then BigQuery will treat it as a sharded table, which can simulate partition table features.
Refer the documentation: https://cloud.google.com/bigquery/docs/partitioned-tables
I am trying to process logs via Spark Streaming and Spark SQL. The main idea is to have a "compacted" dataset with Parquet format for "old" data converted to DataFrame as needed for queries, the compacted dataset loading is done with:
SQLContext sqlContext = JavaSQLContextSingleton.getInstance(sc.sc());
DataFrame compact = null;
compact = sqlContext.parquetFile("hdfs://auto-ha/tmp/data/logs");
As the uncompacted dataset (I compact the dataset daily) is composed of many files, I would like to have the data in the current day within a DStream in order to get those queries fast.
I have tried the DataFrame approach without results....
DataFrame df = JavaSQLContextSingleton.getInstance(sc.sc()).createDataFrame(lastData, schema);
df.registerTempTable("lastData");
JavaDStream SumStream = inputStream.transform(new Function<JavaRDD<Row>, JavaRDD<Object>>() {
#Override
public JavaRDD<Object> call(JavaRDD<Row> v1) throws Exception {
DataFrame df = JavaSQLContextSingleton.getInstance(v1.context()).createDataFrame(v1, schema);
......drop old data from lastData table
df.insertInto("lastData");
}
});
Using this approach I do not get any results if I query the temp table in a different thread for example.
I have also tried to use the RDD transform method, more specifically I tried to follow the Spark Example where I create a empty RDD and then I union the DSStream RDD contents with the empty RDD:
JavaRDD<Row> lastData = sc.emptyRDD();
JavaDStream SumStream = inputStream.transform(new Function<JavaRDD<Row>, JavaRDD<Object>>() {
#Override
public JavaRDD<Object> call(JavaRDD<Row> v1) throws Exception {
lastData.union(v1).filter(let only recent data....);
}
});
This approach does not work too as I do not get any contents in the lastData
Could I use for this purpose Windowed computations or updateStateBy key?
Any suggestions?
Thanks for your help!
Well I finally got it.
I use updateState function and return 0 if the timestamp is older than 24 hour like this.
final static Function2<List<Long>, Optional<Long>, Optional<Long>> RETAIN_RECENT_DATA
= (List<Long> values, Optional<Long> state) -> {
Long newSum = state.or(0L);
for (Long value : values) {
newSum += value;
}
//current milis uses UTC
if (System.currentTimeMillis() - newSum > 86400000L) {
return Optional.absent();
} else {
return Optional.of(newSum);
}
};
Then on each batch I register the DataFrame as temp table:
finalsum.foreachRDD((JavaRDD<Row> rdd, Time time) -> {
if (!rdd.isEmpty()) {
HiveContext sqlContext1 = JavaSQLContextSingleton.getInstance(rdd.context());
if (sqlContext1.cacheManager().isCached("alarm_recent")) {
sqlContext1.uncacheTable("alarm_recent");
}
DataFrame wordsDataFrame = sqlContext1.createDataFrame(rdd, schema);
wordsDataFrame.registerTempTable("alarm_recent");
wordsDataFrame.cache();//
wordsDataFrame.first();
}
return null;
});
You could use mapwithState with Spark1.6.
The mapwithState function is much more efficient and easy to implement.
Take a look at this link.
mapwithState supports cool functionality like State time out and initialRDD which comes handy while maintaining a Stateful Dstream.
Thanks
Manas
I have been looking into scala primarily on how to build DSL similar to C# LINQ/SQL. Having worked with C# LINQ Query provider, it was easy to introduce our own custom query provider which translated LINQ query to our own proprietary data store scripts. I am looking something similar in scala for eg.
val query = select Min(Close), Max(Close)
from StockPrices
where open > 0
First of all is this even possible to achieve in scala using internal DSL.
Any thoughts/ideas in this regard is highly appreciated.
I am still new in scala space, but started looking into Scala MetaProgramming & Slick. My complaint with Slick is i want to align my DSL close to SQL query - similar to above syntax.
There is no way to have an internal DSL (with the currently release) that looks exactly like the example you provided.
Using a macro I still had from this answer, the closest I could get (relatively fast) was:
select(Min(StockPrices.Open), Max(StockPrices.Open))
.from(StockPrices)
A real solution would take quite some time to create. If you are willing to do that you could come quite far using macro's (not a simple topic).
If you really want the exact same syntax I recommend something like XText that allows you to create a DSL with an eclipse based editor for 'free'.
The code required for the above example (I did not include the mentioned macro):
trait SqlElement {
def toString(): String
}
trait SqlMethod extends SqlElement {
protected val methodName: String
protected val arguments: Seq[String]
override def toString() = {
val argumentsString = arguments mkString ","
s"$methodName($argumentsString)"
}
}
case class Select(elements: Seq[SqlElement]) extends SqlElement {
override def toString() = s"SELECT ${elements mkString ", "}"
}
case class From(table: Metadata) extends SqlElement {
private val tableName = table.name
override def toString() = s"FROM $tableName"
}
case class Min(element: Metadata) extends SqlMethod {
val methodName = "Min"
val arguments = Seq(element.name)
}
case class Max(element: Metadata) extends SqlMethod {
val methodName = "Max"
val arguments = Seq(element.name)
}
class QueryBuilder(elements: Seq[SqlElement]) {
def this(element: SqlElement) = this(Seq(element))
def from(o: Metadata) = new QueryBuilder(elements :+ From(o))
def where(element: SqlElement) = new QueryBuilder(elements :+ element)
override def toString() = elements mkString ("\n")
}
def select(args: SqlElement*) = new QueryBuilder(Select(args))
trait Column
object Column extends Column
object tables {
object StockPrices$ {
val Open: Column = Column
val Close: Column = Column
}
val StockPrices = StockPrices$
}
And then to use it:
import tables._
import StockPrices._
select(Min(StockPrices.Open), Max(StockPrices.Open))
.from(StockPrices)
.toString
That is an admirable project, but one that has been embarked upon and which is available in general release.
I'm talking about Slick, of course.
If Scala / Java interoperability is not too much of an issue for you, and if you're willing to use an internal DSL with a couple of syntax quirks compared to the syntax you have suggested, then jOOQ is growing to be a popular alternative to Slick. An example from the jOOQ manual:
for (r <- e
select (
T_BOOK.ID * T_BOOK.AUTHOR_ID,
T_BOOK.ID + T_BOOK.AUTHOR_ID * 3 + 4,
T_BOOK.TITLE || " abc" || " xy"
)
from T_BOOK
leftOuterJoin (
select (x.ID, x.YEAR_OF_BIRTH)
from x
limit 1
asTable x.getName()
)
on T_BOOK.AUTHOR_ID === x.ID
where (T_BOOK.ID <> 2)
or (T_BOOK.TITLE in ("O Alquimista", "Brida"))
fetch
) {
println(r)
}