Stream BigQuery table into Google Pub/Sub - google-bigquery

I have a Google bigQuery Table and I want to stream the entire table into pub-sub Topic
what should be the easy/fast way to do it?
Thank you in advance,

2019 update:
Now it's really easy with a click-to-bigquery option in Pub/Sub:
Find it on: https://console.cloud.google.com/cloudpubsub/topicList
The easiest way I know of is going through Google Cloud Dataflow, which natively knows how to access BigQuery and Pub/Sub.
In theory it should be as easy as the following Python lines:
p = beam.Pipeline(options=pipeline_options)
tablerows = p | 'read' >> beam.io.Read(
beam.io.BigQuerySource('clouddataflow-readonly:samples.weather_stations'))
tablerows | 'write' >> beam.io.Write(
beam.io.PubSubSink('projects/fh-dataflow/topics/bq2pubsub-topic'))
This combination of Python/Dataflow/BigQuery/PubSub doesn't work today (Python Dataflow is in beta, but keep an eye on the changelog).
We can do the same with Java, and it works well - I just tested it. It runs either locally, and also in the hosted Dataflow runner:
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<TableRow> weatherData = p.apply(
BigQueryIO.Read.named("ReadWeatherStations").from("clouddataflow-readonly:samples.weather_stations"));
weatherData.apply(ParDo.named("tableRow2string").of(new DoFn<TableRow, String>() {
#Override
public void processElement(DoFn<TableRow, String>.ProcessContext c) throws Exception {
c.output(c.element().toString());
}
})).apply(PubsubIO.Write.named("WriteToPubsub").topic("projects/myproject/topics/bq2pubsub-topic"));
p.run();
Test if the messages are there with:
gcloud --project myproject beta pubsub subscriptions pull --auto-ack sub1
Hosted Dataflow screenshot:

That really depends on the size of the table.
If it's a small table (a few thousand records, a couple doze columns) then you could setup a process to query the entire table, convert the response into a JSON array, and push to pub-sub.
If it's a big table (millions/billions of records, hundreds of columns) you'd have to export to file, and then prepare/ship to pub-sub
It also depends on your partitioning policy - if your tables are set up to partition by date you might be able to, again, query instead of export.
Last but not least, it also depends on the frequency - is this a one time deal (then export) or a continuous process (then use table decorators to query only the latest data)?
Need some more information if you want a truly helpful answer.
Edit
Based on your comments for the size of the table, I think the best way would be to have a script that would:
Export the table to GCS as newline delimited JSON
Process the file (read line by line) and send to pub-sub
There are client libraries for most programming languages. I've done similar things with Python, and it's fairly straight forward.

Related

Dataflow read from a topic PubSub and write to Bigquery (multiples tables)

there is someone who has used DynamicDestination in Dataflow who has a simple and described example. I got bored of seeing the example teleport in git (https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/DLPTextToBigQueryStreaming.java) , it hurts me to be a novice in apache Beam. By the way, what I need to do is read a message from Pubsub and through a Dataflow job write to different destinations (tables) in BigQuery dataset. I have a custom project that works perfect for a Bigquery table but the Pubsub topic will contain multiple destinations from the same dataset. Also, the message is in JSON format and contains a field with the name of the destination table.
This is my most representative code
TopicToBigQueryOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(TopicToBigQueryOptions.class);
Pipeline p = Pipeline.create(options);
p.apply(Constants.READ_PUBSUB, PubsubIO.readStrings().fromSubscription(options.getInputSubscription()))
.apply(Constants.LINE_TO_CHAMP, new PubSubToTableRowTransform())
.apply(Constants.WRITE_CHAMPBAN, BigQueryIO.writeTableRows()
.to(options.getTableStagingFileLines())
.withSchema(AmplaChangeLogSchema.getTableSchema())
.withCreateDisposition(CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
Any suggest?
Best regards
As I mentioned in the comment, the original author (#Ryan McDowell) is explaining pretty the same user scenario, consuming the JSON payloads from within GCP Pub/Sub messaging queue, performing dynamic routing to Bigquery tables, extracting certain table name throughout a specific attribute from Pub/Sub message.
In the pipeline from the example, we see getTableDestination() method, inherited from DynamicDestinations class, that is used to extract particular attribute(tableNameAttr) from within the message which contains Bigquery table name, finally identifying destination object TableDestination().

gcp NL api sentiment analysis - How to store the results in BigQuery

I am using gcp bigquery to store news streaming by google function and save it in bigquery.
How can I run a python script that is using the data from bigquery and finally writes back the result for score and magnitude to the related dataset?
I could not find anything in the google documentation about it, just how to run the sentiment analysis but not, how to get data from bigquery in and results out back to bigquery.
Thanks a lot for your support.
You didn't give us enough specifics for a specific answer, so let me give you my general way of trying this:
First, let's get the sentiment analysis of one arbitrary sentence with the gcloud CLI:
gcloud --format json ml language analyze-entity-sentiment --content "It's time we just let this thing go - it was a prett
y good bad idea, wasn't it though? -- Bad Idea, Sara Bareilles" | jq -c . > sentiments.json
Please notice that I removed the formatting of the output JSON with jq and stored the results in one file.
To load this file with maybe multiple JSON lines for each sentence into BigQuery:
bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON temp.sentiments sentiments.json
The question asks for "stream into BigQuery", but it might make more sense to batch the load like shown here.
Now we have a table with the results in BigQuery:
SELECT * FROM `fh-bigquery.temp.sentiments` LIMIT 1000
Btw, I added Sara Bareilles to the sentence to make sure that BigQuery got a full schema for auto-detection when creating the table the first time.
If you want to stream data into BigQuery, then look at the streaming into BigQuery docs. I wanted to isolate in this answer the basics of getting and looking at Cloud NLP data into BigQuery - the rest is just the basics of working with it.

Python Apache Beam: BigQuery streaming deduplication by row_id

According to BigQuery docs, you can ensure data consistency providing an insertId (https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency). If it's not provided, BQ will try to ensure consistency based on internals Ids and best-effort.
Using the BQ API you can do that with the row_ids param (https://google-cloud-python.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.insert_rows_json.html#google.cloud.bigquery.client.Client.insert_rows_json) but I can't find the same for the Apache Beam Python SDK.
Looking into the SDK I have noticed that a 'unique_row_id' property exist, but I really don't know how to pass my param to WriteToBigQuery()
How can I write into BQ (streaming) providing a row Id for deduplication?
Update:
If you use WriteToBigQuery then it will automatically create and
insert a unique row id called insertId for you, which will be inserted to bigquery. It's handled for you, you don't need to worry about it. :)
WriteToBigQuery is a PTransform, and in it's expand method calls BigQueryWriteFn
BigQueryWriteFn is a DoFn, and in it's process method calls _flush_batch
_flush_batch is a method that then calls the BigQueryWrapper.insert_rows method
BigQueryWrspper.insert_rows creates a list of bigquery.TableDataInsertAllRequest.RowsValueListEntry objects which contain the insertId and the row data as a json object
The insertId is generated by calling the unique_row_id method which returns a value consisting of UUID4 concatenated with _ and with an auto-incremented number.
In the current 2.7.0 code, there is this happy comment; I've also verified it is true :)
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L1182
# Prepare rows for insertion. Of special note is the row ID that we add to
# each row in order to help BigQuery avoid inserting a row multiple times.
# BigQuery will do a best-effort if unique IDs are provided. This situation
# can happen during retries on failures.
* Don't use BigQuerySink
At least, not in it's current form as it doesn't support streaming. I guess that might change.
Original (non)answer
Great question, I also looked and couldn't find a certain answer.
Apache Beam doesn't appear to use that google.cloud.bigquery client sdk you've linked to, it has some internal generated api client, but it appears to be up-to-date.
I looked at the source:
The insertall method is there https://github.com/apache/beam/blob/18d2168ee71a1b1b04976717f0f955199bb00961/sdks/python/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py#L476
I also found the insertid mentioned
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_messages.py#L1707
So if you can make an InsertAll call it will use a TableDataInsertAllRequest and pass a RowsValueListEntry
class TableDataInsertAllRequest(_messages.Message):
"""A TableDataInsertAllRequest object.
Messages:
RowsValueListEntry: A RowsValueListEntry object.
The RowsValueListEntry message is where the insertid is.
Here's the API docs for insert all
https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll
I will look some more at this because I don't see the WriteToBigQuery() exposing this.
I suspect that the 'bigquery will remember this for at least one minute` is a pretty loose guarantee for de-duping. The docs suggest using datastore if you need transactions. Otherwise you might need to run SQL with window functions to de-dupe at runtime, or run some other de-duping jobs on bigquery.
Perhaps using batch_size parameter of WriteToBigQuery(), and running a combine (or at worst a GroupByKey) step in dataflow is a more stable way to de-dupe prior to writing.

Writing to a BigQuery table with date in table name from a DataFlow streaming pipeline

My table name format: tableName_YYYYMMDD. I am trying to write to this table from a streaming dataflow pipeline. The reason I want to write to a new table everyday is because I want to expire tables after 30 days and only want to keep a window of 30 tables at a time.
Current code:
tableRow.apply(BigQueryIO.Write
.named("WriteBQTable")
.to(String.format("%1$s:%2$s.%3$s",projectId, bqDataSet, bqTable))
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
I do realize above code will not roll over to new day and start writing there.
As this answer suggests I can partition table and expire partitions, but writing to a partitioned tables seems like is not supported from a streaming pipeline.
Any ideas how can I work around this?
In the Dataflow 2.0 SDK there is a way to specify DynamicDestinations
See to(DynamicDestinations<T,?> dynamicDestinations) in BigQuery Dynamic Destionations.
Also, see the TableDestination version, which should be simpler and less code. Though unfortunately there is no example in the javadoc.
to(SerializableFunction<ValueInSingleWindow<T>,TableDestination> tableFunction)
https://beam.apache.org/documentation/sdks/javadoc/2.0.0/
This is an open source pipeline you can use to connect pub/sub to big query. I think google has also added support for streaming pipelines to date partitioned tables. Details here.

Validating rows before inserting into BigQuery from Dataflow

According to
How do we set maximum_bad_records when loading a Bigquery table from dataflow? there is currently no way to set the maxBadRecords configuration when loading data into BigQuery from Dataflow. The suggestion is to validate the rows in the Dataflow job before inserting them into BigQuery.
If I have the TableSchema and a TableRow, how do I go about making sure that the row can safely be inserted into the table?
There must be an easier way of doing this than iterating over the fields in the schema, looking at their type and looking at the class of the value in the row, right? That seems error-prone, and the method must be fool-proof since the whole pipeline fails if a single row cannot be loaded.
Update:
My use case is an ETL job that at first will run on JSON (one object per line) logs on Cloud Storage and write to BigQuery in batch, but later will read objects from PubSub and write to BigQuery continuously. The objects contain a lot of information that isn't necessary to have in BigQuery and also contains parts that aren't even possible to describe in a schema (basically free form JSON payloads). Things like timestamps also need to be formatted to work with BigQuery. There will be a few variants of this job running on different inputs and writing to different tables.
In theory it's not a very difficult process, it takes an object, extracts a few properties (50-100), formats some of them and outputs the object to BigQuery. I more or less just loop over a list of property names, extract the value from the source object, look at a config to see if the property should be formatted somehow, apply the formatting if necessary (this could be downcasing, dividing a millisecond timestamp by 1000, extracting the hostname from a URL, etc.), and write the value to a TableRow object.
My problem is that data is messy. With a couple of hundred million objects there are some that don't look as expected, it's rare, but with these volumes rare things still happen. Sometimes a property that should contain a string contains an integer, or vice-versa. Sometimes there's an array or an object where there should be a string.
Ideally I would like to take my TableRow and pass it by TableSchema and ask "does this work?".
Since this isn't possible what I do instead is I look at the TableSchema object and try to validate/cast the values myself. If the TableSchema says a property is of type STRING I run value.toString() before adding it to the TableRow. If it's an INTEGER I check that it's a Integer, Long or BigInteger, and so on. The problem with this method is that I'm just guessing what will work in BigQuery. What Java data types will it accept for FLOAT? For TIMESTAMP? I think my validations/casts catch most problems, but there are always exceptions and edge cases.
In my experience, which is very limited, the whole work pipeline (job? workflow? not sure about the correct term) fails if a single row fails BigQuery's validations (just like a regular load does unless maxBadRecords is set to a sufficiently large number). It also fails with superficially helpful messages like 'BigQuery import job "dataflow_job_xxx" failed. Causes: (5db0b2cdab1557e0): BigQuery job "dataflow_job_xxx" in project "xxx" finished with error(s): errorResult: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field'. Perhaps there is somewhere that can see a more detailed error message that could tell me which property it was and what the value was? Without that information it could just as well have said "bad data".
From what I can tell, at least when running in batch mode Dataflow will write the TableRow objects to the staging area in Cloud Storage and then start a load once everything is there. This means that there is nowhere for me to catch any errors, my code is no longer running when BigQuery is loaded. I haven't run any job in streaming mode yet, but I'm not sure how it would be different there, from my (admittedly limited) understanding the basic principle is the same, it's just the batch size that's smaller.
People use Dataflow and BigQuery, so it can't be impossible to make this work without always having to worry about the whole pipeline stopping because of a single bad input. How do people do it?
I'm assuming you deserialize the JSON from the file as a Map<String, Object>. Then you should be able to recursively type-check it with a TableSchema.
I'd recommend an iterative approach to developing your schema validation, with the following two steps.
Write a PTransform<Map<String, Object>, TableRow> that converts your JSON rows to TableRow objects. The TableSchema should also be a constructor argument to the function. You can start off making this function really strict -- require that JSON parsed input as Integer directly, for instance, when a BigQuery INTEGER schema was found -- and aggressively declare records in error. Basically, ensure that no invalid records are output by being super-strict in your handling.
Our code here does something somewhat similar -- given a file produced by BigQuery and written as JSON to GCS, we recursively walk the schema and do some type conversions. However, we do not need to validate, because BigQuery itself wrote the data.
Note that the TableSchema object is not Serializable. We've worked around by converting the TableSchema in a DoFn or PTransform constructor to a JSON String and back. See the code in BigQueryIO.java that uses the jsonTableSchema variable.
Use the "dead-letter" strategy described in this blog post to handle bad records -- side output the offending Map<String, Object> rows from your PTransform and write them to a file. That way, you can inspect the rows that failed your validation later.
You might start with some small files and use the DirectPipelineRunner rather than the DataflowPipelineRunner. The direct runner runs the pipeline on your computer, rather than on Google Cloud Dataflow service, and it uses the BigQuery streaming writes. I believe when those writes fail you will get better error messages.
(We use the GCS->BigQuery Load Job pattern for Batch jobs because it's much more efficient and cost-effective, but BigQuery streaming writes in Streaming jobs because they are low-latency.)
Finally, in terms of logging information:
Definitely check Cloud Logging (by following the Worker Logs link on the logs panel.
You may get better information about why the load jobs triggered by your Batch Dataflows fail if you run the bq command-line utility: bq show -j PROJECT:dataflow_job_XXXXXXX.