Airflow Operator BigQueryTablePartitionExistenceSensor Question - google-bigquery

I'm trying to use this BigQueryTablePartitionExistenceSensor operator in Airflow and I was wondering if this operator checks whether the partition is fully loaded or can potentially mark to success even if the data isn't complete yet.
For example, if my table is partitioned on DAY and the load for 20220420 has started but isn't complete, would this sensor trigger? Or, would it wait until that load step has been completed before marking the sensor to success?
Thanks

The Operator will not wait until your data has loaded, it will just check for the existence of the partition value at that moment in time. So if a single row gets inserted into that partition then this sensor would return True. See the sensor code that gets called by this operator.
An idea I've used in the past for similar problems has been to use a sentinel Label on the partitioned table to mark a load as "in-progress" or "done"

As has already been answered, it does not await anything except the existence of the partition.
If your data is streamed into partitions, and you have ordered delivery, you can probably add a sensor for the next-day partition — on the assumption that the previous day is complete when events have started streaming into the next.
If the load is managed by the same Airflow instance, I'd suggest using an ExternalTaskSensor on the load job. If not, you might be able to use the more generic SqlSensor, and run a custom SQL query on metadata tables to determine if a partition is complete, perhaps you can add a label or something with the Load job that you then query for.

Related

How to know the first ran jobId of a cached query in big-query?

When we run a query in big-query environment, the results are cached in the temporary table. From next time onwards, when we ran the same query multiple times, the subsequent runs will fetch the results from the cache for the next 24 hrs with some exceptions. Now my use case is, in the subsequent runs, i want to know like from which jobId this query cache results are got, previous first time run of the query ??
I have checked all the java docs related to query didn't find that info. We have cacheHit variable, which will tell you whether the query has fetched from the cache or not . Here i want to know one step further, from what jobId, the results got fetched. I expected like, may be in this method i can know the info, but i am always getting the null value for that. I also want to know what is meant by parentJob in big-query context.
It's unclear why you'd even care about this other than as a technical exercise. If you want to build your own application caching layer that's a different concern. More details about query caching can be found on https://cloud.google.com/bigquery/docs/cached-results.
The easiest way to do this would probably be by traversing jobs.list until you find a job that has the same destination table (it'll be prefaced with an anon prefix), and where the cacheHit stat is false/not present.
Your inquiry about parentJob is unrelated to this exercise. It's for finding all the child jobs created as part of a script or multi-statement execution. More information about this can be found on https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting-concepts.

Python Apache Beam: BigQuery streaming deduplication by row_id

According to BigQuery docs, you can ensure data consistency providing an insertId (https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency). If it's not provided, BQ will try to ensure consistency based on internals Ids and best-effort.
Using the BQ API you can do that with the row_ids param (https://google-cloud-python.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.insert_rows_json.html#google.cloud.bigquery.client.Client.insert_rows_json) but I can't find the same for the Apache Beam Python SDK.
Looking into the SDK I have noticed that a 'unique_row_id' property exist, but I really don't know how to pass my param to WriteToBigQuery()
How can I write into BQ (streaming) providing a row Id for deduplication?
Update:
If you use WriteToBigQuery then it will automatically create and
insert a unique row id called insertId for you, which will be inserted to bigquery. It's handled for you, you don't need to worry about it. :)
WriteToBigQuery is a PTransform, and in it's expand method calls BigQueryWriteFn
BigQueryWriteFn is a DoFn, and in it's process method calls _flush_batch
_flush_batch is a method that then calls the BigQueryWrapper.insert_rows method
BigQueryWrspper.insert_rows creates a list of bigquery.TableDataInsertAllRequest.RowsValueListEntry objects which contain the insertId and the row data as a json object
The insertId is generated by calling the unique_row_id method which returns a value consisting of UUID4 concatenated with _ and with an auto-incremented number.
In the current 2.7.0 code, there is this happy comment; I've also verified it is true :)
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L1182
# Prepare rows for insertion. Of special note is the row ID that we add to
# each row in order to help BigQuery avoid inserting a row multiple times.
# BigQuery will do a best-effort if unique IDs are provided. This situation
# can happen during retries on failures.
* Don't use BigQuerySink
At least, not in it's current form as it doesn't support streaming. I guess that might change.
Original (non)answer
Great question, I also looked and couldn't find a certain answer.
Apache Beam doesn't appear to use that google.cloud.bigquery client sdk you've linked to, it has some internal generated api client, but it appears to be up-to-date.
I looked at the source:
The insertall method is there https://github.com/apache/beam/blob/18d2168ee71a1b1b04976717f0f955199bb00961/sdks/python/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py#L476
I also found the insertid mentioned
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_messages.py#L1707
So if you can make an InsertAll call it will use a TableDataInsertAllRequest and pass a RowsValueListEntry
class TableDataInsertAllRequest(_messages.Message):
"""A TableDataInsertAllRequest object.
Messages:
RowsValueListEntry: A RowsValueListEntry object.
The RowsValueListEntry message is where the insertid is.
Here's the API docs for insert all
https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll
I will look some more at this because I don't see the WriteToBigQuery() exposing this.
I suspect that the 'bigquery will remember this for at least one minute` is a pretty loose guarantee for de-duping. The docs suggest using datastore if you need transactions. Otherwise you might need to run SQL with window functions to de-dupe at runtime, or run some other de-duping jobs on bigquery.
Perhaps using batch_size parameter of WriteToBigQuery(), and running a combine (or at worst a GroupByKey) step in dataflow is a more stable way to de-dupe prior to writing.

How can I trigger an email or other notification based on a BigQuery query?

I would like to receive a notification, ideally via email, when some threshold is met in Google BigQuery. For example, if the query is:
SELECT name, count(id) FROM terrible_things
WHERE date(terrible_thing) < -1d
Then I would want to get an alert when there were greater than 0 results, and I would want that alert to contain the name of each object and how many there were.
BigQuery does not provide the kinds of services you'd need to build this without involving other technologies. However, you should be able to use something like appengine (which does have a task scheduling mechanism) to periodically issue your monitoring query probe, check the results of the job, and alert if there are nonzero rows in the results. Alternately, you could do this locally using some scripting and leveraging the BQ command line tool.
You could also refine things by using BQ's table decorators to only scan the data that's arrived since you last ran your monitoring query, if you retain knowledge of the last probe's execution in the calling system.
In short: Something else needs to issue the queries and react based on the outcome, but BQ can certainly evaluate the data.

Lambda Architecture Modelling Issue

I am considering implementing a Lambda Architecture in order to process events transmitted by multiple devices.
In most cases (averages etc.) its seems to fit my requirements. However, I am stuck trying to model a specific use case. In short...
Each device has a device_id. Every device emits 1 event per second. Each event has an event_id ranging from {0-->10}.
An event_id of 0 indicates START & an event_id of 10 indicates END
All the events between START & END should be grouped into one single group (event_group).
This will produce tuples of event_groups i.e. {0,2,2,2,5,10}, (0,4,2,7,...5,10), (0,10)
This (event_group) might be small i.e. 10 minutes or very large say 3hours.
According to Lambda Architecture these events transmitted by every device are my "Master Data Set".
Currently, the events are sent to HDFS & Storm using Kafka (Camus, Kafka Spout).
In the Streaming process I group by device_id, and use Redis to maintain a set of incoming events in memory, based on a key which is generated each time an event_id=0 arrives.
The problem lies in HDFS. Say I save a file with all incoming events every hour. Is there a way to distinguish these (group_events)?
Using Hive I can group tuples in the same manner. However, each file will also contain "broken" event_groups
(0,2,2,3) previous computation (file)
(4,3,) previous computation (file)
(5,6,7,8,10) current computation (file)
so that I need to merge them based on device_id into (0,2,2,3,4,3,5,6,7,8,10) (multiple files)
Is a Lambda Architecture a fit for this scenario? Or should the streaming process be the only source of truth? I.e. write to hbase, hdfs itself won't this affect the overall latency.
As far as I understand your process, I don't see any issue, as the principle of Lambda Architecure is to re-process regularly all your data on a batch mode.
(by the way, not all your data, but a time frame, usually larger than the speed layer window)
If you choose a large enough time window for your batch mode (let's say your aggregation window + 3 hours, in order to include even the longest event groups), your map reduce program will be able to compute all your event groups for the desired aggregation window, whatever file the distincts events are stored (Hadoop shuffle magic !)
The underlying files are not part of the problem, but the time windows used to select data to process are.

Caching of Map applications in Hadoop MapReduce?

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.
As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.
My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.
My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?
I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.
Any comments / answers?
If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.
Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.
Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in my_mapper_output/20091101, and the next week's batch in my_mapper_output/20091108, etc. If you want to reduce over the whole set, you should be able to pass in my_mapper_output as the input folder, and catch all of the output sets.
Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:
map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.
map (reduce function) on (map_outputs); store into hbase.
You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.
Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase):
http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability