Dynamic partitioning in google cloud dataflow? - google-bigquery

I'm using dataflow to process files stored in GCS and write to Bigquery tables. Below are my requirements:
input files contain events records, each record pertains to one eventType;
need to partition records by eventType;
for each eventType output/write records to a corresponding Bigquery table, one table per eventType.
events in each batch input files vary;
I'm thinking of applying transforms such as "groupByKey" and "partition", however seems that I have to know number of (and type of) events at the development time which is needed to determine the partitions.
Do you guys have a good idea to do the partitioning dramatically? meaning partitions can be determined at run time?

Why not loading everything into a single "raw" bigquery table and then using BigQuery API determine the different number of events and export each event type to its own table (e.g., via https://cloud.google.com/bigquery/bq-command-line-tool#createtablequery) or an API call?
If your input format is simple, you can do that without using dataflow at all and it will be probably more cost efficient.

Related

Google Dataflow store to specific Partition using BigQuery Storage Write API

I want to store data to BigQuery by using specific partitions. The partitions are ingestion-time based. I want to use a range of partitions spanning over two years. I use the partition alias destination project-id:data-set.table-id$partition-date.
I get failures since it does recognise the destination as an alias but as an actual table.
Is it supported?
When you ingest data into BigQuery, it will land automatically in the corresponding partition. If you choose a daily ingestion time as partition column, that means that every new day will be a new partition. To be able to "backfill" partitions, you need to choose some other column for the partition (e.g. a column in the table with the ingestion date). When you write data from Dataflow (from anywhere actually), the data will be stored in the partition corresponding to the value of that column for each record.
Direct writes to partitions by ingestion time is not supported using the Write API.
Also using the stream api is not supported if a window of 31 days has passed
From the documentation:
When streaming using a partition decorator, you can stream to partitions within the last 31 days in the past and 16 days in the future relative to the current date, based on current UTC time.
The solution that works is to use BigQuery load jobs to insert data. This can handle this scenario.
Because this operation has lot's of IO involved (files getting created on GCS), it can be lengthy, costly and resource intensive depending on the data.
A approach can be to create table shards and split the Big Table to small ones so the Storage Read and the Write api can be used. Then load jobs can be used from the sharded tables towards the partitioned table would require less resources, and the problem is already divided.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Can BigQuery table extracted rows be randomized

I am currently extracting a BigQuery table into sharded .csv's in Google Cloud Storage -- is there any way to shuffle/randomize the rows for the extract? The GCS .csv's will be used as training data for a GCMLE model, and the current exports are in a non-random order as they are bunched up by similar "labels".
This causes issues when training a GCMLE model as you must hand the model random examples of all labels within each batch. While GCMLE/TF has the ability to randomize the order of rows WITHIN individual .csv's, but there is not (to my knowledge) any way to randomize the rows selected within multiple .csv's. So, I am looking for a way to ensure that the rows being output to the .csv are indeed random.
Can BigQuery table extracted rows be randomized?
No. Extract Job API (thus any client built on top of it) has nothing that would allow you to do so.
I am looking for a way to ensure that the rows being output to the .csv are indeed random.
You should first create tables corresponding to your csv file and then extract them one-by-one into separate csv. In this case you can control what goes into what csv
If your concern is cost of processing (you will need to scan table as many times as csv files you will need) - you can check partitioning approaches in Migrating from non-partitioned to Partitioned tables . This still involves cost but substantially reduced one
Finally, zero cost option is to use Tabledata.list API with paging while distributing response throughout your csv files - you can do this easily in client of your choice

Dynamic bigquery table names in dataflow

Basicly we want to split a big (billions of rows) bigquery table into a large number (can be around 100k) smaller tables based on the value of a particular column (not date). I can't figure out how to do it efficiently in bigquery itself, so I am thinking of using dataflow.
With dataflow, we can first load the data from , then create a key value pair for each record, the key is all the possible values for the particular column we want to split the table, then we can group the records by the key. so after this operation, we have PCollection of the (key, [records]). we would then need to write PCollection back to bigquery table, the table name can be key_table.
So the operation would be: p | beam.io.Read(beam.io.BigQuerySource()) | beam.map(lambda record : (record['splitcol'], record)) | beam.GroupByKey() | beam.io.Write(beam.io.BigQuerySink)
The key question now is how do I write to different tables in the last step based on the value in each element in PCollection.
This question is somehow related to the another question:
Writing different values to different BigQuery tables in Apache Beam. But I am a python guy, not sure if the same solution is possible in Python SDK also.
Currently this feature (value-dependent BigQueryIO.write()) is only supported in Beam Java. Unfortunately I can't think of an easy way to mimic it using Beam Python, short of reimplementing the respective Java code. Please feel free to open a JIRA feature request.
I guess the simplest thing that comes to mind is writing a DoFn to manually write your rows to the respective tables, using the BigQuery streaming insert API (rather than the Beam BigQuery connector), however keep in mind that streaming inserts are more expensive and subject to more strict quota policies than bulk imports (which are used by the Java BigQuery connector when writing a bounded PCollection).
There is also work happening in Beam on allowing reuse of transforms across languages - a design is being discussed at https://s.apache.org/beam-mixed-language-pipelines. When that work is completed, you would be able to use the Java BigQuery connector from a Python pipeline.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).