We have been facing a hot key issue in our Dataflow pipeline (streaming pipeline, batch load into BigQuery -- we are using batch for a cost-effective purpose):
We are ingesting data to according tables based on their decoder value. For example, data with http decoder are going to http table, data with ssl decoder are going to ssl table.
So the BigQuery ingestion is using dynamic destinations.
The key is the destination table spec for the data.
An example error log:
A hot key
'key: tableSpec: ace-prod-300923:ml_dataset_us_central1.ssl tableDescription: Table for ssl shard: 1'
was detected in step
'insertTableRowsToBigQuery/BatchLoads/SinglePartitionsReshuffle/GroupByKey/ReadStream' with age of '1116.266s'.
This is a symptom of key distribution being skewed.
To fix, please inspect your data and pipeline to ensure that elements are evenly distributed across your key space.
Error is detected in this step: 'insertTableRowsToBigQuery/BatchLoads/SinglePartitionsReshuffle/GroupByKey/ReadStream'
The hot key issue is because of the nature of data, some decoder data has disproportionately many values. And our pipeline is a streaming pipeline.
We have read the document provided by Google but still not sure how to fix it.
Dataflow shuffle. Our project is already using streaming engine
Rekey. Doesn't seem to apply to our case, as the key is the destination table spec. To make the ingestion work, the key has to match the existing table spec in bigquery.
Combine.PerKey.withHotKeyFanout(). I don't know how to apply this. Because the key is generated in this step: insertTableRowsToBigQuery. This step, we are using BigQueryIO to write to BigQuery. The key is coming from dynamically generate BigQuery table names based on the current window or the current value. Sharding BigQuery output tables
Attached the code where the hot key is detected:
toBq.apply("insertTableRowsToBigQuery",
BigQueryIO
.<DataIntoBQ>write()
.to((ValueInSingleWindow<DataIntoBQ> dataIntoBQ) -> {
try {
String decoder = dataIntoBQ.getValue().getDecoder(); // getter function
String destination = String.format("%s:%s.%s",
PROJECT_ID, DATASET, decoder);
if (!listOfProtocols.contains(decoder)) {
LOG.error("wrong bigquery table decoder destination: " + decoder);
}
return new TableDestination(
destination, // Table spec
"Table for " + decoder // Table description
);
} catch (Exception e) {
LOG.error("insertTableRowsToBigQuery error", e);
return null;
}
}
)
.withFormatFunction(
(DataIntoBQ elem) ->
new DataIntoBQ().toTableRow(elem)
)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(Duration.standardMinutes(3))
.withAutoSharding()
.withCustomGcsTempLocation(ValueProvider.StaticValueProvider.of(options.getGcpTempLocation()))
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
You can still try the rekey strategy. For example, you can apply a transformation before the apply("insertTableRowsToBigQuery") such that:
elements with the key "http" -> "randomVal_http" whereas randomVal is a value in a specific range (say from 0 to 10) the width of range depends on how many splits you want your elements with key http to be split into. For example, if you have 10 million elements with key "http", and you want to make sure they are split in 10 groups, each with approx. 10 elements you can generate uniform rand nrs between 0 and 9.
the same mapping you can apply to elements that belong to hot keys, element with non-hot keys don't need to be rekeyed.
Now, in your "insertTableRowsToBigQuery", you know how to go from a key like "someVal_http" to "http" - split the string.
That should help.
Regarding the Combine.PerKey.withHotKeyFanout() I am not sure how to do this for IO Transforms. If it was some intermediate transform, I could have helpled
Related
I really liked BigQuery's Data Transfer Service. I have flat files in the exact schema sitting to be loaded into BQ. It would have been awesome to just setup DTS schedule that picked up GCS files that match a pattern and load the into BQ. I like the built in option to delete source files after copy and email in case of trouble. But the biggest bummer is that the minimum interval is 60 minutes. That is crazy. I could have lived with a 10 min delay perhaps.
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
StartManualTransferRuns is part of the RPC library but does not have a REST API equivalent as of now. How to use that will depend on your environment. For instance, you can use the Python Client Library (docs).
As an example, I used the following code (you'll need to run pip install google-cloud-bigquery-datatransfer for the depencencies):
import time
from google.cloud import bigquery_datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = bigquery_datatransfer_v1.DataTransferServiceClient()
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = '5e6...7bc' # alphanumeric ID you'll find in the UI
parent = client.project_transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = bigquery_datatransfer_v1.types.Timestamp(seconds=int(time.time() + 10))
response = client.start_manual_transfer_runs(parent, requested_run_time=start_time)
print(response)
Note that you'll need to use the right Transfer Config ID and the requested_run_time has to be of type bigquery_datatransfer_v1.types.Timestamp (for which there was no example in the docs). I set a start time 10 seconds ahead of the current execution time.
You should get a response such as:
runs {
name: "projects/PROJECT_NUMBER/locations/us/transferConfigs/5e6...7bc/runs/5e5...c04"
destination_dataset_id: "DATASET_NAME"
schedule_time {
seconds: 1579358571
nanos: 922599371
}
...
data_source_id: "google_cloud_storage"
state: PENDING
params {
...
}
run_time {
seconds: 1579358581
}
user_id: 28...65
}
and the transfer is triggered as expected (nevermind the error):
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
With this you can set a cron job to execute your function every ten minutes. As discussed in the comments, the minimum interval is 60 minutes so it won't pick up files less than one hour old (docs).
Apart from that, this is not a very robust solution and here come into play your follow-up questions. I think these might be too broad to address in a single StackOverflow question but I would say that, for on-demand refresh, Cloud Scheduler + Cloud Functions/Cloud Run can work very well.
Dataflow would be best if you needed ETL but it has a GCS connector that can watch a file pattern (example). With this you would skip the transfer, set the watch interval and the load job triggering frequency to write the files into BigQuery. VM(s) would be running constantly in a streaming pipeline as opposed to the previous approach but a 10-minute watch period is possible.
If you have complex workflows/dependencies, Airflow has recently introduced operators to start manual runs.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
You can use wildcards to match a file pattern when you create the transfer:
Also, this can be done on a file-by-file basis using Pub/Sub notifications for Cloud Storage to trigger a Cloud Function.
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
There is already a Feature Request here. Feel free to star it to show your interest and receive updates
Now your can easy manual run transfer Bigquery data use RESTApi:
HTTP request
POST https://bigquerydatatransfer.googleapis.com/v1/{parent=projects/*/locations/*/transferConfigs/*}:startManualRuns
About this part > {parent=projects//locations//transferConfigs/*}, check on CONFIGURATION of your Transfer then notice part like image bellow.
Here
More here:
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs/startManualRuns
following the Guillem's answer and the API updates, this is my new code:
import time
from google.cloud.bigquery import datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = datatransfer_v1.DataTransferServiceClient()
config = '34y....654'
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = config
parent = client.transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = Timestamp(seconds=int(time.time()))
request = datatransfer_v1.types.StartManualTransferRunsRequest(
{ "parent": parent, "requested_run_time": start_time }
)
response = client.start_manual_transfer_runs(request, timeout=360)
print(response)
For this to work, you need to know the correct TRANSFER_CONFIG_ID.
In my case, I wanted to list all the BigQuery Scheduled queries, to get a specific ID. You can do it like that :
# Put your projetID here
PROJECT_ID = 'PROJECT_ID'
from google.cloud import bigquery_datatransfer_v1
bq_transfer_client = bigquery_datatransfer_v1.DataTransferServiceClient()
parent = bq_transfer_client.project_path(PROJECT_ID)
# Iterate over all results
for element in bq_transfer_client.list_transfer_configs(parent):
# Print Display Name for each Scheduled Query
print(f'[Schedule Query Name]:\t{element.display_name}')
# Print name of all elements (it contains the ID)
print(f'[Name]:\t\t{element.name}')
# Extract the IDs:
TRANSFER_CONFIG_ID= element.name.split('/')[-1]
print(f'[TRANSFER_CONFIG_ID]:\t\t{TRANSFER_CONFIG_ID}')
# You can print the entire element for debug purposes
print(element)
We are using a pretty simple flow where messages are retrieved from PubSub, their JSON content is being flatten into two types (for BigQuery and Postgres) and then inserted into both sinks.
But, we are seeing duplicates in both sinks (Postgres was kinda fixed with a unique constraint and a "ON CONFLICT... DO NOTHING").
At first we trusted in the supposedly "insertId" UUId that the Apache Beam/BigQuery creates.
Then we add a "unique_label" attribute to each message before queueing them into PubSub, using data from the JSON itself, which gives them uniqueness (a device_id + a reading's timestamp). And subscribed to the topic using that attribute with "withIdAttribute" method.
Finally we paid for GCP Support, and their "solutions" do not work. They have told us to even use Reshuffle transform, which is deprecated by the way, and some windowing (that we do not won't since we want near-real time data).
This the main flow, pretty basic:
[UPDATED WITH LAST CODE]
Pipeline
val options = PipelineOptionsFactory.fromArgs(*args).withValidation().`as`(OptionArgs::class.java)
val pipeline = Pipeline.create(options)
var mappings = ""
// Value only available at runtime
if (options.schemaFile.isAccessible){
mappings = readCloudFile(options.schemaFile.get())
}
val tableRowMapper = ReadingToTableRowMapper(mappings)
val postgresMapper = ReadingToPostgresMapper(mappings)
val pubsubMessages =
pipeline
.apply("ReadPubSubMessages",
PubsubIO
.readMessagesWithAttributes()
.withIdAttribute("id_label")
.fromTopic(options.pubSubInput))
pubsubMessages
.apply("AckPubSubMessages", ParDo.of(object: DoFn<PubsubMessage, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info("Processing readings: " + context.element().attributeMap["id_label"])
context.output("")
}
}))
val disarmedMessages =
pubsubMessages
.apply("DisarmedPubSubMessages",
DisarmPubsubMessage(tableRowMapper, postgresMapper)
)
disarmedMessages
.get(TupleTags.readingErrorTag)
.apply("LogDisarmedErrors", ParDo.of(object: DoFn<String, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info(context.element())
context.output("")
}
}))
disarmedMessages
.get(TupleTags.tableRowTag)
.apply("WriteToBigQuery",
BigQueryIO
.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry())
.to(options.bigQueryOutput)
)
pipeline.run()
DissarmPubsubMessage is a PTransforms that uses FlatMapElements transform to get TableRow and ReadingsInputFlatten (own class for Postgres)
We expect zero duplicates or the "best effort" (and we append some cleaning cron job), we paid for these products to run statistics and bigdata analysis...
[UPDATE 1]
I even append a new simple transform that logs our unique attribute through a ParDo which supposedly should ack the PubsubMessage, but this is not the case:
new flow with AckPubSubMessages step
Thanks!!
Looks like you are using the global window. One technique would be to window this into an N minute window. Then process the keys in the window and drop an items with dup keys.
The supported programming languages are Python and Java, your code seems to be Scala and as far as I know it is not supported. I strongly recommend using Java to avoid any unsupported feature for the programming language you use.
In addition, I would recommend the following approaches to work on duplicates, the option 2 could meet your need of near-real-time:
message_id. Probably you already read the FAQ - duplicates which points to deprecated doc. However, if you check the PubsubMessage object you will notice that messageId is still available and it will be populated if not set by the publisher:
"ID of this message, assigned by the server when the message is
published ... It must not be populated by the publisher in a
topics.publish call"
BigQuery Streaming. To validate duplicate during loading the data, right before inserting in BQ you can create UUID.Please refer the section Example sink: Google BigQuery.
Try the Dataflow template PubSubToBigQuery and validate there are not duplicates in BQ.
Suppose I have a PCollection<Foo> and I want to write it to multiple BigQuery tables, choosing a potentially different table for each Foo.
How can I do this using the Apache Beam BigQueryIO API?
This is possible using a feature recently added to BigQueryIO in Apache Beam.
PCollection<Foo> foos = ...;
foos.apply(BigQueryIO.write().to(new SerializableFunction<ValueInSingleWindow<Foo>, TableDestination>() {
#Override
public TableDestination apply(ValueInSingleWindow<Foo> value) {
Foo foo = value.getValue();
// Also available: value.getWindow(), getTimestamp(), getPane()
String tableSpec = ...;
String tableDescription = ...;
return new TableDestination(tableSpec, tableDescription);
}
}).withFormatFunction(new SerializableFunction<Foo, TableRow>() {
#Override
public TableRow apply(Foo foo) {
return ...;
}
}).withSchema(...));
Depending on whether the input PCollection<Foo> is bounded or unbounded, under the hood this will either create multiple BigQuery import jobs (one or more per table depending on amount of data), or it will use the BigQuery streaming inserts API.
The most flexible version of the API uses DynamicDestinations, which allows you to write different values to different tables with different schemas, and even allows you to use side inputs from the rest of the pipeline in all of these computations.
Additionally, BigQueryIO has been refactored into a number of reusable transforms that you can yourself combine to implement more complex use cases - see files in the source directory.
This feature will be included in the first stable release of Apache Beam and into the next release of Dataflow SDK (which will be based on the first stable release of Apache Beam). Right now you can use this by running your pipeline against a snapshot of Beam at HEAD from github.
As of Beam 2.12.0, this feature is available in the Python SDK as well. It is marked as experimental, so you will have to pass --experiments use_beam_bq_sink to enable it. You'd do something like so:
def get_table_name(element):
if meets_some_condition(element):
return 'mytablename1'
else:
return 'mytablename2'
p = beam.Pipeline(...)
my_input_pcoll = p | ReadInMyPCollection()
my_input_pcoll | beam.io.gcp.bigquery.WriteToBigQuery(table=get_table_name)
The new sink supports a number of other options, which you can review in the pydoc
I have a PCollection as a result of a pipeline after doing Bigquery processing, now I want to use some part of that data separate from the pipeline. How do I transfer a PCollection to a List so that I can iterate through it and use the content.
Am I doing something wrong conceptually ?
Once you are done with data processing inside your Dataflow pipeline, you'd likely want to write the data into a persistent storage, such as files in Cloud Storage (GCS), a table in BigQuery, etc.
You can then consume the data outside Dataflow, for example, to read it into a List. Obviously, it would need to fit into memory for that specific action.
What I would do is creating "side outputs" (https://cloud.google.com/dataflow/model/par-do) that is another PCollection that you create together with your main process so in the end you will have 2 PCollections as result of your BQ process.
Just ensure that on your process function you create a condition to add elements to side output collection. Something like this:
public final void processElement(final ProcessContext context) throws Exception {
context.output(bqProcessResult);
if (condition) {
context.sideOutput(myFilterTag, bqProcessResult);
}
}
The result of that process is not a PCollection but a PCollectionTuple so you just have to do the following:
PCollectionTuple myTuples = previous process using the function above...;
PCollection<MyType> bqCollection = myTuples.get(bqTag);
PCollection<MyType> filteredCollection = myTuples.get(myFilterTag);
when using SetEntry, it will automatic generate a set with key "ids:+ objectName" in redis db.
For example:
typedClient.SetEntry("famyly:username:jhon",new Family {FatherName="Jhon",...});
a set with key name of "ids:Family" and a member like "2343443" will be automatic created in redis db,
and each time I update or modify the same key with SetEntry, the set of "ids:Family" will increment with an new auto generated member. And this set will grow extremely large if I update the key frequently.
How can I disable the auto generated set? this set seems useless for the current circumstances.
thanks
I ran into this same problem - I discovered that our database contained a couple dozen of these "ids:XXX" sets, each containing tens of millions of items, which were consuming significant amounts of memory.
The solution is to switch to untyped clients. You can still use typed methods on the client so you're really not giving up any type safety or automatic serialization at all. There's a couple ways to create clients; we tend to use the get-in-get-out Exec shortcuts on RedisClientsManager. You should be able to adapt this to the way you do it.
Typed client - creates "ids" sets:
// set:
redis.ExecAs<T>(c => c.SetEntry(key, value));
// get:
T value = redis.ExecAs<T>(c => c.GetValue(key));
Untyped client - no "ids" sets created:
// set:
redis.Exec(c => c.Set(key, value));
// get:
using (var cli = _redis.GetClient())
{
T value = cli.Get<T>(key);
}
The inferred auto-generated id's are when you use the high-level Redis Typed Client. Use the IRedisClient.SetEntry on the string-based RedisClient API instead.