We are trying to write to Big Query using Apache Beam and avro.
The following seems to work ok:-
p.apply("Input", AvroIO.read(DataStructure.class).from("AvroSampleFile.avro"))
.apply("Transform", ParDo.of(new CustomTransformFunction()))
.apply("Load", BigQueryIO.writeTableRows().to(table).withSchema(schema));
We then tried to use it in the following manner to get data from the Google Pub/Sub
p.begin()
.apply("Input", PubsubIO.readAvros(DataStructure.class).fromTopic("topicName"))
.apply("Transform", ParDo.of(new CustomTransformFunction()))
.apply("Write", BigQueryIO.writeTableRows()
.to(table)
.withSchema(schema)
.withTimePartitioning(timePartitioning)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
p.run().waitUntilFinish();
When we do this it always pushes it to the buffer and Big Query seems to take a long time to read from the buffer. Can anyone tell me why the above won't write the records directly to the Big Query tables?
UPDATE:-
It looks like I need add the following settings but this throws an java.lang.IllegalArgumentException.
.withMethod(Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
The answer is you need to include "withNumFileShards" like so (Can be 1 to 1000).
p.begin()
.apply("Input", PubsubIO.readAvros(DataStructure.class).fromTopic("topicName"))
.apply("Transform", ParDo.of(new CustomTransformFunction()))
.apply("Write", BigQueryIO.writeTableRows()
.to(table)
.withSchema(schema)
.withTimePartitioning(timePartitioning)
.withMethod(Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
.withNumFileShards(1000)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
p.run().waitUntilFinish();
I can't find this documented anywhere to say that withNumFileShards is mandatory however there is a Jira ticket for this which I found after the fix.
https://issues.apache.org/jira/browse/BEAM-3198
Related
I want to read data from Bigquery periodically in Beam, and the test codes as below
pipeline.apply("Generate Sequence",
GenerateSequence.from(0).withRate(1, Duration.standardMinutes(2)))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))))
.apply("Read from BQ", new ReadBQ())
.apply("Convert Row",
MapElements.into(TypeDescriptor.of(MyData.class)).via(MyData::fromTableRow))
.apply("Map TableRow", ParDo.of(new MapTableRowV1()))
;
static class ReadBQ extends PTransform<PCollection<Long>, PCollection<TableRow>> {
#Override
public PCollection<TableRow> expand(PCollection<Long> input) {
BigQueryIO.TypedRead<TableRow> rows = BigQueryIO.readTableRows()
.fromQuery("select * from project.dataset.table limit 10")
.usingStandardSql();
return rows.expand(input.getPipeline().begin());
}
}
static class MapTableRowV1 extends DoFn<AdUnitECPM, Void> {
#ProcessElement
public void processElement(ProcessContext pc) {
LOG.info("String of mydata is " + pc.element().toString());
}
}
Since BigQueryIO.TypedRead is related to PBegin, one trick is done in ReadBQ through rows.expand(input.getPipeline().begin()). However, this job does NOT run every two minutes. How to read data from bigquery periodically?
Look at using Looping Timers. That provides the right pattern.
As written your code would only fire once after sequence is built. For fixed windows you would need an input value coming into the Window for it to trigger. For example, have the pipeline read a Pub/Sub input and then have an agent push events every 2 minutes into the topic/sub.
I am going to assume that you are running in streaming mode here; however, another way to do this would be to use a batch job and then run it every 2 mins from Composer. Reason being if your job is idle for effectively 90 secs (2 min - processing time) your streaming job wasting some resources.
One other note: Look at thinning down you column selection in your BigQuery SQL (to save time and money). Look at using some filter on your SQL to pick up a partition or cluster. Look at using #timestamp filter to only scan records that are new in last N. This could give you better control over how you deal with latency and variability at the db level.
As you have mentioned in the question, BigQueryIO read transforms start with PBegin, which puts it at the start of the Graph. In order to achieve what you are looking for, you will need to make use of the BigQuery client libraries directly within a DoFn.
For an example of this have a look at this
transform
Using a normal DoFn for this will be ok for small amounts of data, but for a large amount of data, you will want to look at implementing that logic in a SDF.
I'm reading data from BigQuery and writing to Redis using RedisIO from Apache Beam API. Below is the code snippet.
pipeline.apply("Read Data From BigQuery",
BigQueryIO.readTableRows().withoutValidation()
.fromQuery(""))
.apply("Convert Table rows into Redis Entity",
ParDo.of(new RedisEntity()))
.apply("Write to Redis",
RedisIO.write().withEndpoint("localhost", 6379));
When trying to execute the code, I get 2,000 records written in redis and after that getting the below error.
redis.clients.jedis.exceptions.JedisDataException: EXEC without MULTI
at redis.clients.jedis.Pipeline.exec(Pipeline.java:139)
at org.apache.beam.sdk.io.redis.RedisIO$Write$WriteFn.processElement(RedisIO.java:419)
Kindly advice if I'm missing something or if there is a better way to do it.
Seem like a bug in RedisIO, I have submitted an issue to Beam, and have done a pull request to fix it. See if I guess it correctly. issues.apache.org/jira/browse/BEAM-5714
I have a Beam job running on Google Cloud DataFlow that reads data from BigQuery. When I run the job it takes minutes for the job to start reading data from the (tiny) table. It turns out the dataflow job sends of a BigQuery job which runs in BATCH mode and not in INTERACTIVE mode. How can I switch this to run immediately in Apache Beam? I couldn't find a method in the API to change the priority.
Maybe a Googler will correct me, but no, you cannot change this from BATCH to INTERACTIVE because it's not exposed by Beam's API.
From org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.java (here):
private void executeQuery(
String executingProject,
String jobId,
TableReference destinationTable,
JobService jobService) throws IOException, InterruptedException {
JobReference jobRef = new JobReference()
.setProjectId(executingProject)
.setJobId(jobId);
JobConfigurationQuery queryConfig = createBasicQueryConfig()
.setAllowLargeResults(true)
.setCreateDisposition("CREATE_IF_NEEDED")
.setDestinationTable(destinationTable)
.setPriority("BATCH") <-- NOT EXPOSED
.setWriteDisposition("WRITE_EMPTY");
jobService.startQueryJob(jobRef, queryConfig);
Job job = jobService.pollJob(jobRef, JOB_POLL_MAX_RETRIES);
if (parseStatus(job) != Status.SUCCEEDED) {
throw new IOException(String.format(
"Query job %s failed, status: %s.", jobId, statusToPrettyString(job.getStatus())));
}
}
If it's really a problem for you that the query is running in BATCH mode, then one workaround could be:
Using the BigQuery API directly, roll your own initial request, and set the priority to INTERACTIVE.
Write the results of step 1 to a temp table
In your Beam pipeline, read the temp table using BigQueryIO.Read.from()
You can configure to run the queries with "Interactive" priority by passing a priority parameter. Check this Github example for details.
Please note that you might be reaching one of the BigQuery limits and quotas as when you use batch, if you ever hit a rate limit, the query will be queued and retried later. As opposed to the interactive ones, when if these limits are hit, the query will fail immediately. This is because BigQuery assumes that an interactive query is something you need run immediately.
I've a compressed json file (900MB, newline delimited) and load into a new table via bq command and get the load failure:
e.g.
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values mtdataset.mytable gs://xxx/data.gz schema.json
Waiting on bqjob_r3ec270ec14181ca7_000001461d860737_1 ... (1049s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r3ec270ec14181ca7_000001461d860737_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 0: Unexpected. Please try again.
Why the error?
I tried again with the --max_bad_records, still not useful error message
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values --max_bad_records 2 XXX.test23 gs://XXX/20140521/file1.gz schema.json
Waiting on bqjob_r518616022f1db99d_000001461f023f58_1 ... (319s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r518616022f1db99d_000001461f023f58_1': Unexpected. Please try again.
And also cannot find any useful message in the console.
To BigQuery team, can you have a look using the job ID?
As far I know there are two error sections on a job. There is one error result, and that's what you see now. And there is a second, which should be a stream of errors. This second is important as you could have errors in it, but the actual job might succeed.
Also you can set the --max_bad_records=3 on the BQ tool. Check here for more params https://developers.google.com/bigquery/bq-command-line-tool
You probably have an error that is for each line, so you should try a sample set from this big file first.
Also there is an open feature request to improve the error message, you can star (vote) this ticket https://code.google.com/p/google-bigquery-tools/issues/detail?id=13
This answer will be picked up by the BQ team, so for them I am sharing that: We need an endpoint where we can query based on a jobid, the state, or the stream of errors. It would help a lot to get a full list of errors, it would help debugging the BQ jobs. This could be easy to implement.
I looked up this job in the BigQuery logs, and unfortunately, there isn't any more information than "failed to read" somewhere after about 930 MB have been read.
I've filed a bug that we're dropping important error information in one code path and submitted a fix. However, this fix won't be live until next week, and all that will do is give us more diagnostic information.
Since this is repeatable, it isn't likely a transient error reading from GCS. That means one of two problems: we have trouble decoding the .gz file, or there is something wrong with that particular GCS object.
For the first issue, you could try decompressing the file and re-uploading it as uncompressed. While it may sound like a pain to send gigabytes of data over the network, the good news is that the import will be faster since it can be done in parallel (we can't import a compressed file in parallel since it can only be read sequentially).
For the second issue (which is somewhat less likely) you could try downloading the file yourself to make sure you don't get errors, or try re-uploading the same file and seeing if that works.
I was trying to load 1.4Gb gZIP data in to my BigQuery table and i am getting the error Unexpected. Please try again consistently
job_7f1aa8d29ae641459c82243530eb1c65
I was trying to load a structure Row ID,Order Priority,Discount,Unit Price,Shipping Cost,Customer ID,Customer Name,Ship Mode,Product Category,Product Sub-Category,Product Base Margin,Region,State or Province,City,Postal Code,Order Date,Ship Date,Profit,Quantity ordered new,Sales,Order ID
the error is not clear on whats going wrong.
anyone else encountered this error?
Thanks.
It looks like your job ran out of time-- a 53 GB CSV file is a lot to process in one thread. Best practice is to either split your data in multiple chunks, or upload uncompressed data which can be processed in parallel.
I'm in the process of raising the allowed time somewhat, and we'll work on improving the error message when this happens.