I am trying to get the predicted results from the Recommendations AI and store those predictions back into the big query.
After fetching a few rows, the Cloud Function gives timeout exception. Is there any way I can increase the timeout of Cloud Function? or if I can push the predicted results directly to the Big Query without any interaction of Cloud Function?
As mentioned in the Answer:
The maximum run time of 540 seconds applies to all Cloud Functions, no
matter how they're triggered. If you want to run something longer you
will have to either chop it into multiple parts, or run it on a
different platform, such as Compute Engine or App Engine.
Default timeout time can be changed here.
Follow this : select function >test function > edit > timeout
If you want to directly load your data to the BigQuery, you can follow this blog.
you can also look for the documentation on export prediction data to BigQuery as this is not the recommended method you can go for Remote Config personalization.
For more information related to BigQuery ML support model you can refer to the answer and blog.
Related
Previously we had BQ Mate and SuperQuery that are no longer working in new UI to check the cost of query.
Is there any free or relatively inexpensive solution that would show cost estimation before running query?
Google Cloud Pricing Calculator will be the best option to check cost estimation for any google cloud service
https://cloud.google.com/bigquery/docs/estimate-costs
According to the doc
To estimate costs before running a query, you can use one of the following methods:
Query validator in the Google Cloud console
--dry_run flag in the bq command-line tool
dryRun parameter when submitting a query job using the API
The Google Cloud Pricing Calculator
Client libraries
In your case you have to use the Query validator from BQ Console with Pricing calculator .When you enter any query in BQ console the Query validator validates the query as well as show how much data will be read at runtime.
You have to enter those details in the Pricing Calculator and then you will be able to calculate estimated cost.
I'm going to build a web app and use BigQuery as a part of backend database, and I want to show the query cost information (ex. 1.8 sec elapsed, 264.9 MB processed) in the app.
I know we can check the BigQuery's query information inside GCP, but how do we I get that information from BigQuery API?
The information you are interested in is present in the job statistics.
See jobs.get for more details: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/get
The dry-run sample may be of interest as well, though you can get the stats from a real invocation as well (dry run is for estimating costs without executing the query):
https://cloud.google.com/bigquery/docs/samples/bigquery-query-dry-run
I have an app that sending me data from an API. The data is semi-structured (json data)
I would like to send this data to Google Big Query in order to stock all the information.
However, I'm not able to find how can I do it properly.
So far I have used Node JS on my own server to get the data using POST request.
Could you please help me ? Thnak.
You can use bigquery API to do streaming inserts.
You can also write the data to PubSub or Google Cloud Storage and use dataflow pipelines to load them into bigquery (you can either use streaming inserts (incur costs) or batch load jobs (free))
You can also log in stackdriver and from there you can select and send to bigquery (there already exists direct options for it in GCP, note that under the hood it performs streaming inserts)
If you feel that setting up dataflow is complicated, you can store your files and perform batch load jobs by directly calling bigquery API. Note that there are limits on number of batch loads you can make in a day over a particular table (1000 per day)
There is a page in the official documentation that lists all the possibilities of loading data to BigQuery.
For the simplicity, you can just send data from your local data soruce. You should use the Google Cloud client libraries for Big Query. Here you have a guide on how to do that as well as a relevant code example.
But my honest recommendation is to send data to Google Cloud Storage and from there, to load it to BigQuery. This way the whole process will be more stable.
You can check all the options from the first link that I've posted and choose what you think that will fit best with your workflow.
Keep in mind the limitations of this process.
In gcp, I need to update a bigquery table whenever a file (multiple formats such as json,xml) gets uploaded into a bucket. I have two options but not sure what are the pros/cons of each of them. Can someone suggest which is a better solution and why?
Approach 1 :
File uploaded to bucket --> Trigger Cloud Function (which updates the bigquery table) -->Bigquery
Approach 2:
File uploaded to bucket --> Trigger Cloud Function (which triggers a dataflow job) -->Dataflow-->Bigquery.
In production env, which approach is better suited and why? If there are alternative approaches,pls let me know.
This is quite a broad question, so I wouldn't be surprised if it gets voted to be closed. That said however, I'd always go #2 (GCS -> CF -> Dataflow -> BigQuery).
Remember, with Cloud Funtions there is a max execution time. If you kick off a load job from the Cloud Function, you'll need to bake logic into it to poll and check the status (load jobs in BigQuery are async). If it fails, you'll need to handle it. But, what if it's still running and you hit the max execution of your Cloud Function?
At least by using Dataflow, you don't have the problem of max execution times and you can simply rerun your pipeline if it fails for some transient reason e.g. network issues.
I'm looking to use Dataflow to load data into BigQuery tables using BQ load jobs - not streaming (streaming would cost too much for our use case). I see that the Dataflow SDK has built in support for inserting data via BQ streaming, but I wasn't able to find anything in the Dataflow SDK that supports load jobs out of the box.
Some questions:
1) Does the Dataflow SDK have OOTB support for BigQuery load job inserts? If not, is it planned?
2) If I need to roll my own, what are some good approaches?
If I have to roll my own, performing a BQ load job using Google Cloud Storage is a multi step process - write the file to GCS, submit the load job via the BQ API, and (optionally) check the status until the job has completed (or failed). I'd hope I could use the existing TextIO.write() functionality to write to GCS, but I'm not sure how I'd compose that step with the subsequent call to the BQ API to submit the load job (and optionally the subsequent calls to check the status of the job until it's complete).
Also, I'd be using Dataflow in streaming mode, with windows of 60 seconds - so I'd want to do the load job every 60 seconds as well.
Suggestions?
I'm not sure which version of Apache Beam you are using, but now it's possible to use a micro-batching tactic using a Stream Pipeline. If you decide one way or another you can use something like this:
.apply("Saving in batches", BigQueryIO.writeTableRows()
.to(destinationTable(options))
.withMethod(Method.FILE_LOADS)
.withJsonSchema(myTableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withTriggeringFrequency(Duration.standardMinutes(2))
.withNumFileShards(1);
.optimizedWrites());
Things to keep in mind
There are 2 different methods: FILE_LOADS and STREAMING_INSERT, if you use the first one you need to include the withTriggeringFrequency and withNumFileShards. For the first one, from my experience, is better to use minutes and the number will depend on the amount of throughput data. If you receive quite a lot try to keep it small, I have seen "stuck errors" when you increase it too much. The shards can affect mostly your GCS billing, if you add to much shards it will create more files per table per x amount of minutes.
If your input data size is not so big the streaming insert can work really well and the cost shouldn't be a big deal. In that scenario you can use STREAMING_INSERT method and remove the withTriggeringFrequency and withNumFileShards. Also, you can add withFailedInsertRetryPolicy like InsertRetryPolicy.retryTransientErrors() so no rows are being lost (keep in mind that idempotency is not guaranteed with STREAM_INSERTS, so duplication is possible)
You can check your Jobs in BigQuery and validate that everything is working! Keep in mind the policies for jobs with BigQuery (I think is 1000 jobs per table) when you are trying to define triggering frequency and shards.
Note: You can always read this article about efficient aggregation pipelines https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow
BigQueryIO.write() always uses BigQuery load jobs when the input PCollection is bounded. If you'd like it to also use them if it is unbounded, specify .withMethod(FILE_LOADS).withTriggeringFrequency(...).