How to save PageSpeed Insights results to Google BigQuery? - google-bigquery

I have a table of urls in BigQuery on which I would like to perform a check of PageSpeed Insights score (or even include the whole response from API in the BigQuery table). I tried to use UDFs for this purpuse, but for now no luck with this. Is there a way of getting the response from:
https://www.googleapis.com/pagespeedonline/v2/runPagespeed?url=https://google.com/&strategy=mobile&key=yourAPIKey
to BigQuery table?

You cannot make API calls from BigQuery UDFs for several reasons. See here for more details about that.
Although there are a few ways to achieve what you want to do, I'd recommend using a Cloud Dataflow pipeline:
Read your BigQuery using BigQueryIO.Read source
In your ParDo, call the API you want
Write your results back to BigQuery using the BigQueryIO.Write sink

Related

What is the best way to add new data to BigQuery through BigQuery API?

I'm using Django as my backend framework to connect my web app with BigQuery. How I would do it is to use BigQuery API in views.py to fetch data from BQ. So far from my research, I found 2 ways I can add data to BQ from my Django:
Using the insert_rows_json() method where I would just need to have the data in a JSON format and it would append the data to the BQ.
Using the to_gbq() method where I would need the data to be in a Pandas DataFrame and I could include the parameter if_exists="replace" to update existing tables on the BQ.
Currently, for adding new data, I would use method 1 and for other operations such as updating and deleting, I would use method 2.
My question: Is it better if I use method 2 for all of my operations, or should I just stick to using method 1 for adding new data and method 2 for other operations?
OR PERHAPS is there any other way that is more efficient for the web app to run even faster?
Quoted from this doc:
For new projects, we recommend using the BigQuery Storage Write API instead of the tabledata.insertAll method. The Storage Write API has lower pricing and more robust features, including exactly-once delivery semantics. The tabledata.insertAll method is still fully supported.
You can try BigQuery Storage Write API instead of the legacy insert_rows_json() method for streaming data into BigQuery. It has lower pricing and more robust features, including exactly-once delivery semantics. If you still need to use the legacy streaming insert_rows_json() method, you can use it. It is still fully supported by Google Cloud.
Use the insert_rows_json() method for streaming data into BigQuery because that is a recommended method and maintained by Google Cloud.
You can also UPDATE and DELETE table data using DML queries via BigQuery client libraries. But, there are some limitations in BigQuery when doing UPDATE and DELETE queries immediately after streaming inserts.
Rows that were written to a table recently by using streaming (the tabledata.insertall method or the Storage Write API) cannot be modified with UPDATE, DELETE, or MERGE statements. The recent writes are those that occur within the last 30 minutes. All other rows in the table remain modifiable by using UPDATE, DELETE, or MERGE statements. The streamed data can take up to 90 minutes to become available for copy operations.
If you still want to use the to_gbq() method for updating and deleting the table, you can use it. Refer here you can find the difference between the pandas-gbq and google-cloud-bigquery libraries.

BigQuery cost for a project

How to find out BigQuery cost for a project programmatically. Is there an API to do that?
Also, is it possible to know the user level cost details for queries made?
To track individual costs for a BigQuery project, you can redirect all logs back to BigQuery - and then you can run queries over these logs.
https://cloud.google.com/bigquery/docs/reference/auditlogs
These logs include who ran the query, and how much data was scanned.
Another way is using the INFORMATION_SCHEMA table, check this post:
https://www.pascallandau.com/bigquery-snippets/monitor-query-costs/
You can use Cloud Billing API
For example
[GET] https://cloudbilling.googleapis.com/v1/projects/{projectsId}/billingInfo

Pull data from HTTP request API to Google Cloud

I have an app that sending me data from an API. The data is semi-structured (json data)
I would like to send this data to Google Big Query in order to stock all the information.
However, I'm not able to find how can I do it properly.
So far I have used Node JS on my own server to get the data using POST request.
Could you please help me ? Thnak.
You can use bigquery API to do streaming inserts.
You can also write the data to PubSub or Google Cloud Storage and use dataflow pipelines to load them into bigquery (you can either use streaming inserts (incur costs) or batch load jobs (free))
You can also log in stackdriver and from there you can select and send to bigquery (there already exists direct options for it in GCP, note that under the hood it performs streaming inserts)
If you feel that setting up dataflow is complicated, you can store your files and perform batch load jobs by directly calling bigquery API. Note that there are limits on number of batch loads you can make in a day over a particular table (1000 per day)
There is a page in the official documentation that lists all the possibilities of loading data to BigQuery.
For the simplicity, you can just send data from your local data soruce. You should use the Google Cloud client libraries for Big Query. Here you have a guide on how to do that as well as a relevant code example.
But my honest recommendation is to send data to Google Cloud Storage and from there, to load it to BigQuery. This way the whole process will be more stable.
You can check all the options from the first link that I've posted and choose what you think that will fit best with your workflow.
Keep in mind the limitations of this process.

Google DataFlow API for ingesting HLL_COUNT.INIT into BigQuery

I am sending data to PubSub from where I am trying to create a DataFlow job to put data into BigQuery.
I have a column in the data for unique that I want to do HLL_COUNT.INIT
Is there an equivalent method on the DataFlow side so that I can directly store the HLL version of the column in BigQuery?
No, DataFlow doesn't have support for BigQuery HLL sketches format, but it is clearly something that would be useful. I created feature request for it in DataFlow issue tracker: https://issuetracker.google.com/62153424.
Update: A BigQuery-compatible implementation of HyperLogLog++ has been open-sourced to github.com/google/zetasketch, and a design doc (docs.google.com/document/d/…) about integrating it into Apache Beam has been sent out to dev#beam.apache.org.

Using BigQuery for logs analysis

Im trying to do logs analysis with BigQuery. Specifically, I have an appengine app and a javascript client that will be sending log data to BigQuery. In bigquery, I'll store the full log text in one column but also extract important fields into other columns. I then want to be able to do adhoc queries over those columns.
Two questions:
1) Is BigQuery particularly good or particularly bad at this use case?
2) How do I setup revolving logs? I.e. I want to only store the last N logs or the last X GB of log data. I see delete is not supported.
Just so you know, there is an excellent demo of moving App Engine Log data to BigQuery via App Engine MapReduce called log2bq (http://code.google.com/p/log2bq/)
Re: "use case" - Stack Overflow is not a good place for judgements about best or worst, but BigQuery is used internally at Google to analyse really really big log data.
I don't see the advantage of storing full log text in a single column. If you decide that you must set up revolving "logs," you could ingest daily log dumps by creating separate BigQuery tables, perhaps one per day, and then delete the tables when they become old. See https://developers.google.com/bigquery/docs/reference/v2/tables/delete for more information on the Table.delete method.
After implementing this - we decided to open source the framework we built for it. You can see the details of the framework here: http://blog.streak.com/2012/07/export-your-google-app-engine-logs-to.html
If you want your Google App Engine (Google Cloud) project's logs to be in BigQuery, Google has added this functionality built in to the new Cloud Logging system. It is a beta feature known as "Logs Export"
https://cloud.google.com/logging/docs/install/logs_export
They summarize it as:
Export your Google Compute Engine logs and your Google App Engine logs to a Google Cloud Storage bucket, a Google BigQuery dataset, a Google Cloud Pub/Sub topic, or any combination of the three.
We use the "Stream App Engine Logs to BigQuery" feature in our Python GAE projects. This sends our app's logs directly to BigQuery as they are occurring to provide near real-time log records in a BigQuery dataset.
There is also a page describing how to use the exported logs.
https://cloud.google.com/logging/docs/export/using_exported_logs
When we want to query logs exported to BigQuery over multiple days (e.g. the last week), you can use a SQL query with a FROM clause like this:
FROM
(TABLE_DATE_RANGE(my_bq_dataset.myapplog_,
DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY'), CURRENT_TIMESTAMP()))