I have some excel files that i want to append to a Big Query table. After doing that i want to append data to the same Big Query table using cloud storage automatically thru a scheduler which runs 4 times a day. How to do this. Please keep in mind i am not a developer. I just know SQL and Big Query.
I suggest write a simple script and host it Google App Engine, and register a bucket notification and hook it with that GAE endpoint.
Whenever a file is upload to the bucket, GAE will be notificated, and invoke your script, which then gets the file, extract the info from it, and append to the table you specify in Big Query.
https://cloud.google.com/storage/docs/object-change-notification#_Watching
Related
I would like to run bq load command once every day at 00:00 UTC. Can I use Google Cloud scheduler for scheduling this command?
As mentioned by #Daniel, there is no direct way to use cloud scheduler to execute-schedule queries, however there are options you can consider to run queries on schedule.
Use scheduled queries directly in BQ
Since your source is GCS, you can load data from GCS to BQ and then execute scheduled queries like mentioned here
Use scheduled Cloud Function to run queries
Schedule using Data Transfer
You can also try what #Graham Polley has mentioned in this blog, which requires an architecture combining Cloud Scheduler, Cloud Sourse Repositories and Cloud Build
Assuming you have a file that is being loaded into Cloud Storage everyday before 7am, you may consider a more resilient design: when the file is created in CS create a notification that starts the process to load it. It is a better design that will get the information earlier into BigQuery and it will keep working even if the file creation is delayed.
When the file is created in Cloud Storage get a message in PubSub: https://cloud.google.com/storage/docs/pubsub-notifications
Then, a Cloud Function is invoked that will execute the bq load command.
BTW if you have many files or even some dependencies, consider using Cloud Composer as an orchestrator to keep its complexity under control.
You would not be able to do it directly with Cloud Scheduler you would need an intermediary like a Cloud Function to execute a command. Alternatively you could try scheduling a data transfer, depending on the requirements of your load job.
Here is an example from the documentation:
https://cloud.google.com/bigquery/docs/cloud-storage-transfer#setting_up_a_cloud_storage_transfer
Based on your update of desiring to shard the table based on date, try scheduled queries in the following manner.
Create an external table pointed to the desired path in GCS as described here
Define your query, i recommend defining a query with column names and appropriate casting.
SELECT *
FROM myproject.dataset_id.external_table_name
-- INCLUDE FILTERING ON _FILE_NAME IF NEEDED LIKE FOLLOWING:
-- WHERE _FILE_NAME LIKE SOME_VALUE
Create Schedule Query with Run_Date Parmeter in the table name like new_table_{run_date}
I'm looking to load CSV data from Google Cloud Storage to a BigQuery table (see docs) as a batch (see docs) using wildcards, and was wondering whether:
The data in the table will only be available once all CSVs have been loaded (i.e the files get collated in some way before being loaded into BigQuery)
The data in the table will be updated incrementally with each CSV that's loaded (i.e. each CSV is loaded separately, as a separate job)
For some context, I'm trying to work out if it will be possible for a user to view incomplete table data if they access the table before the job to load the batch of CSVs has finished.
A similar question has been asked here before, but I don't have enough reputation to comment :'(
Thanks for the help!
The data are viewable when the job is completed, therefore after all the files ingested.
Indeed, when you define a job, you can specify a WriteTruncate disposition. That means all the current data will be replaced by the new one. If the job fails, the current data stay unchanged. That behaviour won't be possible in case of incremental load.
If you want to propose incremental load, you can use a not efficient process: read the file with dataflow and stream the content into BigQuery. this time, the users will be able to view the incremental data by requesting them (not with the preview feature in the Console, because the streamed data stayed for a while in the buffer of BigQuery (max 90 minutes))
I want to send data from BigQuery (about 500K rows) to a custom endpoint via post method, how can I do this?
These are my options:
A PHP process to read and send the data (I have already tried this one, but it is too slow and the max execution time pops up).
I was looking for Google Cloud Dataflow, but I don't know Java.
Running it into Google Cloud Function, but I don't know how to send data via post.
Do you know another option?
As mentioned in the comments, 500k rows for a POST method is far too much data to be considered as an option.
Dataflow is a product oriented for pipelines development, intended to run several data transformations during its jobs. You can use BigQueryIO (with python sample codes) but, If you just need to migrate the data to a certain machine/endpoint, creating a Dataflow job will add complexity to your task.
The suggested approach is to export to a GCS bucket and then download the data from it.
For instance, if the size of Data that you are trying to retrieve is less than 1GB, you can export to a GCS bucket from the Command Line Interface like: bq extract --compression GZIP 'mydataset.mytable' gs://example-bucket/myfile.csv. Otherwise, you will need to export the data in more files using wildcard URI defining your bucket destination as indicated ('gs://my-bucket/file-name-*.json').
And finally, using gsutil command gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION] you will download the data from your bucket.
Note: you have more available ways to do that in the Cloud documentation links provided, including the BigQuery web UI.
Also, bear in mind that there are no charges for exporting data from BigQuery, but you do incur charges for storing the exported data in Cloud Storage. BigQuery exports are subject to the limits on export jobs.
I have some simple weekly aggregates from Google analytics that i'd like to store somewhere. The reason for storing is because if I run a query against too much data in google analytics, it becomes sampled and I want it to be totally accurate.
What is the best way to solve this?
My thoughts are:
1) Write a process in bigquery to append the data each week to a permanent dataset
2) Use an API that gets the data each week and stores the data in a google spreadsheet (appending a line each time)
What is the best recommendation for my problem - and how do I go about executing it?
Checking your previous questions, we see that you already use Bigquery.
When you run a query against the Google Analytics tables that is not sampled, as that has all the data in it. There is no need to store as you can query every time you need.
In case if you want to store, and pay for the addition table, you can go ahead store in a destination table.
If you want to access quickly, try creating a view.
I suggest the following:
1) make a roll-up table for your weekly data - you can do that either by writing a query for it and running manually or with a script in a Google Spreadsheet that uses the same query (using the API) and is scheduled to run every week. I tried a bunch of the tutorials out there and this one is the simplest to implement
2) depending on the data points you want, you can even use the Google Analytics API without having to go through BigQuery for this request, try pulling this report of yours from here . If it works there are a bunch of Google Sheets extensions that can make it a lot quicker to set up a weekly report. Or you can just code it yourself
Would that work for you?
thks!
I have a big table containing domain data in Google Big query and I would like to create a web app similar to http://whois.domaintools.com/browse/a/
Page with list of sorted results I can dig into.
Is it possible without making query every time page is opened or refreshed which is most ovious way.
Thanks in advance!
Querying directly BigQuery introduces lags, which will affect frontend performance, and for some users and queries it will be several seconds, hence it's not recommended to be used on a live website, as the most suitable way is to run asynchronously in the background.
You need to build your website, so the website itself reads the data from a cache or from a local database.
You then need to build a background process (Message Queue or Cron) in which you will periodically run the BigQuery job, and process the results and write to your local database. You can then choose to run only every 1 hour or so.
See what you can do with BigQuery
https://github.com/everythingme/redash frontend available at http://demo.redash.io/
http://bigqueri.es/