Using Google Cloud Platform for scheduled recurring API data pull that then loads the data to BigQuery - google-bigquery

I need to setup a scheduled daily job that pulls data using a REST API call and then inserts that data into BigQuery.
I traditionally have done these types of tasks using Node.js running on Heroku. My current boss wants me to achieve this using the Google Cloud Platform.
What are some ways to achieve this using Google Cloud Platform?

A few options on GCP:
Spin up a GCE instance and use a cron (a little old school, but will work).
Use Google App Engine and schedule your job(s) that way.
Unfortunately, Google Cloud Functions don't yet support schedulers. Otherwise, that would be perfect.

Related

Using Google Cloud ecosystem vs building your own microservice architecture

Building in the Google Cloud ecosystem is really powerful. I really like how you can ingest files to Cloud Storage then Data Flow enriches, transforms and aggregates the data, and then finally stored in BigQuery or Cloud SQL.
I have a couple of questions to help me have a better understanding.
If you are to build a big data product using the Google services.
When a front-end web application (might be built in React) submits a file to Cloud storage it may take some time before it completely processes. The client might want to view the status the file in the pipeline. They then might want to do something with the result on completion. How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
If you currently have a microservice architecture in which each service does a different kind of processing. For example one might parse a file, another might processes messages. The services communicate using Kafka or RabbitMQ and store data in Postgres or S3.
If you adopt the Google services ecosystem could you replace that microservice architecture with Cloud storage, dataflow, Cloud SQL/Store?
Did you look at Cloud Pub/Sub (topic subscription/publication service).
Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
I believe Pub/Sub can mostly substitute Kafka or RabitMQ in your case.
How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
For example, if you are using dataflow API to process the file, Cloud dataflow can publish the progress and send the status to a topic. Your front end (app engine) just needs to subscribe to that topic and receive update.
1)
Dataflow does not offer inspection to intermediary results. If a frontend wants more progress about an element being processed in a Dataflow pipeline, custom progress reporting will need to be built into the Pipline.
One idea, is to write progress updates to a sink table and output molecules to that at various parts of the pipeline. I.e. have a BigQuery sink where you write rows like ["element_idX", "PHASE-1 DONE"]. Then a frontend can query for those results. (I would avoid overwriting old rows personally, but many approaches can work).
You cand do this by consuming the PCollection in both the new sink, and your pipeline's next step.
2)
Is your Microservice architecture using a "Pipes and filters" pipeline style approach? I.e. each service reads from a source (Kafka/RabbitMQ) and writes data out, then the next consumes it?
Probably the best way to do setup one a few different Dataflow pipelines, and output their results using a Pub/Sub or Kafka sink, and have the next pipeline consume that Pub/Sub sink. You may also wish to sink them to a another location like BigQuery/GCS, so that you can query out these results again if you need to.
There is also an option to use Cloud Functions instead of Dataflow, which have Pub/Sub and GCS triggers. A microservice system can be setup with several Cloud Functions.

Dynamic scheduler on GCP

Does GCP have a job scheduling service like Azure Scheduler, where jobs can be scheduled and managed dynamically via API?
Google Cron service is set in a static file and it seems like their answer to this is to use that to poke a roll your own service backed with PubSub and a data store. Looking for Quartz-like functionality, consumable by APP engine, which can be managed and invoked via API as opposed to managing a cluster, queue, and compute instance/VM deployment of Quartz (or the like) or rolling a custom solution. Should support 50 million simultaneous jobs per day with retry / recoverability and dynamic scheduling per tenant capabilities.
This is the cheapest and easiest way I can imagine building a solution today on top of an existing AppEngine based project:
As you observed, currently there is no such API/service directly available on GCP. There is an open feature request (on GAE) for it.
But, also as you observed, it is possible to build and use a custom solution, just like the one you proposed.
Depending on the context even simpler solutions are possible. For a GAE context check out, for example, How to schedule repeated jobs or tasks from user parameters in Google App Engine?.

Loading data to BigQuery using python API vs bq load

Got a new requirement. In GCS bucket have around 130+ files and these files need to be loaded into different tables on BigQuery on daily basis.
After researching, I found two options.
1) Use "bq load" command to load (Shell Script/Python Script)
2) Create a Python API to load the data to BigQuery
Which option is best. If I go with Python API, I need use APPENGINE to schedule it.
is there any better option other than this?
Thanks,
However you do it, you'll be creating load jobs. So from the BigQuery side of things, it doesn't really matter which option you choose.
As far as scheduling goes, you do have some options on Google Cloud Platform:
App Engine standard environment cron service.
See this example for using this to reliably schedule tasks via Pub/Sub.
Your operating system's cron or systemd timers on a Compute Engine instance.
A cron job on a Kubernetes cluster, using Container Engine.
There are a few differences:
a) BQ Load:
-You can have some issues using special chars as delimiters, like ^ and |.
-You don't need a service account (You can use a user account)
-You can't use it on google cloud functions
b) API
-You don't have the special chars trouble.
-You can use it on google cloud functions
-And if you create a python script, you can schedule it on Scheduled Tasks (On Windows)

How to proceed with query automation using Import.io

I've successfully created a query with the Extractor tool found in Import.io. It does exactly what I want it to do, however I need to now run this once or twice a day. Is the purpose of Import.io as an API to allow me to build logic such as data storage and schedules tasks (running queries multiple times a day) with my own application or are there ways to scheduled queries and make use of long-term storage of my results completely within the Import.io service?
I'm happy to create a Laravel or Rails app to make requests to the API and store the information elsewhere but if I'm reinventing the wheel by doing so and they provides the means to address this then that is a true time saver.
Thanks for using the new forum! Yes, we have moved this over to Stack Overflow to maximise the community atmosphere.
At the moment, Import does not have the ability to schedule crawls. However, this is something we are going to roll out in the near future.
For the moment, there is the ability to set a Cron job to run when you specify.
Another solution if you are using the free version is to use a CI tool like travis or jenkins to schedule your API scripts.
You can query live the extractors so you don't need to make them run manually every time. This will consume one of your requests from your limit.
The endpoint you can use is:
https://extraction.import.io/query/extractor/extractor_id?_apikey=apikey&url=url
Unfortunately the script will not be a very simple one since most websites have very different respond structures towards import.io and as you may already know, the premium version of the tool provides now with scheduling capabilities.

BigQuery API vs BigQuery Tools

I am looking at extracting data from BigQuery and I have found out that it can be extracted using API or tools. Does any one know the advantage of using API over tools?
One of the things I can think of API advantage is that, with API data extraction can be scheduled for fixed time intervals.Are there any other advantage of using API's?
Basically I want to know when to use API vs tools.
To state it explicitly, the BigQuery tools including the BQ CLI, the Web UI, and even third party tools are leveraging the BigQuery API to enable whatever functionality they expose. Google also provides client libraries for many popular programming languages that make working with the API more straightforward.
Your question then becomes whether your particular needs are best served by using one of these tools or building your own integration with the API. If you're simply loading data into tables once an hour, perhaps a local cron job that calls the BQ CLI tool is sufficient. If you're streaming some kind of event record into a table as they happen, the API route may be more appropriate as you're integrating more deeply into your own software stack.