BigQuery API vs BigQuery Tools - google-bigquery

I am looking at extracting data from BigQuery and I have found out that it can be extracted using API or tools. Does any one know the advantage of using API over tools?
One of the things I can think of API advantage is that, with API data extraction can be scheduled for fixed time intervals.Are there any other advantage of using API's?
Basically I want to know when to use API vs tools.

To state it explicitly, the BigQuery tools including the BQ CLI, the Web UI, and even third party tools are leveraging the BigQuery API to enable whatever functionality they expose. Google also provides client libraries for many popular programming languages that make working with the API more straightforward.
Your question then becomes whether your particular needs are best served by using one of these tools or building your own integration with the API. If you're simply loading data into tables once an hour, perhaps a local cron job that calls the BQ CLI tool is sufficient. If you're streaming some kind of event record into a table as they happen, the API route may be more appropriate as you're integrating more deeply into your own software stack.

Related

How to architect scheduled API to API integration

My organization moves data for customers between systems, these integrations are in BizTalk and are done by file, sometimes to/from APIs. More and more customers are switching to APIs so we are facing more and more API to API integrations.
I'm mostly a backend developer but have been tasked with finding out how we can find a more generic pattern or system to make these integrations, we are talking close to a thousand of integrations.
But not thousands of different APIs, many customers use the same sort of systems.
What I want is a solution that:
Fetches data from the source api
Transforms the data to the format for the target api
Sends the data to the target api
Another requirement is that it should be possible to set a schedule when these jobs should run.
This is easily done in BizTalk but as mentioned there will be thousands of integrations and if we need to change something in one of the steps it will be a lot of work.
My vision is something that holds interfaces to all APIs that we communicate with and also contains the scheduled jobs we want to be run between them. Preferrably with logging/tracking.
There must be something out there that does this?
Suggestions?
NOTE: No cloud-based solutions since they are not allowed in our organization.
You can easily implement this using temporal.io open source project. You can code your integrations using a general-purpose programming language. Temporal ensures that the integration runs to completion in the presence of all sorts of intermittent failures. Scheduling is also supported out of the box.
Disclaimer: I'm a founder of the Temporal project.

Is there a way to get Splunk Data to BigQuery?

I have some app data which is currently stored in Splunk. But i am looking for a way where I can input the Splunk data directly to BigQuery. My target is to analyze the app data on BigQuery and perhaps create Data Studio dashboards based on the BigQuery.
I know there are a lot of third party connectors that can help me with this, but I am looking for a solution where I can use features from Splunk or BigQuery to conncet both of them together and not rely on third party connectors.
Based on your comment indicating that you're interested in resources to egress data from Splunk into BigQuery with custom software, I would suggest using either tool's REST API on either side.
You don't indicate whether this is a one-time or a recurring asking - that may impact where you want the software to run that performs this operation. If it's a one-time thing and you've got a fair internet connection yourself, you may just want to write a console application from your own machine to perform the migration. If it's a recurring operation, you might instead look at any of the various "serverless" hosting options out there (e.g. Azure Functions, Google Cloud Functions, or AWS Lambda). In addition to development experience, note that you may have to pay an egress bandwidth cost for each on top of normal service charges.
Beyond that, you need to decide whether it makes more sense to do a bulk export from Splunk out to some external file that you load into Google Drive and then import into Big Query. But maybe it makes more sense to download the records as paged data via HTTPS so you can perform some ETL operation on top of it (e.g. replace nulls with empty strings, update Datetime types to match Google's exacting standards, etc.). If you go this route, it looks as though this is the documentation you'd use from Splunk and you can either use Google's newer, and higher-performance Storage Write API to receive the data or their legacy streaming API to ingest into BigQuery. Either option supports SDKs across varied languages (e.g. C#, Go, Ruby, Node.js, Python, etc.), though only the legacy streaming API supports plain HTTP REST calls.
Beyond that, don't forget your OAuth2 concerns to authenticate on either side of the operation, though this is typically abstracted away by the various SDKs offered by either party, and less of something you'd have to deal with the ins and outs of.

Using Google Cloud ecosystem vs building your own microservice architecture

Building in the Google Cloud ecosystem is really powerful. I really like how you can ingest files to Cloud Storage then Data Flow enriches, transforms and aggregates the data, and then finally stored in BigQuery or Cloud SQL.
I have a couple of questions to help me have a better understanding.
If you are to build a big data product using the Google services.
When a front-end web application (might be built in React) submits a file to Cloud storage it may take some time before it completely processes. The client might want to view the status the file in the pipeline. They then might want to do something with the result on completion. How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
If you currently have a microservice architecture in which each service does a different kind of processing. For example one might parse a file, another might processes messages. The services communicate using Kafka or RabbitMQ and store data in Postgres or S3.
If you adopt the Google services ecosystem could you replace that microservice architecture with Cloud storage, dataflow, Cloud SQL/Store?
Did you look at Cloud Pub/Sub (topic subscription/publication service).
Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
I believe Pub/Sub can mostly substitute Kafka or RabitMQ in your case.
How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
For example, if you are using dataflow API to process the file, Cloud dataflow can publish the progress and send the status to a topic. Your front end (app engine) just needs to subscribe to that topic and receive update.
1)
Dataflow does not offer inspection to intermediary results. If a frontend wants more progress about an element being processed in a Dataflow pipeline, custom progress reporting will need to be built into the Pipline.
One idea, is to write progress updates to a sink table and output molecules to that at various parts of the pipeline. I.e. have a BigQuery sink where you write rows like ["element_idX", "PHASE-1 DONE"]. Then a frontend can query for those results. (I would avoid overwriting old rows personally, but many approaches can work).
You cand do this by consuming the PCollection in both the new sink, and your pipeline's next step.
2)
Is your Microservice architecture using a "Pipes and filters" pipeline style approach? I.e. each service reads from a source (Kafka/RabbitMQ) and writes data out, then the next consumes it?
Probably the best way to do setup one a few different Dataflow pipelines, and output their results using a Pub/Sub or Kafka sink, and have the next pipeline consume that Pub/Sub sink. You may also wish to sink them to a another location like BigQuery/GCS, so that you can query out these results again if you need to.
There is also an option to use Cloud Functions instead of Dataflow, which have Pub/Sub and GCS triggers. A microservice system can be setup with several Cloud Functions.

Using Google Cloud Platform for scheduled recurring API data pull that then loads the data to BigQuery

I need to setup a scheduled daily job that pulls data using a REST API call and then inserts that data into BigQuery.
I traditionally have done these types of tasks using Node.js running on Heroku. My current boss wants me to achieve this using the Google Cloud Platform.
What are some ways to achieve this using Google Cloud Platform?
A few options on GCP:
Spin up a GCE instance and use a cron (a little old school, but will work).
Use Google App Engine and schedule your job(s) that way.
Unfortunately, Google Cloud Functions don't yet support schedulers. Otherwise, that would be perfect.

How to proceed with query automation using Import.io

I've successfully created a query with the Extractor tool found in Import.io. It does exactly what I want it to do, however I need to now run this once or twice a day. Is the purpose of Import.io as an API to allow me to build logic such as data storage and schedules tasks (running queries multiple times a day) with my own application or are there ways to scheduled queries and make use of long-term storage of my results completely within the Import.io service?
I'm happy to create a Laravel or Rails app to make requests to the API and store the information elsewhere but if I'm reinventing the wheel by doing so and they provides the means to address this then that is a true time saver.
Thanks for using the new forum! Yes, we have moved this over to Stack Overflow to maximise the community atmosphere.
At the moment, Import does not have the ability to schedule crawls. However, this is something we are going to roll out in the near future.
For the moment, there is the ability to set a Cron job to run when you specify.
Another solution if you are using the free version is to use a CI tool like travis or jenkins to schedule your API scripts.
You can query live the extractors so you don't need to make them run manually every time. This will consume one of your requests from your limit.
The endpoint you can use is:
https://extraction.import.io/query/extractor/extractor_id?_apikey=apikey&url=url
Unfortunately the script will not be a very simple one since most websites have very different respond structures towards import.io and as you may already know, the premium version of the tool provides now with scheduling capabilities.