When using real time pipeline, unable to feed data to bigquery from gcs - google-bigquery

I have developed a real time pipeline in data fusion to fetch data from pubsub and then feed into GCS and thereafter in BQ. However, after GCS (which is available as a sink), i am not able to feed the data into BQ because GCS is only available as a sink and hence, it doesnt give any output schema. Is there any way out that i can create a pipeline to take the data from GCS to BQ

To provide a possible solution: It is not possible to connect a sink to another sink. From based on the question my guess is SO is trying to connect the GCS sink plugin to BQ sink and have data flow from one sink to another. That is not by design possible with Data Fusion pipelines.
SO can push data simultaneously from pubsub source to both BQ and GCS sinks directly instead of pushing one after the other. It would look something like this,
Hope this helps.

Related

Load batch CSV Files from Cloud Storage to BigQuery and append on same table

I am new to GCP and recently created a bucket on Google Cloud Storage. RAW files are dumping every hour on GCS bucket in every hour in CSV format.
I would like to load all the CSV files from Cloud storage to BigQuery and there will be a scheduling option to load the recent files from Cloud Storage and append the data to the same table on BigQuery.
Please help me to setup this.
There is many options. But I will present only 2:
You can do nothing and use external table in BigQuery, that means you let the data in Cloud Storage and ask BigQuery to request the data directly from Cloud Storage. You don't duplicate the data (and pay less for storage), but the query are slower (need to load the data from a less performant storage and to parse, on the fly, the CSV) and you process all the file for all queries. You can't use BigQuery advanced feature such as partitioning, clustering and others...
Perform a BigQuery load operation to load all the existing file in a BigQuery table (I recommend to partition the table if you can). For the new file, forget the old school scheduled ingestion process. With cloud, you can be event driven. Catch the event that notify a new file on Cloud Storage and load it directly in BigQuery. You have to write a small Cloud Functions for that, but it's the most efficient and the most recommended pattern. You can find code sample here
Just a warning on the latest solution, you can perform "only" 1500 load job per day and per table (about 1 per minute)

Send Bigquery Data to rest endpoint

I want to send data from BigQuery (about 500K rows) to a custom endpoint via post method, how can I do this?
These are my options:
A PHP process to read and send the data (I have already tried this one, but it is too slow and the max execution time pops up).
I was looking for Google Cloud Dataflow, but I don't know Java.
Running it into Google Cloud Function, but I don't know how to send data via post.
Do you know another option?
As mentioned in the comments, 500k rows for a POST method is far too much data to be considered as an option.
Dataflow is a product oriented for pipelines development, intended to run several data transformations during its jobs. You can use BigQueryIO (with python sample codes) but, If you just need to migrate the data to a certain machine/endpoint, creating a Dataflow job will add complexity to your task.
The suggested approach is to export to a GCS bucket and then download the data from it.
For instance, if the size of Data that you are trying to retrieve is less than 1GB, you can export to a GCS bucket from the Command Line Interface like: bq extract --compression GZIP 'mydataset.mytable' gs://example-bucket/myfile.csv. Otherwise, you will need to export the data in more files using wildcard URI defining your bucket destination as indicated ('gs://my-bucket/file-name-*.json').
And finally, using gsutil command gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION] you will download the data from your bucket.
Note: you have more available ways to do that in the Cloud documentation links provided, including the BigQuery web UI.
Also, bear in mind that there are no charges for exporting data from BigQuery, but you do incur charges for storing the exported data in Cloud Storage. BigQuery exports are subject to the limits on export jobs.

Automatically detect changes in GCS for BigQuery

Now I have a BigQuery table whose data source is from some bucket at GCS(Google Cloud Storage).
The GCS is dynamic constantly with new files added in. So do we have any available mechanisms for BigQuery to automatically detect the changes in GCS and sync with the latest data?
Thanks!
There is a very cool beta feature you can use to do that. Check out BigQuery Cloud Storage Transfer. You can schedule transfers run backfill, and much more.
Read "limitations" to see if it can work for you.

How to read from BigQuery as a stream

I'm using Java + Apache Beam SDK for Java 2.0.1-SNAPSHOT
Scenario:
Read Data from BigQuery(BQ) -> ETL Process in Dataflow -> Write Data in BQ tables
The problem is that the pipeline is trying to process all data before performing the insertion in BQ.
Is there a way to execute stream inserts in this case? I've already tried to set a timestamp to the elements when extracting from BQ, but it didn't work.
Or is it possible to set the BatchLoads so that it inserts bulks of data time to time?
I would take a look at this link to Googles Solution. That being said, BigQuery sounds like it is being treated as a bounded source, but that shouldn't be a problem sinking data back into dataflow, see here.

Export big query data into in house Hadoop Cluster

We have GA data in Big query, and some of my users want to join that to in house data in Hadoop which we can not move to Big Query.
Please let me know what is the best way to do this.
See BigQuery to Hadoop Cluster - How to transfer data?:
The easiest way to go from BigQuery to Hadoop is to use the official Google BigQuery Connector for Hadoop
https://cloud.google.com/hadoop/bigquery-connector
This connector defines a BigQueryInputFormat class.
Write a query to select the appropriate BigQuery objects.
Splits the results of the query evenly among the Hadoop nodes.
Parses the splits into java objects to pass to the mapper. The Hadoop Mapper class receives a JsonObject representation of each selected BigQuery object.
(It uses Google Cloud Storage as an intermediary between BigQuery's data and the splits that Hadoop consumes)
You could follow the route of the Hadoop connecter as Felipe Hoffa suggested.. Or build your own application which will transfer data from BigQuery to your Hadoop cluster. In both ways, you will be able to make the required joins on the hadoop cluster using Pig, Hive etc.
In case you want to try the application method, let me take you through a process flow which your application may need to follow:
Query BQ tables (flatten any nested or repeated fields)
If your query response is too large, you can divert this response into a destination table. Your destination table is simply another table in BigQuery.
You can then export this destination table to a GCS bucket. This uses another query request. You will have options to choose an export format, compression type, split up the data into multiple files etc.
From the GCS bucket, using a tool called gsutil, you can copy the files to your cluster gateway machine.
From your cluster gateway machine, you can use the hadoop command 'copyFromLocal' to copy this data to your HDFS directory.
Once it is in a HDFS directory, you can create a hive external table pointing to this HDFS directory. Your data will now be available in the Hive table. Ready to be joined with the in house data on your cluster.
Let me know if you need anymore details or clarifications. I went down this route because I found the connector alternative a little too complex. But that is a subjective opinion varying from a person to person.