Export big query data into in house Hadoop Cluster

Export big query data into in house Hadoop Cluster - google-bigquery

We have GA data in Big query, and some of my users want to join that to in house data in Hadoop which we can not move to Big Query.
Please let me know what is the best way to do this.

See BigQuery to Hadoop Cluster - How to transfer data?:
The easiest way to go from BigQuery to Hadoop is to use the official Google BigQuery Connector for Hadoop
https://cloud.google.com/hadoop/bigquery-connector
This connector defines a BigQueryInputFormat class.
Write a query to select the appropriate BigQuery objects.
Splits the results of the query evenly among the Hadoop nodes.
Parses the splits into java objects to pass to the mapper. The Hadoop Mapper class receives a JsonObject representation of each selected BigQuery object.
(It uses Google Cloud Storage as an intermediary between BigQuery's data and the splits that Hadoop consumes)

You could follow the route of the Hadoop connecter as Felipe Hoffa suggested.. Or build your own application which will transfer data from BigQuery to your Hadoop cluster. In both ways, you will be able to make the required joins on the hadoop cluster using Pig, Hive etc.
In case you want to try the application method, let me take you through a process flow which your application may need to follow:
Query BQ tables (flatten any nested or repeated fields)
If your query response is too large, you can divert this response into a destination table. Your destination table is simply another table in BigQuery.
You can then export this destination table to a GCS bucket. This uses another query request. You will have options to choose an export format, compression type, split up the data into multiple files etc.
From the GCS bucket, using a tool called gsutil, you can copy the files to your cluster gateway machine.
From your cluster gateway machine, you can use the hadoop command 'copyFromLocal' to copy this data to your HDFS directory.
Once it is in a HDFS directory, you can create a hive external table pointing to this HDFS directory. Your data will now be available in the Hive table. Ready to be joined with the in house data on your cluster.
Let me know if you need anymore details or clarifications. I went down this route because I found the connector alternative a little too complex. But that is a subjective opinion varying from a person to person.

Related

Snowflake - Loading data loading data from cloud storage

I have some data stored in an S3 bucket and I want to load it into one of my Snowflake DBs. Could you help me to better understand the 2 following points please :
From the documentation (https://docs.snowflake.com/en/user-guide/data-load-s3.html), I see it is better to first create an external stage before loading the data with the COPY INTO operation, but it is not mandatory.
==> What is the advantage/usage of creating this external step and what happen under the hood if you do not create it
==> In the COPY INTO doc, it is said that the data must be staged beforehand. If the data is not staged, Snowflake creates a temporary stage ?
If my S3 bucket is not in the same region as my Snowflake DB, is it still possible to load the data directly, or one must first transfert the data to another S3 bucket in the same region as the Snowflake DB ?
I expect it is still possible but slower because of network transfert time ?
Thanks in advance

The primary advantages of creating an external stage is the ability to tie a file format directly to the stage and not have to worry about defining it on every COPY INTO statement. You can also tie a connection object that contains all of your security information to make that transparent to your users. Lastly, if you have a ton of code that references the stage, but you wind up moving your bucket, you won't need to update any of your code. This is nice for Dev to Prod migrations, as well.
Snowflake can load from any S3 bucket regardless of region. It might be a little bit slower, but not any slower than it'd be for you to copy it to a different bucket and then load to Snowflake. Just be aware that you might incur some egress charges from AWS for moving data across regions.

Exporting data from the Ignite cache

I see multiple examples of loading and processing data with Apache Ignite. But how do I export data from the ignite cache after it’s been processed?  I'm looking forward to implement processing of some large CSV files on a cluster. Say it’s a simple transformation that preprocesses data in a specific column. After I’m finished w it, how do I get it off the cache to an S3 bucket or some other location. My data will be partitioned across the nodes for speed of loading and loaded as a KV cache.
Is there a standard mechanism to export data from a cache (CSV in / CSV out) ? I've found that ML models can leverage the Exporter APIs. But that's not my use case.
Are scan queries a standard way to achieve what I want?

If you want to export the entire data set, then yes,
ScanQuery in combination with AffinityRun for every partition is probably the most efficient way to iterate over all cache entries and export them.
With affinityRun we ask every node to export its part of data, instead of pulling the data to a single node for export.

sqlline utility comes with Apache Ignite and it can also write CSV files with !outputFormat csv.

Hive partitioning LAYOUT table format in BigQuery

I have many qsns inside this situation. So here goes :
Has anyone ever written Kafka's output to a Google Cloud Storage (GCS) bucket, such that the data in that bucket is partitioned using the "default hive partitioning layout"
The intent behind doing that is this external table needs to be "queryable" in BigQuery
Google's documentation on that is here but wanted to see if someone has an example ( https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs )
for e.g. the documentation says "files follow the default layout, with the key/value pairs laid out as directories with an = sign as a separator, and the partition keys are always in the same order."
What's not clear is
a) does Kafka create these directories on the fly OR do i have to pre-create them ? Lets say i WANT to have KAFKA write to directories based on date in GCS
gs://bucket/table/dt=2020-04-07/
Tonight, after midnight, do i have PRE-create this new directory gs://bucket/table/dt=2020-04-08/ or CAN Kafka create it for me AND in all this, how does hive partitioning LAYOUT help me ?
Does my table's data, which i am trying to put in these dirs every day, need to have "dt" ( from gs://bucket/table/dt=2020-04-07/ ) as a column in it ?
Since the goal in all this to have BigQuery query this external table, which underlying is referencing all data in this bucket i.e.
gs://bucket/table/dt=2020-04-06/
gs://bucket/table/dt=2020-04-07/
gs://bucket/table/dt=2020-04-08/
Just trying to see if this would be the right approach for it.

Kafka itself is a messaging system that allows to exchange data between processes, applications, and servers, but it requires producers and consumers (here is an example) that move the data. For instance:
The Producer needs to send the data in a format that BigQuery can read.
And the Consumer needs to write the data with a valid Hive Layout.
The Consumer should write to GCS, so you would need to find the proper connector for your application (e.g. this Java connector or Confluent connector). And when writing the messages to GCS you need to take care about using a valid 'default hive partitioning layout'.
For example, gs://bucket/table/dt=2020-04-07/, dt is a column where the table is partitioned on, and 2020-04-07 is one of its values, so take care about this. Once you have a valid Hive Layout in GCS, you need to create a table in BigQuery, I recommend a native table from the UI and selecting Google Cloud Storage as the source and enabling 'Source Data Partitioned', but you can also use --hive_partitioning_source_uri_prefix and --hive_partitioning_mode to link the GCS data with a BigQuery table.
As all this process implies different layers of development and configuration, if this process makes sense for you, I recommend you open new questions for any specific errors you could have.
The last but not least, Kafka to BigQuery connector and other connectors to ingest from Kafka to GCP can help better if Hive Layout is not mandatory for your use case.

Send Bigquery Data to rest endpoint

I want to send data from BigQuery (about 500K rows) to a custom endpoint via post method, how can I do this?
These are my options:
A PHP process to read and send the data (I have already tried this one, but it is too slow and the max execution time pops up).
I was looking for Google Cloud Dataflow, but I don't know Java.
Running it into Google Cloud Function, but I don't know how to send data via post.
Do you know another option?

As mentioned in the comments, 500k rows for a POST method is far too much data to be considered as an option.
Dataflow is a product oriented for pipelines development, intended to run several data transformations during its jobs. You can use BigQueryIO (with python sample codes) but, If you just need to migrate the data to a certain machine/endpoint, creating a Dataflow job will add complexity to your task.
The suggested approach is to export to a GCS bucket and then download the data from it.
For instance, if the size of Data that you are trying to retrieve is less than 1GB, you can export to a GCS bucket from the Command Line Interface like: bq extract --compression GZIP 'mydataset.mytable' gs://example-bucket/myfile.csv. Otherwise, you will need to export the data in more files using wildcard URI defining your bucket destination as indicated ('gs://my-bucket/file-name-*.json').
And finally, using gsutil command gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION] you will download the data from your bucket.
Note: you have more available ways to do that in the Cloud documentation links provided, including the BigQuery web UI.
Also, bear in mind that there are no charges for exporting data from BigQuery, but you do incur charges for storing the exported data in Cloud Storage. BigQuery exports are subject to the limits on export jobs.

CAN we run ETL jobs on AWS EFS

I would like to know if we can run ETL jobs on EFS mount files..
if so how? is it using Hive or anyother service?
Our target is to reduce all the files in one mount point to one file...and store that one file in s3 for better processing

EFS in itself does not inherently have a particular data warehouse product included. For data warehousing and ETL you can choose what you want to use that operates in the AWS environment.
On to your problem:
You want to concatenate or in some way combine all of the files currently in your EFS mount into a single file and store that in S3, if I understand it correctly.
You do not mention what type of data you have or what type of files you want to combine. That makes a huge difference in how you would do this. So I will have to give general suggestions. If you have different types of data, SQL tables from different databases, documents, non-sql data; then you need to determine how to combine that data. For that you would be looking at a data integration solution that can accommodate raw data.
Amazon has a few different products that may assist the process such as Redshift, Athena, Snowflake and their ETL solution Glue. Adding products depends on your company's needs and budget.
So, a more flexible data integration approach would be to use ELT (extract, load, transform) instead of ETL. Basically you would create an appropriate file over on your S3 instance. Then you would extract each file on EFS one at a time and load them into your S3 file. Then when you query the data in your S3 file you would perform any transformations needed before seeing the query results. Here's an article that explains the differences in more detail: https://blog.panoply.io/etl-vs-elt-the-difference-is-in-the-how.
There are some vendors supporting the ELT process such as Talend, Hadoop/Hive/Spark, Terradata and Informatica should you want to investigate options.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas