How to execute BigQuery sql script using Apache Beam? - google-bigquery

BigQuery provides a way to use scripting that I want to utilize to create TEMP TABLES to limit calculating some intermediate CTs twice by the complex query.
Scripting prohibits use of destination_table in the configuration.
At the same time beam.io.BigQuerySource that I use, internally uses destination_table.
Is there any workaround to use BQ scripting in Apache Beam?

Related

Replicate data from cloud SQL postgres to bigQuery

I am looking for the recommended way of streaming database change from cloud SQL (postgres) to bigQuery ? I am seeing that CDC streaming does not seems available for postgres, does anyone know the timeline of this feature ?
Thanks a lot for you help.
Jonathan.
With Datastream for BigQuery, you can now replicate data and schema updates from operational databases directly into BigQuery.
Datastream reads and delivers every change—insert, update, and delete—from your MySQL, PostgreSQL, AlloyDB, and Oracle databases into BigQuery with minimal latency. The source database can be hosted on-premises, on Google Cloud services such as Cloud SQL or Bare Metal Solution for Oracle, or anywhere else on any cloud.
https://cloud.google.com/datastream-for-bigquery
You have to create an ETL process. That will allow you to automatically transform data from Postgres into BigQuery. You can do that using many ways, but I will point you to the two main approaches that I've already implemented:
Way 1:
Set Up the ETL Process manually:
Create your ETL using open source tools...
This method involves the use of the COPY command to migrate data from PostgreSQL tables and standard file-system files. It can be used as a normal SQL statement with SQL functions or PL/pgSQL procedures which gives a lot of flexibility to extract data as a full dump or incrementally. You need to know that it is a time-consuming process and would need you to invest in engineering bandwidth!
Also, you could try different tech stacks to implement the above, and I recommended this one Java Spring Data Flow
Way 2:
Using DataFlow
You can automate the ETL process using GCP's DataFlow without coding your own solution. It is faster and it cost, of course.
DataFlow is Unified stream and batch data processing that's
serverless, fast, and cost-effective.
Check more details and learn in a minute here
Also check this

How to execute a macro in bigquery

I have a requirement to move some of the existing frontend applications running Teradata as the backend to Google BigQuery. One of the common pattern used in these frontend applications is to call a Macro in Teradata, based on different input selected by users. Considering BigQuery doesn't have a way to create a macro entity, how can I replace this and have the frontend calling BigQuery to execute something similar. Connection to BigQuery is through ODBC/JDBC or java services.
A macro in Teradata is just a way to execute multiple SQL statements as a single request, which is in turn treated as a single transaction. It also allows you to parameterize your query.
If your new DB backend supports it, you can convert the macros into stored procedures / functions. Otherwise, you can pull out the individual SQL statements from the macro and try to run them together as a single transaction.
These links may be helpful: Functions,
DML
Glancing at the documentation, it looks like writing a function may be your best bet: "There is no support for multi-statement transactions."
You can look at Bigquery scripting which is in Beta - https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting#bigquery-scripting for migrating your macros from Teradata. With this release you can write procedures where you can define all your business logic and then execute the procedure using a CALL statement.
Thanks,
Jayadeep
As mentioned above:
A macro in Teradata is just a way to execute multiple SQL statements
as a single request, which is in turn treated as a single transaction.
It also allows you to parameterize your query.
Having said that, you just need to do the migrating part from teradata, here you can find the guide to do this, and answering your question, the connection is made through JDBC whose drivers are tdgssconfig.jar and terajdbc4.jar.

Impala OR hive with SPARK as execution engine?

I want to design Web UI which fetches data from HDFS. I want to generate some reports using this data which is stored in HDFS. I have my own custom reports format. I am writing REST API's to fetch data. But running HIVE queries gives latency issues Hence I want different approach for this, I could think of two.
Using IMPALA to create tables. But I am not sure about REST support for IMPALA.
Using HIVE but instead of MR use SPARK as execution engine. .
spark-job-server provides REST support, and fetch data with SPARK-SQL.
Which of the approach will be suitable or is there any better approach for this?
Please can anyone help as I am very new in this.
I'd prefer to choose impala if latency is the main consideration. It's dedicated to SQL processing on hdfs and does it well. About REST api and the application logic you are achieving, this seems to be a good example

Presto and Hive

I'm trying to enable basic SQL querying of CSV files located in an s3 directory. Presto seemed like a natural fit (the files are 10s GB). As I went through the setup in Presto, I tried creating a table using the Hive connector. It was not clear to me if I only needed the hive metastore to save my table configurations in Presto, or if I have to create them in there first.
The documentation makes it seem that you can use Presto without having to CONFIGURE Hive, but using Hive syntax. Is that accurate? My experiences are that AWS S3 has not been able to connect.
Presto syntax is similar to Hive syntax. For most simple queries, the identical syntax would function in both. However, there are some key differences that make Presto and Hive not entirely the same thing. For example, in Hive, you might use LATERAL VIEW EXPLODE, whereas in Presto you'd use CROSS JOIN UNNEST. There are many such examples of nuanced syntactical differences between the two.
It is not possible to use vanilla Presto to analyze data on S3 without Hive. Presto provides only distributed execution engine. However, it lacks metadata information about tables. Thus, Presto Coordinator needs Hive to retrieve table metadata to parse and execute a query.
However, you can use AWS Athena, which is managed Presto, to run queries on top of S3.
Another option, in recent 0.198 release Presto adds a capability to connect AWS Glue and retrieve table metadata on top of files in S3.
I know it's been a while, but if this question is still outstanding, have you considered using Spark? Spark connects easily with out-of-the-box methods and can query/process data living in S3/CSV formats.
Also, I'm curious: what solution did you end up implementing to resolve your issue?

Spark SQL - SQL scripts processing

I'm new to Spark and would like to know if there is any possibility to pass Spark an SQL script for processing.
My goal is to bring data from both mysql through jdbc and Cassandra into Spark and pass an SQL script file without having to modify it or minimal modifications applied to it. The reason why I'm saying minimal modifications is that I have a lot of SQL scripts (similar structure to stored procedures) which I don't want to convert them manually to RDD.
Main purpose is to process the data (execute these SQL scripts) through Spark, thus taking advantage of its capabilities and speed.
This guy found a pretty general way to run sql scripts, just pass in the connection to your database:
https://github.com/syncany/syncany/blob/15dc5344696a800061e8b363f94986e821a0b362/syncany-lib/src/main/java/org/syncany/util/SqlRunner.java
One limitation is that each of the statements in your SQL script has to be delimited with a semi-colon. It basically just parses the script like a text document and executes each statement as it goes. You could probably modify it to take advantage of Spark's SQLContext, instead of using a Connection.
In terms of performance, it probably won't be as fast as a stored procedure because you're bottle-necking with the InputStream. But it is a work-around.