Data Movement from generic-ODBC database to BigQuery - google-bigquery

I would save the result of a query on an external db to Bigquery.
I am using pyodbc to manage the odbc connection.
What is the most efficient way to perform such operation?
Should I fetchOne each cursor row and then insert in BigQuery?

Does the result have large amount of data?
If the result is small, you can just read all rows and insert into BigQuery. The benefit is the result is immediately available to BigQuery queries. However, for large results, the streaming insert might be expensive (see https://cloud.google.com/bigquery/pricing).
For large results I would just save the result to a file (commonly CSV), upload it to GCP and run load job.

Related

AWS athena, can we write query results in distributed manner upon query?

I'm trying to compare performance of SELECT Vs. CTAS.
The reason CTAS is faster for bigger data is b.c. data format and its ability to write query results in distributed manner into multiple parquet files.
All athena queries are written to S3 then read from there (I may be wrong), is there way to distributed writing query result of regular select into single file? So without bucketing nor partioning.

Small single parquet file on Data Lake, or relational SQL DB?

I am designing a Data Lake in Azure Synapse and, in my model, there is a table that will store a small amount of data (like 5000 rows).
The single parquet file that stores this data will surely be smaller than the smallest recommended size for a parquet file (128 MB) and I know that Spark is not optimized to handle small files. This table will be linked to a delta table, and I will insert/update new data by using the MERGE command.
In this scenario, regarding performance, is it better to stick with a delta table, or should I create a SQL relational table in another DB and store this data there?
It depends on multiple factors like the types of query you will be running and how often you want to run merge command to upsert data to delta.
But even if you do perform analytical queries, looking at the size of data I would have gone with relational DB.

Insert bigquery query result to mysql

In one of my PHP application, I need to show a report based on the aggregate data, which is fetched from BigQuery. I am planning to execute the queries using a PHP cron job then insert data to MySQL table from which the report will fetch data. Is there any better way of doing this like directly insert the data to MySQL without an application layer in between ?
Also I am interested in real time data, but the daily cron only update data once and there will be some mismatch of the counts with actual data if I check it after some time. If I run hourly cron jobs, I am afraid the data reading charges will be high as I am processing a dataset which is 20GB. Also my report cannot be fetched fro Bigquery itself and it needs to have data from MySQL database.

Google Bigquery query execution using google cloud dataflow

Is it possible to execute Bigquery's query using Google cloud data flow directly and fetch data, not reading data from table then putting conditions?
For example, PCollections res=p.apply(BigqueryIO.execute("Select col1,col2 from publicdata:samples.shakeseare where ...."))
Instead of reinventing using iterative method what Bigquery queries already implemented, we can use the same directly.
Thanks and Regards
Ajay K N
BigQueryIO currently only supports reading from a Table and not a Query or View (FAQ).
One way to work around this is in your main program to create a BigQuery permanent table by issuing a query before you run your Dataflow job. After, your job runs you could delete the table.

Can I denormalize data in google cloud sql in prep for bigquery

Given that bigquery is not meant as a platform to denormalize data, can I denormalize the data in google cloud sql prior to importing into bigquery?
I have the following tables:
Table1 500M rows, Table2 2M rows, Table3 800K rows,
I can't denormalize in our existing relational database for various reasons. So I'd like to do a sql dump of the data base, load it into google cloud sql, then use sql join scripts to create one large flat table to be imported into bigquery.
Thanks.
That should work. You should be able to dump the generated flat table to csv and import to bigquery. There is no direct Cloud SQL to bigquery loading mechanism, currently, however.