Insert bigquery query result to mysql - google-bigquery

In one of my PHP application, I need to show a report based on the aggregate data, which is fetched from BigQuery. I am planning to execute the queries using a PHP cron job then insert data to MySQL table from which the report will fetch data. Is there any better way of doing this like directly insert the data to MySQL without an application layer in between ?
Also I am interested in real time data, but the daily cron only update data once and there will be some mismatch of the counts with actual data if I check it after some time. If I run hourly cron jobs, I am afraid the data reading charges will be high as I am processing a dataset which is 20GB. Also my report cannot be fetched fro Bigquery itself and it needs to have data from MySQL database.

Related

SSIS Incremental Load-15 mins

I have 2 tables. The source table being from a linked server and destination table being from the other server.
I want my data load to happen in the following manner:
Everyday at night I have scheduled a job to do a full dump i.e. truncate the table and load all the data from the source to the destination.
Every 15 minutes to do incremental load as data gets ingested into the source on second basis. I need to replicate the same on the destination too.
For incremental load as of now I have created scripts which are stored in a stored procedure but for future purposes we would like to implement SSIS for this case.
The scripts run in the below manner:
I have an Inserted_Date column, on the basis of this column I take the max of that column and delete all the rows that are greater than or equal to the Max(Inserted_Date) and insert all the similar values from the source to the destination. This job runs evert 15 minutes.
How to implement similar scenario in SSIS?
I have worked on SSIS using the lookup and conditional split using ID columns, but these tables I am working with have a lot of rows so lookup takes up a lot of the time and this is not the right solution to be implemented for my scenario.
Is there any way I can get Max(Inserted_Date) logic into SSIS solution too. My end goal is to remove the approach using scripts and replicate the same approach using SSIS.
Here is the general Control Flow:
There's plenty to go on here, but you may need to learn how to set variables from an Execute SQL and so on.

How can I write the results of a SQL query to azure cloud storage?

Our current data set is not friendly in terms of looking at historic records. I can see what a value for an account is at the time of execution but if I want to look up last month's counts and values that's often lost. To fix this I want to take a "snapshot" of our data by running it at specific times and storing the results in the cloud. We're looking at just over 30,000 records and I'd only run it at the end of the month keeping 12 separate months at a time so the count doesn't get too high.
I can't seem to find anything about how I could do this so I'm hoping someone has experience or knowledge and would like to share.
FYI we're using an on premise oracle DB.
Thanks!
You can use Azure Data Factory (ADF) to schedule a monthly run for a pipe that execute a query/stored procedure against your Azure SQL and writes the data to Azure Storage.

AWS Equivalent For Incremental SQL Query

I can create a materialised view in RDS (postgreSQL) to keep track of the 'latest' data output from a SQL query, and then visualise this in QuickSight. This process is also very 'quick' as it doesn't result in calling additional AWS services and/or re-processing all data again (through the SQL query). My assumption is how this works is it runs a SQL, re-runs the SQL but not for the whole data again, so that if you structure the query correctly, you can end up having a 'real time running total' metric for example.
The issue is, creating materialised views (per 5 seconds) for 100's of queries, and having them all stored in a database is not scalable. Imagine a DB with 1TB data, creating an incremental/materialised view seems much less painful than using other AWS services, but eventually won't be optimal for processing time/cost etc.
I have explored various AWS services, none of which seem to solve this problem.
I tried using AWS Glue. You would need to create 1 script per query and output it to a DB. The lag between reading and writing the incremental data is larger than creating a materialised view; because you can incrementally process data, but then to append it to the current 'total' metric is another process.
I explored using AWS Kinesis followed by a Lambda to run a SQL on the 'new' data in the stream, and store the value in S3 or RDS. Again, this adds latency and doesn't work as well as a materialised view.
I read that AWS Redshift does not have materialised views therefore stuck to RDS (PostgreSQL).
Any thoughts?
[A similar issue: incremental SQL query - except I want to avoid running the SQL on "all" data to avoid massive processing costs.]
Edit (example):
table1 has schema (datetime, customer_id, revenue)
I run this query: select sum(revenue) from table1.
This would scan the whole table to come up with a metric per customer_id.
table1 now gets updated with new data as the datetime progresses e.g. 1 hour extra data.
If I run select sum(revenue) from table1 again, it scans all the data again.
A more efficient way is to just compute the query on the new data, and append the result.
Also, I want the query to actively run where there is a change in data, not have to 'run it with a schedule' so that my front end dashboards basically 'auto update' without the customer doing much.

Data Movement from generic-ODBC database to BigQuery

I would save the result of a query on an external db to Bigquery.
I am using pyodbc to manage the odbc connection.
What is the most efficient way to perform such operation?
Should I fetchOne each cursor row and then insert in BigQuery?
Does the result have large amount of data?
If the result is small, you can just read all rows and insert into BigQuery. The benefit is the result is immediately available to BigQuery queries. However, for large results, the streaming insert might be expensive (see https://cloud.google.com/bigquery/pricing).
For large results I would just save the result to a file (commonly CSV), upload it to GCP and run load job.

Result of Bigquery job running on a table in which data is loaded via streamingAPI

I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?
As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.