Service that does advanced queries on a data set, and automatically returns relevant updated results every time new data is added to the set? - sql

I'm looking for a cloud service that can do advanced statistics calculations on a large amount of votes submitted by users, in "real time".
In our app, users can submit different kind of votes like picking a favorite, rating 1-5, say yes/no etc. on various topics.
We also want to show "live" statistics to the user, showing the popularity of a person etc. This will be generated by a rather complex SQL where we are calculating the average number of times a person was picked as favorite, divided by total number of votes and the number of games in which the person has been participating etc. And the score for the latest X games should count higher than the overall score for all games. This is just an example, there are several other SQL queries with similar complexity.
All our presentable data (including calculated statistics) is served from Firestore documents, and the votes will be saved as Firestore documents.
Ideally, the Firebase-backend (functions, firestore etc) should not need to know about the query logic.
What I wish for is a pay as you go cloud service that does the following:
I define some schemas and set up the queries we need for the statistics we have (15-20 different SQLs). Like setting up views in MySQL
On every vote, we push the vote data to this service, which will store it in a row.
The service should then, based on its knowledge about the defined queries, and the content of the pushed vote data, determine which statistics that are affected by the newly added row, and recalculate these. A specific vote type can affect one or more statistics.
Every time a statistic is recalculated, the result should be automatically pushed back to our Firebase backend (for instance by calling an HTTPS endpoint that hits a cloud function) - so we can update the relevant Firestore documents.
The service should be able to throttle the calculations, like only regenerating new statistics every 1 minute despite having several votes per second on the same topic.
Is there any product like this in the market? Or can it be built by combining available cloud services? And what is the official term for such a product, if I should search for it myself?
I know that I can probably build a solution like this myself, and run it on a cloud hosted database server, which can scale as our need grows - but I believe that I'm not the first developer with a need of this, so I hope that someone has solved it before me :)

You can leverage the existing cloud services available on the Google Cloud Platform.
Google BigQuery, Google Cloud Firestore, Google App Engine (CRON Jobs), Google Cloud Tasks
The services can be used to solve the problems mentioned above:
1) Google BigQuery : Here you can define schema for the data on which you're going to run the SQL queries. BigQuery supports Standard and legacy SQL queries.
2) Every vote can be pushed to the defined BigQuery tables using its streaming insert service.
3) Every vote pushed can trigger the recalculation service which calculates the statistics by executing the defined SQL queries and the query results can be stored as documents in collections in Google Cloud Firestore.
4) Google Cloud Firestore: Here you can store the live statistics of the user. This is a real time database, so you'll be able to configure listeners for the modifications to the statistics and show the modifications as soon as the statistics are recalculated.
5) In the same service which inserts every vote, create a new record with a "syncId" in an another table. The idea is to group a number of votes cast in a particular interval to a its corresponding syncId. The syncId can be suffixed with a timestamp. According to your requirement a particular time interval can be set so that the recalculation can be triggered using CRON jobs service which invokes the recalculation service within the interval. Once the recalculation related to a particular syncId is completed the record corresponding to the syncId should be marked as completed.
We are leveraging the above technologies to build a web application on Google Cloud Platform, where the inputs are recorded on Google Firestore and then stream-inserted to Google BigQuery. The data stored in BigQuery is queried after 30 sec of each update using SQL queries and the query results are stored in Google Cloud Firestore to serve dashboards which are automatically updated using listeners configured for the collection in which the dashboard information is stored.

Related

Log Analytics retention policy and querying on logs

I would like to know how can we address this scenario in Azure Log Analytics where I need to generate Kube-audit logs of different cluster every week and also retain these logs for approx 400 days. Now storing it over Log Analytics will cost me more and its not an optimized architecture as I will not be require that so often. So I would like to know from experts whats the best way to design the architecture, where we get the kube audit logs which can be retained for 400 days and be available for querying when required without incurring too much cost.
PS: I also heard in my team that querying 400 days logs always times out in KQL.
Log analytics offerings:
Log analytics now provides the capability to manage several service tiers at table scope. Setting your data as archive, with no query capabilities at a much lower cost. offering spans for up to 7 years.
when needed, you can choose to elevate a subset of your data into the Analytics offering, providing you the capability to query it. The action of elevating your data is denoted as - "Search jobs"
Another option is to elevate an entire period in time to the Analytic offering, they call it - "Restore logs".
Table's different service tiers -
https://learn.microsoft.com/en-us/azure/azure-monitor/logs/data-retention-archive?tabs=api-1%2Capi-2
Search job offering -
https://learn.microsoft.com/en-us/azure/azure-monitor/logs/search-jobs?tabs=api-1%2Capi-2%2Capi-3
Restore logs -
https://learn.microsoft.com/en-us/azure/azure-monitor/logs/restore?tabs=api-1%2Capi-2
all are under public preview.
both offerings - Search jobs and Restore logs provides you the capability to engage your data on demand, can't comment or suggest regarding the actual cost.
Azure data explorer solution:
Another option is to use Azure storage to hold your data (as an example), Azure data explorer provides the capability to create an external table, that table is a logical view on top of your data, the data itself is kept outside of the ADX cluster. you can query your data by using ADX, expect degradation in query performance.
ADX external table offering -
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/schema-entities/externaltables

Usecase for BIgQuery as a database backend for website thoughts

members,
Currently we synchronise salesdata into BigQuery, and it allows us to make fast, detailed, practically realtime reports of all kinds of stats that we otherwise would not have available. We want to have a website that is able to use these reports and present this information to website-users.
Some specs:
Users are using the data as 'readonly'
We want to do the analysis 'on request', so as soon as a user opens the page, we would query BigQuery and the user would see their stats depending on the query
The stats could change for external sources but often the result will be equal, I take into my mind that BigQuery would cache the query
The average query processes about 100Mb of data, it takes >2 seconds for the whole backend to respond (so user request, query, return resultset) so performance is what we want
Why I doubt:
BigQuery would not be adviced
Could it run 'out of hand'
Dataset will grow bigger, but we will need to keep using all historical data in any case
I would be an option to get aggregated data into another database for doing the main calls, but that would give me not a 'realtime' experience.
I would love to hear your thoughts.
As per your requirement, you can consider Bigquery as an option since Bigquery is fully managed and supports analytics over petabyte-scale data, it will be able to handle large amounts of data. Bigquery is specially designed for performing OLAP transactions so analysis can be performed on requests. Bigquery uses cached query results through which you can cache the query and fetch results quickly.
If your dataset is very large and grows then you can create partitioned tables to store and manage your data and easily query the tables. Since your data can go out of hand, Bigquery being a fully managed service will automatically handle that load. Historical data can be stored and accessed but for that you can set the expiration time of the table and also check the optimized storage according to your requirement.

Dashboard that updates real time, how to structure

I am trying to create a dashboard app for my company that displays data from a few different sources that they use. I am starting with an in house system that stores data in MSSQL. I'm struggling to decide how I can display real time (or at least updated regularly) data based on this database.
I was thinking of writing a node server to poll the company database and check for updates and then store a copy of the relevant tables in my own database. Then creating another node server that computes metrics (average delivery time, Turnover, etc.) from my database and then a frontend (probably react) to display these metrics nicely and trigger the logic in the backend whenever the page is loaded by a user.
This is my first project so just need some guidance on whether this is the right way to go about this or if I'm over complicating it.
Thanks
One of solutions is to implement a CRON job in nodejs or in you frontEnd side, then you can retrieve new Data inserted to your Database.
You can reffer to this link for more information about the CROB job :
https://www.npmjs.com/package/cron
if you are using MySQL, you can use the mysql-events listner, it is a MySQL database and runs callbacks on matched events.
https://www.npmjs.com/package/mysql-events

How to be notified for high costs of queries in BigQuery?

I have a project in BigQuery where many people update/add Views.
Other access Views/Tables from 3rd party softwares like Tableau.
I have no control for example if the Analysit who wrote the query in Tableau used the Partition of the table or not.
Is it possible somehow to ask BigQuery to send email for each query that passes threshold? For example 20GB. Then I can check this specific query and user to see if it's OK or not (I'm not forcing partition as it's not always what we need)
I know that it's possible to use the Stackdriver Logging export to download logs into BigQuery tables / storage but I don't see anything there that can tell me if query passed this specific criteria.
There are different solutions available but the best is using Cloud Pub/Sub topics and piece of Cloud Function:
Enable programmatic notifications to receive Cloud Pub/Sub messages with the current status of your budget
Programmatic Budgets Notification Examples

Google CloudSQL or BigQuery for Big Data Actively Update Every Second

So now I'm currently using Google CloudSQL for my needs.
I'm collecting data from user activities. Every day the number of rows in my table will increase around 9-15 million rows and always updated every second. The data including several main parameters like user locations (latitude longitude), timestamp, user activities and conversations and more.
I need to constantly access a lot of insight from this user activities, like "how many users between latitude-longitude A and latitude-longitude B who use my app per hour since 30 days ago?".
Because my table become bigger every day, it's hard to manage the performance of select query in my table. (I already implemented the indexing method in my table especially for most common use parameter)
All my data insert, select, update and more is executed from API that I code in PHP.
So my question is can I get much more better benefit if I use Google BigQuery for my needs?
If yes, how can I do this? Because is Google BigQuery (forgive my if I'm wrong) designed to be used for static data? (Not a constantly update data)? How can I connect my CloudSQL data into BigQuery in real time?
Which one is better: optimizing my table in CloudSQL to maximize the select process or use BigQuery (if possible)
I also open for another alterntive or sugget to optimize my CloudSQL performance :)
Thank you
Sounds like BigQuery would be far better suited your use case. I can think of a good solution:
Migrate existing data from CloudSQL to BigQuery.
Stream events directly to BigQuery (using a async queue).
Use time partitioned table in BigQuery.
If you use BigQuery, you don't need to worry about performance or scaling. That's all handled for you by Google.