I have a firestore-based app where users can vote on various questions.
The app is then saving the vote to a firestore collection.
At the same time, we want to display real time (live updating) voting results to the users. These results should be saved in a firestore document that the clients can subscribe to.
These voting results can be based on complex queries that we need to run across all collected votes.
BigQuery seems to be ideal for these queries.
Therefore we want to have trigger to ensure that every time a vote document is created in firestore, it will be stream inserted to BigQuery.
After inserting into BigQuery, we want to run the specific query related to the category of the vote, and save the result into the corresponding voting result document, so the user clients will be updated.
But from what I can read, you cannot count on immediate results from BigQuery after the stream insert is accepted, and it can take several seconds before the row appears in a query.
We can live with that delay, but we need a way to trigger that, after the row is actually inserted into BigQuery, we will run the query and save the result in firestore.
What is the recommended approach for this, and are there any other tools that can help us out?
Related
members,
Currently we synchronise salesdata into BigQuery, and it allows us to make fast, detailed, practically realtime reports of all kinds of stats that we otherwise would not have available. We want to have a website that is able to use these reports and present this information to website-users.
Some specs:
Users are using the data as 'readonly'
We want to do the analysis 'on request', so as soon as a user opens the page, we would query BigQuery and the user would see their stats depending on the query
The stats could change for external sources but often the result will be equal, I take into my mind that BigQuery would cache the query
The average query processes about 100Mb of data, it takes >2 seconds for the whole backend to respond (so user request, query, return resultset) so performance is what we want
Why I doubt:
BigQuery would not be adviced
Could it run 'out of hand'
Dataset will grow bigger, but we will need to keep using all historical data in any case
I would be an option to get aggregated data into another database for doing the main calls, but that would give me not a 'realtime' experience.
I would love to hear your thoughts.
As per your requirement, you can consider Bigquery as an option since Bigquery is fully managed and supports analytics over petabyte-scale data, it will be able to handle large amounts of data. Bigquery is specially designed for performing OLAP transactions so analysis can be performed on requests. Bigquery uses cached query results through which you can cache the query and fetch results quickly.
If your dataset is very large and grows then you can create partitioned tables to store and manage your data and easily query the tables. Since your data can go out of hand, Bigquery being a fully managed service will automatically handle that load. Historical data can be stored and accessed but for that you can set the expiration time of the table and also check the optimized storage according to your requirement.
I'm looking for a cloud service that can do advanced statistics calculations on a large amount of votes submitted by users, in "real time".
In our app, users can submit different kind of votes like picking a favorite, rating 1-5, say yes/no etc. on various topics.
We also want to show "live" statistics to the user, showing the popularity of a person etc. This will be generated by a rather complex SQL where we are calculating the average number of times a person was picked as favorite, divided by total number of votes and the number of games in which the person has been participating etc. And the score for the latest X games should count higher than the overall score for all games. This is just an example, there are several other SQL queries with similar complexity.
All our presentable data (including calculated statistics) is served from Firestore documents, and the votes will be saved as Firestore documents.
Ideally, the Firebase-backend (functions, firestore etc) should not need to know about the query logic.
What I wish for is a pay as you go cloud service that does the following:
I define some schemas and set up the queries we need for the statistics we have (15-20 different SQLs). Like setting up views in MySQL
On every vote, we push the vote data to this service, which will store it in a row.
The service should then, based on its knowledge about the defined queries, and the content of the pushed vote data, determine which statistics that are affected by the newly added row, and recalculate these. A specific vote type can affect one or more statistics.
Every time a statistic is recalculated, the result should be automatically pushed back to our Firebase backend (for instance by calling an HTTPS endpoint that hits a cloud function) - so we can update the relevant Firestore documents.
The service should be able to throttle the calculations, like only regenerating new statistics every 1 minute despite having several votes per second on the same topic.
Is there any product like this in the market? Or can it be built by combining available cloud services? And what is the official term for such a product, if I should search for it myself?
I know that I can probably build a solution like this myself, and run it on a cloud hosted database server, which can scale as our need grows - but I believe that I'm not the first developer with a need of this, so I hope that someone has solved it before me :)
You can leverage the existing cloud services available on the Google Cloud Platform.
Google BigQuery, Google Cloud Firestore, Google App Engine (CRON Jobs), Google Cloud Tasks
The services can be used to solve the problems mentioned above:
1) Google BigQuery : Here you can define schema for the data on which you're going to run the SQL queries. BigQuery supports Standard and legacy SQL queries.
2) Every vote can be pushed to the defined BigQuery tables using its streaming insert service.
3) Every vote pushed can trigger the recalculation service which calculates the statistics by executing the defined SQL queries and the query results can be stored as documents in collections in Google Cloud Firestore.
4) Google Cloud Firestore: Here you can store the live statistics of the user. This is a real time database, so you'll be able to configure listeners for the modifications to the statistics and show the modifications as soon as the statistics are recalculated.
5) In the same service which inserts every vote, create a new record with a "syncId" in an another table. The idea is to group a number of votes cast in a particular interval to a its corresponding syncId. The syncId can be suffixed with a timestamp. According to your requirement a particular time interval can be set so that the recalculation can be triggered using CRON jobs service which invokes the recalculation service within the interval. Once the recalculation related to a particular syncId is completed the record corresponding to the syncId should be marked as completed.
We are leveraging the above technologies to build a web application on Google Cloud Platform, where the inputs are recorded on Google Firestore and then stream-inserted to Google BigQuery. The data stored in BigQuery is queried after 30 sec of each update using SQL queries and the query results are stored in Google Cloud Firestore to serve dashboards which are automatically updated using listeners configured for the collection in which the dashboard information is stored.
Our Data Warehouse team is evaluating BigQuery as a Data Warehouse column store solution and had some questions regarding its features and best use. Our existing etl pipeline consumes events asynchronously through a queue and persists the events idempotently into our existing database technology. The idempotent architecture allows us to on occasion replay several hours or days of events to correct for errors and data outages with no risk of duplication.
In testing BigQuery, we've experimented with using the real time streaming insert api with a unique key as the insertId. This provides us with upsert functionality over a short window, but re-streams of the data at later times result in duplication. As a result, we need an elegant option for removing dupes in/near real time to avoid data discrepancies.
We had a couple questions and would appreciate answers to any of them. Any additional advice on using BigQuery in ETL architecture is also appreciated.
Is there a common implementation for de-duplication of real time
streaming beyond the use of the tableId?
If we attempt a delsert (via an delete followed by an insert using
the BigQuery API) will the delete always precede the insert, or do
the operations arrive asynchronously?
Is it possible to implement real time streaming into a staging
environment, followed by a scheduled merge into the destination
table? This is a common solution for other column store etl
technologies but we have seen no documentation suggesting its use in
BigQuery.
We let duplication happen, and write our logic and queries in a such way that every entity is a streamed data. Eg: a user profile is a streamed data, so there are many rows placed in time and when we need to pick the last data, we use the most recent row.
Delsert is not suitable in my opinion as you are limited to 96 DML statements per day per table. So this means you need to temp store in a table batches, for later to issue a single DML statement that deals with a batch of rows, and updates a live table from the temp table.
If you consider delsert, maybe it's easier to consider writing a query to only read most recent row.
Streaming followed by scheduled merge is possible. Actually you can rewrite some data in the same table, eg: removing dups. Or scheduled query batch content from temp table and write to live table. This is somehow the same as let duplicate happening and later deal within a query with it, also called re-materialization if you write to the same table.
So now I'm currently using Google CloudSQL for my needs.
I'm collecting data from user activities. Every day the number of rows in my table will increase around 9-15 million rows and always updated every second. The data including several main parameters like user locations (latitude longitude), timestamp, user activities and conversations and more.
I need to constantly access a lot of insight from this user activities, like "how many users between latitude-longitude A and latitude-longitude B who use my app per hour since 30 days ago?".
Because my table become bigger every day, it's hard to manage the performance of select query in my table. (I already implemented the indexing method in my table especially for most common use parameter)
All my data insert, select, update and more is executed from API that I code in PHP.
So my question is can I get much more better benefit if I use Google BigQuery for my needs?
If yes, how can I do this? Because is Google BigQuery (forgive my if I'm wrong) designed to be used for static data? (Not a constantly update data)? How can I connect my CloudSQL data into BigQuery in real time?
Which one is better: optimizing my table in CloudSQL to maximize the select process or use BigQuery (if possible)
I also open for another alterntive or sugget to optimize my CloudSQL performance :)
Thank you
Sounds like BigQuery would be far better suited your use case. I can think of a good solution:
Migrate existing data from CloudSQL to BigQuery.
Stream events directly to BigQuery (using a async queue).
Use time partitioned table in BigQuery.
If you use BigQuery, you don't need to worry about performance or scaling. That's all handled for you by Google.
I have run a query on Google BigQuery several hours ago, and the query is still running. I clicked "abandon", but it appears there is no way to stop a query. What can I do? Can I contact Google somehow, so they stop the query?
I've been working on a project for a company which analyzes Google Analytics data with BigQuery, so I don't want to run them a big bill or something.
(Maybe StackOverflow is not the right place to ask this question, but I've tried to find another place, and I couldn't. On the BigQuery support page, it is said that questions should be asked here, with the google-bigquery tag, so I'm doing that).
I've written a query (which I don't want to paste or describe here, as someone might abuse it to block BigQuery or something, I don't know). Let's just say it includes inner joins. After I've written it, and before running it, the console message was something like "This will analyze 674KB of data", which looked OK, given the fact that the table only has 10,000 rows. I've got the same message after clicking on "abandon" query, something like "You can abandon this, but you will still be billed for 674KB of data".
I try very hard to make sure what I do doesn't cause problems to someone, so I've actually run that query on a local PostgreSQL database (with the exact same data - 10,000 rows) as in BigQuery, and the query there finishes in a second or two.
How can I cancel this query, and can I (the company I've worked for) be billed for something more than 674KB of data?
At the time being, there is no way to stop a BigQuery job once it's started, neither via web interface or API calls.
According to this, this feature may be added in the future.
As BigQuery will shard the query to multiple machines, even a large query (TeraByte level) will not have a large impact on an individual machine, let alone a query of 674KB. However, according to this, this is the amount that you will be charged.
Here are some tips to save money in BigQuery.
First thing to know is that, unlike traditional RDBMS, BigQuery is column based, and you will be charged by the amount of data in the columns rather than in the rows.
That means, don't include columns that you do not need in the query. This may sound trivial, but sometimes people coming from RDBMS may write queries like this:
SELECT
COUNT(*), user_id
FROM
[Dataset.Table]
The query is absolutely correct, but instead of being charged only the size of user_id column, Google would actually bill the whole table for this query. Therefore it's a good idea to explicitly specify the column names.
Break the tables into smaller chunks. Instead of having a single table that contains all the data, it's a good idea to split the table according to date, and use table wildcard functions to stitch the tables together during query. In this case, you won't be billed by rows that you don't need.
BigQuery supports canceling query jobs.
You can do this via the bq command line utility:
bq cancel <job_id>
or from the API via the jobs.cancel method (documented here)