Google BigQuery for realtime call records data - google-bigquery

I am thinking to use Google Big Query to store realtime call records involving around 3 million rows per day inserted and never updated.
I have signed up for a trial account and ran some tests
I have few concerns before i can go ahead with development
When streaming data via PHP it takes around 10-20 minutes sometime to get loaded on my tables and this is a show stopper for us because network support engineers need this data updated realtime to troubleshoot quality issues
Partitions, we can store data in partitions divided for each day but that also involves one partition being 2.5 GB on any given day and that shoots my costs to query data in range of thousands per month. Is there any other way to bring down cost here? We can store data partitioned per hour but there is no such support available.
If not BigQuery what other solutions are out there in market which can deliver similar performance and can solve these problems ?

You have the "Streaming insert" option which enables the records to be searchable in few seconds (it has its price).
See: streaming-data-into-bigquery
Check table-decorators for limiting query scan.

Related

Storing vast amounts of "uptime" data for a website monitoring service

this is more of a general discussion rather than a code question.
I have a website monitoring platform whereby users of the system can input their website URL and we'll check it every X minutes based on the customer's interval, at each interval, an entry is stored as a UptimeCheck model in the Laravel 8 project with the status being down or up.
If a customer has 20 monitors, and each checks every minute, then over a 30 day period for the one customer they'd accumulate over 1 million rows.
My query, is really do I need to keep this number of rows?
The reason this number of rows is kept is so that we can present a graph showing the average website uptime.
My thinking is that if I created some kind of SVG programatically for each day and store this in the table then I wouldn't need to store as many entries, but my concern here is how would I merge SVG models into one to present a daily graph?
What kind of libraries could I use and how else might I approach this?
Unlike performance, the trick for storing uptime data is simple. You don't store it. ;)
You need to store DOWNTIME data instead. Register only unavailability events and extrapolate uptime when displaying reports.

Google big query backfill takes very long

I am new to stack overflow. I use Google big query to connect data from multiple sources toegether. I have made a connection to Google ads (using data transfer from big query) and this works well. But when i run a backfill of older data it takes more then 3 days to get the data from 180 days in big query. Google advises 180 days as maximum. But it takes so long. I want to do this for the past 2 years and multiple clients (we are an agency). I need to do this in chunks of 180 days.
Does anybody have a solution for this taking so long?
Thanks in advance.
According to the documentation, BigQuery Data Transfer Service supports a maximum of 180 days (as you said) per backfill request and simultaneous backfill requests are not supported [1].
BigQuery Data Transfer Service limits the maximum rate of incoming requests and enforces appropriate quotas on a per-project basis [2] and other BigQuery tasks in the project may be limiting the amount of resources used by the Transfer. Load jobs created by transfers are included in BigQuery's quotas on load jobs. It's important to consider how many transfers you enable in each project to prevent transfers and other load jobs from producing quotaExceeded errors.
If you need to increase the number of transfers, you can create other projects.
If you want to speed up the transfers for all your clients, you could split them into several projects, because it seems that’s an important amount of transfers that you are going to make there.

Google CloudSQL or BigQuery for Big Data Actively Update Every Second

So now I'm currently using Google CloudSQL for my needs.
I'm collecting data from user activities. Every day the number of rows in my table will increase around 9-15 million rows and always updated every second. The data including several main parameters like user locations (latitude longitude), timestamp, user activities and conversations and more.
I need to constantly access a lot of insight from this user activities, like "how many users between latitude-longitude A and latitude-longitude B who use my app per hour since 30 days ago?".
Because my table become bigger every day, it's hard to manage the performance of select query in my table. (I already implemented the indexing method in my table especially for most common use parameter)
All my data insert, select, update and more is executed from API that I code in PHP.
So my question is can I get much more better benefit if I use Google BigQuery for my needs?
If yes, how can I do this? Because is Google BigQuery (forgive my if I'm wrong) designed to be used for static data? (Not a constantly update data)? How can I connect my CloudSQL data into BigQuery in real time?
Which one is better: optimizing my table in CloudSQL to maximize the select process or use BigQuery (if possible)
I also open for another alterntive or sugget to optimize my CloudSQL performance :)
Thank you
Sounds like BigQuery would be far better suited your use case. I can think of a good solution:
Migrate existing data from CloudSQL to BigQuery.
Stream events directly to BigQuery (using a async queue).
Use time partitioned table in BigQuery.
If you use BigQuery, you don't need to worry about performance or scaling. That's all handled for you by Google.

Tableau Data Limits

I've been hearing conflicting statements on how much records / data size, tableau can handle.
In the last week two people have told me they have dashes which are, 100m and 600m records. They do incremental refreshes.
If I have a dash with xxx million records. Do clients only receive the data that is in their aggregated view.
So, if I have a source with 200million records. In the dash it shows the aggregated total per week per product. Let's say this is 400 cells(underneath it's millions of records). Is the client only receiving 400 data points.
If I then add filters to sub product or user level data, would that mean all of these data is imported due to the filters? If this is the case, how does this affect speed?
Ultimately, Tableau can handle as much data as your datasource can handle. If you are set up so Tableau connects to a datasource directly, only the results of a query are transmitted to the user. I've got billion row datasources in BigQuery that return reasonably fast aggregated numbers to Tableau.
If your datasource is not fast then this won't give good results in Tableau.
If you are using extracts, where, in effect, Tableau pulls all the data locally, things will usually be faster, but you will have local drive and memory limits on the size of the dataset. And each user will need an extract. Unless you are using Tableau server in which case the extract can be on the server.
Dashboards built on big datasources sometimes get slow when there are a lot of filters because populating each filter requires a datasource query (which may be triggered every time you use a filter). There are strategies to speed up dashboards with this problem by using partial extracts that generate all the values used for filtering (you can sometimes use parameters for a similar speed gain). Or even just designing the filters intelligently. But speed is usually the limiting factor not the size of the source table.
The only real limit on how much Tableau can handle is how many points are displayed. And that depends on RAM. In my experience a 4GB machine will choke on a chart will a couple of million points (e.g. a map plotting every postcode in the UK). But on a 16GB RAM machine I have never found a limit other than how fast the points are drawn.

Google BigQuery basic questions

These may be few basic questions.
When i load data into BQ tables, where exactly data stored? (If billing is already enabled). if it is data center, what would be data center capacity? Does our data co-exist with other users data?
When we fire queries, How our queries processed? What is the default compute engine used for this?
How can we increase query processing capacity?
Thanks
CP
BigQuery datacenter capacity is practically unlimited. If you plan to upload petabytes in a very short time frame you might need to contact support first just to make sure, but for normal big loads everything should be fine.
BigQuery doesn't use compute engine, but a series of very large clusters where all queries run. That's the secret to a low cost per query, without ongoing costs per hour like other alternatives.
BigQuery increases the number of CPUs involved in your query elastically as the query needs. You don't need to manage storage nor processing capacity.