I am considering Google BigQuery as my data warehouse option. I have data in Google Cloud SQL, Google Cloud BigTable and exposed REST APIs on top of it to consume data on any UI. I am planning to use same APIs as source in my ETL job which will append data into BigQuery
From this API, I can get daily data. Lets take example - Total entities - 10,000, Measurements types associated with each entity - 1000. So for per year (single entry of each measurement per day) - 365 (no of days) * 10,000 (total entities) * 1000 (total measurements) = 3650000000 (around 4 billion)
Right now, I have 2 choices of schema design:-
Create single table - one entity id column and 1000 measurement columns
Use different table year wise (use sharding year wise) and later use UNION queries to fetch data
Please let me know which option would be best in terms of cost, scalability. I understand 2nd option would be more cost effective as it requires few table scans.
Is there any better choices available?
Related
I'm a little new to BQ, I'm doing a query, very simply of a view to get a quick look at the data, but when I put say LIMIT 100, to see just the first 100 rows, I don't get a reduction in the data required and hence the cost. If I want to simply do this, what can I do that is inexpensive to get the data.
For example:
select * from table
uses exactly the same projected data as
select * from table limit 100
Is there not any simplification under hood. Is BQ searching all rows and then taking the top 100?
BigQuery charging is based on the data queried and unfortunately limit does not reduce the volume of data queried.
The following can help:
using the table preview in the console (this is free if I recall correctly) but does not work on views or some types of attached tables
reducing the number of columns that are queried
if your data is partitioned, you can query a specific partition - https://cloud.google.com/bigquery/docs/querying-partitioned-tables
There is information from Google on this page https://cloud.google.com/bigquery/docs/best-practices-performance-input
We have a requirement to store website analytical data (think: views on a page, interactions, etc). Note: this is seperate to Google Analytics data, as we want to own the data and enrich it as we see fit.
Storage requirements:
each 'event' will have a timestamp, event type and some other metadata (user id, etc)
the storage is append only. No updates or deletes
writes are consistent, but not IOT scale. Maybe, 50/sec
estimating growth of about 100 million rows a year
Query requirements:
graphing data cumulatively over a period of time
slice/filter data by all the metadata as well as day/week/month/year slices
will likely need to be integrated into a larger data warehouse
Question: Is this a no brainer for a time series DB like InfluxDB,or can I get away with a well tuned SQL server table?
I have a SQL DW which is about 30 GB. I want to use PowerBI to visualize this data, but I noticed PowerBI desktop only supports file size up to 250MB. What is the best way to connect to PowerBI to visualize this data?
You have a couple of choices depending on your use case:
Direct query of the source data
View based aggregations of the source data
Direct Query
For smaller datasets (think in the thousands of rows), you can simply connect PowerBI directly to Azure SQL Data Warehouse and use the table view to pull in the data as necessary.
View Based Aggregations
For larger datasets (think millions, billions, even trillions of rows) you're better served by running the aggregations within SQL Data Warehouse. This can take the shape of view that is creating the aggregations (think sales by hour instead of every individual sale) or you can create a permanent table at data loading time through a CTAS operation that contains the aggregations your users commonly query against. This latter CTAS operation model is a simple select with filter operation for the user (say Aggregated Sales greater than today - 90 days). Once the view or reporting table is created, you can simply connect to PowerBI as you normally would.
The PowerBI team has a blog post - Exploring Azure SQL Data Warehouse with PowerBI - that covers this as well.
You could also create a query (power query - M) that retrieves only the required data level (ie groups, joins, filters, etc). If done right the queries are translated to tsql and only limited amount of data is downloaded into power bi designer
I'm a Phd student from Singapore Management University. Currently I'm working in Carnegie Mellon University on a research project which needs the historical events from Github Archive (http://www.githubarchive.org/). I noticed that Google Bigquery has Github Archive data. So I run a program to crawl data using Google Bigquery service.
I just found that the price of Google bigquery shows on the console is not updated in real-time... While I started running the program for a few hours, the fee was only 4 dollar plus, so I thought the price is reasonable and I kept running the program. After 1~2 days, I checked the price again on Sep 13, 2013, the price became 1388$...I therefore immediately stopped using Google bigquery service. And just now I checked the price again, it turns out I need to pay 4179$...
It is my fault that I didn't realize I need to pay this big amount of money for executing queries and obtaining data from Google bigquery.
This project is only for research, not for commercial purpose. I would like to know whether it is possible to waive the fee. I really need [Google Bigquery team]'s kindly help.
Thank you very much & Best Regards,
Lisa
A year later update:
Please note some big developments since this situation:
Querying prices are 85% down.
GithubArchive is publishing daily and yearly tables now - so while developing your queries always test them on smaller datasets.
BigQuery pricing is based on the amount of data queried. One of its highlights is how easily it scales, going from scanning few gigabytes to terabytes in very few seconds.
Pricing scaling linearly is a feature: Most (or all?) other databases I know of would require exponentially more expensive resources, or are just not able to handle these amounts of data - at least not in a reasonable time frame.
That said, linear scaling means that a query over a terabyte is a 1000 times more expensive than a query over a gigabyte. BigQuery users need to be aware of this and plan accordingly. For these purposes BigQuery offers the "dry run" flag, that allows one to see exactly how much data will be queried before running the query - and adjust accordingly.
In this case WeiGong was querying a 105 GB table. Ten SELECT * LIMIT 10 queries will quickly amount to a terabyte of data, and so on.
There are ways to make these same queries consume much less data:
Instead of querying SELECT * LIMIT 10, call only the columns you are looking for. BigQuery charges based on the columns you are querying, so having unnecessary columns, will add unnecessary costs.
For example, SELECT * ... queries 105 GB, while SELECT repository_url, repository_name, payload_ref_type, payload_pull_request_deletions FROM [githubarchive:github.timeline] only goes through 8.72 GB, making this query more than 10 times less expensive.
Instead of "SELECT *" use tabledata.list when looking to download the whole table. It's free.
Github archive table contains data for all time. Partition it if you only want to see one month data.
For example, extracting all of the January data with a query leaves a new table of only 91.7 MB. Querying this table is a thousand times less expensive than the big one!
SELECT *
FROM [githubarchive:github.timeline]
WHERE created_at BETWEEN '2014-01-01' and '2014-01-02'
-> save this into a new table 'timeline_201401'
Combining these methods you can go from a $4000 bill, to a $4 one, for the same amount of quick and insightful results.
(I'm working with Github archive's owner to get them to store monthly data instead of one monolithic table to make this even easier)
I design a system that should calculate several KPI indicators. There is a relatively simple transactional database, data from this database are transferred to SQL Server data warehouse (1 facts table, 5 dimension tables). Fact table contains data about call-center phone sessions.
Near 1000 new fact table rows per second are being added to the data warehouse.
I need to build KPI indicators (e.g. average call length per call-center operator for last 2 hours). And the time from data update (new fact row) till KPI update should be less than 2 seconds.
Is it possible to build such fast solution using SQL Server Analysis Services?