Using PowerBI to visualize large amounts of data on a SQL Data Warehouse - sql

I have a SQL DW which is about 30 GB. I want to use PowerBI to visualize this data, but I noticed PowerBI desktop only supports file size up to 250MB. What is the best way to connect to PowerBI to visualize this data?

You have a couple of choices depending on your use case:
Direct query of the source data
View based aggregations of the source data
Direct Query
For smaller datasets (think in the thousands of rows), you can simply connect PowerBI directly to Azure SQL Data Warehouse and use the table view to pull in the data as necessary.
View Based Aggregations
For larger datasets (think millions, billions, even trillions of rows) you're better served by running the aggregations within SQL Data Warehouse. This can take the shape of view that is creating the aggregations (think sales by hour instead of every individual sale) or you can create a permanent table at data loading time through a CTAS operation that contains the aggregations your users commonly query against. This latter CTAS operation model is a simple select with filter operation for the user (say Aggregated Sales greater than today - 90 days). Once the view or reporting table is created, you can simply connect to PowerBI as you normally would.
The PowerBI team has a blog post - Exploring Azure SQL Data Warehouse with PowerBI - that covers this as well.

You could also create a query (power query - M) that retrieves only the required data level (ie groups, joins, filters, etc). If done right the queries are translated to tsql and only limited amount of data is downloaded into power bi designer

Related

In Google Bigquery, how to denormalize tables when the data is from different 3rd party source?

I have data about contacts in Salesforce. I also have data about the contacts in Intercom/Zendesk. I want to create a denormalized table where the data in Salesforce and Intercom is both merged into a single table so I can query about the contact. Imagine, I dumped the Salesforce data into a Bigquery table. The problem is that the we might not dump Intercom/Zendesk until later. So we may only add Salesforce data into a Bigquery table now. And later we may add Intercom data. My question is how to merge these (existing data in Salesforce BQ table and new data from Intercom)? Assume that the Email is the primary key in both 3rd party sources and we can join them.
Do we need to take the Salesforce data out of the BQ table and run it through some tool to merge both tables and create a new table in BQ?
What will happen if we keep getting new data in both Salesforce and Intercom?
Your case seems to be a good use case for Views.
A view is basically a virtual table that points to a query. You can define a view based on a query (lets call it query_1) and then you will be able to see that view as a table. However every time you run a query (lets call it query_2) using that view as source, internally BigQuery will execute query_1 and then execute your query_2 against the results of query_1.
In your case, you could create a query that use join to merge your tables and save this query as a view. You can create a view by clicking on Save view in the BigQuery console just like in the image below and then fill some required fields before saving.
In BigQuery there are also Materialized Views, that implements some cache technologies in order to make the view more similar to a table.
Some benefits of materialized views are:
Reduction in the execution time and cost for queries with aggregate functions. The largest benefit is gained when a query's computation
cost is high and the resulting data set is small.
Automatic and transparent BigQuery optimization because the optimizer uses a materialized view, if available, to improve the query
execution plan. This optimization does not require any changes to the
queries.
The same resilience and high availability as BigQuery tables.
To create a materialized view you have to run the below command:
CREATE MATERIALIZED VIEW project-id.my_dataset.my_mv_table
AS <my-query>
Finally, I would like to paste here the reference links for both views and materialized views in BigQuery. I suggest that you take a look at it and decide which one fits in your use case.
You can read more about querying Google Cloud Storage https://cloud.google.com/bigquery/external-data-cloud-storage.
You can take the extracts and place them into Google Cloud Storage under buckets i.e. Salesforce bucket and Zendesk bucket.
Once the files are available, you can create external tables on those bucket(1 table for each bucket) so that you would be able to query them independently.
Once you can query them , you can perform joins like normal tables.
You can replace the files in buckets when new data comes.

Storage of website analytical data - relational or time series?

We have a requirement to store website analytical data (think: views on a page, interactions, etc). Note: this is seperate to Google Analytics data, as we want to own the data and enrich it as we see fit.
Storage requirements:
each 'event' will have a timestamp, event type and some other metadata (user id, etc)
the storage is append only. No updates or deletes
writes are consistent, but not IOT scale. Maybe, 50/sec
estimating growth of about 100 million rows a year
Query requirements:
graphing data cumulatively over a period of time
slice/filter data by all the metadata as well as day/week/month/year slices
will likely need to be integrated into a larger data warehouse
Question: Is this a no brainer for a time series DB like InfluxDB,or can I get away with a well tuned SQL server table?

Simple Question about how Tableau Desktop talks to a very large database

I am just curious, as to how Tableau talks to a large data source- for example if I have a data source that has 1.4 million records, and I make a simple table with this data, maybe a graph etc, then how does tableau get this data? Does it go query the data source, ask the data source how much it has, then pull in the first 10,000, does it go back and retrieve the next 10k etc? or does it do it in one go? Also I want to know where Tableau stores this data it receives?
Hope my question makes sense - Just trying to understand the underlying mechanisms.
Thank you!
Tableau can work with external data sources in more than one way. You can extract the entire DB content to a local file (called an extract) or you can have a live connection to the database.
If the connection is live, then Tableau sends the DB queries designed to return the data you want not the entire content of the DB. So if you have 1.4m records containing, say, a full year's sales information and you want monthly totals, Tableau will send a query asking the DB to return the monthly totals. This will result in just 12 numbers being returned to Tableau: the DB itself will do the work and Tableau doesn't need to pull 1.4m numbers and add them up. This is how most data sources work: the user requests a result (using SQL queries) and the DB works out how to return that result. This means you don't need to copy the entire database every time you want to add some numbers up.
Live queries won't sample the database: the answers you get will usually be the correct totals (though some sources like Google's BigQuery will use sampling for some statistical aggregates unless told otherwise).
Both Tableau and many databases will cache the results of queries done recently so the results will be faster. Tableau's results will be held locally.

Use case of using Big Query or Big table for querying aggregate values?

I have usecase for designing storage for 30 TB of text files as part of deploying data pipeline on Google cloud. My input data is in CSV format, and I want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which would be a better option in below for this use case?
Using Cloud Storage for storage and link permanent tables in Big Query for query or Using Cloud Big table for storage and installing HBaseShell on compute engine to query Big table data.
Based on my analysis in below for this specific usecase, I see below where cloudstorage can be queried in through BigQuery. Also, Bigtable supports CSV imports and querying. BigQuery limits also mention a maximum size per load job of 15 TB across all input files for CSV, JSON, and Avro based on the documentation, which means i could load mutiple load jobs if loading more than 15 TB, i assume.
https://cloud.google.com/bigquery/external-data-cloud-storage#temporary-tables
https://cloud.google.com/community/tutorials/cbt-import-csv
https://cloud.google.com/bigquery/quotas
So, does that mean I can use BigQuery for the above usecase?
The short answer is yes.
I wrote about this in:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And when loading cluster your tables, for massive improvements in costs for the most common queries:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
In summary:
BigQuery can read CSVs and other files straight from GCS.
You can define a view that parses those CSVs in any way you might prefer, all within SQL.
You can run a CREATE TABLE statement to materialize the CSVs into BigQuery native tables for better performance and costs.
Instead of CREATE TABLE you can do imports via API, those are free (instead of cost of query for CREATE TABLE.
15 TB can be handled easily by BigQuery.

Quickly Large Data Pivoting

We are developing a product which can be used for developing predictive models and the slicing and dicing of the data in order to provide BI.
We are having two kind of data access requirements.
For predictive modeling, we need to read data on daily basis and do it row by row. In this the normal SQL Server database is sufficient and we are not getting any issues.
In case of slicing and dicing data of huge sizes like 1GB of data having let us say 300 M rows. We want to pivot that data easily with minimum response time.
The current SQL Database is having response time issues in this.
We like our product to run on any normal client machine with 2GB RAM with Core 2 Duo processor.
I would like to know how should I store this data and then how I can create a pivoting experience for each of the dimension.
Ideally we will have data of let us say daily sales by sales person by region by product for a large corporation. Then we would like to slice and dice it based on any dimension and also be able to perform aggregation, unique values, maximum, minimum, average values and some other statistical functions.
I would build an in-memory cube on top of that data. To give you an example, icCube is having sub-second response time for 3/4 measures over 50M rows on a single core i5 - without any cache or pre-aggregation (i.e., this response time is constant in all the dimensions).
Contact us directly for more details about how to integrate it into your product.
You could also use PowerPivot to do this. This is a free addin for Excel 2010, which would allow large data sets to be handled, sliced+diced, etc.
If you want to code around it, you can connect to the PowerPivot database (effectively an SSAS cube) using the SSAS database connector
Hope that is of some use..