How do we use BigQuery HLL (HyperLogLog) functions in Looker - google-bigquery

I have a quick question on how we can use the BigQuery HLL functions in Looker.
For example, there is a BigQuery table with the following structure Sample BigQuery Table
In looker do I need to define this field respondents_hll as a
dimension or measure?
If I use it as a measure, how can I extract the value of this HLL field at a different grouping level(for example, country only) and use it in dashboards without losing its meaning?
If I bring this in as a Dimension, I still want to be able to show the extract of this HLL field at different level? How will looker understand the grouping?
Best Regards,
Sam

Related

Can you create preaggregated Dimensions/Measuresments like OLAP in BigQuery with Tableau?

During the Cloud Migration of an On-Premise Microsoft SQL DB, the OLAP Cube, which is part of it, should also be replaced (but not migrated directly). There is the business requirement to keep the functionality in Tableau that you can select different measurements and dimension with their corresponding aggregations, as is possible now when connecting to the OLAP Cube in Tableau.
The underlying Data Source View includes ca. 10 tables (e.g. customer, sales, payment-method, customer-segmentation, time). So via OLAP the analysis "give me the average sales per payment method per customer-segment for every week" is a couple of clicks, in pure SQL it's already some effort.
How can you offer defined aggregations for some BigQUery tables without the user having to write the joins and aggregations by themselves, mainly because it takes much more time than simply drag & drop (SQL skills & time of query-execution are not the issue)?
The answer turns out to be pretty straight forward:
Join all source data together and write it into one flat table in BigQuery which includes the same information as the data source view in the OLAP Cube. Then Tableau connects to this table. The "measurements" logic from the cube is implemented as calculations in Tableau, the table columns are the dimensions.
Some caution needs to be applied when replicating the measurements because 1:n relations in the Data Source View result in multiplied data in the flat table. This can be solvedwith the correct use of Distinct Functions (e.g. "Distinct Count") in the measurement definition.
The table will end up quite large, but the queries on it are very fast, resulting in a performance increase compared to the OLAP Cube with the same user experience as using a cube in Tableau.

How to get table/col stats from BigQuery tables

Does BigQuery provides any sql commands for retrieving table cardinality?
For example, some RDBMS providers have sql commands like:
show table_stats schemaname tablename
for getting table cardinality.
Also, what about column stats? Like the number of distinct values in a col and MIN, MAX, etc.
I saw that the BigQuery console provides both table and col stats but I wonder whether these info are accessible through SQL statements
Thanks!
The features you would like to use are more proper for the language, instead of the tool or service itself.
To get stats about the table. I found the Getting table metadata which explains how to get table metadata for Tables and Columns. Some of the information you will get when running the queries found in that doc.
For Tables: the name of the dataset that contains the table, the default lifetime, in days and other Table_Options view results.
For Columns: The name of the project that contains the dataset, the the column's standard SQL data type, if the value is updatable, stored, or hidden. Find more Results for the Columns view.
To get stats about columns. You can use the COUNT DISTINCT function, which retrieves a statistical approximation on the unique values in certain columns.
I found this Community blog, where they show different examples and ways to get unique values. It even explains how to increase the approximation threshold.
EDIT
It seems that BigQuery does not offer a count of unique fields. However, you can always take a look at the Schema and Details tabs in your BigQuery UI where the Fields' name are shown, including the type and description.
Example from the Public Datasets:
Hope this is helpful.

Youtube Data Studio, how to create calculated fields from two different data sources (such as two BigQuery tables)

https://support.google.com/datastudio/answer/6390659
From the instruction video,"Create new calculated dimensions and metrics, which you can then use in your charts and controls", it shows how to create a calculated field from one data source.
But I need a calculated field that will be sum up of the values of fields from two different data sources (to be specific, two BigQuery tables).
I could not find anyway to do that. Except, I can create a BigQuery view sum up the value from two BigQuery tables in a BigQuery view. And then use this BigQuery View as my new data source. Not sure if this is the right and only way to do so.
Thanks for advices.
You cannot reference more than one data source to create a calculated field in Data Studio. Pushing the calculation down to SQL/BigQuery is the correct approach. At least, that's how we do it too.

BigQuery best practice for segmenting tables by dates

I am new to columnar DB concepts and BigQuery in particular. I noticed that for the sake of performance and cost efficiency it is recommended to split data across tables not only logically - but also by time.
For example - while I need a table to store my logs (1 logical table that is called "logs"), it is actually considered a good practice to have a separate table for different periods, like "logs_2012", "logs_2013", etc... or even "logs_2013_01", "logs_2013_02", etc...
My questions:
1) Is it actually the best practice?
2) Where would be best to draw the line - an annual table? A monthly table? A daily table? You get the point...
3) In terms of retrieving the data via queries - what is the best approach? Should I construct my queries dynamically using the UNION option? If I had all my logs in one table - I would naturally use the where clause to get data for the desired time range, but having data distributed over multiple tables makes it weird. I come from the world of relational DB (if it wasn't obvious so far) and I'm trying to make the leap as smoothly as possible...
4) Using the distributed method (different tables for different periods) still raises the following question: before querying the data itself - I want to be able to determine for a specific log type - what is the available range for querying. For example - for a specific machine I would like to first present to my users the relevant scope of their available logs, and let them choose the specific period within that scope to get insights for. The question is - how do I construct such a query when my data is distributed over a number of tables (each for a period) where I don't know which tables are available? How can I construct a query when I don't know which tables exist? I might try to access the table "logs_2012_12" when this table doesn't actually exist, or event worst - I wouldn't know which tables are relevant and available for my query.
Hope my questions make sense...
Amit
Table naming
For daily tables, the suggested table name pattern is the specific name of your table + the date like in '20131225'. For example, "logs20131225" or "logs_20131225".
Ideal aggregation: Day, month, year?
The answer to this question will depend on your data and your queries.
Will you usually query one or two days of data? Then have daily tables, and your costs will be much lower, as you query only the data you need.
Will you usually query all your data? Then have all the data in one table. Having many tables in one query can get slower as the number of tables to query grow.
If in doubt, do both! You could have daily, monthly, yearly tables. For a small storage cost, you could save a lot when doing queries that target only the intended data.
Unions
Feel free to do unions.
Keep in mind that there is a limit of a 1000 tables per query. This means if you have daily tables, you won't be able to query 3 years of data (3*365 > 1000).
Remember that unions in BigQuery don't use the UNION keyword, but the "," that other databases use for joins. Joins in BigQuery can be done with the explicit SQL keyword JOIN (or JOIN EACH for very big joins).
Table discovery
API: tables.list will list all tables in a dataset, through the API.
SQL: To query the list of tables within SQL... keep tuned.
New 2016 answer: Partitions
Now you can have everything in one table, and BigQuery will analyze only the data contained in the desired dates - if you set up the new partitioned tables:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables

OLAP dimension for Age

We have a client table with a field DateOfBirth.
I'm new to MS Analysis Services, OLAP and data cubes. I'm trying to report on client metrics by age categories (18-25,26-35,35-50,50-65,66+)
I don't see a way to accomplish this. (Note: I'm not concerned with age at the time of a sale. I'm interested in knowing the age distribution of my current active customers).
You can create either a TSQL or Named Calculation in the Data Source View that calculates the CurrentAge based on the DOB field.
You will likely also want to implement another similarly derived field that assigns the CurrentAge Value a Bucket in your date range. this is a simple TSQL Case statement.
Depending on how large the client table is (and the analytical purpose), you may want to make this into a fact table or at least use snowflaking to separate this from the other relatively static attribute fields in the client table.