How to unnest Google Analytics custom dimension in Google Data Prep - google-bigquery

Background story:
We use Google Analytics to track user behaviour on our website. The data is exported daily into Big Query. Our implementation is quite complex and we use a lot of custom dimensions.
Requirements:
1. The data needs to be imported into our internal databases to enable better and more strategic insights.
2. The process needs to run without requiring human interaction
The problem:
Google Analytics data needs to be in a flat format so that we can import it into our database.
Question: How can I unnest custom dimensions data using Google Data Prep?
What it looks like?
----------------
customDimensions
----------------
[{"index":10,"value":"56483799"},{"index":16,"value":"·|·"},{"index":17,"value":"N/A"}]
What I need it to look like?
----------------------------------------------------------
customDimension10 | customDimension16 | customDimension17
----------------------------------------------------------
56483799 | ·|· | N/A
I know how to achieve this using a standard SQL query in Big Query interface but I really want to have a Google Data Prep flow that does it automatically.

Define the flat format and create it in BigQuery first.
You could
create one big table and repeat several values using CROSS JOINs on all the arrays in the table
create multiple tables (per array) and use ids to connect them, e.g.
for session custom dimensions concatenate fullvisitorid / visitstarttime
for hits concatenate fullvisitorid / visitstarttime / hitnumber
for products concatenate fullvisitorid / visitstarttime / hitnumber / productSku
The second options is a bit more effort but you save storage because you're not repeating all the information for everything.

Related

How to get count of active users grouped by version? (from Firebase using BigQuery)

Problem description
I'm trying to get the information of how many active users I have in my app separated by the 2 or 3 latest versions of the app.
I've read some documentations and other stack questions but none of them was solving my problem (and some others had outdated solutions).
Examples of solutions I tried:
https://support.google.com/firebase/answer/9037342?hl=en#zippy=%2Cin-this-article (N-day active users - This solution is probably the best, but even changing the dataset name correctly and removing the _TABLE_SUFFIX conditions it kept returning me a single column n_day_active_users_count = 0 )
https://gist.github.com/sbrissenden/cab9bd3a043f1879ded605cba5005457
(this is not returning any values for me, didn't understand why)
How can I get count of active Users from google analytics (this is not a good fit because the other part of my job is already done and generating charts on Data Studio, so using REST API would be harder to join my two solutions - one from BigQuery and other from REST API)
Discrepancies on "active users metric" between Firebase Analytics dashboard and BigQuery export (this one uses outdated variables)
So, I started to write the solution out of my head, and this is what I get so far:
SELECT
user_pseudo_id,
app_info.version,
ROUND(COUNT(DISTINCT user_pseudo_id) OVER (PARTITION BY app_info.version) / SUM(COUNT(DISTINCT user_pseudo_id)) OVER (), 3) AS adoption
FROM `projet-table.events_*`
WHERE platform = 'ANDROID'
GROUP BY app_info.version, user_pseudo_id
ORDER BY app_info.version
Conclusions
I'm not sure if my logic is correct, but I think I can use user_pseudo_id to calculate it, right? The general idea is: user_of_X_version/users_of_all_versions.
(And the results are kinda close to the ones showing at Google Analytics web platform - I believe the difference is due to the date that I turned on the BigQuery integration. But.... I'd like some confirmation on that: if my logic is correct).
The biggest problem in my code now is that I cannot write it without grouping by user_pseudo_id (Because when I don't BigQuery says: "SELECT list expression references column
user_pseudo_id which is neither grouped nor aggregated at [2:3]") and that's why I have duplicated rows in the query result
Also, about the first link of examples... Is there any possibility of a record with engagement_time_msec param with value < 0? If not, why is that condition in the where clause?

Improve performance of deducting values of same table in SQL

for a metering project I use a simple SQL table in the following format
ID
Timestamp: dat_Time
Metervalue: int_Counts
Meterpoint: fk_MetPoint
While this works nicely in general I have not found an efficient solution for one specific problem: There is one Meterpoint which is a submeter of another Meterpoint. I'd be interested in the Delta of those two Meterpoints to get the remaining consumption. As the registration of counts is done by one device I get datapoints for the various Meterpoints at the same Timestamp.
I think I found a solution applying a subquery which appears to be not very efficient.
SELECT
A.dat_Time,
(A.int_Counts- (SELECT B.int_Counts FROM tbl_Metering AS B WHERE B.fk_MetPoint=2 AND B.dat_Time=A.dat_Time)) AS Delta
FROM tbl_Metering AS A
WHERE fk_MetPoint=1
How could I improve this query?
Thanks in advance
You can try using a window function instead:
SELECT m.dat_Time,
(m.int_counts - m.int_counts_2) as delta
FROM (SELECT m.*,
MAX(CASE WHEN fk.MetPoint = 2 THEN int_counts END) OVER (PARTITION BY dat_time) as int_counts_2
FROM tbl_Metering m
) m
WHERE fk_MetPoint = 1
From a query point of view, you should as a minimum change to a set-based approach instead of an inline sub-query for each row, using a group by as a minimum but it is a good candidate for a windowing query, just as suggested by the "Great" Gordon Linoff
However if this is a metering project, then we are going to expect a high volume of records, if not now, certainly over time.
I would recommend you look into altering the input such that delta is stored as it's own first class column, this moves much of the performance hit to the write process which presumably will only ever occur once for each record, where as your select will be executed many times.
This can be performed using an INSTEAD OF trigger or you could write it into the business logic, in a recent IoT project we computed or stored these additional properties with each inserted reading to greatly simplify many types of aggregate and analysis queries:
Id of the Previous sequential reading
Timestamp of the Previous sequential reading
Value Delta
Time Delta
Number of readings between this and the previous reading
The last one sounds close to your scenario, we were deliberately batching multiple sequential readings into a single record.
You could also process the received data into a separate table that includes this level of aggregation information, so as not to pollute the raw feed and to allow you to re-process it on demand.
You could redirect your analysis queries to this second table, which is now effectively a data warehouse of sorts.

Dynamic query using parameters in Tableau

I am trying to utilize tableau in creating a web dashboard to interact with a postgres database with a fair amount of rows.
The key here is the relevant data is within latitude/longitude boundaries, so I'm using tableau parameters in a custom SQL statement to get what I need, like so
SELECT id, lat, lng... FROM my_table
WHERE lat >= <Parameters.MIN_LAT> AND lat <= <Parameters.MAX_LAT>
AND lng >= <Parameters.MIN_LNG> AND lng <= <Parameters.MAX_LNG>
LIMIT 10000
I'm setting these parameters using the Tableau JavaScript API based off of a Google maps widget boundaries. When the map is moved, I'll refresh the parameters and the data needs to update as well. This refresh is not done constantly, but frequent enough that long wait times are not acceptable.
Because the lat/lng boundaries are dynamic and the full unfiltered table is very big (~1GB) I presumed it is impractical to create a data extract. Am I wrong?
Furthermore when I change some of the in-Tableau filters I'm applying there is a very long wait as if it is re-executing the query every-time, even if the MIN_LAT, MAX_LAT, .. parameters are un-changed.
What's the best way of resolving this? I'm new to Tableau so sorry if I'm missing something super obvious!
Thanks.
The best way of resolving this, is by making a query with less information on it (1GB is too much, the extract can help to group data to present dimensions very fast, but that's it.. if there is nothing to group it will be very extense), which permits doing a drill down to present more information on subsequents steps or dashboards levels.
I am thinking of a field of the database which can tell the zoom level of presenting information.
If you are navigating on googlemaps... first you find the countrys, then the capital cities, then the cities, then the small towns, then the local stores...
The key is on the zoom level you are on each time.
You may visit Tableau about Drill downs.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

How much storage capacity my dataset or table consume?

I have multiple datasets, each with hundreds of tables in Google BigQuery. I'd like to remove some old, legacy data and I am looking for the most convenient way to know how much storage space my each dataset and table is occupying, so I could make educated decision on what datasets/tables I may remove.
I tried to use bq command-line tool but couldn't find a way to display table storage and entire dataset storage related information.
You can access metadata about the tables in a dataset by using the TABLES meta-table. I.e., and example:
select * from [publicdata:samples.__TABLES__]
returns
project_id dataset_id table_id creation_time last_modified_time row_count size_bytes type
publicdata samples github_nested 1348782587310 1348782587310 2541639 1694950811 1
publicdata samples github_timeline 1335915950690 1335915950690 6219749 3801936185 1
publicdata samples gsod 1335916040125 1440625349328 14420316 17290009238 1
publicdata samples natality 1335916045005 1440625330604 37826763 23562717384 1
publicdata samples shakespeare 1335916045099 1440625429551 164656 6432064 1
publicdata samples trigrams 1335916127449 1445684180324 68051509 277168458677 1
publicdata samples wikipedia 1335916132870 1445689914564 13797035 38324173849 1
More documentation here: https://cloud.google.com/bigquery/querying-data
Below is an example of how to combine use of metadata (as in answer by #Moshapasumansky) with visualization (as in recommendation by #DoITInternational) and all without leaving BigQuery Web UI, but you will need BigQuery Mate Chrome Extension
Assuming you have extension - Follow below steps:
Step 1 - Run Query against tables metadata in publicdata:samples dataset
SELECT
table_id,
DATE(TIMESTAMP(creation_time/1000)) AS Created,
DATE(TIMESTAMP(last_modified_time/1000)) AS Modified,
row_count AS Rows,
ROUND(size_bytes/POW(1024, 3)) AS GB
FROM [publicdata:samples.__TABLES__]
Step 2 - Move to JSON View
Step 3 - Expand Result Panel by Clicking on + Button
This is for two reasons:
To bring to result panel up to 500 records (which should cover your case as you mentioned you have hundreds tables) at a time vs. relatively limited amount of rows at a time that currently supported by native ui
To release more real estate for chart
Step 4 - Close Query Editor (optional) – more real estate for chart
Step 5 - Click Show Pivot to bring Pivot/Chart Tool up with data from Result and than design your pivot chart the way you like (as it is in below screenshot for example)
It might be not the best way - but at least it allows you to do what you want here w/o leaving web ui. In some cases it can be a preferred option I think.
Rather than using the BigQuery API (Tables: get method specifically) and looking into numBytes in the response, I can suggest to use BQdu or BigQuery Disk Usage web application. It will scan your project for datasets and tables and will display this nice visualization, mentioning how much storage each table (or entire dataset) is consuming.