Dynamically including all available custom dimensions in BigQuery select statement

Dynamically including all available custom dimensions in BigQuery select statement - sql

We are using a query similar to the one below for a report
SELECT
visitId AS visitId,
hits.hitNumber AS hits_hitNumber,
hits.time AS hits_time,
hits.page.pagePath AS hits_page_pagePath,
-- hit scope custom dimensions
(SELECT value from hits.customDimensions where index=1) AS CD1,
(SELECT value from hits.customDimensions where index=2) AS CD2,
-- user and session level custom dimensions
(SELECT value from sessions.customDimensions where index=3) AS CD3
FROM `ga_sessions_20191031` AS sessions, UNNEST(hits) as hits
ORDER BY visitId, hits_hitNumber
LIMIT 50
The query uses un-nesting to flatten some of the custom dimensions. However the index values are hard coded in the query. So every time there is a new custom dimension defined, the query needs to be updated. Is it possible to use a subquery to select all available distinct index values and add them to the query dynamically ?
EDIT:
The following queries provide distinct index values : Is there a way to link them in first query ?
(hit scope )
SELECT
DISTINCT cds.index as hit_cd_index
FROM `ga_sessions_20191031` AS sessions, UNNEST(hits) as hits, UNNEST(hits.customDimensions) as cds
ORDER BY hit_cd_index
(user and session scope )
SELECT
DISTINCT session_cds.index as session_cd_index
FROM `ga_sessions_20191031`, UNNEST(customDimensions) as session_cds
ORDER BY session_cd_index asc

the most robust solution would be to add a table into your BigQuery dataset containing data from Management API so you'll be able to construct your select based on values from the most recent custom dimensions list: https://developers.google.com/analytics/devguides/config/mgmt/v3/mgmtReference/management/customDimensions/list

Related

Why does SQLite return the wrong value from a subquery?

Given a schema and data in SQLite 3.7.17 (I'm stuck with this version):
CREATE TABLE reservations (id INTEGER NOT NULL PRIMARY KEY,NodeID INTEGER,ifIndex INTEGER,dt TEXT,action TEXT,user TEXT,p TEXT);
INSERT INTO "reservations" VALUES(1,584,436211200,'2022-03-12 10:10:00','R','s','x');
INSERT INTO "reservations" VALUES(2,584,436211200,'2022-03-12 10:10:01','R','s','x');
INSERT INTO "reservations" VALUES(3,584,436211200,'2022-03-12 10:10:05','U','s','x');
INSERT INTO "reservations" VALUES(4,584,436211200,'2022-03-12 10:09:01','R','s','x');
I'm trying to get the most recent action for each pair of (NodeID,ifIndex).
Running SELECT MAX(dt),action FROM reservations GROUP BY NodeId,ifIndex; I get:
MAX(dt)|action
2022-03-12 10:10:05|U
Perfect.
Now I want to select just the action from this query (dropping the MAX(dt)): SELECT t.action FROM (SELECT MAX(dt),action FROM reservations GROUP BY NodeId,ifIndex) AS t;:
t.action
R
This I don't understand. Also: SELECT t.* FROM (SELECT MAX(dt),action FROM reservations GROUP BY NodeId,ifIndex) AS t;:
MAX(dt)|action
2022-03-12 10:10:05|U
gives the correct value. So why does the query not seem to be querying against the subquery?
Perhaps it's a bug in this version of SQLite as SQLFiddle works fine (http://sqlfiddle.com/#!7/f7619a/4)
In attempt to workaround this issue I use this query: SELECT t2.action FROM (SELECT MAX(dt),* FROM reservations GROUP BY NodeId,ifIndex) AS t1 INNER JOIN reservations AS t2 on t1.id = t2.i which seems to work:
action
U

You are right, this seems to be a bug in your SQLite version.
To get into more detail, you are using SQLite's GROUP BY extension "Bare columns in an aggregate query".
In standard SQL and almost all RDBMS your query
SELECT MAX(dt), action FROM reservations GROUP BY NodeId, ifIndex;
is invalid. Why is that? You group by NodeId and ifIndex, thus aggregating your data down to one result row per NodeId and ifIndex. In each such row you want to show the group's maximum date and the group's action. But while there is one maximum date for a group, there is no one action for it, but several. Your query is considered invalid in standard SQL, because you don't tell the DBMS which of the group's actions you want to see. This could be the minimum action for example (i.e. the first in alphabetical order). That means there must be an aggregation function invoked on that column.
Not so in SQLite. When SQLite finds a "bare column" in a GROUP BY query that is meant to find a MAX or MIN of a column, it considers this to mean to take the bare column's value from the row where the minimum or maximum is found in. This is an extension to the SQL standard, and SQLite is the only DBMS I know of to feature this. You can read about this in the SQLite docs: Search "Bare columns in an aggregate query" in https://www.sqlite.org/lang_select.html#resultset.
SELECT MAX(dt), action FROM reservations GROUP BY NodeId, ifIndex;
hence finds the action in the row with the maximum dt. If you selected MIN(dt) instead, it would get you the action of the row with the minimum dt.
And of course a query selecting from a subquery result should still get the same value. It seems, however, that in your version SQLite gets confused with its bare column detection. It doesn't throw an error telling you it doesn't know which action to select, but it doesn't select the maximum dt's action either. Obviously a bug.
In standard SQL (and almost any RDBMS) your original query would be written like this:
SELECT dt, action
FROM reservations r
WHERE dt =
(
SELECT MAX(dt)
FROM reservations mr
WHERE mr.NodeId = r.NodeId AND mr.ifIndex = r.ifIndex
);
or like this:
SELECT dt, action
FROM reservations r
WHERE NOT EXISTS
(
SELECT NULL
FROM reservations gr
WHERE gr.NodeId = r.NodeId
AND gr.ifIndex = r.ifIndex
AND gr.dt > r.dt
);
or like this:
SELECT dt, action
FROM
(
SELECT dt, action, MAX(dt) OVER (PARTITION BY NodeId, ifIndex) AS max_dt
FROM reservations
) with_max_dt
WHERE dt = max_dt;
And there are still other ways to get the top row(s) per group.
In any of these proper SQL queries, you can remove dt from the select list and still get the maximum dt's action.

Execute Subquery refactoring first before any other SQL

I Have a very complex view which is of the below form
create or replace view loan_vw as
select * from (with loan_info as (select loan_table.*,commission_table.*
from loan_table,
commission_table where
contract_id=commission_id)
select /*complex transformations */ from loan_info
where type <> 'PRINCIPAL'
union all
select /*complex transformations */ from loan_info
where type = 'PRINCIPAL')
Now IF I do the below select the query hangs
select * from loan_vw where contract_id='HA001234TY56';
But if I hardcode inside the subquery refactoring or use package level variable in the same session the query returns in a second
create or replace view loan_vw as
select * from (with loan_info as (select loan_table.*,commission_table.*
from loan_table,
commission_table where
contract_id=commission_id
and contract_id='HA001234TY56'
)
select /*complex transformations */ from loan_info
where type <> 'PRINCIPAL'
union all
select /*complex transformations */ from loan_info
where type = 'PRINCIPAL')
Since I use Business object I cannot use package level variable
So my question is there a hint in Oracle to tell the optimizer to first check the contract_id in loan_vw in the subquery refactoring
As requested the analytical function used is the below
select value_date, item, credit_entry, item_paid
from (
select value_date, item, credit_entry, debit_entry,
greatest(0, least(credit_entry, nvl(sum(debit_entry) over (), 0)
- nvl(sum(credit_entry) over (order by value_date
rows between unbounded preceding and 1 preceding), 0))) as item_paid
from your_table
)
where item is not null;
After following the advice given by Boneist and MarcinJ I removed the Sub query refactoring (CTE) and wrote one long query like the below which improved the performance from 3 min to 0.156 seconds
create or replace view loan_vw as
select /*complex transformations */
from loan_table,
commission_table where
contract_id=commission_id
and loan_table.type <> 'PRINCIPAL'
union all
select /*complex transformations */
from loan_table,
commission_table where
contract_id=commission_id
and loan_table.type = 'PRINCIPAL'

Are these transformations really that complex you have to use UNION ALL? It's really hard to optimize something you can't see, but have you maybe tried getting rid of the CTE and implementing your calculations inline?
CREATE OR REPLACE VIEW loan_vw AS
SELECT loan.contract_id
, CASE commission.type -- or wherever this comes from
WHEN 'PRINCIPAL'
THEN SUM(whatever) OVER (PARTITION BY loan.contract_id, loan.type) -- total_whatever
ELSE SUM(something_else) OVER (PARTITION BY loan.contract_id, loan.type) -- total_something_else
END AS whatever_something
FROM loan_table loan
INNER
JOIN commission_table commission
ON loan.contract_id = commission.commission_id
Note that if your analytic functions don't have PARTITION BY contract_id you won't be able to use an index on that contract_id column at all.
Take a look at this db fiddle (you'll have to click on ... on the last result table to expand the results). Here, the loan table has an indexed (PK) contract_id column, but also some_other_id that is also unique, but not indexed and the predicate on the outer query is still on contract_id. If you compare plans for partition by contract and partition by other id, you'll see that index is not used at all in the partition by other id plan: there's a TABLE ACCESS with FULL options on the loan table, as compared to INDEX - UNIQUE SCAN in partition by contract. That's obviously because the optimizer cannot resolve the relation between contract_id and some_other_id by its own, and so it'll need to run SUM or AVG over the entire window instead of limiting window row counts through index usage.
What you can also try - if you have a dimension table with those contracts - is to join it to your results and expose the contract_id from the dimension table instead of the most likely huge loan fact table. Sometimes this can lead to an improvement in cardinality estimates through the usage of a unique index on the dimension table.
Again, it's really hard to optimize a black box, without a query or even a plan, so we don't know what's going on. CTE or a subquery can get materialized unnecessarily for example.

Thanks for the update to include an example of the column list.
Given your updated query, I would suggest changing your view (or possibly creating a second view for querying single contract_ids, if your original view could be used to query for multiple contract_ids - unless, of course, the results of the original view only make sense for individual contract_ids!) to something like:
CREATE OR REPLACE VIEW loan_vw AS
WITH loan_info AS (SELECT l.*, c.* -- for future-proofing, you should list the column names explicitly; if this statement is rerun and there's a column with the same name in both tables, it'll fail.
FROM loan_table l
INNER JOIN commission_table c ON l.contract_id = c.commission_id -- you should always alias the join condition columns for ease of maintenance.
)
SELECT value_date,
item,
credit_entry,
debit_entry,
GREATEST(0,
LEAST(credit_entry,
NVL(SUM(debit_entry) OVER (PARTITION BY contract_id), 0)
- NVL(SUM(credit_entry) OVER (PARTITION BY contract_id ORDER BY value_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0))) AS item_paid
FROM loan_info
WHERE TYPE <> 'PRINCIPAL'
UNION ALL
SELECT ...
FROM loan_info
WHERE TYPE = 'PRINCIPAL';
Note that I've converted your join into ANSI syntax, because it's easier to understand than the old style joins (easier to separate join conditions from predicates, for a start!).

Postgres - aggregate information once and add add multiple properties as columns

I have a table called project and a view called downtime_report_overview. The downtime_report_overview consists of the table downtimeReport (id, startTime, stopTime, downTimeCauseId, employeeId, ...) and the joined downtimeCause.name.
Thanks to Gorden's reply (postgres - select one specfic row of another table and store it as column), I am able to include an active downtime (stopTime = null) via an array aggregate and filter as column to the project query. Since I might need to more properties to the downtime_report_overview (e.g. meta data like username) in the near future I was wondering if is a way where I can extract the correct downtimeReport only once.
In the example below I using the array aggregation 3 times, once id, startTime and causeName. It seems verbose on the one hand and on the other I'm not even certain that it will select the correct downTime row for all 3 columns.
SELECT
COUNT(downtime_report_overview."downtimeReportId") AS "downtimeReportsTotalCount",
FLOOR(date_part('epoch'::text, sum(downtime_report_overview."stopTime" - downtime_report_overview."startTime")))::integer AS "downtimeReportsTotalDurationInSeconds",
(array_agg(downtime_report_overview."downtimeReportId" ORDER BY downtime_report_overview."startTime" DESC) FILTER (WHERE downtime_report_overview."stopTime" IS null))[1] AS "activeDownTimeReportId",
(array_agg(downtime_report_overview."startTime" ORDER BY downtime_report_overview."startTime" DESC) FILTER (WHERE downtime_report_overview."stopTime" IS null))[1] AS "activeDownTimeReportStartTime",
(array_agg(downtime_report_overview."downtimeCauseName" ORDER BY downtime_report_overview."startTime" DESC) FILTER (WHERE downtime_report_overview."stopTime" IS null))[1] AS "activeDownTimeReportCauseName"
...

There are several ways to approach this. Obviously, you can write a separate expression for each column. Or, you can play around with manipulating an entire row as a record.
In this case, perhaps the simplest approach is to separate the aggregation and getting the row of interest. Based on the original question, the code would look like:
SELECT p.*, tt.*
FROM (SELECT p."projectID"
count(t."timeTrackId") as "timeTracksTotalCount",
floor(date_part('epoch'::text, sum(t."stopTime" - t."startTime")))::integer AS "timeTracksTotalDurationInSeconds"
FROM project p LEFT JOIN
time_track t
ON t."fkProjectId" = p."projectId"
GROUP BY p."projectID"
) p LEFT JOIN
(SELECT DISTINCT ON (t."fkProjectId") tt.*
FROM time_track tt
WHERE t."stopTime" is null
ORDER BY t."fkProjectId", t."startTime" desc
) tt
ON tt."fkProjectId" = p."projectId";

Migrating from Legacy SQL: options for "WITHIN RECORD" with Standard SQL

I am trying to migrate to Standard SQL from BigQuery Legacy SQL. The Legacy product offered the ability to query "WITHIN RECORD" which came in handy on numerous occasions.
I am looking for an efficient alternative to WITHIN RECORD. I could always just use a few subqueries and join them but wondering if there may be a more efficient way using ARRAY + ORDINAL.
EXAMPLE: Consider the following Standard SQL
WITH
sessPageVideoPlays AS (
SELECT fullVisitorId, visitNumber, h.page.pagePath,
# This would previously use WITHIN RECORD in Legacy SQL:
ARRAY( SELECT eventInfo.eventAction FROM UNNEST(hits)
WHERE eventInfo.eventCategory="videoPlay"
ORDER BY hitNumber DESC
)[ORDINAL(1)] AS lastVideoSeen
FROM
`proj.ga_sessions`, UNNEST(hits) as h
GROUP BY fullVisitorId, visitNumber, h.page.pagePath, lastVideoSeen
)
SELECT
pagePath, lastVideoSeen, numOccur
FROM
(SELECT
pagePath, lastVideoSeen, count(1) numOccur
FROM
sessPageVideoPlays
GROUP BY
pagePath, lastVideoSeen
)
Resulting output:
Questions:
1) I would like to see the last video play event on a given page, which is what I used accomplish using WITHIN RECORD but am attempting the ARRAY + ORDINAL approach shown above. However for this to work, I'm thinking the SELECT statement within ARRAY() must get synchronized to the outer record since it is now flattened? Is that accurate?
2) I would also like get a COUNT of DISTINCT videos played on a given page and wondering if more efficient approach would be joining to a separate query OR inserting another inline aggregate function, like done with ARRAY above.
Any suggestions would be appreciated.

1) I would like to see the last video play event on a given page,
which is what I used accomplish using WITHIN RECORD but am attempting
the ARRAY + ORDINAL approach shown above. However for this to work,
I'm thinking the SELECT statement within ARRAY() must get synchronized
to the outer record since it is now flattened? Is that accurate?
I think that is correct. With your query the UNNEST(hits) from the inner query would be independent from the outer UNNEST, and is probably not want you wanted.
I think maybe one way to write it is this:
WITH
sessPageVideoPlays AS (
SELECT fullVisitorId, visitNumber,
ARRAY(
SELECT AS STRUCT pagePath, lastVideoSeen FROM (
SELECT
page.pagePath,
eventInfo.eventAction AS lastVideoSeen,
ROW_NUMBER() OVER (PARTITION BY page.pagePath ORDER BY hitNumber DESC) AS rank
FROM UNNEST(hits)
WHERE eventInfo.eventCategory="videoPlay")
WHERE rank = 1
) AS lastVideoSeenOnPage
FROM
`proj.ga_sessions`
)
SELECT
pagePath, lastVideoSeen, numOccur
FROM (
SELECT
pagePath, lastVideoSeen, count(1) numOccur
FROM
sessPageVideoPlays, UNNEST(lastVideoSeenOnPage)
GROUP BY
pagePath, lastVideoSeen
)
2) I would also like get a COUNT of DISTINCT videos played on a given
page and wondering if more efficient approach would be joining to a
separate query OR inserting another inline aggregate function, like
done with ARRAY above.
I think both are OK, but inserting another inline aggregate function would evaluate them closer together, so it might be a bit easier for the query engine to optimize if there is a chance.

Bug or new behavior in BigQuery?

Since two days ago (August 10th 2016), a query which used to work (using tables of the BQ Export for Google Analytics Premium) has stopped working. It returns the following error:
Error: Cannot union tables : Incompatible types.
'hits.latencyTracking.userTimingVariable' : TYPE_INT64
'hits.latencyTracking.userTimingVariable' : TYPE_STRING
After some investigation, it seems to be a problem with using IN in a WHERE clause when I query tables from before and after August 10th (table ga_sessions_20160810).
I've simplified my original query to provide a dummy one which has the same basic structure. The following query works (querying data from 2016-08-08 and 2016-08-09):
SELECT fullVisitorId, sum(totals.visits)
FROM (select * from TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-08'),TIMESTAMP('2016-08-09')))
WHERE fullVisitorId in(
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-08'),TIMESTAMP('2016-08-09'))
)
GROUP BY fullVisitorId
But this other one (just changing dates, in this case from 2016-08-09 and 2016-08-10) returns the error:
SELECT fullVisitorId, sum(totals.visits)
FROM (select * from TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-09'),TIMESTAMP('2016-08-10')))
WHERE fullVisitorId in(
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-09'),TIMESTAMP('2016-08-10'))
)
GROUP BY fullVisitorId
This last query works fine either if I delete the WHERE clause or if I just try the query within the IN, so I guess the problem is with the structure WHERE field IN(...). Furthermore, querying only data from 2016-08-10 does work. Also, the same happens using a field different to fullVisitorId and running the same queries in different BQ projects.
Looking to the error description, it should be a problem with variable types, but I don't know what is hits.latencyTracking.userTimingVariable. My query used to work properly, so I can't figure out what has changed that produces the error. Have some fields changed their type or what happened?
Has anyone experienced this? Is this a bug or a new behavior in BigQuery? How can this error be solved?

As you are using * in select clause it might causing problem when union is happening its trying to combine two different column types ( as schema changed from INT64 to STRING).
I have two approaches
1) use only those fields required by you than using * in select clause
SELECT fullVisitorId, sum(totals.visits)
FROM (select fullVisitorId,totals.visits from TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-09'),TIMESTAMP('2016-08-10')))
WHERE fullVisitorId in(
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-09'),TIMESTAMP('2016-08-10'))
) GROUP BY fullVisitorId
2) using views to split inner queries and use the view later in the query. (even in view you need to use only use those fields which are required )
SELECT fullVisitorId, sum(totals.visits)
FROM [view.innertable2]
WHERE fullVisitorId in(
SELECT fullVisitorId from [view.innertable1] ) GROUP BY fullVisitorId
This will exclude the hits.latencyTracking.userTimingVariable so there will be no error.

If the fields that you are querying are compatible, you may try using Standard SQL wildcard tables (you'll have to uncheck use Legacy SQL box if you are doing this from the UI). Something like this:
SELECT fullVisitorId, sum(totals.visits)
FROM `xxxxxxxx.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20160808' and '20160810'
GROUP BY fullVisitorId;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas