How to build a raport Top Conversion Paths in BigQuery - sql

I have a problem with build a raport like "Top Conversion Paths" in Google Analytics. Any ideas how can I create this?
I find something like this, but it dosen't work (https://lastclick.city/top-conversion-paths-in-ga-and-bigquery.html):
SELECT
REGEXP_REPLACE(touchpointPath, 'Conversion >.*', 'Conversion') as touchpointPath, COUNT(touchpointPath) AS TOP
FROM (SELECT
GROUP_CONCAT(touchpoint,' > ') AS touchpointPath
FROM (SELECT
*
FROM (SELECT
fullVisitorId,
'Conversion' AS touchpoint,
(visitStartTime+hits.time) AS timestamp
FROM
TABLE_DATE_RANGE([pro-tracker-id.ga_sessions_], TIMESTAMP('2018-10-01'), TIMESTAMP('2018-10-05'))
WHERE
hits.eventInfo.eventAction="Email Submission success")
,
(SELECT
fullVisitorId,
CONCAT(trafficSource.source,'/',trafficSource.medium) AS touchpoint,
(visitStartTime+hits.time) AS timestamp
FROM
TABLE_DATE_RANGE([pro-tracker-id.ga_sessions_], TIMESTAMP('2018-10-01'), TIMESTAMP('2018-10-05'))
WHERE
hits.hitNumber=1)
ORDER BY
timestamp)
GROUP BY
fullVisitorId
HAVING
touchpointPath LIKE '%Conversion%')
GROUP BY
touchpointPath
ORDER BY
TOP DESC

It doesn't work because you have to modify the query to your needs.
This line needs to be changed to match your specific event action:
hits.eventInfo.eventAction="YOUR EVENT ACTION HERE")
The table reference and the dates need to be changed too:
TABLE_DATE_RANGE([pro-tracker-id.ga_sessions_], TIMESTAMP('2018-10-01'), TIMESTAMP('2018-10-05'))

The shared article refers to a link regarding getting information about the flatten function in BigQuery Legacy SQL.
As far as I know, queries in the new BigQuery UI runs as Standard SQL by default; however, you are able to set the SQL variant by including a prefix to your query in the web UI, REST API call or when using the Cloud Client library.

Related

Query Snowflake Jobs [duplicate]

is there any way within snowflake/sql query to view what tables are being queried the most as well as what columns? I want to know what data is of most value to my users and not sure how to do this programatically. Any thoughts are appreciated - thank you!
2021 update
The new ACCESS_HISTORY view has this information (in preview right now, enterprise edition).
For example, if you want to find the most used columns:
select obj.value:objectName::string objName
, col.value:columnName::string colName
, count(*) uses
, min(query_start_time) since
, max(query_start_time) until
from snowflake.account_usage.access_history
, table(flatten(direct_objects_accessed)) obj
, table(flatten(obj.value:columns)) col
group by 1, 2
order by uses desc
Ref: https://docs.snowflake.com/en/sql-reference/account-usage/access_history.html
2020 answer
The best I found (for now):
For any given query, you can find what tables are scanned through looking at the plan generated for it:
SELECT *, "objects"
FROM TABLE(EXPLAIN_JSON(SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.any_table_or_view')))
WHERE "operation"='TableScan'
You can find all of your previous ran queries too:
select QUERY_TEXT
from table(information_schema.query_history())
So the natural next step would be combine both - but that's not straightforward, as you'll get an error like:
SQL compilation error: argument 1 to function EXPLAIN_JSON needs to be constant, found 'SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.c')'
The solution would be to combine the queries from the query_history() with the SYSTEM$EXPLAIN_PLAN_JSON outside (to make the strings constant), and then you will be able to find out the most queried tables.

Google BigQuery Trying to run a TABLE_RANGE_DATE

i am building a partition based table in a dataset and i am trying to query those partitions using a date range.
Here is an example of the data:
Dataset:
logs
Tables:
logs_20170501
logs_20170502
logs_20170503
i am trying first the TABLE_RANGE_DATE
SELECT count(*) FROM TABLE_DATE_RANGE([logs.logs_],
TIMESTAMP("2017-05-01"),
TIMESTAMP("2017-05-03")) as logs_count
i am keep getting : "ERROR:Error evaluating subsidiary query"
i tried those options as well:
single comma:
SELECT count(*) FROM TABLE_DATE_RANGE([logs.logs_],
TIMESTAMP('2017-05-01'),
TIMESTAMP('2017-05-03')) as logs_count
Add Project ID:
SELECT count(*) FROM TABLE_DATE_RANGE([main_sys_logs:logs.logs_],
TIMESTAMP('2017-05-01'),
TIMESTAMP('2017-05-03')) as logs_count
And it didn't worked.
So i tried to use TABLE_SUFFIX
SELECT
count(*)
FROM [main_sys_logs:logs.logs_*]
WHERE _TABLE_SUFFIX BETWEEN '20170501' AND '20170503'
And i got this error :
Invalid table name:'main_sys_logs:logs.logs_*
i have been switching SQL Dialect between legacy SQL ON/Off and i just got different errors on the table name part.
Is there any tips or help for this matter ?
maybe my table name is build wrong with the "_" at the end and this is causing the problem ? thanks for any help.
So i tried this Query and it worked :
SELECT count(*) FROM TABLE_DATE_RANGE(logs.logs_,
TIMESTAMP("2017-05-01"),
TIMESTAMP("2017-05-03")) as logs_count
it started to work after i run this query , i don't know if this is the reason .. but i just query the TABLES data for the dataset
SELECT *
FROM logs__TABLES__

GitHub Archive Google Big Query repositories language information for 2015

I have a problem with retrieving language information from GitHub Archive Google BigQuery since the structure of the tables changed which was at the beginning of 2015.
When querying github_timeline table I have a field named repository_language. It allows me to get my language statistics.
Unfortunately for 2015 the structure has changed and the table doesn't contain any events after 2014.
For example the following query doesn't return any data:
select
repository_language, repository_url, created_at
FROM [githubarchive:github.timeline]
where
PARSE_UTC_USEC(created_at) > PARSE_UTC_USEC('2015-01-02 00:00:00')
Events for 2015 are in: githubarchive:month & githubarchive:day tables. None of them have language information tho (or at least repository_language column).
Can anyone help me?
Look at payload field
It is string that, I think, actually holds JSON with all "missing" attributes
You can process this using JSON Functions
Added Query
Try as below:
SELECT
JSON_EXTRACT_SCALAR(payload, '$.pull_request.head.repo.language') AS language,
COUNT(1) AS usage
FROM [githubarchive:month.201601]
GROUP BY language
HAVING NOT language IS NULL
ORDER BY usage DESC
What Mikhail said + you can use a query like this:
SELECT JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') language, COUNT(*) c
FROM [githubarchive:month.201501]
GROUP BY 1
ORDER BY 2 DESC

Nth(n,split()) in bigquery

I am running the following query and keep getting the error message:
SELECT NTH(2,split(Web_Address_,'.')) +'.'+NTH(3,split(Web_Address_,'.')) as D , Web_Address_
FROM [Domains.domain
limit 10
Error message: Error: (L1:110): (L1:119): SELECT clause has mix of
aggregations 'D' and fields 'Web_Address_' without GROUP BY
clause Job ID:
symmetric-aura-572:job_axsxEyfYpXbe2gpmlYzH6bKGdtI
I tried to use group by clause on field D and/or Web_address_, but still getting errors about group by.
Does anyone know why this is the case? I have had success with similar query before.
You probably want to use WITHIN RECORD aggregation here, not GROUP BY
select concat(p1, '.', p2), Web_Address_ FROM
(SELECT
NTH(2,split(Web_Ad`enter code here`dress_,'.')) WITHIN RECORD p1,
NTH(3,split(Web_Address_,'.')) WITHIN RECORD p2, Web_Address_
FROM (SELECT 'a.b.c' as Web_Address_))
P.S. If you just trying to cut off first part of web address, it will be easier to do with RIGHT and INSTR functions.
You can also consider using URL functions: HOST, DOMAIN and TLD

JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions

How would you overcome the above restriction?
I am trying to find flows based on sequences of 3 records using the LEAD and LAG window functions, and than calculate some aggregations (count, sum, etc,) of their attributes.
When i run my queries on a small sample of data, everything is fine and the group by runs OK. but when running on larger data set, i get: "Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead."
In many other cases switching to GROUP EACH BY do the work...
However, as I use window functions, I cannot use EACH...
Any suggestions? Best practices?
here is a sample query based of wikipedia sample data. it shows the frequency of title editing by different contributors. the where condition is just to limit response size, if you remove the "B" we get results, if we add it we got the "use EACH" recomendation.
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true)
group by title
Thanks
I guess your particular use case is different to the sample query, but let me comment on what I'm able to see:
You found a way to make GROUP EACH and OVER possible: Surrounding the OVER() query with another one allows you to change the GROUP BY to GROUP EACH BY. However, this query's problem is not there.
Let's forget about GROUP and GROUP EACH. Let's look at the core query:
SELECT title, contributor_id, LEAD(contributor_id)
OVER(PARTITION BY title ORDER BY timestamp) AS LeadContributor
FROM [publicdata:samples.wikipedia]
WHERE REGEXP_MATCH(title, r'^[A,B]')
This query fails with r'^[A,B]' and works with r'^[A]', and it highlight an OVER() limitation: As GROUP BY and ORDER BY, it only works when data fits in one machine, as they are not parallelizable. As the answer to r'^[A]' reveals, that can be a lot of data - though sometimes not enough. That's why BigQuery offers the parallelizable GROUP EACH BY. However, there is no parallelizable OVER EACH BY we can use here.
The workaround I would apply here is exactly what you are doing: Do the OVER() with just a fraction of the data.
(btw, let me say I love the sample query... it's an interesting question with an interesting answer!)