Writing a String comparison function in bigquery

Writing a String comparison function in bigquery - google-bigquery

I am trying to write a function with bigquery UDF to compare a list of string with other list of strings.
basically I would like to know how many new users do we have per week and from these new Users how many of them kept visiting our website in future weeks. For that I created a query which gives me a String of all emails per week (with group_concat) and saved it as a table. now need to know how can I compare each with Other collections of emails per week.
At the end, I would like to have a table like this :
+----------------+-------+-------+--------+------+
| | week 1 | week 2 | week 3| week 4 | ... |
+----------------+-------+-------+--------+------+
| week1 | 17 | 7 | 5 | 9 | ... |
+----------------+-------+-------+--------+------+
| week2 | | 19 | 13 | 8 | ... |
+-----------------+-------+-------+--------+-----+
| week3 | | | 24 | 15 | ... |
+-----------------+-------+-------+--------+-----+

Just to give you an idea to play with
SELECT
CONCAT('week', STRING(prev)) AS WEEK,
SUM(IF(next=19, authors, 0)) AS week19,
SUM(IF(next=20, authors, 0)) AS week20,
SUM(IF(next=21, authors, 0)) AS week21,
SUM(IF(next=22, authors, 0)) AS week22,
SUM(IF(next=23, authors, 0)) AS week23
FROM (
SELECT prev, next, COUNT(author) AS authors
FROM (
SELECT
prev_week.week_created AS prev,
next_week.week_created AS next,
prev_week.author AS author
FROM (
SELECT
WEEK(SEC_TO_TIMESTAMP(created_utc)) AS week_created,
author
FROM [fh-bigquery:reddit_posts.2016_05]
GROUP BY 1,2
) next_week
LEFT JOIN (
SELECT
WEEK(SEC_TO_TIMESTAMP(created_utc)) AS week_created,
author
FROM [fh-bigquery:reddit_posts.2016_05]
GROUP BY 1,2
) AS prev_week
ON prev_week.author = next_week.author
HAVING prev <= next
)
GROUP BY 1,2
)
GROUP BY 1
ORDER BY 1
Result is as below
This is the closest to what you asked i can think of
Meantime, please note - BigQuery is less tailored for reports design rather for data crunching. So I think that creating matrix/pivot within BigQuery (outer select) is not the best fit - it can be done in your reporting tool. But calculating all pairs prev|next|count (inner select) is definitely suitable here in BigQuery

Related

Rolling sum based on date (when dates are missing)

You may be aware of rolling the results of an aggregate over a specific number of preceding rows. I.e.: how many hot dogs did I eat over the last 7 days
SELECT HotDogCount,
DateKey,
SUM(HotDogCount) OVER (ORDER BY DateKey ROWS 6 PRECEDING) AS HotDogsLast7Days
FROM dbo.HotDogConsumption
Results:
+-------------+------------+------------------+
| HotDogCount | DateKey | HotDogsLast7Days |
+-------------+------------+------------------+
| 3 | 09/21/2020 | 3 |
| 2 | 9/22/2020 | 5 |
| 1 | 09/23/2020 | 6 |
| 1 | 09/24/2020 | 7 |
| 1 | 09/25/2020 | 8 |
| 4 | 09/26/2020 | 12 |
| 1 | 09/27/2020 | 13 |
| 3 | 09/28/2020 | 13 |
| 2 | 09/29/2020 | 13 |
| 1 | 09/30/2020 | 13 |
+-------------+------------+------------------+
Now, the problem I am having is when there are gaps in the dates. So, basically, one day my intestines and circulatory system are screaming at me: "What the heck are you doing, you're going to kill us all!!!" So, I decide to give my body a break for a day and now there is no record for that day. When I use the "ROWS 6 PRECEDING" method, I will now be reaching back 8 days, rather than 7, because one day was missed.
So, the question is, do any of you know how I could use the OVER clause to truly use a date value (something like "DATEADD(day,-7,DateKey)") to determine how many previous rows should be summed up for a true 7 day rolling sum, regardless of whether I only ate hot dogs on one day or on all 7 days?
Side note, to have a record of 0 for the days I didn't eat any hotdogs is not an option. I understand that I could use an array of dates and left join to it and do a
CASE WHEN Datekey IS NULL THEN 0 END
type of deal, but I would like to find out if there is a different way where the rows preceding value can somehow be determined dynamically based on the date.

Window functions are the right approach in theory. But to look back at the 7 preceding days (not rows), we need a range frame specification - which, unfornately, SQL Server does not support.
I am going to recommend a subquery, or a lateral join:
select hdc.*, hdc1.*
from dbo.HotDogConsumption hdc
cross apply (
select coalesce(sum(HotDogCount), 0) HotDogsLast7Days
from dbo.HotDogConsumption hdc1
where hdc1.datekey >= dateadd(day, -7, hdc.datekey)
and hdc1.datekey < hdc.datekey
) hdc1
You might want to adjust the conditions in the where clause of the subquery to the precise frame that you want. The above code computes over the last 7 days, not including today. Something equivalent to your current attempt would be like:
where hdc1.datekey >= dateadd(day, -6, hdc.datekey)
and hdc1.datekey <= hdc.datekey

I'm kind of old school, but this is how I'd go about it:
SELECT
HDC1.HotDogCount
,HDC1.DateKey
,( SELECT SUM( HDC2.HotDogCount )
FROM HotDogConsumption HDC2
WHERE HDC2.DateKey BETWEEN DATEADD( DD, -7, HDC1.DateKey )
AND HDC1.DateKey ) AS 'HotDogsLast7Days'
FROM
HotDogConsumption HDC1
;
Someone younger might use an OUTER APPLY or something.

Calculate time span over a number of records

I have a table that has the following schema:
ID | FirstName | Surname | TransmissionID | CaptureDateTime
1 | Billy | Goat | ABCDEF | 2018-09-20 13:45:01.098
2 | Jonny | Cash | ABCDEF | 2018-09-20 13:45.01.108
3 | Sally | Sue | ABCDEF | 2018-09-20 13:45:01.298
4 | Jermaine | Cole | PQRSTU | 2018-09-20 13:45:01.398
5 | Mike | Smith | PQRSTU | 2018-09-20 13:45:01.498
There are well over 70,000 records and they store logs of transmissions to a web-service. What I'd like to know is how would I go about writing a script that would select the distinct TransmissionID values and also show the timespan between the earliest CaptureDateTime record and the latest record? Essentially I'd like to see what the rate of records the web-service is reading & writing.
Is it even possible to do so in a single SELECT statement or should I just create a stored procedure or report in code? I don't know where to start aside from SELECT DISTINCT TransmissionID for this sort of query.
Here's what I have so far (I'm stuck on the time calculation)
SELECT DISTINCT [TransmissionID],
COUNT(*) as 'Number of records'
FROM [log_table]
GROUP BY [TransmissionID]
HAVING COUNT(*) > 1
Not sure how to get the difference between the first and last record with the same TransmissionID I would like to get a result set like:
TransmissionID | TimeToCompletion | Number of records |
ABCDEF | 2.001 | 5000 |

Simply GROUP BY and use MIN / MAX function to find min/max date in each group and subtract them:
SELECT
TransmissionID,
COUNT(*),
DATEDIFF(second, MIN(CaptureDateTime), MAX(CaptureDateTime))
FROM yourdata
GROUP BY TransmissionID
HAVING COUNT(*) > 1

Use min and max to calculate timespan
SELECT [TransmissionID],
COUNT(*) as 'Number of records',datediff(s,min(CaptureDateTime),max(CaptureDateTime)) as timespan
FROM [log_table]
GROUP BY [TransmissionID]
HAVING COUNT(*) > 1

A method that returns the average time for all transmissionids, even those with only 1 record:
SELECT TransmissionID,
COUNT(*),
DATEDIFF(second, MIN(CaptureDateTime), MAX(CaptureDateTime)) * 1.0 / NULLIF(COUNT(*) - 1, 0)
FROM yourdata
GROUP BY TransmissionID;
Note that you may not actually want the maximum of the capture date for a given transmissionId. You might want the overall maximum in the table -- so you can consider the final period after the most recent record.
If so, this looks like:
SELECT TransmissionID,
COUNT(*),
DATEDIFF(second,
MIN(CaptureDateTime),
MAX(MAX(CaptureDateTime)) OVER ()
) * 1.0 / COUNT(*)
FROM yourdata
GROUP BY TransmissionID;

Google Big Query : Window Function Row Wise Cumulative Sum Across Columns

I am looking to calculate cumulative sum across columns in Google Big Query.
Assume there are five columns (NAME,A,B,C,D) with two rows of integers, for example:
NAME | A | B | C | D
----------------------
Bob | 1 | 2 | 3 | 4
Carl | 5 | 6 | 7 | 8
I am looking for a windowing function or UDF to calculate the cumulative sum across rows to generate this output:
NAME | A | B | C | D
-------------------------
Bob | 1 | 3 | 6 | 10
Carl | 5 | 11 | 18 | 27
Any thoughts or suggestions greatly appreciated!

I think, there are number of reasonable workarounds for your requirements mostly in the area of designing better your table. All really depends on how you input your data and most importantly how than you consume it
Still, if to stay with presented requirements - Below is not exactly what you expect in your question as an output, but might be usefull as an example:
SELECT name, GROUP_CONCAT(STRING(cum)) AS all FROM (
SELECT name,
SUM(INTEGER(num))
OVER(PARTITION BY name
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum
FROM (
SELECT name, SPLIT(all) AS num FROM (
SELECT name,
CONCAT(STRING(a),',',STRING(b),',',STRING(c),',',STRING(d)) AS all
FROM yourtable
)
)
)
GROUP BY name
Output is:
name all
Bob 1,3,6,10
Carl 5,11,18,26
Depends on how you than consume this data - it still can work for you
Note, not you avoiding now writing something like col1 + col2 + .. + col89 + col90 - but still need to explicitelly mention each column just ones.
in case if you have "luxury" of implementing your requirements outside of GBQ UI, but rather in some Client- you can use BigQuery API to programatically aquire table schema and build on fly your logic/query and than execute it
Take a look at below APIs to start with:
To get table schema - https://cloud.google.com/bigquery/docs/reference/v2/tables/get
To issue query job - https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert

There's no need for a UDF:
SELECT name, a, a+b, a+b+c, a+b+c+d
FROM tab

Select the difference of two consecutive columns

I have a table car that looks like this:
| mileage | carid |
------------------
| 30 | 1 |
| 50 | 1 |
| 100 | 1 |
| 0 | 2 |
| 70 | 2 |
I would like to get the average difference for each car. So for example for car 1 I would like to get ((50-30)+(100-50))/2 = 35. So I created the following query
SELECT AVG(diff),carid FROM (
SELECT (mileage-
(SELECT Max(mileage) FROM car Where mileage<mileage AND carid=carid GROUP BY carid))
AS diff,carid
FROM car GROUP BY carid)
But this doesn't work as I'm not able to use current row for the other column. And I'm quite clueless on how to actually solve this in a different way.
So how would I be able to obtain the value of the next row somehow?

The average difference is the maximum minus he minimum divided by one less than the count (you can do the arithmetic to convince yourself this is true).
Hence:
select carid,
( (max(mileage) - min(mileage)) / nullif(count(*) - 1, 0)) as avg_diff
from cars
group by carid;

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.

Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.

Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.

It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.

So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Writing a String comparison function in bigquery - google-bigquery

Related

Rolling sum based on date (when dates are missing)

Calculate time span over a number of records

Google Big Query : Window Function Row Wise Cumulative Sum Across Columns

Select the difference of two consecutive columns

Find spectators that have seen the same shows (match multiple rows for each)

Categories

Resources