Can't understand a deduplication code in BigQuery - google-bigquery

I'm new to SQL, and i got this piece of code, inspired by this post Delete duplicate rows from a BigQuery table
I ran the code, and understood that it deduplicates my table, but i'm not sure how.
The code is the one that follows
WITH product_query AS(
SELECT
DISTINCT
v2ProductName,
productSKU
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE v2ProductName IS NOT NULL
)
SELECT k.* FROM (
#aggregate the products into an array and only take 1 result
SELECT ARRAY_AGG (x LIMIT 1)[OFFSET(0)] k
FROM product_query x
GROUP BY productSKU
)
I think i got the first part, but, starting from "SELECT k.*"... i'm not sure what am i looking at. What the "k" and the "x" are supposed to mean? And what is the OFFSET function?

Related

SQL count (*) function query for a baseball data table explanation

This line of code is simple and I understand what the query will return. I am trying to understand the third line of this code.
select First
, sum(rbi) as 'rbis'
**, count(*) as 'num_players'**
from #batter_stats
group by First
having count(*) >= 5
order by [rbis] desc
The count(*) as 'num_players' is confusing me as to what this does in this query.
Thanks you. I provided both the line of code and the data table.
In your code the aggregation functions
, sum(rbi) as rbis
, count(*) as num_players
agregated by column First
count(*) retunr the number of rows for the corresponding groped value
so in your case retunr the number of rows in your table for the corresponding value of olumn First

Using Multiple aggregate functions in the where clause

We have a select statement in production that takes quite a lot of time.
The current query uses row number - window function.
I am trying to rewrite the query and test the same. assuming its orc table fetching aggregate values instead of using row number may help to reduce the execution time, is my assumption
Is something like this possible. Let me know if i am missing anything.
Sorry i am trying to learn, so please bear with my mistakes, if any.
I tried to rewrite the query as mentioned below.
Original query
SELECT
Q.id,
Q.crt_ts,
Q.upd_ts,
Q.exp_ts,
Q.biz_effdt
(
SELECT u.id, u.crt_ts, u.upd_ts, u.exp_ts, u.biz_effdt, ROW_NUMBER() OVER (PARTITION BY u.id ORDER BY u.crt_ts DESC) AS ROW_N
FROM ( SELECT cust_prd.id, cust_prd.crt_ts, cust_prd.upd_ts, cust_prd.exp_ts, cust_prd.biz_effdt FROM MSTR_CORE.cust_prd
WHERE biz_effdt IN ( SELECT MAX(cust_prd.biz_effdt) FROM MSTR_CORE.cust_prd )
) U
)Q WHERE Q.row_n = 1
My attempt:
SELECT cust_prd.id, cust_prd.crt_ts, cust_prd.upd_ts, cust_prd.exp_ts, cust_prd.biz_effdt FROM MSTR_CORE.cust_prd
WHERE biz_effdt IN ( SELECT MAX(cust_prd.biz_effdt) FROM MSTR_CORE.cust_prd )
having cust_prd.crt_ts = max (cust_prd.crt_ts)

How to implement lag function in teradata.

Input :
Output :
I want the output as shown in the image below.
In the output image, 4 in 'behind' is evaluated as tot_cnt-tot and the subsequent numbers in 'behind', for eg: 2 is evaluated as lag(behind)-tot & as long as the 'rank' remains same, even 'behind' should remain same.
Can anyone please help me implement this in teradata?
You appears to want :
select *, (select count(*)
from table t1
where t1.rank > t.rank
) as behind
from table t;
I would summarize the data and do:
select id, max(tot_cnt), max(tot),
(max(tot_cnt) -
sum(max(tot)) over (order by id rows between unbounded preceding and current row)
) as diff
from t
group by id;
This provides one row per id, which makes a lot more sense to me. If you want the original data rows (which are all duplicates anyway), you can join this back to your table.

Access 2013 - Query not returning correct Number of Results

I am trying to get the query below to return the TWO lowest PlayedTo results for each PlayerID.
select
x1.PlayerID, x1.RoundID, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and x2.PlayedTo <= x1.PlayedTo
) <3
order by PlayerID, PlayedTo, RoundID;
Unfortunately at the moment it doesn't return a result when there is a tie for one of the lowest scores. A copy of the dataset and code is here http://sqlfiddle.com/#!3/4a9fc/13.
PlayerID 47 has only one result returned as there are two different RoundID's that are tied for the second lowest PlayedTo. For what I am trying to calculate it doesn't matter which of these two it returns as I just need to know what the number is but for reporting I ideally need to know the one with the newest date.
One other slight problem with the query is the time it takes to run. It takes about 2 minutes in Access to run through the 83 records but it will need to run on about 1000 records when the database is fully up and running.
Any help will be much appreciated.
Resolve the tie by adding DatePlayed to your internal sorting (you wanted the one with the newest date anyway):
select
x1.PlayerID, x1.RoundID
, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and (x2.PlayedTo < x1.PlayedTo
or x2.PlayedTo = x1.PlayedTo
and x2.DatePlayed >= x1.DatePlayed
)
) <3
order by PlayerID, PlayedTo, RoundID;
For performance create an index supporting the join condition. Something like:
create index P_7to8Calcs__PlayerID_RoundID on P_7to8Calcs(PlayerId, PlayedTo);
Note: I used your SQLFiddle as I do not have Acess available here.
Edit: In case the index does not improve performance enough, you might want to try the following query using window functions (which avoids nested sub-query). It works in your SQLFiddle but I am not sure if this is supported by Access.
select x1.PlayerID, x1.RoundID, x1.PlayedTo
from (
select PlayerID, RoundID, PlayedTo
, RANK() OVER (PARTITION BY PlayerId ORDER BY PlayedTo, DatePlayed DESC) AS Rank
from P_7to8Calcs
) as x1
where x1.RANK < 3
order by PlayerID, PlayedTo, RoundID;
See OVER clause and Ranking Functions for documentation.

Purposely having a query return blank entries at regular intervals

I want to write a query that returns 3 results followed by blank results followed by the next 3 results, and so on. So if my database had this data:
CREATE TABLE table (a integer, b integer, c integer, d integer);
INSERT INTO table (a,b,c,d)
VALUES (1,2,3,4),
(5,6,7,8),
(9,10,11,12),
(13,14,15,16),
(17,18,19,20),
(21,22,23,24),
(25,26,37,28);
I would want my query to return this
1,2,3,4
5,6,7,8
9,10,11,12
, , ,
13,14,15,16
17,18,19,20
21,22,23,24
, , ,
25,26,27,28
I need this to work for arbitrarily many entries that I select for, have three be grouped together like this.
I'm running postgresql 8.3
This should work flawlessly in PostgreSQL 8.3
SELECT a, b, c, d
FROM (
SELECT rn, 0 AS rk, (x[rn]).*
FROM (
SELECT x, generate_series(1, array_upper(x, 1)) AS rn
FROM (SELECT ARRAY(SELECT tbl FROM tbl) AS x) x
) y
UNION ALL
SELECT generate_series(3, (SELECT count(*) FROM tbl), 3), 1, (NULL::tbl).*
ORDER BY rn, rk
) z
Major points
Works for a query that selects all columns of tbl.
Works for any table.
For selecting arbitrary columns you have to substitute (NULL::tbl).* with a matching number of NULL columns in the second query.
Assuming that NULL values are ok for "blank" rows.
If not, you'll have to cast your columns to text in the first and substitute '' for NULL in the second SELECT.
Query will be slow with very big tables.
If I had to do it, I would write a plpgsql function that loops through the results and inserts the blank rows. But you mentioned you had no direct access to the db ...
In short, no, there's not an easy way to do this, and generally, you shouldn't try. The database is concerned with what your data actually is, not how it's going to be displayed. It's not an appropriate scope of responsibility to expect your database to return "dummy" or "extra" data so that some down-stream process produces a desired output. The generating script needs to do that.
As you can't change your down-stream process, you could (read that with a significant degree of skepticism and disdain) add things like this:
Select Top 3
a, b, c, d
From
table
Union Select Top 1
'', '', '', ''
From
table
Union Select Top 3 Skip 3
a, b, c, d
From
table
Please, don't actually try do that.
You can do it (at least on DB2 - there doesn't appear to be equivalent functionality for your version of PostgreSQL).
No looping needed, although there is a bit of trickery involved...
Please note that though this works, it's really best to change your display code.
Statement requires CTEs (although that can be re-written to use other table references), and OLAP functions (I guess you could re-write it to count() previous rows in a subquery, but...).
WITH dataList (rowNum, dataColumn) as (SELECT CAST(CAST(:interval as REAL) /
(:interval - 1) * ROW_NUMBER() OVER(ORDER BY dataColumn) as INTEGER),
dataColumn
FROM dataTable),
blankIncluder(rowNum, dataColumn) as (SELECT rowNum, dataColumn
FROM dataList
UNION ALL
SELECT rowNum - 1, :blankDataColumn
FROM dataList
WHERE MOD(rowNum - 1, :interval) = 0
AND rowNum > :interval)
SELECT *
FROM dataList
ORDER BY rowNum
This will generate a list of those elements from the datatable, with a 'blank' line every interval lines, as ordered by the initial query. The result set only has 'blank' lines between existing lines - there are no 'blank' lines on the ends.