BigQuery "Schrödingers Row" or why ROW_NUMBER() is not a good identifier - google-bigquery

Situation
We have a fairly complex internal logic to allocate marketing spend to various channels and had currently started to rework some of our queries to simplify the setup. We recently came across a really puzzling case where using ROW_NUMBER() OVER() to identify unique rows lead to very strange results.
Problem
In essence, using ROW_NUMBER() OVER() resulted in what I call Schrödingers Rows. As they appear to be matched and unmatched at the same time (please find replicable query below). In the attached screenshot (which is a result of the query) it can be clearly seen that
german_spend + non_german_spend > total_spend
Which should not be the case.
Query
Please note that execution of the query will give you different results each time you run it as it relies on RAND() to generate dummy data. Also please be aware that the query is a very dumbed down version of what we are doing. For reasons beyond the scope of this post, we needed to uniquely identify the buckets.
###################
# CREATE Dummy Data
###################
DECLARE NUMBER_OF_DUMMY_RECORDS DEFAULT 1000000;
WITH data AS (
SELECT
num as campaign_id,
RAND() as rand_1,
RAND() as rand_2
FROM
UNNEST(GENERATE_ARRAY(1, NUMBER_OF_DUMMY_RECORDS)) AS num
),
spend_with_categories AS (
SELECT
campaign_id,
CASE
WHEN rand_1 < 0.25 THEN 'DE'
WHEN rand_1 < 0.5 THEN 'AT'
WHEN rand_1 < 0.75 THEN 'CH'
ELSE 'IT'
END AS country,
CASE
WHEN rand_2 < 0.25 THEN 'SMALL'
WHEN rand_2 < 0.5 THEN 'MEDIUM'
WHEN rand_2 < 0.75 THEN 'BIG'
ELSE 'MEGA'
END AS city_size,
CAST(RAND() * 1000000 AS INT64) as marketing_spend
FROM
data
),
###################
# END Dummy Data
###################
spend_buckets AS (
SELECT
country,
city_size,
CONCAT("row_", ROW_NUMBER() OVER()) AS identifier,
#MD5(CONCAT(country, city_size)) AS identifier, (this works)
SUM(marketing_spend) AS marketing_spend
FROM
spend_with_categories
GROUP BY 1,2
),
german_spend AS (
SELECT
country,
ARRAY_AGG(identifier) AS identifier,
SUM(marketing_spend) AS marketing_spend
FROM
spend_buckets
WHERE
country = 'DE'
GROUP BY
country
),
german_identifiers AS (
SELECT id AS identifier FROM german_spend, UNNEST(identifier) as id
),
non_german_spend AS (
SELECT SUM(marketing_spend) AS marketing_spend FROM spend_buckets WHERE identifier NOT IN (SELECT identifier FROM german_identifiers)
)
(SELECT "german_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM german_spend
UNION ALL
SELECT "non_german_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM non_german_spend
UNION ALL
SELECT "total_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM spend_buckets)
Solution
We were actually able to solve the problem by using a hash of the key instead of the ROW_NUMBER() OVER() identifier, but out of curiosity I would still love to understand what causes this.
Additional Notes
Using GENERATE_UUID() AS identifier instead of CONCAT("row_", ROW_NUMBER() OVER()) AS identifier leads to almost 0 matches. I.e. entire spend is classified as non-german.
Writing spend_buckets to a table also solves the problem, which leads me to believe that maybe ROW_NUMBER() OVER() is lazily executed or so?
using a small number for the dummy data also produces non-matching results regardless of the method of generating a "unique" id

Hash functions are a way better for marking rows than generating a rownumber, which is changing each day.
The CTE (with tables) are not persistent, but calculated for each time used in your query.
Running the same CTE several times within a query, results in different results:
With test as (Select rand() as x)
Select * from test
union all Select * from test
union all Select * from test
A good solution is the use of temp table. A workaround is to use search for CTE table, which creates a row_number or generates random number and are used more than once in following. These CTE are to rename and be used in a recursive CTE and then the later CTE is used. In your example it is the spend_buckets:
WITH recursive
...
spend_buckets_ as (
...),
spend_buckets as
(select * from spend_buckets_
union all select * from spend_buckets_
where false
),
Then the values will match.

Related

Finding Duplicate Records on Different Criteria

I am trying to write a group of queries that find instances of duplicate records with matches across several columns, some of them being exact matches and some being similar matches. I believe all of these queries should be very similar with just a few changes to the filters for the different variations, but I may be mistaken.
Two examples:
A query to find all instances of records with identical matches in columns A, B, D, and E with C and F being shown in the results but not needing to have any duplication.
A query to find all instances of records with identical matches in Columns A, B, and D, while E just has to be similar to other like records. (again C and F being shown in the results but not needing to have any duplication)
SQL is a language I have some familiarity with but am on a whole, a novice. So far I have tried to use GROUP BY and a HAVING statement to find instances where the count of the group is > 1, but I can only get this to work with the identical matches and can only see the columns used in the GROUP BY (no C and F). I have figured out that I can use levenshtein distance to find the similar instances cleanly, but I have not been able to work out how to integrate this into the GROUP BY formula.
Any advice on this would be helpful, feel free to ask any questions.
Thanks
This block is as far as I have got in an example with 3 identical columns and one similar, it produces some accurate results but also a handful of records that will only have a single of the necessary matches, I am unsure why. I am not married to having the query work this way, this is just what I came up with. E & R Keys are unique identifiers necessary for later processing.
WITH A AS (
SELECT
ROW_NUMBER() OVER(ORDER BY "Tickets"."id", "Tickets"."vid" ASC) AS RowNo,
to_hex(MD5( TO_UTF8( CONCAT( "Tickets"."vid" ,"Tickets"."id" , "Tickets"."num", cast("Tickets"."date"as varchar) )))) AS R_KEY,
"Tickets"."vid" AS "vid",
"Tickets"."id" AS "id",
"Tickets"."num" AS "num",
"Tickets"."date" AS "date",
"Tickets"."amount" AS "amount",
ROW_NUMBER() OVER(ORDER BY "TicketsB"."vid", "TicketsB"."id" ASC) AS RowNo2,
to_hex(MD5( TO_UTF8( CONCAT( "TicketsB"."vid" ,"TicketsB"."id" , "TicketsB"."num", cast("TicketsB"."date"as varchar) )))) AS R_KEY2,
"TicketsB"."vid" AS "vid2",
"TicketsB"."id" AS "id2",
"TicketsB"."num" AS "num2",
"TicketsB"."date" AS "date2",
"TicketsB"."amount" AS "amount2"
FROM
TicketTable AS "Tickets",
TicketTable AS "TicketsB"
WHERE "Tickets".vid = "TicketsB".vid
AND "Tickets".date = "TicketsB".date
AND "Tickets".amount = "TicketsB".amount
AND "Tickets".num != "TicketsB".num
AND levenshtein_distance("Tickets".num,"TicketsB".num) < 2
AND length("TicketsB".num) - length("Tickets".num) = 1
LIMIT 100000
)
SELECT to_hex(MD5( TO_UTF8( CONCAT(VID , cast(DATE as varchar), cast(AMOUNT as varchar) )))) AS E_KEY,
R_KEY, VID, ID, NUM, DATE, AMOUNT FROM A
UNION
SELECT to_hex(MD5( TO_UTF8( CONCAT(VID , cast(DATE as varchar), cast(AMOUNT as varchar) )))) AS E_KEY,
R_KEY2, ID2, ID2, NUM2, DATE2, AMOUNT2 FROM A
ORDER BY VID, DATE, AMOUNT

SQL to show one result calculated by the other values?

It seems we can use a SQL statement as:
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
);
but we can't do
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
),
(
select
(c_foos / c_bars) as the_ratio
);
or
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
),
(c_foos / c_bars) as the_ratio;
Is there a way to do that showing all 3 numbers? Is there a more definite rule as to what can be done and what can't?
You can try this:
You define two CTEs in a WITH clause, so you can use your result in the main query built on two cte tables (cte_num and cte_den)
WITH recursive
cte_num AS (
SELECT count(*) as c_foos
FROM foos
),
cte_den AS (
SELECT count(*) as c_bars
FROM bars
)
SELECT
cte_num.foos,
cte_den.bars,
cte_num.foos / cte_den.bars as the_ratio
from cte_num, cte_den;
There is a small number of simple rules... but SQL seems so easy that most programmers prefer to cut to the chase, and later complain they didn't get the plot :)
You can think of a query as a description of a flow: columns in a select share inputs (defined in from), but are evaluated "in parallel", without seeing each other. Your complex example boils down to the fact, that you cannot do this:
select 1 as a, 2 as b, a + b;
fields a and b are defined as outputs from the query, but there are no inputs called a and b. All you have to do is modify the query so that a and b are inputs:
select a + b from (select 1 as a, 2 as b) as inputs
And this will work (this is, btw., the solution for your queries).
Addendum:
The confusion comes from the fact, that in most SQL 101 cases outputs are created directly from inputs (data just passes through).
This flow model is useful, because it makes things easier to reason about in more complex cases. Also, we avoid ambiguities and loops. You can think about it in the context of query like: select name as last_name, last_name as name, name || ' ' || last_name from person;
Move the conditions to the FROM clause:
select f.c_foos, b.c_bars, f.c_foos / f.c_bars
from (select count(*) as c_foos from foos
) f cross join
(select count(*) as c_bars from bars
) b;
Ironically, your first version will work in MySQL (see here). I don't actually think this is intentional. I think it is an artifact of their parser -- meaning that it happens to work but might stop working in future versions.
The simplest way is to use a CTE that returns the 2 columns:
with cte as (
select
(select count(*) from foos) as c_foos,
(select count(*) from bars) as c_bars
)
select c_foos, c_bars, (c_foos / c_bars) as the_ratio
from cte
Note that the aliases of the 2 columns must be set outside of each query and not inside (the parentheses).

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

DB2 getting QDT Array List maximum exceeded using CTE and sql recursion

I am using CTE to create a recursive query to merge multiple column data into one.
I have about 9 working CTE's (I need to merge columns a few times in one row per request, so I have the CTE helpers). When I add the 10th, I get an error. I am running the query on Visual Studio 2010 and here is the error:
And on the As400 system using the, WRKOBJLCK MyUserProfile *USRPRF command, I see:
I can't find any information on this.
I am using DB2 running on an AS400 system, and using: Operating system: i5/OS Version: V5R4M0
I repeat these same 3 CTE's but with different conditions to compare against:
t1A (ROWNUM, PARTNO, LOCNAM, LOCCODE, QTY) AS
(
SELECT rownumber() over(partition by s2.LOCPART), s2.LOCPART, s2.LOCNAM, s2.LOCCODE, s2.LOCQTY
FROM (
SELECT distinct s1.LOCPART, L.LOCNAM, L.LOCCODE, L.LOCQTY
FROM(
SELECT COUNT(LOCPART) AS counts, LOCPART
FROM LOCATIONS
WHERE LOCCODE = 'A'
GROUP BY LOCPART) S1, LOCATIONS L
WHERE S1.COUNTS > 1 AND S1.LOCPART = L.LOCPART AND L.LOCCODE = 'A'
)s2
),
t2A(PARTNO, LIST, QTY, CODE, CNT) AS
(
select PARTNO, LOCNAM, QTY, LOCCODE, 1
from t1A
where ROWNUM = 1
UNION ALL
select t2A.PARTNO, t2A.LIST || ', ' || t1A.LOCNAM, t1A.QTY, t1A.LOCCODE, t2A.CNT + 1
FROM t2A, t1A
where t2A.PARTNO = t1A.PARTNO
AND t2A.CNT + 1 = t1A.ROWNUM
),
t3A(PARTNO, LIST, QTY, CODE, CNT) AS
(
select t2.PARTNO, t2.LIST, q.SQTY, t2.CODE, t2.CNT
from(
select SUM(QTY) as SQTY, PARTNO
FROM t1A
GROUP BY PARTNO
) q, t2A t2
where t2.PARTNO = q.PARTNO
)
Using these, I just call a simple select on one of the CTE's just for testing, and I get the error each time when I have more than 9 CTE's (even if only one is being called).
In the AS400 error (green screen snapshot) what does QDT stand for, and when am I using an Array here?
This was a mess. Error after error. The only way I could get around this was to create views and piece them together.
When creating the view I was only able to get it to work with one CTE not multiple, then what worked fine as one recursive CTE, wouldn't work when trying to define as a view. I had to break apart the sub query into views, and I couldn't create a view out of SELECT rownumber() over(partition by COL1, Col2) that contained a sub query, I had to break it down into two views. If I called SELECT rownumber() over(partition by COL1, Col2) using a view as its subquery and threw that into the CTE it wouldn't work. I had to put the SELECT rownumber() over(partition by COL1, Col2) with its inner view into another view, and then I was able to use it in the CTE, and then create a main view out of all of that.
Also, Each error I got was a system error not SQL.
So in conclusion, I relied heavily on views to fix my issue if anyone ever runs across this same problem.

Weighted average in T-SQL (like Excel's SUMPRODUCT)

I am looking for a way to derive a weighted average from two rows of data with the same number of columns, where the average is as follows (borrowing Excel notation):
(A1*B1)+(A2*B2)+...+(An*Bn)/SUM(A1:An)
The first part reflects the same functionality as Excel's SUMPRODUCT() function.
My catch is that I need to dynamically specify which row gets averaged with weights, and which row the weights come from, and a date range.
EDIT: This is easier than I thought, because Excel was making me think I required some kind of pivot. My solution so far is thus:
select sum(baseSeries.Actual * weightSeries.Actual) / sum(weightSeries.Actual)
from (
select RecordDate , Actual
from CalcProductionRecords
where KPI = 'Weighty'
) baseSeries inner join (
select RecordDate , Actual
from CalcProductionRecords
where KPI = 'Tons Milled'
) weightSeries on baseSeries.RecordDate = weightSeries.RecordDate
Quassnoi's answer shows how to do the SumProduct, and using a WHERE clause would allow you to restrict by a Date field...
SELECT
SUM([tbl].data * [tbl].weight) / SUM([tbl].weight)
FROM
[tbl]
WHERE
[tbl].date >= '2009 Jan 01'
AND [tbl].date < '2010 Jan 01'
The more complex part is where you want to "dynamically specify" the what field is [data] and what field is [weight]. The short answer is that realistically you'd have to make use of Dynamic SQL. Something along the lines of:
- Create a string template
- Replace all instances of [tbl].data with the appropriate data field
- Replace all instances of [tbl].weight with the appropriate weight field
- Execute the string
Dynamic SQL, however, carries it's own overhead. Is the queries are relatively infrequent , or the execution time of the query itself is relatively long, this may not matter. If they are common and short, however, you may notice that using dynamic sql introduces a noticable overhead. (Not to mention being careful of SQL injection attacks, etc.)
EDIT:
In your lastest example you highlight three fields:
RecordDate
KPI
Actual
When the [KPI] is "Weight Y", then [Actual] the Weighting Factor to use.
When the [KPI] is "Tons Milled", then [Actual] is the Data you want to aggregate.
Some questions I have are:
Are there any other fields?
Is there only ever ONE actual per date per KPI?
The reason I ask being that you want to ensure the JOIN you do is only ever 1:1. (You don't want 5 Actuals joining with 5 Weights, giving 25 resultsing records)
Regardless, a slight simplification of your query is certainly possible...
SELECT
SUM([baseSeries].Actual * [weightSeries].Actual) / SUM([weightSeries].Actual)
FROM
CalcProductionRecords AS [baseSeries]
INNER JOIN
CalcProductionRecords AS [weightSeries]
ON [weightSeries].RecordDate = [baseSeries].RecordDate
-- AND [weightSeries].someOtherID = [baseSeries].someOtherID
WHERE
[baseSeries].KPI = 'Tons Milled'
AND [weightSeries].KPI = 'Weighty'
The commented out line only needed if you need additional predicates to ensure a 1:1 relationship between your data and the weights.
If you can't guarnatee just One value per date, and don't have any other fields to join on, you can modify your sub_query based version slightly...
SELECT
SUM([baseSeries].Actual * [weightSeries].Actual) / SUM([weightSeries].Actual)
FROM
(
SELECT
RecordDate,
SUM(Actual)
FROM
CalcProductionRecords
WHERE
KPI = 'Tons Milled'
GROUP BY
RecordDate
)
AS [baseSeries]
INNER JOIN
(
SELECT
RecordDate,
AVG(Actual)
FROM
CalcProductionRecords
WHERE
KPI = 'Weighty'
GROUP BY
RecordDate
)
AS [weightSeries]
ON [weightSeries].RecordDate = [baseSeries].RecordDate
This assumes the AVG of the weight is valid if there are multiple weights for the same day.
EDIT : Someone just voted for this so I thought I'd improve the final answer :)
SELECT
SUM(Actual * Weight) / SUM(Weight)
FROM
(
SELECT
RecordDate,
SUM(CASE WHEN KPI = 'Tons Milled' THEN Actual ELSE NULL END) AS Actual,
AVG(CASE WHEN KPI = 'Weighty' THEN Actual ELSE NULL END) AS Weight
FROM
CalcProductionRecords
WHERE
KPI IN ('Tons Milled', 'Weighty')
GROUP BY
RecordDate
)
AS pivotAggregate
This avoids the JOIN and also only scans the table once.
It relies on the fact that NULL values are ignored when calculating the AVG().
SELECT SUM(A * B) / SUM(A)
FROM mytable
If I have understand the problem then try this
SET DATEFORMAT dmy
declare #tbl table(A int, B int,recorddate datetime,KPI varchar(50))
insert into #tbl
select 1,10 ,'21/01/2009', 'Weighty'union all
select 2,20,'10/01/2009', 'Tons Milled' union all
select 3,30 ,'03/02/2009', 'xyz'union all
select 4,40 ,'10/01/2009', 'Weighty'union all
select 5,50 ,'05/01/2009', 'Tons Milled'union all
select 6,60,'04/01/2009', 'abc' union all
select 7,70 ,'05/01/2009', 'Weighty'union all
select 8,80,'09/01/2009', 'xyz' union all
select 9,90 ,'05/01/2009', 'kws' union all
select 10,100,'05/01/2009', 'Tons Milled'
select SUM(t1.A*t2.A)/SUM(t2.A)Result from
(select RecordDate,A,B,KPI from #tbl)t1
inner join(select RecordDate,A,B,KPI from #tbl t)t2
on t1.RecordDate = t2.RecordDate
and t1.KPI = t2.KPI