DB2 getting QDT Array List maximum exceeded using CTE and sql recursion - sql

I am using CTE to create a recursive query to merge multiple column data into one.
I have about 9 working CTE's (I need to merge columns a few times in one row per request, so I have the CTE helpers). When I add the 10th, I get an error. I am running the query on Visual Studio 2010 and here is the error:
And on the As400 system using the, WRKOBJLCK MyUserProfile *USRPRF command, I see:
I can't find any information on this.
I am using DB2 running on an AS400 system, and using: Operating system: i5/OS Version: V5R4M0
I repeat these same 3 CTE's but with different conditions to compare against:
t1A (ROWNUM, PARTNO, LOCNAM, LOCCODE, QTY) AS
(
SELECT rownumber() over(partition by s2.LOCPART), s2.LOCPART, s2.LOCNAM, s2.LOCCODE, s2.LOCQTY
FROM (
SELECT distinct s1.LOCPART, L.LOCNAM, L.LOCCODE, L.LOCQTY
FROM(
SELECT COUNT(LOCPART) AS counts, LOCPART
FROM LOCATIONS
WHERE LOCCODE = 'A'
GROUP BY LOCPART) S1, LOCATIONS L
WHERE S1.COUNTS > 1 AND S1.LOCPART = L.LOCPART AND L.LOCCODE = 'A'
)s2
),
t2A(PARTNO, LIST, QTY, CODE, CNT) AS
(
select PARTNO, LOCNAM, QTY, LOCCODE, 1
from t1A
where ROWNUM = 1
UNION ALL
select t2A.PARTNO, t2A.LIST || ', ' || t1A.LOCNAM, t1A.QTY, t1A.LOCCODE, t2A.CNT + 1
FROM t2A, t1A
where t2A.PARTNO = t1A.PARTNO
AND t2A.CNT + 1 = t1A.ROWNUM
),
t3A(PARTNO, LIST, QTY, CODE, CNT) AS
(
select t2.PARTNO, t2.LIST, q.SQTY, t2.CODE, t2.CNT
from(
select SUM(QTY) as SQTY, PARTNO
FROM t1A
GROUP BY PARTNO
) q, t2A t2
where t2.PARTNO = q.PARTNO
)
Using these, I just call a simple select on one of the CTE's just for testing, and I get the error each time when I have more than 9 CTE's (even if only one is being called).
In the AS400 error (green screen snapshot) what does QDT stand for, and when am I using an Array here?

This was a mess. Error after error. The only way I could get around this was to create views and piece them together.
When creating the view I was only able to get it to work with one CTE not multiple, then what worked fine as one recursive CTE, wouldn't work when trying to define as a view. I had to break apart the sub query into views, and I couldn't create a view out of SELECT rownumber() over(partition by COL1, Col2) that contained a sub query, I had to break it down into two views. If I called SELECT rownumber() over(partition by COL1, Col2) using a view as its subquery and threw that into the CTE it wouldn't work. I had to put the SELECT rownumber() over(partition by COL1, Col2) with its inner view into another view, and then I was able to use it in the CTE, and then create a main view out of all of that.
Also, Each error I got was a system error not SQL.
So in conclusion, I relied heavily on views to fix my issue if anyone ever runs across this same problem.

Related

BigQuery "Schrödingers Row" or why ROW_NUMBER() is not a good identifier

Situation
We have a fairly complex internal logic to allocate marketing spend to various channels and had currently started to rework some of our queries to simplify the setup. We recently came across a really puzzling case where using ROW_NUMBER() OVER() to identify unique rows lead to very strange results.
Problem
In essence, using ROW_NUMBER() OVER() resulted in what I call Schrödingers Rows. As they appear to be matched and unmatched at the same time (please find replicable query below). In the attached screenshot (which is a result of the query) it can be clearly seen that
german_spend + non_german_spend > total_spend
Which should not be the case.
Query
Please note that execution of the query will give you different results each time you run it as it relies on RAND() to generate dummy data. Also please be aware that the query is a very dumbed down version of what we are doing. For reasons beyond the scope of this post, we needed to uniquely identify the buckets.
###################
# CREATE Dummy Data
###################
DECLARE NUMBER_OF_DUMMY_RECORDS DEFAULT 1000000;
WITH data AS (
SELECT
num as campaign_id,
RAND() as rand_1,
RAND() as rand_2
FROM
UNNEST(GENERATE_ARRAY(1, NUMBER_OF_DUMMY_RECORDS)) AS num
),
spend_with_categories AS (
SELECT
campaign_id,
CASE
WHEN rand_1 < 0.25 THEN 'DE'
WHEN rand_1 < 0.5 THEN 'AT'
WHEN rand_1 < 0.75 THEN 'CH'
ELSE 'IT'
END AS country,
CASE
WHEN rand_2 < 0.25 THEN 'SMALL'
WHEN rand_2 < 0.5 THEN 'MEDIUM'
WHEN rand_2 < 0.75 THEN 'BIG'
ELSE 'MEGA'
END AS city_size,
CAST(RAND() * 1000000 AS INT64) as marketing_spend
FROM
data
),
###################
# END Dummy Data
###################
spend_buckets AS (
SELECT
country,
city_size,
CONCAT("row_", ROW_NUMBER() OVER()) AS identifier,
#MD5(CONCAT(country, city_size)) AS identifier, (this works)
SUM(marketing_spend) AS marketing_spend
FROM
spend_with_categories
GROUP BY 1,2
),
german_spend AS (
SELECT
country,
ARRAY_AGG(identifier) AS identifier,
SUM(marketing_spend) AS marketing_spend
FROM
spend_buckets
WHERE
country = 'DE'
GROUP BY
country
),
german_identifiers AS (
SELECT id AS identifier FROM german_spend, UNNEST(identifier) as id
),
non_german_spend AS (
SELECT SUM(marketing_spend) AS marketing_spend FROM spend_buckets WHERE identifier NOT IN (SELECT identifier FROM german_identifiers)
)
(SELECT "german_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM german_spend
UNION ALL
SELECT "non_german_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM non_german_spend
UNION ALL
SELECT "total_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM spend_buckets)
Solution
We were actually able to solve the problem by using a hash of the key instead of the ROW_NUMBER() OVER() identifier, but out of curiosity I would still love to understand what causes this.
Additional Notes
Using GENERATE_UUID() AS identifier instead of CONCAT("row_", ROW_NUMBER() OVER()) AS identifier leads to almost 0 matches. I.e. entire spend is classified as non-german.
Writing spend_buckets to a table also solves the problem, which leads me to believe that maybe ROW_NUMBER() OVER() is lazily executed or so?
using a small number for the dummy data also produces non-matching results regardless of the method of generating a "unique" id
Hash functions are a way better for marking rows than generating a rownumber, which is changing each day.
The CTE (with tables) are not persistent, but calculated for each time used in your query.
Running the same CTE several times within a query, results in different results:
With test as (Select rand() as x)
Select * from test
union all Select * from test
union all Select * from test
A good solution is the use of temp table. A workaround is to use search for CTE table, which creates a row_number or generates random number and are used more than once in following. These CTE are to rename and be used in a recursive CTE and then the later CTE is used. In your example it is the spend_buckets:
WITH recursive
...
spend_buckets_ as (
...),
spend_buckets as
(select * from spend_buckets_
union all select * from spend_buckets_
where false
),
Then the values will match.

SQL to show one result calculated by the other values?

It seems we can use a SQL statement as:
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
);
but we can't do
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
),
(
select
(c_foos / c_bars) as the_ratio
);
or
select
(
select
count(*) as c_foos
from
foos
),
(
select
count(*) as c_bars
from
bars
),
(c_foos / c_bars) as the_ratio;
Is there a way to do that showing all 3 numbers? Is there a more definite rule as to what can be done and what can't?
You can try this:
You define two CTEs in a WITH clause, so you can use your result in the main query built on two cte tables (cte_num and cte_den)
WITH recursive
cte_num AS (
SELECT count(*) as c_foos
FROM foos
),
cte_den AS (
SELECT count(*) as c_bars
FROM bars
)
SELECT
cte_num.foos,
cte_den.bars,
cte_num.foos / cte_den.bars as the_ratio
from cte_num, cte_den;
There is a small number of simple rules... but SQL seems so easy that most programmers prefer to cut to the chase, and later complain they didn't get the plot :)
You can think of a query as a description of a flow: columns in a select share inputs (defined in from), but are evaluated "in parallel", without seeing each other. Your complex example boils down to the fact, that you cannot do this:
select 1 as a, 2 as b, a + b;
fields a and b are defined as outputs from the query, but there are no inputs called a and b. All you have to do is modify the query so that a and b are inputs:
select a + b from (select 1 as a, 2 as b) as inputs
And this will work (this is, btw., the solution for your queries).
Addendum:
The confusion comes from the fact, that in most SQL 101 cases outputs are created directly from inputs (data just passes through).
This flow model is useful, because it makes things easier to reason about in more complex cases. Also, we avoid ambiguities and loops. You can think about it in the context of query like: select name as last_name, last_name as name, name || ' ' || last_name from person;
Move the conditions to the FROM clause:
select f.c_foos, b.c_bars, f.c_foos / f.c_bars
from (select count(*) as c_foos from foos
) f cross join
(select count(*) as c_bars from bars
) b;
Ironically, your first version will work in MySQL (see here). I don't actually think this is intentional. I think it is an artifact of their parser -- meaning that it happens to work but might stop working in future versions.
The simplest way is to use a CTE that returns the 2 columns:
with cte as (
select
(select count(*) from foos) as c_foos,
(select count(*) from bars) as c_bars
)
select c_foos, c_bars, (c_foos / c_bars) as the_ratio
from cte
Note that the aliases of the 2 columns must be set outside of each query and not inside (the parentheses).

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

SQL Logic: Finding Non-Duplicates with Similar Rows

I'll do my best to summarize what I am having trouble with. I never used much SQL until recently.
Currently I am using SQL Server 2012 at work and have been tasked with trying to find oddities in SQL tables. Specifically, the tables contain similar information regarding servers. Kind of meta, I know. So they each share a column called "DB_NAME". After that, there are no similar columns. So I need to compare Table A and Table B and produce a list of records (servers) where a server is NOT listed in BOTH Table A and B. Additionally, this query is being ran against an exception list. I'm not 100% sure of the logic to best handle this. And while I would love to get something "extremely efficient", I am more-so looking at something that just plain works at the time being.
SELECT *
FROM (SELECT
UPPER(ta.DB_NAME) AS [DB_Name]
FROM
[CMS].[dbo].[TABLE_A] AS ta
UNION
SELECT
UPPER(tb.DB_NAME) AS [DB_Name]
FROM
[CMS].[dbo].[TABLE_B] as tb
) AS SQLresults
WHERE NOT EXISTS (
SELECT *
FROM
[CMS].[dbo].[TABLE_C_EXCEPTIONS] as tc
WHERE
SQLresults.[DB_Name] = tc.DB_NAME)
ORDER BY SQLresults.[DB_Name]
One method uses union all and aggregation:
select ab.*
from ((select upper(name) as name, 'A' as which
from CMS.dbo.TABLE_A
) union all
(select upper(name), 'B' as which
from CMS.dbo.TABLE_B
)
) ab
where not exists (select 1
from CMS.dbo.TABLE_C_EXCEPTION e
where upper(e.name) = ab.name
)
having count(distinct which) <> 2;
SQL Server is case-insensitive by default. I left the upper()s in the query in case your installation is case sensitive.
Here is another option using EXCEPT. I added a group by in each half of the union because it was not clear in your original post if DB_NAME is unique in your tables.
select DatabaseName
from
(
SELECT UPPER(ta.DB_NAME) AS DatabaseName
FROM [CMS].[dbo].[TABLE_A] AS ta
GROUP BY UPPER(ta.DB_NAME)
UNION ALL
SELECT UPPER(tb.DB_NAME) AS DatabaseName
FROM [CMS].[dbo].[TABLE_B] as tb
GROUP BY UPPER(tb.DB_NAME)
) x
group by DatabaseName
having count(*) < 2
EXCEPT
(
select DN_Name
from CMS.dbo.TABLE_C_EXCEPTION
)

Logic to identify new line item for current month

I am trying to write a query to spot new line items appearing in my data set. So for example I have the following table structure.
The logic needs to identify if the line item is new since the previous billedmonth
TableA
So if I was to write it in English.
Select IF 'CLI' & 'Description' & 'UnitCost' doesn't exist for BilledMonth -1
I have managed to create a join showing if it exists for the previous billing month.
But I am really struggling with the negative logic (i.e. the line item is new for this month)
Any help greatly appreciated.
SELECT t.CLI, t.Description
FROM yourTable t
LEFT JOIN yourTable t2
ON t.CLI = t2.CLI
AND t.Description = t2.Description
AND t.UnitCost = t2.UnitCost
AND t.BilledMonth - 1 = t2.BilledMonth
WHERE t2.CLI is null
I think sql server supports analytic functions, so something like this should work:
select CLI, Description, UnitCost, billedMonth
from (
select CLI, Description, UnitCost, billedMonth
count(*) over (partition by CLI, Description, UnitCost order by billedMonth) cnt
from mytable
) where cnt = 1
Iff this works it is very likely to be way more efficient and faster than a join based select statement.