Teradata SQL stack rows per user - sql

Is there a way to stack/group string/text per user ?
data I have
USER STATES
1 CA
1 AR
1 IN
2 CA
3 CA
3 NY
4 CA
4 AL
4 SD
4 TX
What I need is
USER STATES
1 CA / AR / IN
2 CA
3 CA / NY
4 CA / AL / SD / TX
I tried cross join and then another cross join however but the data spools out. Thanks!

If Teradata's XML-services are installed there's a function named XMLAGG, which returns a similar result: CA, AR, IN
SELECT user,
TRIM(TRAILING ',' FROM (XMLAGG(TRIM(states)|| ',' /* optionally ORDER BY ...*/) (VARCHAR(10000))))
FROM tab
GROUP BY 1
Btw, using recursion will result in huge spool usage, because you keep all the intermediate rows in spool before returning the final row.

I am not an expert but this should work. You may need to modify it a bit per your exact requirement. Hope this helps!
CREATE VOLATILE TABLE temp AS (
SELECT
USER
,STATES
,ROW_NUMBER() OVER (PARTITION BY USER ORDER BY STATES) AS rn
FROM yourtable
) WITH DATA PRIMARY INDEX(USER) ON COMMIT PRESERVE ROWS;
WITH RECURSIVE rec_test(US,ST, LVL)
AS
(
SELECT USER,STATES (VARCHAR(10)),1
FROM temp
WHERE rn = 1
UNION ALL
SELECT USER, TRIM(STATES) || ', ' || ST,LVL+1
FROM temp INNER JOIN rec_test
ON USER = US
AND temp.rn = rec_test.lvl+1
)
SELECT US,ST, LVL
FROM rec_test
QUALIFY RANK() OVER(PARTITION BY US ORDER BY LVL DESC) = 1;

Unfortunately there is no GROUP_CONCAT or any string aggregate functions in Teradata (at least none that I'm aware of) so one way to achieve your result would be to use recursion, since you don't know the maximum values of states per user.
For recursion you should use a Volatile Table, as OLAP functions are not allowed in the recursive part. This is a non-tested code (I've got no way of testing it unfortunately), so there might be several bugs, but should give you the concept and with some troubleshooting (if needed) give you expected result.
Replace yourtable in definition of Volatile Table with your real table name.
CREATE VOLATILE TABLE vt AS (
SELECT
user
, states
, ROW_NUMBER() OVER (PARTITION BY user ORDER BY states) AS rn
, COUNT(*) OVER (PARTITION BY user) AS cnt
FROM yourtable
) WITH DATA
UNIQUE PRIMARY INDEX(user, rn)
ON COMMIT PRESERVE ROWS;
WITH RECURSIVE cte (user, list, rn) AS (
SELECT
user
, CAST(states AS VARCHAR(1000)) -- maximum size based on maximum number of rows * length of states
, rn
FROM vt
WHERE rn = cnt -- start with last states row
UNION ALL
SELECT
vt.user
, cte.list || ',' || vt.states
, vt.rn
FROM vt
JOIN cte ON vt.user = cte.user AND vt.rn = cte.rn - 1 -- append a row that is rn-1 of your rows for a given user
)
SELECT user, list
FROM cte
WHERE rn = 1; -- going from last to first, in this condition there should be entire list
This solution isn't perfect - it forces the engine to store immediate results in a temporary area during query processing. You may encounter a No more spool space error.

Related

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

Teradata SQL Reverse Parent Child Hierarchy

I know how to build a hierarchy starting with the root node (i.e. where parent_id is null or something like that), but I can't find anything on how to build a hierarchy upward from the final child/edge node. I'd like to start with a child and build all the way back up to the top. Assume I don't know how many levels, or who the parent is, and we'll have to use SQL to figure it out.
Here is my base table:
old_entity_key,new_entity_key
1,2
2,3
3,4
4,5
5,6
Desired output:
new_entity_key,path
2,1/2
3,1/2/3
4,1/2/3/4
5,1/2/3/4/5
6,1/2/3/4/5/6
This is also acceptable:
new_entity_key,path
2,2/1
3,3/2/1
4,4/3/2/1
5,5/4/3/2/1
6,6/5/4/3/2/1
Here is the CTE I've started with:
with recursive history as (
select
old_entity_key,
new_entity_key,
cast(old_entity_key||'/'||new_entity_key as varchar(1000)) as path
from table
where new_entity_key not in (select old_entity_key from table)
and cast(start_time as date) between current_date - interval '3' day and current_date
union all
select
c.old_entity_key,
c.new_entity_key,
p.new_entity_key||'/'||c.path
from history c
join table p on p.new_entity_key = c.old_entity_key
)
select new_entity_key, old_entity_key, substr(path, 1, instr(path, '/') - 1) as original_entity_key, path
from history s;
The problem with the above query is that it runs forever. I think I've created an infinite loop. I've also tried using the below where filter in the bottom query of the union to try to find the root node, but Teradata gives me an error:
where p.new_entity_key in (select old_entity_key from table)
Any help would be greatly appreciated.
You'll need some sort of counter, and I think your join logic in your CTE doesn't make sense. I threw together a very simple volatile table example:
create volatile table tb
(old_entity_key char(1),
new_entity_key char(1),
rn integer)
on commit preserve rows;
insert into tb values ('1','2',1);
insert into tb values ('2','3',2);
insert into tb values ('3','4',3);
Now we can put together a recursive CTE:
with recursive history as (
select
old_entity_key,
new_entity_key,
cast(old_entity_key||'/'||new_entity_key as varchar(1000)) as path,
rn
from tb t
where
rn = 1
union all
select
t.old_entity_key,
t.new_entity_key,
h.path || '/' || t.new_entity_key,
t.rn
from
tb t
join history h
on t.rn = h.rn + 1
)
select * from history order by rn
The important things here are:
Limit your first pass (accomplished here by rn=1).
The second pass needs to pick up the "next" row, based on the previous row (t.rn = h.rn + 1)

SQL - Returning CTE with Top 1

I am trying to return a set of results and decided to try my luck with CTE, the first table "Vendor", has a list of references, the second table "TVView", has ticket numbers that were created using a reference from the "Vendor" table. There may be one or more tickets using the same ticket number depending on the state of that ticket and I am wanting to return the last entry for each ticket found in "TVView" that matches a selected reference from "Vendor". Also, the "TVView" table has a seed field that is incremented.
I got this to return the right amount of entries (meaning not showing the duplicate tickets but only once) but I cannot figure out how to add an additional layer to go back through and select the last entry for that ticket and return some other fields. I can figure out how to sum which is actually easy, but I really need the Top 1 of each ticket entry in "TVView" regardless if its a duplicate or not while returning all references from "Vendor". Would be nice if SQL supported "Last"
How do you do that?
Here is what I have done so far:
with cteTickets as (
Select s.Mth2, c.Ticket, c.PyRt from Vendor s
Inner join
TVView c on c.Mth1 = s.Mth1 and c.Vendor = s.Vendor
)
Select Mth2, Ticket, PayRt from cteTickets
Where cteTickets.Vendor >='20'
and cteTickets.Vendor <='40'
and cteTickets.Mth2 ='8/15/2014'
Group by cteTickets.Ticket
order by cteTickets.Ticket
Several rdbms's that support Common Table Expressions (CTE) that I am aware of also support analytic functions, including the very useful ROW_NUMBER(), so the following should work in Oracle, TSQL (MSSQL/Sybase), DB2, PostgreSQL.
In the suggestions the intention is to return just the most recent entry for each ticket found in TVView. This is done by using ROW_NUMBER() which is PARTITIONED BY Ticket that instructs row_number to recommence numbering for each change of the Ticket value. The subsequent ORDER BY Mth1 DESC is used to determine which record within each partition is assigned 1, here it will be the most recent date.
The output of row_number() needs to be referenced by a column alias, so using it in a CTE or derived table permits selection of just the most recent records by RN = 1 which you will see used in both options below:
-- using a CTE
WITH
TVLatest
AS (
SELECT
* -- specify the fields
, ROW_NUMBER() OVER (PARTITION BY Ticket
ORDER BY Mth1 DESC) AS RN
FROM TVView
)
SELECT
Mth2
, Ticket
, PayRt
FROM Vendor v
INNER JOIN TVLatest l ON v.Mth1 = l.Mth1
AND v.Vendor = l.Vendor
AND l.RN = 1
WHERE v.Vendor >= '20'
AND v <= '40'
AND v.Mth2 = '2014-08-15'
ORDER BY
v.Ticket
;
-- using a derived table instead
SELECT
Mth2
, Ticket
, PayRt
FROM Vendor v
INNER JOIN (
SELECT
* -- specify the fields
, ROW_NUMBER() OVER (PARTITION BY Ticket
ORDER BY Mth1 DESC) AS RN
FROM TVView
) TVLatest l ON v.Mth1 = l.Mth1
AND v.Vendor = l.Vendor
AND l.RN = 1
WHERE v.Vendor >= '20'
AND v <= '40'
AND v.Mth2 = '2014-08-15'
ORDER BY
v.Ticket
;
please note: "SELECT *" is a convenience or used as an abbreviation if full details are unknown. The queries above may not operate without correctly specifying the field list (eg. 'as is' they would fail in Oracle).

How do I SELECT TOP X where it INCLUDES records based on a criteria?

I have a table, with multiple columns, including a column named "PolicyNumber"
Here's a sample:
PolicyNumber
NYH1111
NYD2222
SCH3333
SCS4444
LUH5555
LUS6666
ALH7777
ALW8888
VAH9999
AKH0000
...
NYH1010
NYD2318
There are 1,000+ records in this table and records contain several of each policy number types. For example, multiple policies starting with "NYH" or multiple policies starting with "VAH."
The possible policy types are here:
NYH
NYD
SCH
SCS
LUH
LUS
ALH
ALW
VAH
AKH
How do I do a SELECT TOP 300 where it'll INCLUDE at least one of each Policy Type? Remember, a policy type is the first 3 letters of a policy number.
Is this even possible? The purpose of this is that I have to grab 300 records from production to dump into a test environment and I need to include at least 1 of each policy. After I have at least one of each, it can be completely randomized.
You can try this:
In this solution first there is the newid() with you can generate random order by each running.
To achive the "at least one from each policy" goal, I made the AtLeastOne column. This selects the first from the randomized CTE table for each unique three letters at the start. If the current Policy equals with this first selected value, then it gets 1 else 0. So with this logic, you can select a randomized first one from each unique three letters.
Note: You can put this logic directly into the Order By part too if you need the Policy field only. (I made the example on this way to make the logic behind it visible)
In the last step you just have to order by the AtLeastOne Desc and then by the random ID.
WITH CTE_Policy
AS
(
SELECT newid() as ID, Policy
FROM Code
)
SELECT TOP 300
Policy,
CASE WHEN Policy = (SELECT TOP 1 Policy FROM cte_Policy c
WHERE SUBSTRING(c.Policy,1,3) =
SUBSTRING(CTE_Policy.Policy,1,3))
THEN 1 ELSE 0 END AS AtLeastOne
FROM CTE_Policy
ORDER BY AtLeastOne DESC, ID
Here is an SQLFiddle demo.
Off the top of my head, you could do:
SELECT TOP 30 Column1, Column2, Column3, PolicyNumber
FROM YourTable
WHERE PolicyNumber LIKE 'NYH%'
UNION
SELECT TOP 30 Column1, Column2, Column3, PolicyNumber
FROM YourTable
WHERE PolicyNumber LIKE 'NYD%'
UNION
/* ... remaining eight policy types go here */
ORDER BY PolicyNumber /* Or whatever sort order you want */
It will give you 30 of each type every time, instead of X of one type, and Y of another, however.
one quick way that comes to my mind.. below query will grab just 1 record per policy type
SELECT TOP 300 *
FROM ( SELECT *,rank1= ROW_NUMBER () OVER (PARTITION BY LEFT (PolicyNo,3) ORDER BY GETDATE ()) FROM MyTable
) AS t1
WHERE t1.rank1 = 1
Try this for SQL Server 2005+:
;WITH CTE AS
(
SELECT LEFT(PolicyNumber) PolicyType, PolicyNumber,
ROW_NUMBER() OVER(PARTITION BY LEFT(PolicyNumber) ORDER BY NEWID()) RN
FROM YourTable
)
SELECT TOP 300 PolicyNumber
FROM CTE
ORDER BY RN, NEWID()
Borrowed from ClearLogic +1 Please give the check to ClearLogic if this works
Problem with WHERE t1.rank1 = 1 is that it will stop short of 300 if less then 300 unique
SELECT TOP 300 t1.PolicyNo
FROM ( SELECT PolicyNo, rank1= ROW_NUMBER ()
OVER (PARTITION BY LEFT (PolicyNo,3) ORDER BY NEWID())
FROM MyTable
) AS t1
order by t1.rank, t1.PolicyNo

How can I select adjacent rows to an arbitrary row (in sql or postgresql)?

I want to select some rows based on certain criteria, and then take one entry from that set and the 5 rows before it and after it.
Now, I can do this numerically if there is a primary key on the table, (e.g. primary keys that are numerically 5 less than the target row's key and 5 more than the target row's key).
So select the row with the primary key of 7 and the nearby rows:
select primary_key from table where primary_key > (7-5) order by primary_key limit 11;
2
3
4
5
6
-=7=-
8
9
10
11
12
But if I select only certain rows to begin with, I lose that numeric method of using primary keys (and that was assuming the keys didn't have any gaps in their order anyway), and need another way to get the closest rows before and after a certain targeted row.
The primary key output of such a select might look more random and thus less succeptable to mathematical locating (since some results would be filtered, out, e.g. with a where active=1):
select primary_key from table where primary_key > (34-5)
order by primary_key where active=1 limit 11;
30
-=34=-
80
83
100
113
125
126
127
128
129
Note how due to the gaps in the primary keys caused by the example where condition (for example becaseu there are many inactive items), I'm no longer getting the closest 5 above and 5 below, instead I'm getting the closest 1 below and the closest 9 above, instead.
There's a lot of ways to do it if you run two queries with a programming language, but here's one way to do it in one SQL query:
(SELECT * FROM table WHERE id >= 34 AND active = 1 ORDER BY id ASC LIMIT 6)
UNION
(SELECT * FROM table WHERE id < 34 AND active = 1 ORDER BY id DESC LIMIT 5)
ORDER BY id ASC
This would return the 5 rows above, the target row, and 5 rows below.
Here's another way to do it with analytic functions lead and lag. It would be nice if we could use analytic functions in the WHERE clause. So instead you need to use subqueries or CTE's. Here's an example that will work with the pagila sample database.
WITH base AS (
SELECT lag(customer_id, 5) OVER (ORDER BY customer_id) lag,
lead(customer_id, 5) OVER (ORDER BY customer_id) lead,
c.*
FROM customer c
WHERE c.active = 1
AND c.last_name LIKE 'B%'
)
SELECT base.* FROM base
JOIN (
-- Select the center row, coalesce so it still works if there aren't
-- 5 rows in front or behind
SELECT COALESCE(lag, 0) AS lag, COALESCE(lead, 99999) AS lead
FROM base WHERE customer_id = 280
) sub ON base.customer_id BETWEEN sub.lag AND sub.lead
The problem with sgriffinusa's solution is that you don't know which row_number your center row will end up being. He assumed it will be row 30.
For similar query I use analytic functions without CTE. Something like:
select ...,
LEAD(gm.id) OVER (ORDER BY Cit DESC) as leadId,
LEAD(gm.id, 2) OVER (ORDER BY Cit DESC) as leadId2,
LAG(gm.id) OVER (ORDER BY Cit DESC) as lagId,
LAG(gm.id, 2) OVER (ORDER BY Cit DESC) as lagId2
...
where id = 25912
or leadId = 25912 or leadId2 = 25912
or lagId = 25912 or lagId2 = 25912
such query works more faster for me than CTE with join (answer from Scott Bailey). But of course less elegant
You could do this utilizing row_number() (available as of 8.4). This may not be the correct syntax (not familiar with postgresql), but hopefully the idea will be illustrated:
SELECT *
FROM (SELECT ROW_NUMBER() OVER (ORDER BY primary_key) AS r, *
FROM table
WHERE active=1) t
WHERE 25 < r and r < 35
This will generate a first column having sequential numbers. You can use this to identify the single row and the rows above and below it.
If you wanted to do it in a 'relationally pure' way, you could write a query that sorted and numbered the rows. Like:
select (
select count(*) from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
Then use that as a common table expression. Write a select which filters it down to the rows you're interested in, then join it back onto itself using a criterion that the index of the right-hand copy of the table is no more than k larger or smaller than the index of the row on the left. Project over just the rows on the right. Like:
with numbered_emps as (
select (
select count(*)
from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
)
select b.*
from numbered_emps a, numbered_emps b
where a.name like '% Smith' -- this is your main selection criterion
and ((b.idx - a.idx) between -5 and 5) -- this is your adjacency fuzzy-join criterion
What could be simpler!
I'd imagine the row-number based solutions will be faster, though.