I have a table transactions with columns account_id, device_id and card_id. I'd like to create a new column token defined as user ID. This new column will have same value between transactions that share values of at least one of the 3 columns previously mentioned.
The idea is to create a user identifier that is a mix of account, device and card. This new ID is "stronger" than using any of the previous ones, in the sense that, for instance, if someone changes his account or device but not his card, his transactions will still be linked to the previous ones by this new token id.
Which would be a simple way to do this in Redshift SQL?
Example:
transaction_id account_id device_id card_id token
1 1 1 1 1
2 1 2 2 1
3 2 2 3 1
4 3 3 3 1
5 4 4 4 2
6 5 4 5 2
7 6 8 2 1
8 7 7 7 3
In the above example:
T1 has token 1 since first transaction.
T2 has token 1 since linked to T1 by account.
T3 has token 1 since linked to T2 by device.
T4 has token 1 since linked to T3 by card.
T5 has token 2 since any to the 3 fields share a value with previous transactions (new user).
T6 has token 2 since linked to T5 by device.
T7 has token 1 since linked to T2 by card.
T8 has token 3 since any to the 3 fields share a value with previous transactions (new user)
UPDATE
I managed to get a solution based on 2 steps :
For each transaction, I update the token column with the transaction_id of the oldest transaction linked to any of the 3 IDs (acccount, device or card). This step doesn't solve the problem because for instance, T4 should have token 1, but it has token 2 since the oldest transaction linked to T4 is T2, not T1.
Once each transaction has the oldest transaction linked, I update all the transactions (whose token is not the same transaction ID, just for performance since not needed), and I change its token for the token of the transaction linked. By doing this incrementaly one row at a time and in different SQL transactions, I get the output expected.
Here is the code
create table sandbox.test(
transaction_id integer,
account_id integer,
device_id integer,
card_id integer
);
insert into sandbox.test values
(1, 1, 1, 1),
(2, 1, 2, 2),
(3, 2, 2, 3),
(4, 3, 3, 3),
(5, 4, 4, 4),
(6, 5, 4, 5),
(7, 6, 8, 2),
(8, 7, 7, 7),
(9, 3, 9, 9),
(10, 10, 9, 9);
alter table sandbox.test
add column token int
default NULL;
-- Step 1
update sandbox.test
set token = A.token
from (
with links as (
select transaction_id,
first_value(transaction_id)
over(partition by account_id order by transaction_id rows unbounded preceding) as account_link,
first_value(transaction_id)
over(partition by device_id order by transaction_id rows unbounded preceding) device_link,
first_value(transaction_id)
over(partition by card_id order by transaction_id rows unbounded preceding) card_link
from sandbox.test
)
select l1.transaction_id, least(l2.account_link, l2.device_link, l2.card_link) token
from links l2
inner join links l1
on l2.transaction_id=least(l1.account_link, l1.device_link, l1.card_link)
) A
where sandbox.test.transaction_id=A.transaction_id;
-- Step 2
CREATE OR REPLACE PROCEDURE refresh_transactions() AS $$
DECLARE
transactions RECORD;
BEGIN
FOR transactions IN SELECT transaction_id FROM sandbox.test ORDER BY transaction_id LOOP
RAISE INFO '%', transactions.transaction_id;
EXECUTE 'update
sandbox.test
set token = (select A.token from sandbox.test A where
A.transaction_id = sandbox.test.token)
where transaction_id='||transactions.transaction_id||';';
END LOOP;
RETURN;
END;
$$ LANGUAGE plpgsql;
CALL refresh_transactions();
However, the second part of this solution (the refresh_transactions() call) takes too long to run on my DB with millions of transactions, so impossible to implement it. Maybe there is a way to do it more efficiently ?
Related
Edit: Added another case scenario in the notes and updated the sample attachment.
I am trying to write a sql to get an output attached with this question along with sample data.
There are two table, one with distinct ID's (pk) with their current flag.
another with Active ID (fk to the pk from the first table) and Inactive ID (fk to the pk from the first table)
Final output should return two columns, first column consist of all distinct ID's from the first table and second column should contain Active ID from the 2nd table.
Below is the sql:
IF OBJECT_ID('tempdb..#main') IS NOT NULL DROP TABLE #main;
IF OBJECT_ID('tempdb..#merges') IS NOT NULL DROP TABLE #merges
IF OBJECT_ID('tempdb..#final') IS NOT NULL DROP TABLE #final
SELECT DISTINCT id,
current
INTO #main
FROM tb_ID t1
--get list of all active_id and inactive_id
SELECT DISTINCT active_id,
inactive_id,
Update_dt
INTO #merges
FROM tb_merges
-- Combine where the id from the main table matched to the inactive_id (should return all the rows from #main)
SELECT id,
active_id AS merged_to_id
INTO #final
FROM (SELECT t1.*,
t2.active_id,
Update_dt ,
Row_number()
OVER (
partition BY id, active_id
ORDER BY Update_dt DESC) AS rn
FROM #main t1
LEFT JOIN #merges t2
ON t1.id = t2.inactive_id) t3
WHERE rn = 1
SELECT *
FROM #final
This sql partially works. It doesn't work, where the id was once active then gets inactive.
Please note:
the active ID should return the last most active ID
the ID which doesn't have any active ID should either be null or the ID itself
ID where the current = 0, in those cases active ID should be the ID current in tb_ID
ID's may get interchanged. For example there are two ID's 6 and 7, when 6 is active 7 is inactive and vice versa. the only way to know the most current active state is by the update date
Attached sample might be easy to understand
Looks like I might have to use recursive cte for achieiving the results. Can someone please help?
thank you for your time!
I think you're correct that a recursive CTE looks like a good solution for this. I'm not entirely certain that I've understood exactly what you're asking for, particularly with regard to the update_dt column, just because the data is a little abstract as-is, but I've taken a stab at it, and it does seem to work with your sample data. The comments explain what's going on.
declare #tb_id table (id bigint, [current] bit);
declare #tb_merges table (active_id bigint, inactive_id bigint, update_dt datetime2);
insert #tb_id values
-- Sample data from the question.
(1, 1),
(2, 1),
(3, 1),
(4, 1),
(5, 0),
-- A few additional data to illustrate a deeper search.
(6, 1),
(7, 1),
(8, 1),
(9, 1),
(10, 1);
insert #tb_merges values
-- Sample data from the question.
(3, 1, '2017-01-11T13:09:00'),
(1, 2, '2017-01-11T13:07:00'),
(5, 4, '2013-12-31T14:37:00'),
(4, 5, '2013-01-18T15:43:00'),
-- A few additional data to illustrate a deeper search.
(6, 7, getdate()),
(7, 8, getdate()),
(8, 9, getdate()),
(9, 10, getdate());
if object_id('tempdb..#ValidMerge') is not null
drop table #ValidMerge;
-- Get the subset of merge records whose active_id identifies a "current" id and
-- rank by date so we can consider only the latest merge record for each active_id.
with ValidMergeCTE as
(
select
M.active_id,
M.inactive_id,
[Priority] = row_number() over (partition by M.active_id order by M.update_dt desc)
from
#tb_merges M
inner join #tb_id I on M.active_id = I.id
where
I.[current] = 1
)
select
active_id,
inactive_id
into
#ValidMerge
from
ValidMergeCTE
where
[Priority] = 1;
-- Here's the recursive CTE, which draws on the subset of merges identified above.
with SearchCTE as
(
-- Base case: any record whose active_id is not used as an inactive_id is an endpoint.
select
M.active_id,
M.inactive_id,
Depth = 0
from
#ValidMerge M
where
not exists (select 1 from #ValidMerge M2 where M.active_id = M2.inactive_id)
-- Recursive case: look for records whose active_id matches the inactive_id of a previously
-- identified record.
union all
select
S.active_id,
M.inactive_id,
Depth = S.Depth + 1
from
#ValidMerge M
inner join SearchCTE S on M.active_id = S.inactive_id
)
select
I.id,
S.active_id
from
#tb_id I
left join SearchCTE S on I.id = S.inactive_id;
Results:
id active_id
------------------
1 3
2 3
3 NULL
4 NULL
5 4
6 NULL
7 6
8 6
9 6
10 6
Context:
I have two tables: markettypewagerlimitgroups (mtwlg) and stakedistributionindicators (sdi). When a mtwlg is created, 2 rows are created in the sdi table which are linked to the mtwlg - each row with the same values bar 2, the id and another field (let's call it column X) which must contain a 0 for one row and 1 for the other.
There was a bug present in our codebase which prevented this happening automatically, so any mtwlg's created during the time that bug was present do not have the related sdi's, causing NPE's in various places.
To fix this, a patch needs to be written to loop through the mtwlg table and for each ID, search the sdi table for the 2 related rows. If the rows are present, do nothing; if there is only 1 row, check if F is a 0 or a 1, and insert a row with the other value; if neither row is present, insert them both. This needs to be done for every mtwlg, and a unique ID needs to be inserted too.
Pseudocode:
For each market type wager limit group ID
Check if there are 2 rows with that id in the stake distributions table, 1 where column X = 0 and one where column X = 1
if none
create 2 rows in the stake distributions table with unique id's; 1 for each X value
if one
create the missing row in the stake distributions table with a unique id
if 2
do nothing
If it helps at all - the patch will be applied using liquibase.
Anyone with any advice or thoughts as to if and how this will be possible to write in SQL/a liquibase patch?
Thanks in advance, let me know of any other information you need.
EDIT:
I've actually just been advised to do this using PL/SQL, do you have any thoughts/suggestions in regards to this?
Thanks again.
Oooooh, an excellent job for MERGE.
Here's your pseudo code again:
For each market type wager limit group ID
Check if there are 2 rows with that id in the stake distributions table,
1 where column X = 0 and one where column X = 1
if none
create 2 rows in the stake distributions table with unique id's;
1 for each X value
if one
create the missing row in the stake distributions table with a unique id
if 2
do nothing
Here's the MERGE variant (still pseudo-code'ish as I don't know how your data really looks):
MERGE INTO stake_distributions d
USING (
SELECT limit_group_id, 0 AS x
FROM market_type_wagers
UNION ALL
SELECT limit_group_id, 1 AS x
FROM market_type_wagers
) t
ON (
d.limit_group_id = t.limit_group_id AND d.x = t.x
)
WHEN NOT MATCHED THEN INSERT (d.limit_group_id, d.x)
VALUES (t.limit_group_id, t.x);
No loops, no PL/SQL, no conditional statements, just plain beautiful SQL.
Nice alternative suggested by Boneist in the comments uses a CROSS JOIN rather than UNION ALL in the USING clause, which is likely to perform better (unverified):
MERGE INTO stake_distributions d
USING (
SELECT w.limit_group_id, x.x
FROM market_type_wagers w
CROSS JOIN (
SELECT 0 AS x FROM DUAL
UNION ALL
SELECT 1 AS x FROM DUAL
) x
) t
ON (
d.limit_group_id = t.limit_group_id AND d.x = t.x
)
WHEN NOT MATCHED THEN INSERT (d.limit_group_id, d.x)
VALUES (t.limit_group_id, t.x);
Answer: you don't. There is absolutely no need to loop through anything - you can do it in a single insert. All you need to do is identify the rows that are missing, and then you just need to add them in.
Here is an example:
drop table t1;
drop table t2;
drop sequence t2_seq;
create table t1 (cola number,
colb number,
colc number);
create table t2 (id number,
cola number,
colb number,
colc number,
colx number);
create sequence t2_seq
START WITH 1
INCREMENT BY 1
MAXVALUE 99999999
MINVALUE 1
NOCYCLE
CACHE 20
NOORDER;
insert into t1 values (1, 10, 100);
insert into t2 values (t2_seq.nextval, 1, 10, 100, 0);
insert into t2 values (t2_seq.nextval, 1, 10, 100, 1);
insert into t1 values (2, 20, 200);
insert into t2 values (t2_seq.nextval, 2, 20, 200, 0);
insert into t1 values (3, 30, 300);
insert into t2 values (t2_seq.nextval, 3, 30, 300, 1);
insert into t1 values (4, 40, 400);
commit;
insert into t2 (id, cola, colb, colc, colx)
with dummy as (select 1 id from dual union all
select 0 id from dual)
select t2_seq.nextval,
t1.cola,
t1.colb,
t1.colc,
d.id
from t1
cross join dummy d
left outer join t2 on (t2.cola = t1.cola and d.id = t2.colx)
where t2.id is null;
commit;
select * from t2
order by t2.cola;
ID COLA COLB COLC COLX
---------- ---------- ---------- ---------- ----------
1 1 10 100 0
2 1 10 100 1
3 2 20 200 0
5 2 20 200 1
7 3 30 300 0
4 3 30 300 1
6 4 40 400 0
8 4 40 400 1
If the processing logic is too gnarly to be encapsulated in a single SQL statement, you may need to resort to cursor for loops and row types - basically allows you to do things like the following:
DECLARE
r_mtwlg markettypewagerlimitgroups%ROWTYPE;
BEGIN
FOR r_mtwlg IN (
SELECT mtwlg.*
FROM markettypewagerlimitgroups mtwlg
)
LOOP
-- do stuff here
-- refer to elements of the current row like this
DBMS_OUTPUT.PUT_LINE(r_mtwlg.id);
END LOOP;
END;
/
You can obviously nest another loop inside this one that hits the stakedistributionindicators table, but I'll leave that as an exercise for you. You could also left join to stakedistributionindicators a couple of times in this first cursor so that you only return rows that don't already have an x=1 and x=0, again you can probably work that bit out for yourself.
If you would rather write your logic in Java vs. PL/SQL, Liquibase allows you to create custom changes. The custom change points to a Java class you write that can do whatever logic you need. A simple example can be found here
How execute additional query (UPDATE) on each row from SELECT?
I have to get amount from each row from select and send it to user's balance table.
Example:
status 0 - open
status 1 - processed
status 2 - closed
My select statement:
select id, user_id, sell_amount, sell_currency_id
from (select id, user_id, sell_amount, sell_currency_id,
sum(sell_amount)
over (order by buy_amount/sell_amount ASC, date_add ASC) as cumsell
from market t
where (status = 0 or status = 1) and type = 0
) t
where 0 <= cumsell and 7 > cumsell - sell_amount;
Select result from market table
id;user_id;amount;status
4;1;1.00000000;0
6;2;2.60000000;0
5;3;2.00000000;0
7;4;4.00000000;0
We get 7 amount and send it to user balance table.
id;user_id;amount;status
4;1;0.00000000;2 -- took 1, sum 1, status changed to 2
6;2;0.00000000;2 -- took 2.6, sum=3.6, status changed to 2
5;3;0.00000000;2 -- took 2, sum 5.6, status changed to 2
7;4;2.60000000;1 -- took 1.4, sum 7.0, status changed to 1 (because there left 2.6 to close)
User's balance table
user_id;balance
5;7 -- added 7 from previous operation
Postgres version 9.3
The general principle is to use UPDATE ... FROM over a subquery. Your example is too hard to turn into useful CREATE TABLE and SELECT statements, so I've made up a quick dummy dataset:
CREATE TABLE balances (user_id integer, balance numeric);
INSERT INTO balances (user_id, balance) VALUES (1,0), (2, 2.1), (3, 99);
CREATE TABLE transactions (user_id integer, amount numeric, applied boolean default 'f');
INSERT INTO transactions (user_id, amount) VALUES (1, 22), (1, 10), (2, -10), (4, 1000000);
If you wanted to apply the transactions to the balances you would do something like:
BEGIN;
LOCK TABLE balances IN EXCLUSIVE MODE;
LOCK TABLE transactions IN EXCLUSIVE MODE;
UPDATE balances SET balance = balance + t.amount
FROM (
SELECT t2.user_id, sum(t2.amount) AS amount
FROM transactions t2
GROUP BY t2.user_id
) t
WHERE balances.user_id = t.user_id;
UPDATE transactions
SET applied = true
FROM balances b
WHERE transactions.user_id = b.user_id;
The LOCK statements are important for correctness in the presence of concurrent inserts/updates.
The second UPDATE marks the transactions as applied; you might not need something like that in your design.
I am trying to run below 2 queries on the same table and hoping to get results in 2 different columns.
Query 1: select ID as M from table where field = 1
returns:
1
2
3
Query 2: select ID as N from table where field = 2
returns:
4
5
6
My goal is to get
Column1 - Column2
-----------------
1 4
2 5
3 6
Any suggestions? I am using SQL Server 2008 R2
Thanks
There has to be a primary key to foreign key relationship to JOIN data between two tables.
That is the idea about relational algebra and normalization. Otherwise, the correlation of the data is meaningless.
http://en.wikipedia.org/wiki/Database_normalization
The CROSS JOIN will give you all possibilities. (1,4), (1,5), (1, 6) ... (3,6). I do not think that is what you want.
You can always use a ROW_NUMBER() OVER () function to generate a surrogate key in both tables. Order the data the way you want inside the OVER () clause. However, this is still not in any Normal form.
In short. Why do this?
Quick test database. Stores products from sporting goods and home goods using non-normal form.
The results of the SELECT do not mean anything.
-- Just play
use tempdb;
go
-- Drop table
if object_id('abnormal_form') > 0
drop table abnormal_form
go
-- Create table
create table abnormal_form
(
Id int,
Category int,
Name varchar(50)
);
-- Load store products
insert into abnormal_form values
(1, 1, 'Bike'),
(2, 1, 'Bat'),
(3, 1, 'Ball'),
(4, 2, 'Pot'),
(5, 2, 'Pan'),
(6, 2, 'Spoon');
-- Sporting Goods
select * from abnormal_form where Category = 1
-- Home Goods
select * from abnormal_form where Category = 2
-- Does not mean anything to me
select Id1, Id2 from
(select ROW_NUMBER () OVER (ORDER BY ID) AS Rid1, Id as Id1
from abnormal_form where Category = 1) as s
join
(select ROW_NUMBER () OVER (ORDER BY ID) AS Rid2, Id as Id2
from abnormal_form where Category = 2) as h
on s.Rid1 = h.Rid2
We definitely need more information from the user.
This one's a bit of a mess, and there's probably some far superior way of doing this but we just need the information for some reports we're working on.
So, we have a bunch of projects; each project has a bunch of tasks and each task has a document type ID associated with it. A project can belong to one or more workgroups.
We want to analyze projects that have at least one task of doc type x, and then see how many workgroups it has. I can do that with:
select distinct T.PROJECTID,
(select COUNT(*) from TPM_PROJECTWORKGROUPS where PROJECTID=T.PROJECTID) as NumWorkgroups
from TPM_TASK T
where T.DOCUMENTTYPEID=17
Now, we want to see the average number of workgroups across these projects. So I can do:
select AVG(NumWorkgroups) FROM (
select distinct T.PROJECTID,
(select COUNT(*) from TPM_PROJECTWORKGROUPS where PROJECTID=T.PROJECTID) as NumWorkgroups
from TPM_TASK T
where T.DOCUMENTTYPEID=17
)
However, we want to run this same query across all the document types (there's about 200 of them). I can't find a way to do this without copying and pasting the query 200 times. I've tried:
select DOCUMENTTYPEID,
(select AVG(NumWorkgroups) FROM (
select distinct T.PROJECTID,
(select COUNT(*) from TPM_PROJECTWORKGROUPS where PROJECTID=T.PROJECTID) as NumWorkgroups
from TPM_TASK T
where T.DOCUMENTTYPEID=DT.DOCUMENTTYPEID
))
from TPM_DOCUMENTTYPE DT
However, I get the error:
ORA-00904: "TPM_DOCUMENTTYPE"."DOCUMENTTYPEID": invalid identifier
I believe because DT is out of scope more than one level down in a nested query. Is there a better way to do this query?
Update for Justin:
Here's a sample schema:
create table Test_Projects (
id number primary key
)
create table Test_Tasks (
id number primary key,
project number,
doctype number
)
create table Test_Workgroups (
id number primary key,
workgroup number,
project number
)
With some sample data:
insert into Test_Projects VALUES (1) --Create projects 1 and 2
insert into Test_Projects VALUES (2)
insert into Test_Tasks VALUES (1, 1, 5) --Project 1 has two tasks, doc types 5 and 6
insert into Test_Tasks VALUES (2, 1, 6)
insert into Test_Tasks VALUES (3, 2, 6) --Project 2 has one task, doc type 6
insert into Test_Workgroups VALUES (1, 1, 1) --Project 1 belongs to workgroups 1 and 2
insert into Test_Workgroups VALUES (2, 2, 1)
insert into Test_Workgroups VALUES (3, 2, 2) --Project 2 belongs to workgroup 2
We need to know the average number of workgroups that a project with a task of type x belongs to.
For example, doc type 5 has only project 1 which has 2 workgroups, so the average is 2. Doc type 6 has 2 projects (1 and 2) - 1 has 2 workgroups and 2 has one workgroup - so the average is 1.5.
We need to list all doc types and the average number of workgroups in each.
I'd expect this query to return:
DOCTYPE AverageWorkgroups
------- -----------------
5 2
6 1.5
Thanks for the sample data. That makes it much clearer.
I believe this does what you want (I'm including the calculations for the number of projects and the number of workgroups in the output as well just because that made my testing easier)
SQL> ed
Wrote file afiedt.buf
1 select t.doctype,
2 count(distinct p.id) numProjects,
3 count(*) numWorkgroups,
4 count(*)/ count( distinct p.id) avgNumWorkgroups
5 from test_projects p,
6 test_tasks t,
7 test_workgroups w
8 where p.id = t.project
9 and p.id = w.project
10* group by t.doctype
SQL> /
DOCTYPE NUMPROJECTS NUMWORKGROUPS AVGNUMWORKGROUPS
---------- ----------- ------------- ----------------
6 2 3 1.5
5 1 2 2