PostgreSQL - How to select the first consecutive group having same value - sql

I have a table with pk and dept columns:
pk dept
-------
27 A
29 A
30 B
31 B
33 A
I need to select the first consecutive group, that is the first successive set of rows all having the same dept value when the table is ordered by pk, i.e. the expected result is:
pk dept
-------
27 A
29 A
In my example there are 3 consecutive groups (AA, BB and A). The size of a group is unlimited (can be more than 2).

The following query should do what you want (I named your table tx):
SELECT *
FROM tx t1
WHERE NOT EXISTS (
SELECT *
FROM tx t2
WHERE t2.dept <> t1.dept
AND t2.pk < t1.pk);
The idea is to look for tuples such that no tuple with a lesser pk and a different department exists.
The first two A tuples are kept;
The B tuples are dropped because of the first two A tuples;
The last A tuple is dropped because of the B tuples.

Remember about stored functions. Unlike to using window functions its allows to avoid the reading of the whole table:
--drop function if exists foo();
--drop table if exists t;
create table t(pk int, dep text);
insert into t values(27,'A'),(29,'A'),(30,'B'),(31,'B'),(33,'A');
create function foo() returns setof t language plpgsql as $$
declare
r t;
p t;
begin
for r in (select * from t order by pk) loop
if p is null then
p := r;
end if;
exit when p.dep is distinct from r.dep;
return next r;
end loop;
return;
end $$;
select * from foo();

Its a little bit complex and probably, the permformance poor, but you can achieve what you want with the code below. There are four operations:
The first one is where we obtain the base order and base group ids
for the next operation.
In the sencond operation we make the trick computing an unique group
id for each group
In the third operation, where are spreading the unique group id over
the rows of each group.
Finally, we compute a consecutive group id for each group to allow
the discretionary selection of groups, so we only have to filter by
the group number we want to obtain.
Hope this helps.
SELECT fourthOperation.pk,
fourthOperation.dept
FROM (SELECT thirdOperation.pk,
thirdOperation.dept,
DENSE_RANK() OVER (ORDER BY thirdOperation.spreadedIdGroup) denseIdGroup
FROM (SELECT secondOperation.*,
NVL(idGroup, LAG(secondOperation.idGroup IGNORE NULLS) OVER (ORDER BY secondOperation.numRow)) spreadedIdGroup
FROM (SELECT firstOperation.*,
CASE WHEN LAG(firstOperation.rankRow) OVER (ORDER BY firstOperation.numRow) = firstOperation.rankRow
THEN NULL
ELSE firstOperation.numRow
END idGroup
FROM (SELECT yourTable.*,
ROW_NUMBER() OVER (ORDER BY PK) AS numRow,
DENSE_RANK() OVER (ORDER BY DEPT) AS rankRow
FROM ABORRAR yourTable) firstOperation) secondOperation ) thirdOperation) fourthOperation
WHERE fourthOperation.denseIdGroup = 1

I'm not sure if I understand your question, but for the first pk of each dept you can try this:
select min(pk) as pk,
dept
from your_table
group by dept

Related

SQL ranking over two tables

I have two tables with user rankings.
Table rankingA and rankingB.
Each table has the columns:
user_id
points
group_id
Higher the points so higher the rank of the user/group...
Now i try to get the group ranking for the question which rank has my group.
So far i have this SQL:
select sum(ra.points) as rapoints, sum(rb.points) as rbpoints from public.rankinga ra
LEFT JOIN public.rankingb rb ON ra.group_id=rb.group_id and ra.user_id=rb.user_id where
ra.group_id=200;
It returns the points from rankinga and rankinb for the group 200.
How can i get the rankings of the group? I tryd it with:
row_number() OVER (ORDER BY sum(rb.points) DESC) AS rankb
but got a wrong result.
My expected result for group_id 200 is:
rapoints,rbpoints,rarank, rbrank
420, 10, 3, same points as group_id 300 so rbrank 2 or 3
How can i get this?
Setup
CREATE TABLE rankinga
(
user_id bigint,
group_id bigint,
points integer
)
CREATE TABLE rankingb
(
user_id bigint,
group_id bigint,
points integer
)
insert into public.rankinga (user_id,group_id,points) values (1,100,120),(2,100,300), (3,100,20),(4,200,300),(5,200,120),(6,300,600);
insert into public.rankingb (user_id,group_id,points) values (1,100,5),(2,100,3),(3,100,10),(4,200,2),(5,200,8),(6,300,10);
I think you want to do this with union all, aggregation, and the window function. Joining the tables is likely to miss rows (if users are in one table but not the other) or over count (if you join on group). So this may do what you want:
select group_id, sum(rapoints) as rapoints, sum(rbpoints) as rbpoints,
sum(rapoints) + sum(rbpoints) as points,
dense_rank() over (order by sum(rapoints) + sum(rbpoints) desc) as ranking
from ((select ra.group_id, sum(ra.points) as rapoints, 0 as rbpoints
from public.rankinga ra
group by ra.group_id
) union all
(select rb.group_id, 0, sum(rb.points) as rbpoints
from public.rankingb rb
group by rb.group_id
)
) ab
group by group_id;
If you want to select just one group, then put this in a subquery (or CTE) and then select the group.
Here is a SQL Fiddle.
EDIT:
If you want just the result for one group, you still need to calculate the values for all groups. So:
select ab.*
from (<above query here>) ab
where group_id = 200;

How can I get a random cartesian product in PostgreSQL?

I have two tables, custassets and tags. To generate some test data I'd like to do an INSERT INTO a many-to-many table with a SELECT that gets random rows from each (so that a random primary key from one table is paired with a random primary key from the second). To my surprise this isn't as easy as I first thought, so I'm persisting with this to teach myself.
Here's my first attempt. I select 10 custassets and 3 tags, but both are the same in each case. I'd be fine with the first table being fixed, but I'd like to randomise the tags assigned.
SELECT
custassets_rand.id custassets_id,
tags_rand.id tags_rand_id
FROM
(
SELECT id FROM custassets WHERE defunct = false ORDER BY RANDOM() LIMIT 10
) AS custassets_rand
,
(
SELECT id FROM tags WHERE defunct = false ORDER BY RANDOM() LIMIT 3
) AS tags_rand
This produces:
custassets_id | tags_rand_id
---------------+--------------
9849 | 3322 }
9849 | 4871 } this pattern of tag PKs is repeated
9849 | 5188 }
12145 | 3322
12145 | 4871
12145 | 5188
17837 | 3322
17837 | 4871
17837 | 5188
....
I then tried the following approach: doing the second RANDOM() call in the SELECT column list. However this one was worse, as it chooses a single tag PK and sticks with it.
SELECT
custassets_rand.id custassets_id,
(SELECT id FROM tags WHERE defunct = false ORDER BY RANDOM() LIMIT 1) tags_rand_id
FROM
(
SELECT id FROM custassets WHERE defunct = false ORDER BY RANDOM() LIMIT 30
) AS custassets_rand
Result:
custassets_id | tags_rand_id
---------------+--------------
16694 | 1537
14204 | 1537
23823 | 1537
34799 | 1537
36388 | 1537
....
This would be easy in a scripting language, and I'm sure can be done quite easily with a stored procedure or temporary table. But can I do it just with a INSERT INTO SELECT?
I did think of choosing integer primary keys using a random function, but unfortunately the primary keys for both tables have gaps in the increment sequences (and so an empty row might be chosen in each table). That would have been fine otherwise!
Note that what you are looking for is not a Cartesian product, which would produce n*m rows; rather a random 1:1 association, which produces GREATEST(n,m) rows.
To produce truly random combinations, it's enough to randomize rn for the bigger set:
SELECT c_id, t_id
FROM (
SELECT id AS c_id, row_number() OVER (ORDER BY random()) AS rn
FROM custassets
) x
JOIN (SELECT id AS t_id, row_number() OVER () AS rn FROM tags) y USING (rn);
If arbitrary combinations are good enough, this is faster (especially for big tables):
SELECT c_id, t_id
FROM (SELECT id AS c_id, row_number() OVER () AS rn FROM custassets) x
JOIN (SELECT id AS t_id, row_number() OVER () AS rn FROM tags) y USING (rn);
If the number of rows in both tables do not match and you do not want to lose rows from the bigger table, use the modulo operator % to join rows from the smaller table multiple times:
SELECT c_id, t_id
FROM (
SELECT id AS c_id, row_number() OVER () AS rn
FROM custassets -- table with fewer rows
) x
JOIN (
SELECT id AS t_id, (row_number() OVER () % small.ct) + 1 AS rn
FROM tags
, (SELECT count(*) AS ct FROM custassets) AS small
) y USING (rn);
Window functions were added with PostgreSQL 8.4.
WITH a_ttl AS (
SELECT count(*) AS ttl FROM custassets c),
b_ttl AS (
SELECT count(*) AS ttl FROM tags),
rows AS (
SELECT gs.*
FROM generate_series(1,
(SELECT max(ttl) AS ttl FROM
(SELECT ttl FROM a_ttl UNION SELECT ttl FROM b_ttl) AS m))
AS gs(row)),
tab_a_rand AS (
SELECT custassets_id, row_number() OVER (order by random()) as row
FROM custassets),
tab_b_rand AS (
SELECT id, row_number() OVER (order by random()) as row
FROM tags)
SELECT a.custassets_id, b.id
FROM rows r
JOIN a_ttl ON 1=1 JOIN b_ttl ON 1=1
LEFT JOIN tab_a_rand a ON a.row = (r.row % a_ttl.ttl)+1
LEFT JOIN tab_b_rand b ON b.row = (r.row % b_ttl.ttl)+1
ORDER BY 1,2;
You can test this query on SQL Fiddle.
Here is a different approach to pick a single combination from 2 tables by random, assuming two tables a and b, both with primary key id. The tables needn't be of same size, and the second row is independently chosen from the first, which might not be that important for testdata.
SELECT * FROM a, b
WHERE a.id = (
SELECT id
FROM a
OFFSET (
SELECT random () * (SELECT count(*) FROM a)
)
LIMIT 1)
AND b.id = (
SELECT id
FROM b
OFFSET (
SELECT random () * (SELECT count(*) FROM b)
)
LIMIT 1);
Tested with two tables, one of size 7000 rows, one with 100k rows, result: immediately. For more than one result, you have to call the query repeatedly - increasing the LIMIT and changing x.id = to x.id IN would produce (aA, aB, bA, bB) result patterns.
It bugs me that after all these years of relational databases, there doesn't seem to be very good cross database ways of doing things like this. The MSDN article http://msdn.microsoft.com/en-us/library/cc441928.aspx seems to have some interesting ideas, but of course that's not PostgreSQL. And even then, their solution requires a single pass, when I'd think it ought to be able to be done without the scan.
I can imagine a few ways that might work without a pass (in selection), but it would involve creating another table that maps your table's primary keys to random numbers (or to linear sequences that you later randomly select, which in some ways may actually be better), and of course, that may have issues as well.
I realize this is probably a non-useful comment, I just felt I needed to rant a bit.
If you just want to get a random set of rows from each side, use a pseudo-random number generator. I would use something like:
select *
from (select a.*, row_number() over (order by NULL) as rownum -- NULL may not work, "(SELECT NULL)" works in MSSQL
from a
) a cross join
(select b.*, row_number() over (order by NULL) as rownum
from b
) b
where a.rownum <= 30 and b.rownum <= 30
This is doing a Cartesian product, which returns 900 rows assuming a and b each have at least 30 rows.
However, I interpreted your question as getting random combinations. Once again, I'd go for the pseudo-random approach.
select *
from (select a.*, row_number() over (order by NULL) as rownum -- NULL may not work, "(SELECT NULL)" works in MSSQL
from a
) a cross join
(select b.*, row_number() over (order by NULL) as rownum
from b
) b
where modf(a.rownum*107+b.rownum*257+17, 101) < <some vaue>
This let's you get combinations among arbitrary rows.
Just a plain carthesian product ON random() appears to work reasonably well. Simple comme bonjour...
-- Cartesian product
-- EXPLAIN ANALYZE
INSERT INTO dirgraph(point_from,point_to,costs)
SELECT p1.the_point , p2.the_point, (1000*random() ) +1
FROM allpoints p1
JOIN allpoints p2 ON random() < 0.002
;

how to Update duplicate rows without primary key

i have a table with NO Primary Key.
name age sex
a 12 m
b 61 m
c 23 f
d 12 m
a 12 m
a 12 m
f 14 f
i have exactly 3 similar rows-row-1,row-5 and row-6.
i want to update row-5 without affecting row-1 and row-6.
Pls help me out, how to achieve this.
Your real problem is that you have no row numbers, and no way to distiguish identical rows. If your data is still in the order in which it was inserted you have simply been lucky so far. SQL Server gives no guarantees of row ordering and could randomize the order without notice. To preserve your ordering you can add an identity column to the table.
ALTER TABLE TableWithNoPrimaryKey
ADD RowNum int IDENTITY(1,1) PRIMARY KEY
Its not possible to use ROW_NUMBER function in SQL as this duplicates can be spread over thousands of record.
ROW_NUMBER is use to get the row number with OVER clause and as the storage of such data is non clustred its not possible to delete.
The only option is to add some identity or unique column to the table and then delete the special record and if you dont want table with the new index or new column you can delete that cloumn from the table.
There is a way to do what you wish. It is not recommended though.
;WITH cte AS
(
SELECT *, RowNum = ROW_NUMBER() OVER (ORDER BY GETDATE())
FROM [table]
)
UPDATE cte
SET age = age + 1
WHERE (RowNum = 5)
AND (name = 'a' AND age = 12 AND sex = 'm');
I have considered your table name as T1 and updated one of the rows. I don't want to order the result set but still want to generate the row numbers, so I used a dummy subquery - Select 0. Below query works in Oracle and IBM Netezza. You can try using rowid if it exists in SQL SERVER or any other equivalent to rowid should work.
UPDATE T1 SET name = a1, age = 21, sex = 'm'
FROM (SELECT name, age, sex, rowid,
row_number() over(partition by name, age, sex ORDER BY(SELECT 0)) as rn
FROM T1)A
WHERE T1.name = A.name
AND T1.age = A.age
AND T1.sex = A.sex
AND T1.rowid = A.rowid
AND A.rn = 1;

Get row count including column values in sql server

I need to get the row count of a query, and also get the query's columns in one single query. The count should be a part of the result's columns (It should be the same for all rows, since it's the total).
for example, if I do this:
select count(1) from table
I can have the total number of rows.
If I do this:
select a,b,c from table
I'll get the column's values for the query.
What I need is to get the count and the columns values in one query, with a very effective way.
For example:
select Count(1), a,b,c from table
with no group by, since I want the total.
The only way I've found is to do a temp table (using variables), insert the query's result, then count, then returning the join of both. But if the result gets thousands of records, that wouldn't be very efficient.
Any ideas?
#Jim H is almost right, but chooses the wrong ranking function:
create table #T (ID int)
insert into #T (ID)
select 1 union all
select 2 union all
select 3
select ID,COUNT(*) OVER (PARTITION BY 1) as RowCnt from #T
drop table #T
Results:
ID RowCnt
1 3
2 3
3 3
Partitioning by a constant makes it count over the whole resultset.
Using CROSS JOIN:
SELECT a.*, b.numRows
FROM YOUR_TABLE a
CROSS JOIN (SELECT COUNT(*) AS numRows
FROM YOUR_TABLE) b
Look at the Ranking functions of SQL Server.
SELECT ROW_NUMBER() OVER (ORDER BY a) AS 'RowNumber', a, b, c
FROM table;
You could do it like this:
SELECT x.total, a, b, c
FROM
table
JOIN (SELECT total = COUNT(*) FROM table) AS x ON 1=1
which will return the total number of records in the first column, followed by fields a,b & c

Simulate row number using numbers table

How would I simulate row number for a table using a numbers table WITHOUT using ROW_NUMBER() function.
sample table:
create table accounts
(
account_num VARCHAR(25),
primary key (account_num)
)
The numbers table has 1mil rows.
In case you're meaning, when it's not available (aka MySQL), try something like this:
select #rownum := #rownum + 1 rownum,
t.*
from (select * from table t order by col) t,
(select #rownum := 0) r
It'll yield the same as:
select row_number() over (order by col)
from table
order by col
A Numbers table does not help you here because you have no means to associate a value in your table with a number in the Numbers table. However, if you are asking whether it is possible to create a sequence without using ROW_NUMBER() or a variable, you can do it like so:
Select A1.Account_Num, Count( A2.Account_Num ) + 1 As Num
From Accounts As A1
Left Join Accounts As A2
On A2.Account_Num < A1.Account_Num
Group By A1.Account_Num