Retrieving consecutive rows (and the counts) with the same values - sql

I've got a table with almost 10 million views and would to run this query on the latest million or hundred thousand or so.
Here's a SQL fiddle with example data and input/output: http://sqlfiddle.com/#!9/340a41
Is this even possible?
CREATE TABLE object (`id` int, `name` varchar(7), `value` int);
INSERT INTO object (`id`, `name`, `value`)
VALUES
(1, 'a', 1),
(2, 'b', 2),
(3, 'c', 100),
(4, 'a', 1),
(5, 'b', 2),
(6, 'c', 200),
(7, 'a', 2),
(8, 'b', 2),
(9, 'c', 300),
(10, 'a', 2),
(11, 'b', 2),
(12, 'a', 2),
(13, 'b', 2),
(14, 'c', 400)
;
-- Want:
-- name, max(id), count(id)
-- 'a', 4, 2
-- 'b', 14, 5
-- 'a', 12, 3

If you want the latest and the id is implemented sequentially, then you can do this using limit or top. In SQL Server:
select top 100000 o.*
from object o
order by id desc;
In MySQL, you would use limit:
select o.*
from object o
order by id desc
limit 100000

select name, count(id) cnt, max(id) max_id, max(value) max_v
from
(select
top 1000000 -- MS SQL Server
id,name,value
from myTable
limit 1000000 --mySQL
order by id desc)
group by name
remove line which doesn't match your server.

Related

How to concat all values associated with a key?

I have the following schema:
CREATE TABLE table1
(
user,
phoneType, --from ['A','B','C','D', 'E'], user can have any number of any type
uniquePhoneID, --unique string identifying the phone
day_id --date; record does not necessarily exist for every seen user + phoneType every day, represented as number in example
);
INSERT INTO table1
VALUES (1, 'A', xyz, 1),
(1, 'A', abc, 1),
(1, 'B', def, 2),
(1, 'A', xyz, 2),
(1, 'C', hij, 4),
(1, 'A' xyz, 5),
(2, 'C', w, 9),
(2, 'D', z, 10),
(2, 'A', p, 10),
(2, 'E', c, 11),
(3, 'A', r, 19),
(3, 'B', q, 19),
(3, 'B', q, 20),
(3, 'B', f, 20),
(3, 'B', y, 21);
A single user, uniquePhoneID, day_id will only show up at most once, but not necessarily at all on any given day.
I am looking to concatenate each user in the table with their 4 phoneTypes in alphabetical order, so the result is as follows:
1 | AABC
2 | ACDE
3 | ABBB
I have tried a few different ways of doing this but I am unsure how to get the answer I am looking for.
I think user is a reserved word, so you will have to resolve that. Otherwise, I think something like this will work for you:
select user, string_agg (phonetype, '' order by phonetype)
from table1
group by user
-- EDIT 4/21/2022 --
Aah, okay. I did not glean that from the original question.
What if you used the distinct on the original table before the aggregation?
select userid, string_agg (phonetype, '' order by phonetype)
from (select distinct userid, phonetype, uniquephoneid from table1) x
group by userid
I got these results from this version:
1 AABC
2 ACDE
3 ABBB
If that logic still doesn't work, can you alter the sample data to find an example where it fails?

Can I improve this query for use in large tables?

How can I improve this query for use in large tables....?
I use a table ('DataValues') to store a collection of values ('Value') for collections ('Visit_id') ie it records certain values for each visit.
I use a table ('MatchItems') to store dynamic match sets 'MatchSet' of values ('Value'), sets can contain any number of values. The table also has a IsNeg field to indicate if the match should require a value to be not present in the visit collection.
This allows me to dynamically match visits that conform to certain criteria such as
Must contain values A, B and C and NOT D OR C and B AND NOT A.
ie (Value = A and Value = B and Value = C and Value /= D)
or (Value = C and Value = B and Value /= A)
I have a query that delivers a reasonable solution fiddle:
CREATE TABLE DataValues (
id NUMBER(5) CONSTRAINT DataValues_pk PRIMARY KEY,
Visit_id Number(5) ,
Value varchar(5)
);
INSERT INTO DataValues VALUES (1, 1, 'M');
INSERT INTO DataValues VALUES (2, 1, 'I');
INSERT INTO DataValues VALUES (3, 1, 'C');
INSERT INTO DataValues VALUES (4, 1, 'K');
INSERT INTO DataValues VALUES (5, 1, 'E');
INSERT INTO DataValues VALUES (6, 1, 'Y');
INSERT INTO DataValues VALUES (7, 2, 'M');
INSERT INTO DataValues VALUES (8, 2, 'O');
INSERT INTO DataValues VALUES (9, 2, 'U');
INSERT INTO DataValues VALUES (10, 2, 'S');
INSERT INTO DataValues VALUES (11, 2, 'E');
INSERT INTO DataValues VALUES (12, 3, 'C');
INSERT INTO DataValues VALUES (13, 3, 'A');
INSERT INTO DataValues VALUES (14, 3, 'T');
INSERT INTO DataValues VALUES (15, 4, 'S');
INSERT INTO DataValues VALUES (16, 4, 'A');
INSERT INTO DataValues VALUES (17, 4, 'T');
INSERT INTO DataValues VALUES (18, 5, 'M');
INSERT INTO DataValues VALUES (19, 5, 'A');
INSERT INTO DataValues VALUES (20, 5, 'T');
CREATE TABLE MatchItems (
id NUMBER(5) CONSTRAINT MatchItems_pk PRIMARY KEY,
MatchSet Number(5),
Value VARCHAR(5),
IsNeg NUMBER(1) NOT NULL CHECK (IsNeg in (0,1))
);
INSERT INTO MatchItems VALUES (1, 1, 'M', 0);
INSERT INTO MatchItems VALUES (2, 1, 'I', 0);
INSERT INTO MatchItems VALUES (3, 1, 'C', 0);
INSERT INTO MatchItems VALUES (4, 1, 'K', 0);
INSERT INTO MatchItems VALUES (5, 1, 'E', 0);
INSERT INTO MatchItems VALUES (6, 1, 'Y', 0);
INSERT INTO MatchItems VALUES (7, 2, 'C', 0);
INSERT INTO MatchItems VALUES (8, 2, 'A', 0);
INSERT INTO MatchItems VALUES (9, 3, 'A', 0);
INSERT INTO MatchItems VALUES (10, 3, 'T', 0);
INSERT INTO MatchItems VALUES (11, 4, 'S', 1);
INSERT INTO MatchItems VALUES (12, 4, 'A', 0);
INSERT INTO MatchItems VALUES (13, 4, 'K', 1);
INSERT INTO MatchItems VALUES (14, 5, 'A', 0);
INSERT INTO MatchItems VALUES (15, 5, 'T', 0);
SELECT
MatchItems.MatchSet,
DataValues.Visit_id,
GpMatchItems.Count TgtCount,
Count(MatchItems.Id),
sum(MatchItems.IsNeg)
FROM DataValues
LEFT JOIN MatchItems ON MatchItems.Value = DataValues.Value
--AND MatchItems.MatchSet = 4
LEFT JOIN (SELECT
MatchItems.MatchSet,
count(*) Count
FROM MatchItems
WHERE
MatchItems.IsNeg = 0
GROUP BY
MatchItems.MatchSet) GpMatchItems ON GpMatchItems.MatchSet = MatchItems.MatchSet
HAVING
Count(MatchItems.Id) = GpMatchItems.Count
AND sum(MatchItems.IsNeg) = 0
GROUP BY
MatchItems.MatchSet,
DataValues.Visit_id,
GpMatchItems.Count
How can I improve the performance of this query where the DataValues table contains 100m records, and MatchItems may include a collection of 50 sets each of 2 - 20 values?
You can try this version using Analytic functions and see if it performs any better. This query removes the subquery GpMatchItems that you are joining with.
SELECT DISTINCT matchset,
visit_id,
tgtcount,
match_visit_count,
isneg_sum
FROM (SELECT MatchItems.MatchSet,
DataValues.Visit_id,
COUNT (DISTINCT CASE MatchItems.IsNeg WHEN 0 THEN MatchItems.id ELSE NULL END)
OVER (PARTITION BY MatchItems.MatchSet)
AS tgtcount,
COUNT (*) OVER (PARTITION BY MatchItems.MatchSet, DataValues.Visit_id)
AS match_visit_count,
SUM (MatchItems.IsNeg) OVER (PARTITION BY MatchItems.MatchSet, DataValues.Visit_id)
AS isneg_sum
FROM DataValues LEFT JOIN MatchItems ON MatchItems.VALUE = DataValues.VALUE)
WHERE tgtcount = match_visit_count AND isneg_sum = 0;
I have adjusted EJ's suggestion to include a LEFT JOIN to collect the tgtCount to identify the total number of good matches required in each MatchSet:
SELECT DISTINCT matchset,
visit_id,
tgtcount,
match_visit_count,
isneg_sum
GpMatchItems.count tgtCount
FROM
COUNT (*) OVER (PARTITION BY MatchItems.MatchSet, DataValues.Visit_id)
AS match_visit_count,
SUM (MatchItems.IsNeg) OVER (PARTITION BY MatchItems.MatchSet, DataValues.Visit_id)
AS isneg_sum
FROM DataValues
LEFT JOIN MatchItems ON MatchItems.VALUE = DataValues.VALUE)
LEFT JOIN ( SELECT
MatchItems.MatchSet,
count(*) Count
FROM MatchItems
WHERE MatchItems.IsNeg = 0
GROUP BY
MatchItems.MatchSet) GpMatchItems
ON GpMatchItems.MatchSet = MatchItems.MatchSet
)
WHERE
tgtcount = match_visit_count
AND isneg_sum = 0;

SQL query to get all records with sum less than 10000

I have data in a table which looks like the below data set.
I want to get a group of those items whose price sum is less than 10000.
CREATE TABLE Table1
(slno int, item varchar(10), price int);
INSERT INTO Table1
(slno, item, price)
VALUES
(1, 'item1', 1000),
(2, 'item2', 2000),
(3, 'item3', 3000),
(4, 'item4', 4000),
(5, 'item5', 5000),
(6, 'item6', 6000),
(7, 'item7', 10000),
(8, 'item8', 2000),
(9, 'item9', 8000),
(10, 'item10', 2500),
(11, 'item11', 9000),
(12, 'item12', 1000),
(13, 'item13', 2500),
(14, 'item14', 2500),
(15, 'item15', 2500);
My sql query looks like this:
SELECT slno, item,price
FROM
(
SELECT slno, item,price
(
SELECT SUM(price)
FROM Table1
WHERE slno<= t.slno
) total
FROM Table1 t
) q
WHERE total <= 1000
ORDER BY item
It's not giving the expected result though, it's giving only one set of records:
(1, 'item1', 1000),
(2, 'item2', 2000),
(3, 'item3', 3000),
(4, 'item4', 4000)
whereas I need it to give me something like this:
1ST SET
(1, 'item1', 1000),
(2, 'item2', 2000),
(3, 'item3', 3000),
(4, 'item4', 4000)
2ND SET
(7, 'item7', 10000),
#GordonLinoff
This type of operation requires a recursive CTE. In this case, you can assign a group to each row using such logic. The following assumes that slno has no gaps as in your example data:
with cte as (
select slno, item, price, 1 as grp, price as running_price
from table1
where slno = 1
union all
select t1.slno, t1.item, t1.price,
(case when t1.price + cte.running_price > 10000 then grp + 1 else grp end),
(case when t1.price + cte.running_price > 10000 then t1.price else cte.running_price + t1.price end)
from cte join
table1 t1
on t1.slno = cte.slno + 1
)
select *
from cte
order by slno;
Here is a db<>fiddle.

Select duplicate persons with duplicate memberships

SQL Fiddle with schema and my intial attempt.
CREATE TABLE person
([firstname] varchar(10), [surname] varchar(10), [dob] date, [personid] int);
INSERT INTO person
([firstname], [surname], [dob] ,[personid])
VALUES
('Alice', 'AA', '1/1/1990', 1),
('Alice', 'AA', '1/1/1990', 2),
('Bob' , 'BB', '1/1/1990', 3),
('Carol', 'CC', '1/1/1990', 4),
('Alice', 'AA', '1/1/1990', 5),
('Kate' , 'KK', '1/1/1990', 6),
('Kate' , 'KK', '1/1/1990', 7)
;
CREATE TABLE person_membership
([personid] int, [personstatus] varchar(1), [memberid] int);
INSERT INTO person_membership
([personid], [personstatus], [memberid])
VALUES
(1, 'A', 10),
(2, 'A', 20),
(3, 'A', 30),
(3, 'A', 40),
(4, 'A', 50),
(4, 'A', 60),
(5, 'T', 70),
(6, 'A', 80),
(7, 'A', 90);
CREATE TABLE membership
([membershipid] int, [memstatus] varchar(1));
INSERT INTO membership
([membershipid], [memstatus])
VALUES
(10, 'A'),
(20, 'A'),
(30, 'A'),
(40, 'A'),
(50, 'T'),
(60, 'A'),
(70, 'A'),
(80, 'A'),
(90, 'T');
There are three tables (as per the fiddle above). Person table contains duplicates, same people entered more than once, for the purpose of this exercise we assume that a combination of the first name, surname and DoB is enough to uniquely identify a person.
I am trying to build a query which will show duplicates of people (first name+surname+Dob) with two or more active entries in the Person table (person_membership.person_status=A) AND two or more active memberships (membership.mestatus=A).
Using the example from SQL Fiddle, the result of the query should be just Alice (two active person IDs, two active membership IDs).
I think I'm making progress with the following effort but it looks rather cumbersome and I need to remove Katie from the final result - she doesn't have a duplicate membership.
SELECT q.firstname, q.surname, q.dob, p1.personid, m.membershipid
FROM
(SELECT
p.firstname,p.surname,p.dob, count(*) as cnt
FROM
person p
GROUP BY
p.firstname,p.surname,p.dob
HAVING COUNT(1) > 1) as q
INNER JOIN person p1 ON q.firstname=p1.firstname AND q.surname=p1.surname AND q.dob=p1.dob
INNER JOIN person_membership pm ON p1.personid=pm.personid
INNER JOIN membership m ON pm.memberid = m.membershipid
WHERE pm.personstatus = 'A' AND m.memstatus = 'A'
Since you are using SQL Server windows function will be handy for this scenario. The following will give you the expected output.
SELECT firstname,surname,dob,personid,memberid
from(
SELECT firstname,surname,dob,p.personid,memberid
,Rank() over(partition by p.firstname,p.surname,p.dob order by p.personid) rnasc
,Rank() over(partition by p.firstname,p.surname,p.dob order by p.personid desc) rndesc
FROM [StagingGRG].[dbo].[person] p
INNER JOIN person_membership pm ON p.personid=pm.personid
INNER JOIN membership m ON pm.memberid = m.membershipid
where personstatus='A' and memstatus='A')a
where a.rnasc+rndesc>2
You have to add Group by and Having clause to return duplicate items only-
SELECT
person.firstname,person.surname,person.dob
FROM
person, person_membership, membership
WHERE
person.personid=person_membership.personid AND person_membership.memberid = membership.membershipid
AND
person_membership.personstatus = 'A' AND membership.memstatus = 'A'
GROUP BY
person.firstname,person.surname,person.dob
HAVING COUNT(1) > 1

Delete all rows with specified parameters but newest in Postgresql

I have a hash-table:
CREATE TABLE hash_table ( hash_id bigserial,
user_name varchar(80),
hash varchar(80),
exp_time bigint,
PRIMARY KEY (hash_id));
INSERT INTO hash_table (hash_id, user_name, exp_time) VALUES
(1, 'one', 10),
(2, 'two', 20),
(3, 'three', 31),
(4, 'three', 32),
(5, 'three', 33),
(6, 'three', 33),
(7, 'three', 35),
(8, 'three', 36),
(9, 'two', 40),
(10, 'two', 50),
(11, 'one', 60),
(12, 'three', 70);
exp_time - expiration time of hash. exp_time = now() + delta_time when the row creates
I need a result:
(1, 'one', 10),
(2, 'two', 20),
(7, 'three', 35),
(8, 'three', 36),
(9, 'two', 40),
(10, 'two', 50),
(11, 'one', 60),
(12, 'three', 70);
It contains a lot of user_name-hash pairs. user_name may dublicate a lot of time.
How to delete all rows but (several, e.g. 10) newest of specified user_name ?
I found this solution (for mySQL but I hope it works) but it removes all other user_names
DELETE FROM `hash_table`
WHERE user_name NOT IN (
SELECT user_name
FROM (
SELECT user_name
FROM `hash_table`
ORDER BY exp_time DESC
LIMIT 10 -- keep this many records
)
);
This will keep 3 newest records for each user_name:
DELETE FROM hash_table h
USING
(
SELECT
user_name,
exp_time,
row_number() OVER (PARTITION BY user_name ORDER BY exp_time DESC) AS r
FROM
hash_table
) AS s
WHERE
h.user_name = s.user_name
AND h.exp_time<s.exp_time
AND s.r=3
You can use row_number to number rows of a user based on exp_time in the descending order. And then delete user rows from the table who have more than 10 entries.
Fiddle with sample data
delete from hash_table as h
using
(
SELECT user_name, exp_time,
row_number() over(partition by user_name order by exp_time desc) as rn
FROM hash_table
) as t
where h.user_name = t.user_name and h.exp_time <= t.exp_time and t.rn > 10;
DELETE FROM hash_table WHERE hash_id NOT IN (SELECT hash_id FROM hash_table WHERE user_name = 'three' ORDER BY exp_time DESC LIMIT 3) AND user_name = 'three';
myself answer: the query calculates rows only for wanted user_name, not for all user_name's.