SQL using Count, with same "Like" multiple times in same cell - sql

I'm trying to get a count on how many times BNxxxx has been commented in the comments cell. So far, I can make each cell be counted once, but there may be multiple comments in a cell containing BNxxxx.
For example, this:
-------
BN0012
-------
BN0012
-------
BN0012
BN0123
-------
should show an output of BN0012 3 times and BN0123 once. Instead, I get BN0012 3 times only.
Here's my code:
select COMMENTS, count(*) as TOTAL
from NOTE
Where COMMENTS like '%BN%' AND CREATE_DATE between '01/1/2015' AND '11/03/2015'
group by COMMENTS
order by Total desc;
Any ideas?
edit
My code now looks like
select BRIDGE_NO, count(*)
from IACD_ASSET b join
IACD_NOTE c
on c.COMMENTS like concat(concat('BN',b.BRIDGE_NO),'%')
Where c.CREATE_DATE between '01/1/2015' AND '11/03/2015' AND length(b.BRIDGE_NO) > 1
group by b.BRIDGE_NO
order by count(*);
Problem with this is the BN44 is the same as BN4455 .. have tried concat(concat('BN',b.BRIDGE_NO),'_') comes back with nothing , any ideas how i can get exact likes

You have a problem. Let me assume that you have a table of all known BN values that you care about. Then you can do something like:
select bn.fullbn, count(*)
from tableBN bn join
comments c
on c.comment like ('%' || bn.fullbn || '%')
group by bn.fullbn;
The performance of this might be quite slow.
If you happen to be storing lists of things in the comment field, then this is a very bad idea. You should not store lists in strings; you should use a junction table.

I'm going to assume that your COMMENTS table has a primary key column (such as comment_id) or at least that comments isn't a CLOB. If it is a CLOB then you're not going to be able to use GROUP BY on that column.
You can accomplish this as follows without even a lookup table of BN.... values. No guarantees as to the performance:
WITH d1 AS (
SELECT 1 AS comment_id, 'BN0123 is a terrible thing BN0121 also BN0000' AS comments
, date'2015-01-03' AS create_date
FROM dual
UNION ALL
SELECT 2 AS comment_id, 'BN0125 is a terrible thing BN0120 also BN1000' AS comments
, date'2015-02-03' AS create_date
FROM dual
)
SELECT comment_id, comments, COUNT(*) AS total FROM (
SELECT comment_id, comments, TRIM(REGEXP_SUBSTR(comments, '(^|\s)BN\d+(\s|$)', 1, LEVEL, 'i')) AS bn
FROM d1
WHERE create_date >= date'2015-01-01'
AND create_date < date'2015-11-04'
CONNECT BY REGEXP_SUBSTR(comments, '(^|\s)BN\d+(\s|$)', 1, LEVEL, 'i') IS NOT NULL
AND PRIOR comment_id = comment_id
AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL
) GROUP BY comment_id, comments;
Note that I corrected your filter:
CREATE_DATE between '01/1/2015' AND '11/03/2015'
First, you should be using ANSI date literals (e.g., date'2015-01-01'); second, using BETWEEN for dates is often a bad idea as Oracle DATE values contain a time portion. So this should be rewritten as:
create_date >= date'2015-01-01'
AND create_date < date'2015-11-04'
Note that the later date is November 4, to make sure we capture all possible comments that were made on November 3.
If you want to see the matched comments without aggregating the counts, then do the following (taking out the outer query, basically):
WITH d1 AS (
SELECT 1 AS comment_id, 'BN0123 is a terrible thing BN0121 also BN0000' AS comments
, date'2015-01-03' AS create_date
FROM dual
UNION ALL
SELECT 2 AS comment_id, 'BN0125 is a terrible thing BN0120 also BN1000' AS comments
, date'2015-02-03' AS create_date
FROM dual
)
SELECT comment_id, comments, TRIM(REGEXP_SUBSTR(comments, '(^|\s)BN\d+(\s|$)', 1, LEVEL, 'i')) AS bn
FROM d1
WHERE create_date >= date'2015-01-01'
AND create_date < date'2015-11-04'
CONNECT BY REGEXP_SUBSTR(comments, '(^|\s)BN\d+(\s|$)', 1, LEVEL, 'i') IS NOT NULL
AND PRIOR comment_id = comment_id
AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;
Given the edits to your question, I think you want something like the following:
SELECT b.bridge_no, COUNT(*) AS comment_cnt
FROM iacd_asset b INNER JOIN iacd_note c
ON REGEXP_LIKE(c.comments, '(^|\W)BN' || b.bridge_no || '(\W|$)', 'i')
WHERE c.create_dt >= date'2015-01-01'
AND c.create_dt < date'2015-03-12' -- It just struck me that your dates are dd/mm/yyyy
AND length(b.bridge_no) > 1
GROUP BY b.bridge_no
ORDER BY comment_cnt;
Note that I am using \W in the regex above instead of \s as I did earlier to make sure that it captures things like BN1234/BN6547.

Try use the distinct keyword in your select statement, to pull in unique values for the comments. Like this:
select distinct COMMENTS, count(*) as TOTAL
from NOTE
Where COMMENTS like '%BN%' AND CREATE_DATE between '01/1/2015' AND
'11/03/2015'
group by COMMENTS
order by Total desc;

Related

More than one row returned by a subquery used as an expression when UPDATE on multiple rows

I'm trying to update rows in a single table by splitting them into two "sets" of rows.
The top part of the set should have a status set to X and the bottom one should have a status set to status Y.
I've tried putting together a query that looks like this
WITH x_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
LIMIT 5
), y_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
OFFSET 5
)
UPDATE people
SET status = folks.status
FROM (values
((SELECT id from x_status), 'X'),
((SELECT id from y_status), 'Y')
) as folks (ids, status)
WHERE id IN (folks.ids);
When I run this query I get the following error:
pq: more than one row returned by a subquery used as an expression
This makes sense, folks.ids is expected to return a list of IDs, hence the IN clause in the UPDATE statement, but I suspect the problem is I can not return the list in the values statement in the FROM clause as it turns into something like this:
(1, 2, 3, 4, 5, 5)
(6, 7, 8, 9, 1)
Is there a way how this UPDATE can be done using a CTE query at all? I could split this into two separate UPDATE queries, but CTE query would be better and in theory faster.
I think I understand now... if I get your problem, you want to set the status to 'X' for the oldest five records and 'Y' for everything else?
In that case I think the row_number() analytic would work -- and it should do it in a single pass, two scans, and eliminating one order by. Let me know if something like this does what you seek.
with ranked as (
select
id, row_number() over (order by date_registered desc) as rn
from people
)
update people p
set
status = case when r.rn <= 5 then 'X' else 'Y' end
from ranked r
where
p.id = r.id
Any time you do an update from another data set, it's helpful to have a where clause that defines the relationship between the two datasets (the non-ANSI join syntax). This makes it iron-clad what you are updating.
Also I believe this code is pretty readable so it will be easier to build on if you need to make tweaks.
Let me know if I missed the boat.
So after more tinkering, I've come up with a solution.
The problem with why the previous query fails is we are not grouping the IDs in the subqueries into arrays so the result expands into a huge list as I suspected.
The solution is grouping the IDs in the subqueries into ARRAY -- that way they get returned as a single result (tuple) in ids value.
This is the query that does the job. Note that we must unnest the IDs in the WHERE clause:
WITH x_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
LIMIT 5
), y_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
OFFSET 5
)
UPDATE people
SET status = folks.status
FROM (values
(ARRAY(SELECT id from x_status), 'X'),
(ARRAY(SELECT id from y_status), 'Y')
) as folks (ids, status)
WHERE id IN (SELECT * from unnest(folks.ids));

PostgreSQL GROUP BY that includes zeros

I have a SQL query (postgresql) that looks something like this:
SELECT
my_timestamp::timestamp::date as the_date,
count(*) as count
FROM my_table
WHERE ...
GROUP BY the_date
ORDER BY the_date
The result is a table of YYYY-MM-DD, count pairs.
Now I've been asked to fill in the empty dates with zero. So if I was previously providing
2022-03-15 3
2022-03-17 1
I'd now want to return
2022-03-15 3
2022-03-16 0
2022-03-17 1
Now I can easily do this client-side (relative to the database) and let my program compute and return the zero-augmented list to its clients based on the original list from postgres. But perhaps it would better if I could just tell postgresql to include zeros.
I suspect this isn't easy at all, because postgres has no obvious way of knowing what I'm up to. But in the interests of learning more about postgres and SQL, I thought I'd have try. The try isn't too promising thus far...
Any pointers before I conclude that I was right to leave this to my (postgres client) program?
Update
This is an interesting case where my simplification of the problem led to a correct answer that didn't work for me. For those who come after, I thought it worth documenting what followed, because it take some fun twists through constructing SQL queries.
#a_horse_with_no_name responded with a query that I've verified works if I simplify my own query to match. Unfortunately, my query had some extra baggage that I didn't think pertinent, and so had trimmed out when posting the original question.
Here's my real (original) query, with all names preserved (if shortened):
-- current query
SELECT
LEAST(time1, time2, time3, time4)::timestamp::date as the_date,
count(*) as count
FROM reading_group_reader rgr
INNER JOIN ( SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
WHERE LEAST(time1, time2, time3, time4) > current_date - 30
GROUP BY the_date
ORDER BY the_date;
If I translate that directly into the proposed solution, however, the inner join between reading_group_reader and the temporary table TT causes the left join to become inner (I think) and the date sequence drops its zeros again. Fwiw, the table TT is a table because sometimes it actually is a subselect.
So I transformed my query into this:
SELECT
g.dt::date as the_date,
count(*) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY the_date;
but this outputs 1's instead of 0's at the places that should be 0.
The reason for that, however, is because I've now selected every date, so, of course, there's one of each. I need to include an additional field (which will be NULL) and count that.
So this query finally does what I want:
SELECT
g.dt::date as the_date,
count(rgrt.device_id) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date,
rgr.device_id
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)
) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt(the_date)
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY g.dt;
And, of course, on re-reading the accepted answer, I eventually saw that he did count an unrelated field, which I'd simply missed on my first several readings.
You will need to join to a list of dates. This can e.g. be done using generate_series()
SELECT g.dt::date as the_date,
count(t.my_timestamp) as count
FROM generate_series(date '2022-03-01',
date '2022-03-31',
interval '1 day') as g(dt)
LEFT JOIN my_table as t
ON t.my_timestamp::date = g.dt::date
AND ... -- the original WHERE clause goes here!
GROUP BY the_date
ORDER BY the_date;
Note that the original WHERE conditions need to go into the join condition of the LEFT JOIN. You can't put them into a WHERE clause because that would turn the outer join back into an inner join (which means the missing dates wouldn't be returned).

SQL group by selecting top rows with possible nulls

The example table:
id
name
create_time
group_id
1
a
2022-01-01 12:00:00
group1
2
b
2022-01-01 13:00:00
group1
3
c
2022-01-01 12:00:00
NULL
4
d
2022-01-01 13:00:00
NULL
5
e
NULL
group2
I need to get top 1 rows (with the minimal create_time) grouped by group_id with these conditions:
create_time can be null - it should be treated as a minimal value
group_id can be null - all rows with nullable group_id should be returned (if it's not possible, we can use coalesce(group_id, id) or sth like that assuming that ids are unique and never collide with group ids)
it should be possible to apply pagination on the query (so join can be a problem)
the query should be universal as much as possible (so no vendor-specific things). Again, if it's not possible, it should work in MySQL 5&8, PostgreSQL 9+ and H2
The expected output for the example:
id
name
create_time
group_id
1
a
2022-01-01 12:00:00
group1
3
c
2022-01-01 12:00:00
NULL
4
d
2022-01-01 13:00:00
NULL
5
e
NULL
group2
I've already read similar questions on SO but 90% of answers are with specific keywords (numerous answers with PARTITION BY like https://stackoverflow.com/a/6841644/5572007) and others don't honor null values in the group condition columns and probably pagination (like https://stackoverflow.com/a/14346780/5572007).
You can combine two queries with UNION ALL. E.g.:
select id, name, create_time, group_id
from mytable
where group_id is not null
and not exists
(
select null
from mytable older
where older.group_id = mytable.group_id
and older.create_time < mytable.create_time
)
union all
select id, name, create_time, group_id
from mytable
where group_id is null
order by id;
This is standard SQL and very basic at that. It should work in about every RDBMS.
As to pagination: This is usually costly, as you run the same query again and again in order to always pick the "next" part of the result, instead of running the query only once. The best approach is usually to use the primary key to get to the next part so an index on the key can be used. In above query we'd ideally add where id > :last_biggest_id to the queries and limit the result, which would be fetch next <n> rows only in standard SQL. Everytime we run the query, we use the last read ID as :last_biggest_id, so we read on from there.
Variables, however, are dealt with differently in the various DBMS; most commonly they are preceded by either a colon, a dollar sign or an at sign. And the standard fetch clause, too, is supported by only some DBMS, while others have a LIMIT or TOP clause instead.
If these little differences make it impossible to apply them, then you must find a workaround. For the variable this can be a one-row-table holding the last read maximum ID. For the fetch clause this can mean you simply fetch as many rows as you need and stop there. Of course this isn't ideal, as the DBMS doesn't know then that you only need the next n rows and cannot optimize the execution plan accordingly.
And then there is the option not to do the pagination in the DBMS, but read the complete result into your app and handle pagination there (which then becomes a mere display thing and allocates a lot of memory of course).
select * from T t1
where coalesce(create_time, 0) = (
select min(coalesce(create_time, 0)) from T t2
where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
)
Not sure how you imagine "pagination" should work. Here's one way:
and (
select count(distinct coalesce(t2.group_id, t2.id)) from T t2
where coalesce(t2.group_id, t2.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5 /* for example */
order by coalesce(t1.group_id, t1.id)
I'm assuming there's an implicit cast from 0 to a date value with a resulting value lower than all those in your database. Not sure if that's reliable. (Try '19000101' instead?) Otherwise the rest should be universal. You could probably also parameterize that in the same way as the page range.
You've also got a potential a complication with potential collisions between the group_id and id spaces. Yours don't appear to have that problem though having mixed data types creates its own issues.
This all gets more difficult when you want to order by other columns like name:
select * from T t1
where coalesce(create_time, 0) = (
select min(coalesce(create_time, 0)) from T t2
where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
) and (
select count(*) from (
select * from T t1
where coalesce(create_time, 0) = (
select min(coalesce(create_time, 0)) from T t2
where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
)
) t3
where t3.name < t1.name or t3.name = t1.name
and coalesce(t3.group_id, t3.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5
order by t1.name;
That does handle ties but also makes the simplifying assumption that name can't be null which would add yet another small twist. At least you can see that it's possible without CTEs and window functions but expect these to also be a lot less efficient to run.
https://dbfiddle.uk/?rdbms=mysql_5.5&fiddle=9697fd274e73f4fa7c1a3a48d2c78691
I would guess
SELECT id, name, MAX(create_time), group_id
FROM tb GROUP BY group_id
UNION ALL
SELECT id, name, create_time, group_id
FROM tb WHERE group_id IS NULL
ORDER BY name
I should point out that 'name' is a reserved word.

Substring in a column

I have a column that has several items in which I need to count the times it is called, my column table looks something like this:
Table Example
Id_TR Triggered
-------------- ------------------
A1_6547 R1:23;R2:0;R4:9000
A2_1235 R2:0;R2:100;R3:-100
A3_5436 R1:23;R2:100;R4:9000
A4_1245 R2:0;R5:150
And I would like the result to be like this:
Expected Results
Triggered Count(1)
--------------- --------
R1:23 2
R2:0 3
R2:100 2
R3:-100 1
R4:9000 2
R5:150 1
I've tried to do some substring, but cant seem to find how to solve this problem. Can anyone help?
This solution is X3 times faster than the CONNECT BY solution
performance: 15K records per second
with cte (token,suffix)
as
(
select substr(triggered||';',1,instr(triggered,';')-1) as token
,substr(triggered||';',instr(triggered,';')+1) as suffix
from t
union all
select substr(suffix,1,instr(suffix,';')-1) as token
,substr(suffix,instr(suffix,';')+1) as suffix
from cte
where suffix is not null
)
select token,count(*)
from cte
group by token
;
with x as (
select listagg(Triggered, ';') within group (order by Id_TR) str from table
)
select regexp_substr(str,'[^;]+',1,level) element, count(*)
from x
connect by level <= length(regexp_replace(str,'[^;]+')) + 1
group by regexp_substr(str,'[^;]+',1,level);
First concatenate all values of triggered into one list using listagg then parse it and do group by.
Another methods of parsing list you can find here or here
This is a fair solution.
performance: 5K records per second
select triggered
,count(*) as cnt
from (select id_tr
,regexp_substr(triggered,'[^;]+',1,level) as triggered
from t
connect by id_tr = prior id_tr
and level <= regexp_count(triggered,';')+1
and prior sys_guid() is not null
) t
group by triggered
;
This is just for learning purposes.
Check my other solutions.
performance: 1K records per second
select x.triggered
,count(*)
from t
,xmltable
(
'/r/x'
passing xmltype('<r><x>' || replace(triggered,';', '</x><x>') || '</x></r>')
columns triggered varchar(100) path '.'
) x
group by x.triggered
;

SQL Show Dates Per Status

I have a table that looks like so:
id animal_id transfer_date status_from status_to
-----------------------------------------------------------------
100 5265 01-Jul-2016 NULL P
101 5265 22-Jul-2016 P A
102 5265 26-Jul-2016 A B
103 5265 06-Aug-2016 B A
I want to create a view to show me the movement of the animal with start and end dates like the following:
animal_id status start_date end_date
---------------------------------------------------------
5265 NULL NULL 30-Jun-2016
5265 P 01-Jul-2016 21-Jul-2016
5265 A 22-Jul-2016 25-Jul-2016
5265 B 26-Jul-2016 05-Aug-2016
5265 A 06-Aug-2016 SYSDATE OR NULL (current status)
As much as I want to provide a query that I've tried, I have none. I don't even know what to search for.
Something like this may be more efficient than a join. Alas, I didn't see a way to avoid scanning the table twice.
NOTE: I didn't use an ORDER BY clause (and indeed, if I had the ordering would be weird, since I used to_char on the dates to format them). If you need this in further processing, it is best to NOT wrap the dates within to_char.
with
input_data ( id, animal_id, transfer_date, status_from, status_to) as (
select 100, 5265, to_date('01-Jul-2016', 'dd-Mon-yyyy'), null, 'P' from dual union all
select 101, 5265, to_date('22-Jul-2016', 'dd-Mon-yyyy'), 'P' , 'A' from dual union all
select 102, 5265, to_date('26-Jul-2016', 'dd-Mon-yyyy'), 'A' , 'B' from dual union all
select 103, 5265, to_date('06-Aug-2016', 'dd-Mon-yyyy'), 'B' , 'A' from dual
)
select animal_id,
lag (status_to) over (partition by animal_id order by transfer_date) as status,
to_char(lag (transfer_date) over (partition by animal_id order by transfer_date),
'dd-Mon-yyyy') as start_date,
to_char(transfer_date - 1, 'dd-Mon-yyyy') as end_date
from input_data
union all
select animal_id,
max(status_to) keep (dense_rank last order by transfer_date),
to_char(max(transfer_date), 'dd-Mon-yyyy'),
null
from input_data
group by animal_id
;
ANIMAL_ID STATUS START_DATE END_DATE
---------- ------ -------------------- --------------------
5265 30-Jun-2016
5265 P 01-Jul-2016 21-Jul-2016
5265 A 22-Jul-2016 25-Jul-2016
5265 B 26-Jul-2016 05-Aug-2016
5265 A 06-Aug-2016
Added: Explanation of how this works. First, there is a "WITH clause" to create the input data from the OP's message; this is a standard technique, anyone who is not familiar with factored subqueries (CTE, WITH clause) - introduced in Oracle 11.1 - will do themselves (and the rest of us!) a lot of good by reading about it/them.
The query joins together rows from two sources. In one branch, I use the lag() analytic function; it orders rows, within each group by the columns in the "partition by" clause, according to the ordering by the column in the "order by" clause. So for example, the lag(status_to) will look at all the rows within the same animal_id, it will order them by transfer_date, and for each row, it will pick the status_to from the PREVIOUS row (hence "lag"). The rest of that part of the union works similarly.
I have a second part to the union... as you can see in the original post, there are four rows, but the output must have five. In general that suggests a union of some sort will be needed somewhere in the solution (either directly and obviously as in my solution, or via a self-join or in any other way). Here I just another row for the last status (which is still "current"). I use dense_rank last which, within each group (how shown in a GROUP BY), selects just the last row by transfer_date.
To understand how the query works, it may help to, first, comment out the lines union all and select... group by animal_id and run what's left. That will show what the first part of the query does. Then un-comment those lines, and instead comment the first part, from the first select animal_id to union all (comment these two lines and everything in between). Run the query again, this will show just the last row for each animal_id.
Of course, in the sample the OP provided there is only one animal_id; if you like, you can add a few more rows (for example in the WITH clause) with different animal_id. Only now the partition by animal_id and the group by animal_id become important; with only one animal_id they wouldn't be needed (for example, if all the rows are already filtered by WHERE animal_id = 5265 somewhere else in a subquery).
ADDED #2 - the OP has requested one more version of this - what if the first row is not needed? Then the query is much easier to write and to read. Below I won't copy the CTE (WITH clause), and I don't wrap the dates within to_date() anymore. No GROUP BY is needed, and I didn't order the rows (but the OP can do so if needed).
select animal_id,
status_to as status,
transfer_date as start_date,
lead(transfer_date) over (partition by animal_id order by transfer_date) - 1
as end_date
from input_data
;