PostgreSQL search lists of substrings in string column - sql

I have the following table in a postreSQL database (simplified for clarity):
| serverdate | name | value
|-------------------------------------
0 | 2019-12-01 | A LOC 123 DISP | 1
1 | 2019-12-01 | B LOC 456 DISP | 2
2 | 2019-12-01 | C LOC 777 DISP | 0
3 | 2019-12-01 | D LOC 000 DISP | 10
4 | 2019-12-01 | A LOC 700 DISP | 123
5 | 2019-12-01 | F LOC 777 DISP | 8
name columns is of type string. The substrings LOC and DISP can have other values of different lengths but are not of interest in this question.
The problem: I want to SELECT the rows that only contain a certain substring. There are several substrings, passed as an ARRAY, in the following format:
['A_123', 'F_777'] # this is an example only
I would want to select all the rows that contain the first part of the substring (sepparating it by the underscore '_'), as well as the second. In this example, with the mentioned array, I should obtain rows 0 and 5 (as these are the only ones with exact matches in both parts of the):
| serverdate | name | value
|-------------------------------------
0 | 2019-12-01 | A LOC 123 DISP | 1
5 | 2019-12-01 | F LOC 777 DISP | 8
Row 4 has the first part of the substring correct, but not the other one, so it shouldn't be returned. Same thing with row 2 (only second part matches).
How could this query be done? I'm relatively new to SQL.
This query is part of process in Python, so I can adjust the input parameter (the substring array) if needed, but the behaviour must be the same as the one described.
Thanks!

Have you tried with regexp_replace and a subquery?
SELECT * FROM
(SELECT serverdate, substring(name from 1 for 1)||'_'||
regexp_replace(name, '\D*', '', 'g') AS name, value
FROM t) j
WHERE name IN('A_123', 'F_777');
Or using a CTE
WITH j AS (
SELECT serverdate, substring(name from 1 for 1)||'_'||
regexp_replace(name, '\D*', '', 'g') AS name2,
value,name
FROM t
) SELECT serverdate,name,value FROM j
WHERE name2 IN('A_123', 'F_777');
serverdate | name | value
------------+----------------+-------
2019-12-01 | A LOC 123 DISP | 1
2019-12-01 | F LOC 777 DISP | 8
(2 Zeilen)

Just unnest the array and join the table using a like clause
select
*
from
Table1
join
(
select
'%'||replace(unnest, '_', '%')||'%' pat
from
unnest(array['A_123', 'F_777'])
) pat_table on "name" like "pat"
Just replace unnest(array['A_123', 'F_777']) with unnest(string_to_array(str_variable, ','))

Thanks for your answers! Solution by Larry B got me an error, but it was caused by external factors (I run the queries using an internal tool developed by my company and it threw errors when using the % wildcard. Strange behaviour, I already contacted support team), so I could not test it properly.
Solution by Jim Jones seemed an alternative, but I found that, in some cases, the values in the name field would look like these (didn't notice it when writing the question, as it a rare case):
ABC LOC 123 DISP
So I modified the solution a little bit so as to grab the first part of the name when splitting it by the ' ' character.
(TLDR: 1st substring of name could be of arbitrary length, but is always at the start)
My solution is this one:
SELECT * FROM
(SELECT serverdate, split_part(name, ' ', 1)||'_'||
regexp_replace(name, '\D*', '', 'g') AS name, value
FROM t) j
WHERE name IN('A_123', 'F_777');

split_part(name,'_',1) + '_' + split_part(name,'_',3) as name
this is the break down of the query: A + _ + 123 = A_123

Related

how to loop an array in string in a where clause

I have an information table with a column of an array in string format. The length is unknown starting from 0. How can I put it in a where clause of PostgreSQL?
* hospital_information_table
| ID | main_name | alternative_name |
| --- | ---------- | ----------------- |
| 111 | 'abc' | 'abe, abx' |
| 222 | 'bbc' | '' |
| 333 | 'cbc' | 'cbe,cbd,cbf,cbg' |
​
​
* record
| ID | name | hospital_id |
| --- | ------- | ------------ |
| 1 | 'abc-1' | |
| 2 | 'bbe+2' | |
| 3 | 'cbf*3' | |
​
e.g. this column is for alternative names of hospitals. let's say e.g. 'abc,abd,abe,abf' as column Name and '111' as ID. And I have a record with a hospital name 'cbf*3' ('3' is the department name) and I would like to check its ID. How can I check all names one by one in 'cbe,cbd,cbf,cbg' and get its ID '333'?
--update--
In the example, in the record table, I used '-', '*', '+', meaning that I couldn't split the name in the record table under a certain pattern. But I can make sure that some of the alternative names may appear in the record name (as a substring). something similar to e.g. 'cbf' in 'cbf*3'. I would like to check all names, if 'abe' in 'cbf*3'? no, if 'abx' in 'cbf*3'? no, then the next row etc.
--update--
Thanks for the answers! They are great!
For more details, the original dataset is not in alphabetic languages. The text in the record name is not separable. it is really hard to find a separator or many separators. Therefore, for the solutions with regrex like '[-*+]' could not work here.
Thanks in advance!
You could use regexp_split_to_array to convert the coma-delimited string to a proper array, and then use the any operator to search inside it:
SELECT r.*, h.id
FROM record r
JOIN hospital_information h ON
SPLIT_PART(r.name, '-', 1) = ANY(REGEXP_SPLIT_TO_ARRAY(h.name, ','))
SQLFiddle demo
Substring can be used with a regular expression to get the hospital name from the record's name.
And String_to_array can transform a CSV string to an array.
SELECT
r.id as record_id
, r.name as record_name
, h.id as hospital_id
FROM record r
LEFT JOIN hospital_information h
ON SUBSTRING(r.name from '^(.*)[+*\-]\w+$') = ANY(STRING_TO_ARRAY(h.alternative_name,',')||h.main_name)
WHERE r.hospital_id IS NULL;
record_id
record_name
hospital_id
1
abc-1
111
2
bbe+2
222
3
cbf*3
333
Demo on db<>fiddle here
Btw, text [] can be used as a datatype in a table.

Pull NULL if column not present in table while UNION SQL Server

I am currently building a dynamic SQL query. The tables and columns are sent as parameters. So the columns may not be present in the table. Is there a way to pull NULL data in the result set when the column is not present in the table?
ex:
SELECT * FROM Table1
Output:
created date | Name | Salary | Married
-------------+-------+--------+----------
25-Jan-2016 | Chris | 2500 | Y
27-Jan-2016 | John | 4576 | N
30-Jan-2016 | June | 3401 | N
So when I run the query below
SELECT Created_date, Name, Age, Married
FROM Table1
I need to get
created date | Name | AGE | Married
-------------+-------+--------+----------
25-Jan-2016 | Chris | NULL | Y
27-Jan-2016 | John | NULL | N
30-Jan-2016 | June | NULL | N
Does anything like IF NOT EXISTS or ISNULL work in this?
I can't use extensive T-SQL in this segment and need to be simple since I am creating a UNION query to more than 50 tables (requirement :| ) . Any advice would be of great help to me.
I can't think of an easy solution. Since you're using dynamic sql, instead of
(previous dynamic string part)+' fieldname '+(next dynamic string part)
you could use
(previous dynamic string part)
+ case when exists (
select 1
from sys.tables t
inner join sys.columns c on t.object_id=c.object_id
where c.name=your_field_name and t.name=your_table_name)
) then ' fieldname ' else ' NULL ' end
+(next dynamic string part)

Verifying a cycle of values in table

I have a table which looks like this (the real table has dates and time in place of the Letters):
| assigned | start | end
| xyz | A | B
| xyz | B | C
| xyz | C | D
| xyz | D | E
| xyz | E | F
| fgh | A | B
| fgh | B | C
etc.
There is a rotation with each assigned code (xyz,fgh and so on) where 'end' is congruent with the next 'start' up to a value indicating a defined end (here 'F').
I am looking for a statement which scans/verifys that this rotation is indeed occurring, that it starts at A and ends with F and did every step up until then.
Any help is greatly appreciated.
edit: The rotation always uses 5 rows (or 4 steps), even if the intervall length can change in between.
This is really a hack that works because the dates are replaced by characters, but it might give you ideas on how to make it work for real.
select * from (
select a_code, min(a_start) as thestart, max(a_end) as theend,
substring(group_concat(a_start order by a_start), 3) as starts,
substring(group_concat(a_end order by a_end), 1, length(group_concat(a_end))-2) as ends
from so_test
group by a_code ) as grpSelect
where thestart = 'a'
and theend = 'f'
and starts = ends
The group_concat of a_start for xyz prduces a string of 'a,b,c,d,e' while the group_concat for a_end prduces b,c,d,e,f. The substring removes the a from the start and the f from the end so that the outer query can compare b,c,d,e in both strings.

Splitting a string column in BigQuery

Let's say I have a table in BigQuery containing 2 columns. The first column represents a name, and the second is a delimited list of values, of arbitrary length. Example:
Name | Scores
-----+-------
Bob |10;20;20
Sue |14;12;19;90
Joe |30;15
I want to transform into columns where the first is the name, and the second is a single score value, like so:
Name,Score
Bob,10
Bob,20
Bob,20
Sue,14
Sue,12
Sue,19
Sue,90
Joe,30
Joe,15
Can this be done in BigQuery alone?
Good news everyone! BigQuery can now SPLIT()!
Look at "find all two word phrases that appear in more than one row in a dataset".
There is no current way to split() a value in BigQuery to generate multiple rows from a string, but you could use a regular expression to look for the commas and find the first value. Then run a similar query to find the 2nd value, and so on. They can all be merged into only one query, using the pattern presented in the above example (UNION through commas).
Trying to rewrite Elad Ben Akoune's answer in Standart SQL, the query becomes like this;
WITH name_score AS (
SELECT Name, split(Scores,';') AS Score
FROM (
(SELECT * FROM (SELECT 'Bob' AS Name ,'10;20;20' AS Scores))
UNION ALL
(SELECT * FROM (SELECT 'Sue' AS Name ,'14;12;19;90' AS Scores))
UNION ALL
(SELECT * FROM (SELECT 'Joe' AS Name ,'30;15' AS Scores))
))
SELECT name, score
FROM name_score
CROSS JOIN UNNEST(name_score.score) AS score;
And this outputs;
+------+-------+
| name | score |
+------+-------+
| Bob | 10 |
| Bob | 20 |
| Bob | 20 |
| Sue | 14 |
| Sue | 12 |
| Sue | 19 |
| Sue | 90 |
| Joe | 30 |
| Joe | 15 |
+------+-------+
If someone is still looking for an answer
select Name,split(Scores,';') as Score
from (
# replace the inner custome select with your source table
select *
from
(select 'Bob' as Name ,'10;20;20' as Scores),
(select 'Sue' as Name ,'14;12;19;90' as Scores),
(select 'Joe' as Name ,'30;15' as Scores)
);

Finding the difference between two sets of data from the same table

My data looks like:
run | line | checksum | group
-----------------------------
1 | 3 | 123 | 1
1 | 7 | 123 | 1
1 | 4 | 123 | 2
1 | 5 | 124 | 2
2 | 3 | 123 | 1
2 | 7 | 123 | 1
2 | 4 | 124 | 2
2 | 4 | 124 | 2
and I need a query that returns me the new entries in run 2
run | line | checksum | group
-----------------------------
2 | 4 | 124 | 2
2 | 4 | 124 | 2
I tried several things, but I never got to a satisfying answer.
In this case I'm using H2, but of course I'm interested in a general explanation that would help me to wrap my head around the concept.
EDIT:
OK, it's my first post here so please forgive if I didn't state the question precisely enough.
Basically given two run values (r1, r2, with r2 > r1) I want to determine which rows having row = r2 have a different line, checksum or group from any row where row = r1.
select * from yourtable
where run = 2 and checksum = (select max(checksum)
from yourtable)
Assuming your last run will have the higher run value than others, below SQL will help
select * from table1 t1
where t1.run in
(select max(t2.run) table1 t2)
Update:
Above SQL may not give you the right rows because your requirement is not so clear. But the overall idea is to fetch the rows based on the latest run parameters.
SELECT line, checksum, group
FROM TableX
WHERE run = 2
EXCEPT
SELECT line, checksum, group
FROM TableX
WHERE run = 1
or (with slightly different result):
SELECT *
FROM TableX x
WHERE run = 2
AND NOT EXISTS
( SELECT *
FROM TableX x2
WHERE run = 1
AND x2.line = x.line
AND x2.checksum = x.checksum
AND x2.group = x.group
)
A slightly different approach:
select min(run) run, line, checksum, group
from mytable
where run in (1,2)
group by line, checksum, group
having count(*)=1 and min(run)=2
Incidentally, I assume that the "group" column in your table isn't actually called group - this is a reserved word in SQL and would need to be enclosed in double quotes (or backticks or square brackets, depending on which RDBMS you are using).