Match similar column values in different rows - sql

I have a table with an ID column (String) and I need to be able to find IDs that are similar between different rows. What is the SQL that will allow me to flag a row as similar? Note: There can be one-to-many rows like shown below (i.e. 12345, 12345RED, etc.)
Update: The IDs are "similar" in that there is typically leading numerical values followed by no space then alpha characters OR space " ", hyphen "-", or forward slash "/" then followed by alpha characters. ####[a-zA-z], #### [a-zA-Z], ####-[a-zA-z}, or ####/[a-zA-z]. (I'm not sure how to indicate 1-to-many numeric characters).
ID
Similar
12345
Yes
12345RED (Could also be 12345-RED, 12345/RED, or 12345 RED)
Yes
12345BLU (Could also be 12345-BLU, 12345/BLU, or 12345 BLU)
Yes
12345GRN (Could also be 12345-GRN, 12345/GRN, or 12345 GRN)
Yes
12345BLK (Could also be 12345-BLK, 12345/BLK, or 12345 BLK)
Yes
123456
No
123457
No

Assuming "similar" means "have the same leading numerals"...
First, extract the numerals, such as with a regular expression. Then, count how many other ros have the same leading numerals, using a window function.
WITH
extract_numerals AS
(
SELECT
*,
REGEXP_EXTRACT(id, r'^\d+') AS leading_numerals
FROM
your_table
)
SELECT
*,
COUNT(*) OVER (PARTITION BY leading_numerals) - 1 AS similar_rows
FROM
extract_numerals
ORDER BY
leading_numerals
Any row where the count is zero (after having deducted one from the window function) has no "similar" rows.

Related

Find all the rows where column is letter case postgresql

I have a table in postgres database where I need to find all the rows -
Between two dates where fromTo is date column.
And also only those rows where column data contains mix of lower and upper case letters. for eg: eCTiWkAohbQAlmHHAemK
I can do between two dates as shown below but confuse on second point on how to do that?
SELECT * FROM test where fromTo BETWEEN '2022-09-08' AND '2022-09-23';
Data type for fromTo column is shown below -
fromTo | timestamp without time zone | | not null | CURRENT_TIMESTAMP
You can use a regular expression to check that it is only alphabetical characters and at least one uppercase character.
select *
from foo
where data ~ '[[:upper:]]' and data ~ '^[[:alpha:]]+$';
and fromTo BETWEEN '2022-09-08' AND '2022-09-23'
The character classes will match all alphabetical characters, including those with accents.
Demonstration.
Note that this may not be able to make use of an index. If your table is large, you may need to reconsider how you're storing the data.

SQL: Change query output to have two separate columns from having rows with 2 values

For some reason my query output is grouping together 2 columns into 1, and putting the 2 values in the same row like this:
PATIENT_NAME
--------------------------------------------------------
INSURANCE
-------------------------
Aimie Pepsodent
Manulife
Aka Fresh
Blue Cross
Apple Addaye
Blue Cross
But I want them to appear in two separate columns like my teacher's output:
PATIENT_NAME INSURANCE
-------------- ----------------
Apple Addaye Blue Cross
Roy Alflush No Insurance
Shane Cane No Insurance
Is there a way I can change it to this?
Right now my sql query looks like this:
select (fname||' '||lname) patient_name,
(nvl(l4_insurance_cos.company_name, 'No Insurance')) insurance
from l4_patients
left join l4_insurance_cos
on l4_patients.ins_id = l4_insurance_cos.id
order by l4_patients.lname;
This is a pure SQLPlus display issue. The size of the line is too small for the two columns to fit in it, so SQLPlus splits the results on two lines.
You need to adjust the linesize of your terminal, and/or the display width of each column - by default, it corresponds to the maximum length of the resultset column (if you concatenate two columns in the query, that's the sum of the length of the two columns, with a limit of 4000 bytes for varchars).
The actual values will depend on your terminal and table definition, but here is an example:
set linesize 140 -- allow a total of 140 characters per line
column patient_name format a80 -- 80 characters for column "patient_name"
column insurance format a60 -- 60 characters for column "insurance"
Then, you can run your query.

SQLite3 Order by highest/lowest numerical value

I am trying to do a query in SQLite3 to order a column by numerical value. Instead of getting the rows ordered by the numerical value of the column, the rows are ordered alphabetically by the first digit's numerical value.
For example in the query below 110 appears before 2 because the first digit (1) is less than two. However the entire number 110 is greater than 2 and I need that to appear after 2.
sqlite> SELECT digit,text FROM test ORDER BY digit;
1|one
110|One Hundred Ten
2|TWO
3|Three
sqlite>
Is there a way to make 110 appear after 2?
It seems like digit is a stored as a string, not as a number. You need to convert it to a number to get the proper ordering. A simple approach uses:
SELECT digit, text
FROM test
ORDER BY digit + 0

Extracting number of specific length from a string in Postgres

I am trying to extract a set of numbers from comments like
"on april-17 transactions numbers are 12345 / 56789"
"on april-18 transactions numbers are 56789"
"on may-19 no transactions"
Which are stored in a column called "com" in table comments
My requirement is to get the numbers of specific length. In this case length of 5, so 12345 and 56789 from the above string separately, It is possible to to have 0 five digit number or more more than 2 five digit number.
I tried using regexp_replace with the following result, I am trying the find a efficient regex or other method to achieve it
select regexp_replace(com, '[^0-9]',' ', 'g') from comments;
regexp_replace
----------------------------------------------------
17 12345 56789
I expect the result to get only
column1 | column2
12345 56789
There is no easy way to create query which gets an arbitrary number of columns: It cannot create one column for one number and at the next try the query would give two.
For fixed two columns:
demo:db<>fiddle
SELECT
matches[1] AS col1,
matches[2] AS col2
FROM (
SELECT
array_agg(regexp_matches[1]) AS matches
FROM
regexp_matches(
'on april-17 transactions numbers are 12345 / 56789',
'\d{5}',
'g'
)
) s
regexp_matches() gives out all finds in one row per find
array_agg() puts all elements into one array
The array elements can be give out as separate columns.

Finding what words a set of letters can create?

I am trying to write some SQL that will accept a set of letters and return all of the possible words it can make. My first thought was to create a basic three table database like so:
Words -- contains 200k words in real life
------
1 | act
2 | cat
Letters -- contains the whole alphabet in real life
--------
1 | a
3 | c
20 | t
WordLetters --First column is the WordId and the second column is the LetterId
------------
1 | 1
1 | 3
1 | 20
2 | 3
2 | 1
2 | 20
But I'm a bit stuck on how I would write a query that returns words that have an entry in WordLetters for every letter passed in. It also needs to account for words that have two of the same letter. I started with this query, but it obviously does not work:
SELECT DISTINCT w.Word
FROM Words w
INNER JOIN WordLetters wl
ON wl.LetterId = 20 AND wl.LetterId = 3 AND wl.LetterId = 1
How would I write a query to return only words that contain all of the letters passed in and accounting for duplicate letters?
Other info:
My Word table contains close to 200,000 words which is why I am trying to do this on the database side rather than in code. I am using the enable1 word list if anyone cares.
Ignoring, for the moment, the SQL part of the problem, the algorithm I'd use is fairly simple: start by taking each word in your dictionary, and producing a version of it with the letters in sorted order, along with a pointer back to the original version of that word.
This would give a table with entries like:
sorted_text word_id
act 123 /* we'll assume `act` was word number 123 in the original list */
act 321 /* we'll assume 'cat' was word number 321 in the original list */
Then when we receive an input (say, "tac") we sort it's letters, look it up in our table of sorted letters joined to the table of the original words, and that gives us a list of the words that can be created from that input.
If I were doing this, I'd have the tables for that in a SQL database, but probably use something else to pre-process the word list into the sorted form. Likewise, I'd probably leave sorting the letters of the user's input to whatever I was using to create the front-end, so SQL would be left to do what it's good at: relational database management.
If you use the solution you provide, you'll need to add an order column to the WordLetters table. Without that, there's no guarantee that you'll retrieve the rows that you retrieve are in the same order you inserted them.
However, I think I have a better solution. Based on your question, it appears that you want to find all words with the same component letters, independent of order or number of occurrences. This means that you have a limited number of possibilities. If you translate each letter of the alphabet into a different power of two, you can create a unique value for each combination of letters (aka a bitmask). You can then simply add together the values for each letter found in a word. This will make matching the words trivial, as all words with the same letters will map to the same value. Here's an example:
WITH letters
AS (SELECT Cast('a' AS VARCHAR) AS Letter,
1 AS LetterValue,
1 AS LetterNumber
UNION ALL
SELECT Cast(Char(97 + LetterNumber) AS VARCHAR),
Power(2, LetterNumber),
LetterNumber + 1
FROM letters
WHERE LetterNumber < 26),
words
AS (SELECT 1 AS wordid, 'act' AS word
UNION ALL SELECT 2, 'cat'
UNION ALL SELECT 3, 'tom'
UNION ALL SELECT 4, 'moot'
UNION ALL SELECT 5, 'mote')
SELECT wordid,
word,
Sum(distinct LetterValue) as WordValue
FROM letters
JOIN words
ON word LIKE '%' + letter + '%'
GROUP BY wordid, word
As you'll see if you run this query, "act" and "cat" have the same WordValue, as do "tom" and "moot", despite the difference in number of characters.
What makes this better than your solution? You don't have to build a lot of non-words to weed them out. This will constitute a massive savings of both storage and processing needed to perform the task.
There is a solution to this in SQL. It involves using a trick to count the number of times that each letter appears in a word. The following expression counts the number of times that 'a' appears:
select len(word) - len(replace(word, 'a', ''))
The idea is to count the total of all the letters in the word and see if that matches the overall length:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word, (LEN(w.word) - LEN(replace(w.word, wl.letter))) as LettersInWord
from word w
cross join wordletters wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
This particular solution allows multiple occurrences of a letter. I'm not sure if this was desired in the original question or not. If we want up to a certain number of occurrences, then we might do the following:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word,
(case when (LEN(w.word) - LEN(replace(w.word, wl.letter))) <= maxcount
then (LEN(w.word) - LEN(replace(w.word, wl.letter)))
else maxcount end) as LettersInWord
from word w
cross join
(
select letter, count(*) as maxcount
from wordletters wl
group by letter
) wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
If you want an exact match to the letters, then the case statement should use " = maxcount" instead of " <= maxcount".
In my experience, I have actually seen decent performance with small cross joins. This might actually work server-side. There are two big advantages to doing this work on the server. First, it takes advantage of the parallelism on the box. Second, a much smaller set of data needs to be transfered across the network.