SQL replace list of strings with element prefixes - sql

in Postgres, I have a table with a column which is a list of text:
devdb=> \d txyz
Table "public.txyz"
Column | Type | Collation | Nullable | Default
---------------+--------+-----------+----------+---------
status | text | | |
lstcol | text[] | | |
and lstcol contains
devdb=> select lstcol from txyz limit 1 ;
lstcol
----------------------------------------------------------------------
{"ABCD - Company One Ltd","EFG - Second Corp."}
I want to replace each element contained in the list with the word that precedes the " - ", obtaining
{"ABCD","EFG"}
How can I achieve that?
It is fine to create another column, and then replace the original one.
My SQL isn't stellar and this project has a lot of it. Any help is deeply appreciated.
Many thanks

You can update the existing table (i.e. transform the existing column contents) like this:
update txyz
set lstcol = (select array_agg(trim(split_part(s, '-', 1))) from unnest(lstcol) s);
And it would be good to vacuum table txyz after that.

One method is a lateral join which pulls the array apart, picks out the the piece you want, and then reaggregates:
select t.*, x.ar
from txyz t cross join lateral
(select array_agg(split_part(col, ' - ', 1)) as ar
from unnest(t.lstcol) col
) x;
Here is a db<>fiddle.

You should to read official docs of Postgres on it's official site.
https://www.postgresql.org/docs/13/arrays.html - this part of Manual describes arrays and operation with ones.

Related

Spark SQL regex to extract date, file name and brand

Currently I have several files and I want to upload them to a DB, creating new columns with some metadata on them. An example of the files I have is the following:
MYBRAND-GOOD_20210202.tab
MYBRAND-BAD_20210202.tab
MYBRAND_20210202.tab
each file have x,y,z columns and additionally I want to create 3 new columns with metadata on them, based on some properties of the files. What I would like to have as a result is the following:
Table MYBRAND-GOOD
x | y | z | brand | FILE_DATE | SOURCE_DETAILS | Name
a. b c GOOD 20210202 tab MYBRAND-GOOD_20210202
Table MYBRAND-BAD
x | y | z | brand | FILE_DATE | SOURCE_DETAILS | Name
a. b c BAD 20210202 tab MYBRAND-BAD_20210202
Table MYBRAND
x | y | z | brand | FILE_DATE | SOURCE_DETAILS | Name
a. b c MYBRAND 20210202 tab MYBRAND_20210202
What I'm currently doing is the following :
SELECT x,y,z,
split(INPUT_FILE_NAME(),'- | _')[1] AS brand,
regexp_extract(INPUT_FILE_NAME(), '.*/modified_dttm=(.*)/.+', 1) AS FILE_DATE,
regexp_extract(regexp_replace(INPUT_FILE_NAME()\\,'%20'\\,'')\\, '.*/.*-([0-9]{4}-[0-9]{2}-[0-9]{2}).tab'\\, 1)) AS SOURCE_DETAILS
regexp_extract(INPUT_FILE_NAME(), '^([^\.]+)\.?', 0) AS NAME
However I'm facing several problems (since I'm not very proficient with regex):
brand fails if it doesn't have a '-' separator (AS in 'MYBRAND')
I'm not sure if 'FILE_DATE' it's doing what's suppose to do
SOURCE_DETAILS is giving me empty results
NAME is ok, but I would like to exclude the '.'
If someone could guide me with this regex rules, which I don't follow completely, I would appreciate any correction.
We can write one pattern for the whole string and vary the index argument of regexp_extract() for each desired element.
(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)
Using that pattern each time, you can select which capture group to display
Select x,y,z
Regexp_extract(INPUT_FILE_NAME(),'(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)', 3) AS Brand,
Regexp_extract(INPUT_FILE_NAME(),'(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)', 4) AS FileDate,
Regexp_extract(INPUT_FILE_NAME(),'(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)', 5) AS SourceDetails,
Regexp_extract(INPUT_FILE_NAME(),'(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)', 1) AS Name
You parenthesize each subpattern you want to capture, so we start with a parenthesis pair right at the beginning to capture the name. Then we scan MYBRAND, then start a new parenthesis group because the hyphen is optional. Then we start the third parenthesis group to capture the alphanumerics [A-Za-z0-9]* which make up the brand. The star lets the group be empty which will retrieve a null. Next comes an underscore followed by a new set of parens to capture the digits making up the date \d{8,8}. We close the first parenthesis here to end the file name capture, then a dot, and the final parens to capture the filetype (\w+).

SQL Server Full Text Search to find containing characters

I have a table with a column Document that is FullText Index.
Let say I have this in this table:
| ID | Document |
| 1 | WINTER SUMMER SPRING OTHER |
My requirement is to find rows that contains 'ER'.
For this I am querying like this:
SELECT TOP 100
[FullTextSearch].[Document], [FullTextSearch].[ID]
FROM
[FullTextSearch]
WHERE
CONTAINS(Document, '"*ER*"')
But this is not working.
Please suggest what should be best way to do this using FullTextSearch.
I am expecting id 1 should be returned.
You can user LIKE operator to find the value.
The LIKE operator is used in a WHERE clause to search for a specified pattern in a column.
There are two wildcards used in conjunction with the LIKE operator:
% - The percent sign represents zero, one, or multiple characters
_ - The underscore represents a single character
Syntax,
SELECT column1, column2, ...
FROM table_name
WHERE columnN LIKE pattern;
This query can help to find the result.
SELECT Document,ID FROM FullTextSearch
WHERE Document LIKE '%ER%';
It's a wildcard query...This should work.
SELECT TOP 100
[FullTextSearch].[Document], [FullTextSearch].[ID]
FROM
[FullTextSearch]
WHERE
Document like '%ER%'
========OR=============
SELECT TOP 100
[FullTextSearch].[Document], [FullTextSearch].[ID]
FROM
[FullTextSearch]
WHERE
CONTAINS(Document, '%ER%')

PostgreSQL Reverse LIKE

I need to test if any part of a column value is in a given string, instead of whether the string is part of a column value.
For instance:
This way, I can find if any of the rows in my table contains the string 'bricks' in column:
SELECT column FROM table
WHERE column ILIKE '%bricks%';
But what I'm looking for, is to find out if any part of the sentence "The ships hung in the sky in much the same way that bricks don’t" is in any of the rows.
Something like:
SELECT column FROM table
WHERE 'The ships hung in the sky in much the same way that bricks don’t' ILIKE '%' || column || '%';
So the row from the first example, where the column contains 'bricks', will show up as result.
I've looked through some suggestions here and some other forums but none of them worked.
Your simple case can be solved with a simple query using the ANY construct and ~*:
SELECT *
FROM tbl
WHERE col ~* ANY (string_to_array('The ships hung in the sky ... bricks don’t', ' '));
~* is the case insensitive regular expression match operator. I use that instead of ILIKE so we can use original words in your string without the need to pad % for ILIKE. The result is the same - except for words containing special characters: %_\ for ILIKE and !$()*+.:<=>?[\]^{|}- for regular expression patterns. You may need to escape special characters either way to avoid surprises. Here is a function for regular expressions:
Escape function for regular expression or LIKE patterns
But I have nagging doubts that will be all you need. See my comment. I suspect you need Full Text Search with a matching dictionary for your natural language to provide useful word stemming ...
Related:
IN vs ANY operator in PostgreSQL
PostgreSQL LIKE query performance variations
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
This query:
SELECT
regexp_split_to_table(
'The ships hung in the sky in much the same way that bricks don’t',
'\s' );
gives a following result:
| regexp_split_to_table |
|-----------------------|
| The |
| ships |
| hung |
| in |
| the |
| sky |
| in |
| much |
| the |
| same |
| way |
| that |
| bricks |
| don’t |
Now just do a semijoin against a result of this query to get desired results
SELECT * FROM table t
WHERE EXISTS (
SELECT * FROM (
SELECT
regexp_split_to_table(
'The ships hung in the sky in much the same way that bricks don’t',
'\s' ) x
) x
WHERE t.column LIKE '%'|| x.x || '%'
)

Moving data by length

I have to move words table's data to another tables.
For example words table is :
------------------------------
| word | type |
|------------------|---------|
| car | NA |
| home | NA |
| question | PR |
------------------------------
I have to move this data by length . For example , car's length is 3 , and car will move to 3-char table (with type column). And question will moved to 8-char .
How can i do it with SQL commands .
Sort of an incomplete question, but something like this might help point you in the right direction:
INSERT INTO words_3char SELECT word FROM all_words WHERE LENGTH(word)=3;
DELETE FROM all_words WHERE LENGTH(word)=3;
I'm not going to ask why you need to do all this moving around, but I'm not sure its a good idea. Assuming it is, take a look at the Length() function for mysql and then try something like this.
Insert into table_Char3(Word) Values (
Select Word from Words where Length(word) = 3)
You can move them to new tables like this
create table word1char as select word from words where length(trim(word)) = 1
..
create table word3chars as select word from words where length(trim(word)) = 3

Custom sorting (order by) in PostgreSQL, independent of locale

Let's say I have a simple table with two columns: id (int) and name (varchar). In this table I store some names which are in Polish, e.g.:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
Now, let's say I want to sort the results by name:
SELECT * FROM table ORDER BY name;
If I have C locale, I get:
4 | Włocławek
1 | sępoleński
3 | toruński
2 | świecki
which is wrong, because "ś" should be after "s" and before "t". If I use Polish locale (pl_PL.UTF-8), I get:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
which is also not what I want, because I would like names starting with capital letters to be first just like in C locale, like this:
4 | Włocławek
1 | sępoleński
2 | świecki
3 | toruński
How can I do this?
If you want a custom sort, you must define some function that modifies your values in some way so that the natural ordering of the modified values fits your requirement.
For example, you can append some character or string it the value starts with uppercase:
CREATE OR REPLACE FUNCTION mysort(text) returns text IMMUTABLE as $$
SELECT CASE WHEN substring($1 from 1 for 1) =
upper( substring($1 from 1 for 1)) then 'AAAA' || $1 else $1 END
;
$$ LANGUAGE SQL;
And then
SELECT * FROM table ORDER BY mysort(name);
This is not foolprof (you might want to change 'AAA' for something more apt) and hurts performance, of course.
If you want it efficient, you'll need to create another column that "naturally" sorts correctly (e.g. even in the C locale), and use that as a sorting criterion. For that, you should use the approach of the strxfrm C library function. As a straight-forward strxfrm table for your approach, replace each letter with two ASCII letters: 's' would become 's0' and 'ś' would become 's1'. Then 'świecki' becomes 's1w0i0e0c0k0i0', and the regular ASCII sorting will sort it correctly.
If you don't want to create a separate column, you can try to use a function in the where clause:
SELECT * FROM table ORDER BY strxfrm(name);
Here, strxfrm needs to be replaced with a proper function. Either you write one yourself, or you use the standard translate function (although this doesn't support replacing a character with two of them, so you'll need some more involved transformation).