Spark SQL regex to extract date, file name and brand - sql

Currently I have several files and I want to upload them to a DB, creating new columns with some metadata on them. An example of the files I have is the following:
MYBRAND-GOOD_20210202.tab
MYBRAND-BAD_20210202.tab
MYBRAND_20210202.tab
each file have x,y,z columns and additionally I want to create 3 new columns with metadata on them, based on some properties of the files. What I would like to have as a result is the following:
Table MYBRAND-GOOD
x | y | z | brand | FILE_DATE | SOURCE_DETAILS | Name
a. b c GOOD 20210202 tab MYBRAND-GOOD_20210202
Table MYBRAND-BAD
x | y | z | brand | FILE_DATE | SOURCE_DETAILS | Name
a. b c BAD 20210202 tab MYBRAND-BAD_20210202
Table MYBRAND
x | y | z | brand | FILE_DATE | SOURCE_DETAILS | Name
a. b c MYBRAND 20210202 tab MYBRAND_20210202
What I'm currently doing is the following :
SELECT x,y,z,
split(INPUT_FILE_NAME(),'- | _')[1] AS brand,
regexp_extract(INPUT_FILE_NAME(), '.*/modified_dttm=(.*)/.+', 1) AS FILE_DATE,
regexp_extract(regexp_replace(INPUT_FILE_NAME()\\,'%20'\\,'')\\, '.*/.*-([0-9]{4}-[0-9]{2}-[0-9]{2}).tab'\\, 1)) AS SOURCE_DETAILS
regexp_extract(INPUT_FILE_NAME(), '^([^\.]+)\.?', 0) AS NAME
However I'm facing several problems (since I'm not very proficient with regex):
brand fails if it doesn't have a '-' separator (AS in 'MYBRAND')
I'm not sure if 'FILE_DATE' it's doing what's suppose to do
SOURCE_DETAILS is giving me empty results
NAME is ok, but I would like to exclude the '.'
If someone could guide me with this regex rules, which I don't follow completely, I would appreciate any correction.

We can write one pattern for the whole string and vary the index argument of regexp_extract() for each desired element.
(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)
Using that pattern each time, you can select which capture group to display
Select x,y,z
Regexp_extract(INPUT_FILE_NAME(),'(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)', 3) AS Brand,
Regexp_extract(INPUT_FILE_NAME(),'(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)', 4) AS FileDate,
Regexp_extract(INPUT_FILE_NAME(),'(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)', 5) AS SourceDetails,
Regexp_extract(INPUT_FILE_NAME(),'(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)', 1) AS Name
You parenthesize each subpattern you want to capture, so we start with a parenthesis pair right at the beginning to capture the name. Then we scan MYBRAND, then start a new parenthesis group because the hyphen is optional. Then we start the third parenthesis group to capture the alphanumerics [A-Za-z0-9]* which make up the brand. The star lets the group be empty which will retrieve a null. Next comes an underscore followed by a new set of parens to capture the digits making up the date \d{8,8}. We close the first parenthesis here to end the file name capture, then a dot, and the final parens to capture the filetype (\w+).

Related

SQL replace list of strings with element prefixes

in Postgres, I have a table with a column which is a list of text:
devdb=> \d txyz
Table "public.txyz"
Column | Type | Collation | Nullable | Default
---------------+--------+-----------+----------+---------
status | text | | |
lstcol | text[] | | |
and lstcol contains
devdb=> select lstcol from txyz limit 1 ;
lstcol
----------------------------------------------------------------------
{"ABCD - Company One Ltd","EFG - Second Corp."}
I want to replace each element contained in the list with the word that precedes the " - ", obtaining
{"ABCD","EFG"}
How can I achieve that?
It is fine to create another column, and then replace the original one.
My SQL isn't stellar and this project has a lot of it. Any help is deeply appreciated.
Many thanks
You can update the existing table (i.e. transform the existing column contents) like this:
update txyz
set lstcol = (select array_agg(trim(split_part(s, '-', 1))) from unnest(lstcol) s);
And it would be good to vacuum table txyz after that.
One method is a lateral join which pulls the array apart, picks out the the piece you want, and then reaggregates:
select t.*, x.ar
from txyz t cross join lateral
(select array_agg(split_part(col, ' - ', 1)) as ar
from unnest(t.lstcol) col
) x;
Here is a db<>fiddle.
You should to read official docs of Postgres on it's official site.
https://www.postgresql.org/docs/13/arrays.html - this part of Manual describes arrays and operation with ones.

How to replace text contained in one row with text contained in another row using a select statement

I am crafting a sql query that dynamically builds a where clause. I was able to transform the separate pieces of the where clause as return rows like so:
-------------------------------------------
| ID | Query Part |
-------------------------------------------
| TOKEN 1 | (A = 1 OR B = 2) |
-------------------------------------------
| TOKEN 2 | ([TOKEN 1] or C = 3 |
-------------------------------------------
| TOKEN 3 | ([TOKEN 2] and D = 4) |
-------------------------------------------
My goal is to wrap the current return results above in a stuff and or replace (or something entirely different I hadn't considered) to output the following result:
(((A=1 OR B=2) OR C=3) AND D=4)
Ideally there would be no temp table necessary but I am open to recommendations.
Thank you for any guidance, this has had me pretty stumped at work.
Its unusual. It looks like the query part you want is only Token 3. Then the process should replace any [token] tags in this query part with the corresponding query parts. With the subsequent resulting query part, again the process should replace any [token] tags with the corresponding query parts. This continues until there are no more [token] tags to replace.
I think there should there be a way of indicating the master query (ie token 3) , then use a recursive common table expression to build the expression up until there are no more [token]s.

Postgres matching against an array of regular expressions

My client wants the possibility to match a set of data against an array of regular expressions, meaning:
table:
name | officeId (foreignkey)
--------
bob | 1
alice | 1
alicia | 2
walter | 2
and he wants to do something along those lines:
get me all records of offices (officeId) where there is a member with
ANY name ~ ANY[.*ob, ali.*]
meaning
ANY of[alicia, walter] ~ ANY of [.*ob, ali.*] results in true
I could not figure it out by myself sadly :/.
Edit
The real Problem was missing form the original description:
I cannot use select disctinct officeId .. where name ~ ANY[.*ob, ali.*], because:
This application, stored data in postgres-xml columns, which means i do in fact have (after evaluating xpath('/data/clients/name/text()'))::text[]):
table:
name | officeId (foreignkey)
-----------------------------------------
[bob, alice] | 1
[anthony, walter] | 2
[alicia, walter] | 3
There is the Problem. And "you don't do that, that is horrible, why would you do it like this, store it like it is meant to be stored in a relation database, user a no-sql database for Document-based storage, use json" are no options.
I am stuck with this datamodel.
This looks pretty horrific, but the only way I can think of doing such a thing would be a hybrid of a cross-join and a semi join. On small data sets this would probably work pretty well. On large datasets, I imagine the cross-join component could hit you pretty hard.
Check it out and let me know if it works against your real data:
with patterns as (
select unnest(array['.*ob', 'ali.*']) as pattern
)
select
o.name, o.officeid
from
office o
where exists (
select null
from patterns p
where o.name ~ p.pattern
)
The semi-join helps protect you from cases where you have a name like "alicia nob" that would meet multiple search patterns would otherwise come back for every match.
You could cast the array to text.
SELECT * FROM workers WHERE (xpath('/data/clients/name/text()', xml_field))::text ~ ANY(ARRAY['wal','ant']);
When casting a string array into text, strings containing special characters or consisting of keywords are enclosed in double quotes kind of like {jimmy,"walter, james"} being two entries. Also when matching with ~ it is matched against any part of the string, not the same as LIKE where it's matched against the whole string.
Here is what I did in my test database:
test=# select id, (xpath('/data/clients/name/text()', name))::text[] as xss, officeid from workers WHERE (xpath('/data/clients/name/text()', name))::text ~ ANY(ARRAY['wal','ant']);
id | xss | officeid
----+-------------------------+----------
2 | {anthony,walter} | 2
3 | {alicia,walter} | 3
4 | {"walter, james"} | 5
5 | {jimmy,"walter, james"} | 4
(4 rows)

Remove Text from a String in an SSIS Package

I am currently updating an already existing SSIS package.
The current Package pulls data from an Excel Spread Sheet that is provided by our IT Department. It lists Machine Names of Computers and counts it for a License Report.
I currently have the Job (derived column) strip off the M (Mobile) or D (Desktop) from the first part of the machine name so that it returns just the user name, which is what I need for the report.
MBRUBAKERBR => BRUBAKERBR
However, our IT Department just implemented Windows 7 and with it a new Naming convention.
Now there is a 76A, B, C or D that is added to the end of all of the updated machines. If the machine has not been updated then it stays with the older Naming Convention (seen Above).
There are also machines that have to stay on XP, their names have been update to have X3A, B, C or D at the end of theirs.
MBRUBAKERBR76A or DBRUBAKERX3C
What I need is to remove the last part of the name so that I just get the user name out of it for reporting.
The issues is I can't use a LEFT, RIGHT, LTRIM or RTRIM expression as some of the computer names will only have the M or D in front (as they have not yet been upgraded).
What can I do to remove these characters without rebuilding this package?
UPDATE: I would really like to update the existing Expression that Removed the M and D.
Here is the Expression that I am using.
SUBSTRING(Name,2,50)
this is in a Derived Column in my SSIS Package.
As for Sample Data here is what it looks like coming in.
| Name |
| MBrubakerBR76A |
| MBROCKSKX3A |
| DGOLDBERGZA |
| MWILLIAMSEL |
| DEASTST76C |
| DCUSICKEVX3D |
This is what I want it to return.
| Name |
| BRUBAKERBR |
| BROCKSK |
| GOLDBERGZA |
| WILLIAMSEL |
| EASTST |
| CUSICKEV |
Let me know if you need any more information or examples.
First determine if the machine has been upgraded, if it is then strip out last 3 and the first letter. If it has not been upgraded then just strip out the first letter. I avoided Trim functions to keep the code clear.
SELECT
machineName,
CASE WHEN RIGHT(machineName, 3) Like '%[0-9]%' THEN
SUBSTRING(machineName, 2, len(machineName) - 4)
ELSE
RIGHT(machineName, len(machineName)-1)
END AS UserName
From MachineList
SQL Fiddle Example
SSIS Expression
As pattern matching not working in SSIS expression, try this
LEFT(RIGHT(machineName, 3),2)="X3"||LEFT(RIGHT(machineName, 3),2)="76"?SUBSTRING(machineName, 2, len(machineName) - 4):RIGHT(machineName, len(machineName)-1)

Custom sorting (order by) in PostgreSQL, independent of locale

Let's say I have a simple table with two columns: id (int) and name (varchar). In this table I store some names which are in Polish, e.g.:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
Now, let's say I want to sort the results by name:
SELECT * FROM table ORDER BY name;
If I have C locale, I get:
4 | Włocławek
1 | sępoleński
3 | toruński
2 | świecki
which is wrong, because "ś" should be after "s" and before "t". If I use Polish locale (pl_PL.UTF-8), I get:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
which is also not what I want, because I would like names starting with capital letters to be first just like in C locale, like this:
4 | Włocławek
1 | sępoleński
2 | świecki
3 | toruński
How can I do this?
If you want a custom sort, you must define some function that modifies your values in some way so that the natural ordering of the modified values fits your requirement.
For example, you can append some character or string it the value starts with uppercase:
CREATE OR REPLACE FUNCTION mysort(text) returns text IMMUTABLE as $$
SELECT CASE WHEN substring($1 from 1 for 1) =
upper( substring($1 from 1 for 1)) then 'AAAA' || $1 else $1 END
;
$$ LANGUAGE SQL;
And then
SELECT * FROM table ORDER BY mysort(name);
This is not foolprof (you might want to change 'AAA' for something more apt) and hurts performance, of course.
If you want it efficient, you'll need to create another column that "naturally" sorts correctly (e.g. even in the C locale), and use that as a sorting criterion. For that, you should use the approach of the strxfrm C library function. As a straight-forward strxfrm table for your approach, replace each letter with two ASCII letters: 's' would become 's0' and 'ś' would become 's1'. Then 'świecki' becomes 's1w0i0e0c0k0i0', and the regular ASCII sorting will sort it correctly.
If you don't want to create a separate column, you can try to use a function in the where clause:
SELECT * FROM table ORDER BY strxfrm(name);
Here, strxfrm needs to be replaced with a proper function. Either you write one yourself, or you use the standard translate function (although this doesn't support replacing a character with two of them, so you'll need some more involved transformation).