Remove Text from a String in an SSIS Package - sql

I am currently updating an already existing SSIS package.
The current Package pulls data from an Excel Spread Sheet that is provided by our IT Department. It lists Machine Names of Computers and counts it for a License Report.
I currently have the Job (derived column) strip off the M (Mobile) or D (Desktop) from the first part of the machine name so that it returns just the user name, which is what I need for the report.
MBRUBAKERBR => BRUBAKERBR
However, our IT Department just implemented Windows 7 and with it a new Naming convention.
Now there is a 76A, B, C or D that is added to the end of all of the updated machines. If the machine has not been updated then it stays with the older Naming Convention (seen Above).
There are also machines that have to stay on XP, their names have been update to have X3A, B, C or D at the end of theirs.
MBRUBAKERBR76A or DBRUBAKERX3C
What I need is to remove the last part of the name so that I just get the user name out of it for reporting.
The issues is I can't use a LEFT, RIGHT, LTRIM or RTRIM expression as some of the computer names will only have the M or D in front (as they have not yet been upgraded).
What can I do to remove these characters without rebuilding this package?
UPDATE: I would really like to update the existing Expression that Removed the M and D.
Here is the Expression that I am using.
SUBSTRING(Name,2,50)
this is in a Derived Column in my SSIS Package.
As for Sample Data here is what it looks like coming in.
| Name |
| MBrubakerBR76A |
| MBROCKSKX3A |
| DGOLDBERGZA |
| MWILLIAMSEL |
| DEASTST76C |
| DCUSICKEVX3D |
This is what I want it to return.
| Name |
| BRUBAKERBR |
| BROCKSK |
| GOLDBERGZA |
| WILLIAMSEL |
| EASTST |
| CUSICKEV |
Let me know if you need any more information or examples.

First determine if the machine has been upgraded, if it is then strip out last 3 and the first letter. If it has not been upgraded then just strip out the first letter. I avoided Trim functions to keep the code clear.
SELECT
machineName,
CASE WHEN RIGHT(machineName, 3) Like '%[0-9]%' THEN
SUBSTRING(machineName, 2, len(machineName) - 4)
ELSE
RIGHT(machineName, len(machineName)-1)
END AS UserName
From MachineList
SQL Fiddle Example
SSIS Expression
As pattern matching not working in SSIS expression, try this
LEFT(RIGHT(machineName, 3),2)="X3"||LEFT(RIGHT(machineName, 3),2)="76"?SUBSTRING(machineName, 2, len(machineName) - 4):RIGHT(machineName, len(machineName)-1)

Related

Open sums in SQL / dynamic selection of tables

Much ink has been spilled on the topic of sum types in SQL. The standard solutions are called absorption, separation, and partition; see, e.g.: https://www.inf.unibz.it/~montali/teaching/1415/dpm/slides/4.relational-mapping.pdf .
I want to ask about how to encode open sums. Normal sums allow a field to be one of a fixed set of several different types; with open sums, this set is not fixed.
The basic setup in our program: There is a list of "triggers," where each trigger can be one of many different things. Plugins can be written defining new trigger types, although the set of trigger types can be assumed to be known at compile time.
We want a table of all triggers.
Our current best idea:
Dynamically create a materialized view of the following form:
id | id_in_plugin_table | thing_in_main_program_it_refs | plugin_name
---------------------------------------------------------------------
1 | 27 | 8 | RegexTrigger
2 | 27 | 12 | RidiculouslyUnsafeCustomJSTrigger
This relation is automatically generated from the various plugin tables, each of which have their own ID and a thing_in_main_program_it_refs field.
For illustration, here's what the referenced tables may look like.
RegexTrigger table:
id | thing_in_main_program_it_refs | regex
---------------------------------------------------------------------
27 | 8 | hel*o
RidiculouslyUnsafeCustomJSTrigger
id | thing_in_main_program_it_refs | custom_js
---------------------------------------------------------------------
27 | 12 | (x) => isPrime(x.length())
Either use two roundtrips to lookup the plugin table and then query it, or combine them into a single SQL program which uses EXEC.
I'm happy with part 1, but not with part 2. Neither option sounds efficient, and the latter option uses EXEC.
So, we're looking for either (a) a better way to dynamically select a table in a query, or (b) a different approach to open sums.

Spark SQL regex to extract date, file name and brand

Currently I have several files and I want to upload them to a DB, creating new columns with some metadata on them. An example of the files I have is the following:
MYBRAND-GOOD_20210202.tab
MYBRAND-BAD_20210202.tab
MYBRAND_20210202.tab
each file have x,y,z columns and additionally I want to create 3 new columns with metadata on them, based on some properties of the files. What I would like to have as a result is the following:
Table MYBRAND-GOOD
x | y | z | brand | FILE_DATE | SOURCE_DETAILS | Name
a. b c GOOD 20210202 tab MYBRAND-GOOD_20210202
Table MYBRAND-BAD
x | y | z | brand | FILE_DATE | SOURCE_DETAILS | Name
a. b c BAD 20210202 tab MYBRAND-BAD_20210202
Table MYBRAND
x | y | z | brand | FILE_DATE | SOURCE_DETAILS | Name
a. b c MYBRAND 20210202 tab MYBRAND_20210202
What I'm currently doing is the following :
SELECT x,y,z,
split(INPUT_FILE_NAME(),'- | _')[1] AS brand,
regexp_extract(INPUT_FILE_NAME(), '.*/modified_dttm=(.*)/.+', 1) AS FILE_DATE,
regexp_extract(regexp_replace(INPUT_FILE_NAME()\\,'%20'\\,'')\\, '.*/.*-([0-9]{4}-[0-9]{2}-[0-9]{2}).tab'\\, 1)) AS SOURCE_DETAILS
regexp_extract(INPUT_FILE_NAME(), '^([^\.]+)\.?', 0) AS NAME
However I'm facing several problems (since I'm not very proficient with regex):
brand fails if it doesn't have a '-' separator (AS in 'MYBRAND')
I'm not sure if 'FILE_DATE' it's doing what's suppose to do
SOURCE_DETAILS is giving me empty results
NAME is ok, but I would like to exclude the '.'
If someone could guide me with this regex rules, which I don't follow completely, I would appreciate any correction.
We can write one pattern for the whole string and vary the index argument of regexp_extract() for each desired element.
(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)
Using that pattern each time, you can select which capture group to display
Select x,y,z
Regexp_extract(INPUT_FILE_NAME(),'(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)', 3) AS Brand,
Regexp_extract(INPUT_FILE_NAME(),'(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)', 4) AS FileDate,
Regexp_extract(INPUT_FILE_NAME(),'(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)', 5) AS SourceDetails,
Regexp_extract(INPUT_FILE_NAME(),'(Mybrand(-([A-Za-z0-9]*))?_(\d{8,8}))\.(\w+)', 1) AS Name
You parenthesize each subpattern you want to capture, so we start with a parenthesis pair right at the beginning to capture the name. Then we scan MYBRAND, then start a new parenthesis group because the hyphen is optional. Then we start the third parenthesis group to capture the alphanumerics [A-Za-z0-9]* which make up the brand. The star lets the group be empty which will retrieve a null. Next comes an underscore followed by a new set of parens to capture the digits making up the date \d{8,8}. We close the first parenthesis here to end the file name capture, then a dot, and the final parens to capture the filetype (\w+).

Transpose/Pivot Excel file in Pentaho (using multiple files)

I've been having some trouble with the following situation: There's an Excel file I need to use which has the information in the following format:
ColumnA | ColumnB
Name | John
Business | Pentaho
Address | Evergreen 123
Job type | Food processing
NameBoss | Boss lv1
Phone | 555-NoPhone
Mail | thisATmail
What I need to do is get all column A as different columns, ending with 7 different columns, each one with one value, which is the data in column B. Additionally, the integration is reading the filename as an extra output field:
SELECT
'${FILES_ROOT}/proyectos/BUSINESS_NAME/B_NAME_OPER/archivos_fuente/NÓMINA BAC - ' ||nombre_empresa||'.xlsx' as nombre_archivo
--, nombre_empresa
FROM "public".maestro_empresa
The transformation for the Excel file I have it as this:
As can bee seen, in the fields tab of the transformation, added manually each column, since the data in the Excel file does not has headers.
With this done, I am not sure how to proceed from here in order to get the transposed data I need. What can I do?
End result I am looking forward is something like this:
Name | Business | Address | Job type | NameBoss | Phone | Mail | excel_name
John | Pentaho | Evergreen 123 | Food processing | Boss lv1 | 555-NoPhone | thisAtMail | ExcelName.xlsx
With step 'Row demoralizer', you can do this easily. AT first you need to take input from excel file -> you need to use 'Row demoralizer' step. You can see sample from HERE.
Note: Remove ''Id'' column from my sample if you always suppose to get one line.
If you ColumnA values are dynamic /not specific . You can use THIS Metadata Injection sample ( where you need to take same excel input twice. But not require to specify column name). Please run transformation "MetaDataInjectionPV.ktr"

Transpose variable number of rows into columns in OpenRefine

I have an xml file containing records from a library catalogue. I have imported it into OpenRefine but all the values are in one column. I want to transpose it so each field in the record has its own column. However, this is complicated by the fact that a) each field is optional so does not exist in all records and b) many fields are repeatable so can appear multiple times in each record. Here's a simplified example of what the data looks like:
| RecordID | Tag | Data |
| 1 | 040a | CaABCD |
| 1 | 245a | Go fish |
| 1 | 245a | A guide to fish |
| 1 | 246i | Fish series |
| 1 | 260a | Fishing friends |
| 2 | 040a | CaABDC |
| 2 | 245a | Happy trails |
| 2 | 246i | Hiking series |
| 2 | 260i | The happy hiker |
| 2 | 500a | Notes |
I have read the Q&A here Openrefine - Transpose rows into columns based on text but the problem with this solution is that if I concatenate all the values together I have no way to be sure what field they belong in anymore, as my data is much more complicated than the data in that question (my actual data has 25+ fields and many thousands of records).
I was able to get closer using Google Sheets and making a pivot table with a calculated field (as in PivotTable to show values, not sum of values - see the answer at the very bottom). However, I still don't know how to handle the repeating fields. In the pivot table the multiple values are there but only the first displays (double-clicking on an individual cell brings up a details table which lists all the values), so when I copy-paste the table I lose the additional values. I would like to concatenate them but I cannot see a way to do so within the pivot table.
Can you think of any other way I could do this, in OpenRefine or another tool? Thanks!
The classic way to fix this in OpenRefine is to use "Transpose -> Columnize by key value". But this feature is poorly documented and can cause headaches even for OpenRefine developers. In your case, repeated fields will be problematic, so here is a possible solution.
1° Go to the "tag" column, click on "Transpose -> Columnize by key value" and use the following configuration (don't forget the "Note column (optional)")
The result will look like this (my dataset is not exactly the same as yours, I modified a value to do some test)
2° In the new column "Record ID: 040 a", click on "edit column -> Move Column To Beginning".
3° If you want to merge the repeated fields, go to each column that contains them and click on "Edit Cells -> Join Multi Value cells" by choosing a separator, for example "|".
The end result will look like this.
To get rid of unnecessary columns: Click on Export -> Custom tabular export and deselect the columns whose name starts with RecordId.
OpenRefine also has a native MARC importer which might be something worth trying if you need to work with MARC data in the future. MARCEdit also has some specific OpenRefine support built in.

Postgres matching against an array of regular expressions

My client wants the possibility to match a set of data against an array of regular expressions, meaning:
table:
name | officeId (foreignkey)
--------
bob | 1
alice | 1
alicia | 2
walter | 2
and he wants to do something along those lines:
get me all records of offices (officeId) where there is a member with
ANY name ~ ANY[.*ob, ali.*]
meaning
ANY of[alicia, walter] ~ ANY of [.*ob, ali.*] results in true
I could not figure it out by myself sadly :/.
Edit
The real Problem was missing form the original description:
I cannot use select disctinct officeId .. where name ~ ANY[.*ob, ali.*], because:
This application, stored data in postgres-xml columns, which means i do in fact have (after evaluating xpath('/data/clients/name/text()'))::text[]):
table:
name | officeId (foreignkey)
-----------------------------------------
[bob, alice] | 1
[anthony, walter] | 2
[alicia, walter] | 3
There is the Problem. And "you don't do that, that is horrible, why would you do it like this, store it like it is meant to be stored in a relation database, user a no-sql database for Document-based storage, use json" are no options.
I am stuck with this datamodel.
This looks pretty horrific, but the only way I can think of doing such a thing would be a hybrid of a cross-join and a semi join. On small data sets this would probably work pretty well. On large datasets, I imagine the cross-join component could hit you pretty hard.
Check it out and let me know if it works against your real data:
with patterns as (
select unnest(array['.*ob', 'ali.*']) as pattern
)
select
o.name, o.officeid
from
office o
where exists (
select null
from patterns p
where o.name ~ p.pattern
)
The semi-join helps protect you from cases where you have a name like "alicia nob" that would meet multiple search patterns would otherwise come back for every match.
You could cast the array to text.
SELECT * FROM workers WHERE (xpath('/data/clients/name/text()', xml_field))::text ~ ANY(ARRAY['wal','ant']);
When casting a string array into text, strings containing special characters or consisting of keywords are enclosed in double quotes kind of like {jimmy,"walter, james"} being two entries. Also when matching with ~ it is matched against any part of the string, not the same as LIKE where it's matched against the whole string.
Here is what I did in my test database:
test=# select id, (xpath('/data/clients/name/text()', name))::text[] as xss, officeid from workers WHERE (xpath('/data/clients/name/text()', name))::text ~ ANY(ARRAY['wal','ant']);
id | xss | officeid
----+-------------------------+----------
2 | {anthony,walter} | 2
3 | {alicia,walter} | 3
4 | {"walter, james"} | 5
5 | {jimmy,"walter, james"} | 4
(4 rows)