Lookup using contains match - pentaho

I have to design ETL to load data into a table. But, I shouldnt load values matching keys in a lookup table. Eg.
*Input Table*
Cab Ride
Ride in Cab
Booked Cab
Self drive
Car pooling
*Lookup Table*
Cab
Taxi
*Destination Table*
Self drive
Car pooling
As we can see in destination table we are not loading data that contains Cab
Please let me know if this possible in Pentaho or SQL

The simplest way is building a regex
Your lookup table feeds the various strings that you want to filter
out (or in);
Using a group by step you concatenate all those strings
separated by |; the result is "Cab|Taxi".
Prepend .( and append
). with a calculator step, ending up with .(Cab|Taxi).; call this field "regex_filter"
Cross join this single row with the main data stream;
Now you can use a filter rows with condition "NOT input_field REGEXP regex_filter". (you may want to prepend (?i) to the the regex to make it case
insensitive).
See attached example: Regex filter in PDI 5.4

Related

Datasource Establishment in Tableau and 170,000 records

I have two EXCEL datasources. 175,000 rows. I'm trying to set up a join (Add New Join Clause) using the INNER option between the two datasources. The left datasource includes certain member id #s. Unfortunately, the right datasource's member id #s are within a large field called member Desc. Something like below,
Datasource Left
Member ID #
ALL89098
Datasource Right
Member Desc
YTRNNN TO=ALL89098_KIA TO BE OR NOT OR
POALL89098 JOE
So, I need to deal with two scenarios as you notice from above. The member id is within the Member Desc after a TO= and it could be anywhere like scenario 2 POALL89098
If I can't get this done in Tableau to establish the Join between these two columns from different datasources, since I have both of these datasources loaded into SQL Server DB, I can run SQL statements in SQL since they are in two different Tables within SQL Server DB as well.
I'm trying the use of CONTAINS clause in Tableau such as below but it is running very very slow. it is only Tableau Desktop with 16 GB Ram.
if contains([Member Desc],([Member id #])then
[Member id #]
ELSE
"NOT FOUND"
END
Thanks so much for your time.
SO, IS THERE A WAY TO HAVE THE REGEXP WITHIN IF AND ELSE OR CASE STATEMENTS?
You can create a join calculation. The highlighted dropdown shows where this can be found:
As long as the format of the Member ID in [Member Desc] has some pattern, it can be extracted with Regex. As you mention in your question, one way the ID may present itself is after a "TO=" and it looks like it ends before a "_". The following regex calculated field will pull the string between the two:
REGEXP_EXTRACT([Member Desc],"([^TO=]*)(?=_)")
The result should properly join the two datasources:
The above is an outline which I hope sets you on the right path. I realize that there may be a few different methods in which the [Member ID] presents itself so I wont be able to nail down the exact Regex, but if there is any pattern at all then the format above should work. (ie: even if the only pattern is that [Member ID] is three letters followed by four numbers - or it always starts with an A and ends with something else - etc.)
Regex should also perform better than a contains() function, but do be aware that the function does need to search through every string in every row to make the join.
Edit in response to comment:
To add multiple conditions, try the following method:
IF LEN(REGEXP_EXTRACT([Member Desc],"([^FROM=]*)(?=,)")) > 0
THEN REGEXP_EXTRACT([Member Desc],"([^FROM=]*)(?=,)")
ELSEIF LEN(REGEXP_EXTRACT([Member Desc],"([^TO=]*)(?=,)")) > 0
THEN REGEXP_EXTRACT([Member Desc],"([^TO=]*)(?=,)")
ELSEIF [...Put as many of these as might match your pattern]
THEN [...Put as many of these as might match your pattern]
END
Essentially the calculation is going down the list and trying each possibility. I changed yours a little to look at the length (LEN()) of the returned value which should compare fairly quickly, as it is an integer. As this calculation iterates through each ELSEIF and finds a match, it will stop iterating through the list -- so its important to put the most likely match at the top. The result of the calculated field should be a member ID. If there is no match, there really isn't a need for an ELSE statement because the Inner Join will exclude it automatically.
Edit in response to comment:
Thank you. I see your recommendations.
I think you are going to have to find a way to strip out the member ID from the member desc in SQL. There should be some pattern to Member ID.
For instance is it always 3 letters followed by 5 numbers or something similar.
If you can come up with a pattern, then you can use SQL and some combination of Substring, Charindex, and/or Like %Text% or a regex
pattern to strip out the actual member ID in the SQL Server table as its own field before bringing it into Tableau.

Optimising LIKE expressions that start with wildcards

I have a table in a SQL Server database with an address field (ex. 1 Farnham Road, Guildford, Surrey, GU2XFF) which I want to search with a wildcard before and after the search string.
SELECT *
FROM Table
WHERE Address_Field LIKE '%nham%'
I have around 2 million records in this table and I'm finding that queries take anywhere from 5-10s, which isn't ideal. I believe this is because of the preceding wildcard.
I think I'm right in saying that any indexes won't be used for seek operations because of the preceeding wildcard.
Using full text searching and CONTAINS isn't possible because I want to search for the latter parts of words (I know that you could replace the search string for Guil* in the below query and this would return results). Certainly running the following returns no results
SELECT *
FROM Table
WHERE CONTAINS(Address_Field, '"nham"')
Is there any way to optimise queries with preceding wildcards?
Here is one (not really recommended) solution.
Create a table AddressSubstrings. This table would have multiple rows per address and the primary key of table.
When you insert an address into table, insert substrings starting from each position. So, if you want to insert 'abcd', then you would insert:
abcd
bcd
cd
d
along with the unique id of the row in Table. (This can all be done using a trigger.)
Create an index on AddressSubstrings(AddressSubstring).
Then you can phrase your query as:
SELECT *
FROM Table t JOIN
AddressSubstrings ads
ON t.table_id = ads.table_id
WHERE ads.AddressSubstring LIKE 'nham%';
Now there will be a matching row starting with nham. So, like should make use of an index (and a full text index also works).
If you are interesting in the right way to handle this problem, a reasonable place to start is the Postgres documentation. This uses a method similar to the above, but using n-grams. The only problem with n-grams for your particular problem is that they require re-writing the comparison as well as changing the storing.
I can't offer a complete solution to this difficult problem.
But if you're looking to create a suffix search capability, in which, for example, you'd be able to find the row containing HWilson with ilson and the row containing ABC123000654 with 654, here's a suggestion.
WHERE REVERSE(textcolumn) LIKE REVERSE('ilson') + '%'
Of course this isn't sargable the way I wrote it here. But many modern DBMSs, including recent versions of SQL server, allow the definition, and indexing, of computed or virtual columns.
I've deployed this technique, to the delight of end users, in a health-care system with lots of record IDs like ABC123000654.
Not without a serious preparation effort, hwilson1.
At the risk of repeating the obvious - any search path optimisation - leading to the decision whether an index is used, or which type of join operator to use, etc. (independently of which DBMS we're talking about) - works on equality (equal to) or range checking (greater-than and less-than).
With leading wildcards, you're out of luck.
The workaround is a serious preparation effort, as stated up front:
It would boil down to Vertica's text search feature, where that problem is solved. See here:
https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/AdministratorsGuide/Tables/TextSearch/UsingTextSearch.htm
For any other database platform, including MS SQL, you'll have to do that manually.
In a nutshell: It relies on a primary key or unique identifier of the table whose text search you want to optimise.
You create an auxiliary table, whose primary key is the primary key of your base table, plus a sequence number, and a VARCHAR column that will contain a series of substrings of the base table's string you initially searched using wildcards. In an over-simplified way:
If your input table (just showing the columns that matter) is this:
id |the_search_col |other_col
42|The Restaurant at the End of the Universe|Arthur Dent
43|The Hitch-Hiker's Guide to the Galaxy |Ford Prefect
Your auxiliary search table could contain:
id |seq|search_token
42| 1|Restaurant
42| 2|End
42| 3|Universe
43| 1|Hitch-Hiker
43| 2|Guide
43| 3|Galaxy
Normally, you suppress typical "fillers" like articles and prepositions and apostrophe-s , and split into tokens separated by punctuation and white space. For your '%nham%' example, however, you'd probably need to talk to a linguist who has specialised in English morphology to find splitting token candidates .... :-]
You could start by the same technique that I use when I un-pivot a horizontal series of measures without the PIVOT clause, like here:
Pivot sql convert rows to columns
Then, use a combination of, probably nested, CHARINDEX() and SUBSTRING() using the index you get from the CROSS JOIN with a series of index integers as described in my post suggested above, and use that very index as the sequence for the auxiliary search table.
Lay an index on search_token and you'll have a very fast access path to a big table.
Not a stroll in the park, I agree, but promising ...
Happy playing -
Marco the Sane

Apache Pig: Extracting url query parameters that appear in arbitrary order

I have a logfile with urls that are tagged with custom Google Analytics campaign parameters (utm_source, utm_medium, utm_campaign). I need to extract the parameters from the urls and create a csv file where source, medium and campaign appear each in their own column (plus several other fields from the logfile).
This is how I started (url is the field that contains the url obviously):
extracted = foreach mydata GENERATE date, time,
FLATTEN(REGEX_EXTRACT_ALL(url, '.*utm_source=(.*)&utm_medium=(.*)&utm_campaign=(.*)&.*?'))
AS (source:CHARARRAY, medium:CHARARRAY, campaign:CHARARRAY);
This works, but only as long as the parameters appear in a fixed order (and are not preceeded by another parameter in the url).
So this will e.g. extract data from https://www.example.com/page.html?&utm_source=publisher&utm_medium=display&utm_campaign=standard&someotherparam but not from https://www.example.com/page.html?&utm_medium=display&utm_source=publisher&utm_campaign=standard&someotherparam. Since the parameter order is not consistent that doesn't work for me.
I have tried multiple conditions for the regexp separated by or (|) but that only ever gave me the first match. I have also tried to extract each parameter in it's own extract command and then join the data but that took ages and ended up duplicating the data.
So what would be the best (or at least a working) way to rewrite my pig command so that it will extract all three utm parameters from the urls independently from the order in which they appear ?
I would simply have three REGEX_ECTRACT:
... FOREACH mydata GENERATE FLATTEN(REGEX_EXTRACT(url, '.*utm_source=([^&]*)'), 1) AS (source:CHARARRAY)
...
Although you could probably do it with just one regex but I find this simpler and more readdable.

Test multiple regex on each document

I am getting all documents from a mongodb collection (millions), and I have a lot of regex in a postgreSQL.
I wanted to test each regex until one match on multiple fields containded in documents.
Do you have any idea how to do that ?
I tried with a Filter Row step, but I can't figure how to loop over all regex from postgreSQL.
You can solve your problem by using a Join rows (Cartesian Product) component. One of your inputs will have to read in the docs, the other will have to read in the regular expressions. The join component will create a outer product from these resulting in every possible combination of regex expressions and docs. This stream you will have to feed into the Filter Rows component and send the result to some output.
The following transformation will mimick this approach (it reads from CSV files but that should not make any difference to reading it from postgreSQL or MungoDB):
The input data for "documents" is configured as follows:
The input data for "regular expressions" is configured as follows:
The Join Rows does not have to be configured at all since we will NOT provide a join condition and hence making it effectively an full outer join.
In the Filter component you will have to use the DOC_TEXT and the REGEX_TEXT fields to execute the check base upon REGEXP operator.
For this document input
DOC_ID;DOC_TEXT
1;DFGBGGG
2;UHLLJAL
3;JJJJHHH
4;FGAKKBL
and this regex input
REGEX_ID;REGEX_TEXT
1;.*A.*
2;.*B.*
the transformation will output the following result:
DOC_ID;DOC_TEXT;REGEX_ID;REGEX_TEXT
1;DFGBGGG;2;.*B.*
2;UHLLJAL;1;.*A.*
4;FGAKKBL;1;.*A.*
4;FGAKKBL;2;.*B.*

pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

Suppose I have the following flat file on HDFS (let's call this key_value):
1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing
Here is the output I'm looking for:
(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)
In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).
Towards this end, I started putting together this pig script:
data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
sorted = ORDER data BY key, value;
GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;
The above script gives me the following output:
(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)
Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).
Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?
UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:
(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)
First of all, let me point out that if for most rows, most of the columns are not filled out, that a better solution IMO would be to use a map. The builtin TOMAP UDF combined with a custom UDF to combine maps would enable you to do this.
I am sure there is a way to solve your original question by computing a list of all possible keys, exploding it out with null values and then throwing away the instances where a non-null value also exists... but this would involve a lot of MR cycles, really ugly code, and I suspect is no better than organizing your data in some other way.
You could also write a UDF to take in a bag of key/value pairs, another bag all possible keys, and generates the tuple you're looking for. That would be clearer and simpler.