Test multiple regex on each document

Test multiple regex on each document - pentaho

I am getting all documents from a mongodb collection (millions), and I have a lot of regex in a postgreSQL.
I wanted to test each regex until one match on multiple fields containded in documents.
Do you have any idea how to do that ?
I tried with a Filter Row step, but I can't figure how to loop over all regex from postgreSQL.

You can solve your problem by using a Join rows (Cartesian Product) component. One of your inputs will have to read in the docs, the other will have to read in the regular expressions. The join component will create a outer product from these resulting in every possible combination of regex expressions and docs. This stream you will have to feed into the Filter Rows component and send the result to some output.
The following transformation will mimick this approach (it reads from CSV files but that should not make any difference to reading it from postgreSQL or MungoDB):
The input data for "documents" is configured as follows:
The input data for "regular expressions" is configured as follows:
The Join Rows does not have to be configured at all since we will NOT provide a join condition and hence making it effectively an full outer join.
In the Filter component you will have to use the DOC_TEXT and the REGEX_TEXT fields to execute the check base upon REGEXP operator.
For this document input
DOC_ID;DOC_TEXT
1;DFGBGGG
2;UHLLJAL
3;JJJJHHH
4;FGAKKBL
and this regex input
REGEX_ID;REGEX_TEXT
1;.*A.*
2;.*B.*
the transformation will output the following result:
DOC_ID;DOC_TEXT;REGEX_ID;REGEX_TEXT
1;DFGBGGG;2;.*B.*
2;UHLLJAL;1;.*A.*
4;FGAKKBL;1;.*A.*
4;FGAKKBL;2;.*B.*

Related

How to query string inside a record in Google BigQuery? Docs not working

I want to query a subset of a record in the bitcoin blockchain using the google bigquery database. I go here and click view dataset https://console.cloud.google.com/marketplace/details/bigquery-public-data/bitcoin-blockchain. Then, on the left sidebar, it seems you have to click the dropdown at 'bigquery-public-data', then click 'bitcoin_blockchain' then 'transactions'. Then on the right you have to click the button 'Query Table'. This is the only way I have found to select the table -- just copying and pasting the command below won't recreate the error.
Based on the table that appears following the above instructions, I noticed that outputs are arecord type. I would like to view only one string from inside the record. The string is called output_pubkey_base58.
So I read the docs, and the docs imply the command would be:
SELECT outputs.output_pubkey_base58 FROM `bigquery-public-data.bitcoin_blockchain.transactions` LIMIT 1000;
I get an error: Cannot access value on Array<Struct<output_satoshis ... .. I tried outputs[0].output_pubkey_base58, didn't work
The annoying thing is that this problem is in the same format as the first example, where they query the citiesLived.place parameter from inside the citiesLived record in the same kind of command. : https://cloud.google.com/bigquery/docs/legacy-nested-repeated

You need to unnest the array into a new variable.
SELECT o.output_pubkey_base58
FROM
`bigquery-public-data.bitcoin_blockchain.transactions`,
UNNEST (outputs) as o
LIMIT
1000

Feel the confusion here is about legacy SQL and standard SQL. UNNEST must be used in standard SQL as described in document: https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#differences_in_repeated_field_handling
Selecting nested repeated leaf fields
Using legacy SQL, you can "dot" into a nested repeated field without needing to consider where the repetition occurs. In standard SQL, attempting to "dot" into a nested repeated field results in an error.

SQL Server full-text search for Latex content

I have a web app that allows users to save Latex content to a SQL Server 2012 database. I am running a full-text query as below to search for Latex expression.
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'2x-4=0');
The problem I am facing with above query is that some of the messages being returned by above query do not contain the latex expression 2x-4=0. For example, a message whose saved value is as below is also being returned by above query. You can clearly see that there is no 2x-4=0 contained in this message.
<p>Another example of inline Latex is \$x=34\$.</p>
<p>What are the roots of following equation: \$x^2 - 2x + 1 = 0\$?</p>
Question
Why is this happening and is there a way to get correct records returned when doing a full text search to look for the latex expression 2x-4 = 0? I have tried to repopulate the full text search data for the table being used, but it had no effect.
UPDATE 1
Strange, but the following Latex expression filter always returns exact matching results. I am now looking for $2x-4=0$ rather than 2x-4=0.
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'$2x-4=0$');
I have two types of delimiters for latex expression in my app: $$ for paragraph display and \$ for inline display of Latex expression, and therefore there will always be a $ symbol surrounding the latex expression stored in database, though the trailing delimiter could be \$ but full-text search seems to ignore the backslash character.
Why this modified query returns exact matches is not clear to me.
UPDATE 2
Another approach that works accurately is as mentioned in the answer. The full query for this is mentioned below. So, the LIKE operator ends up scanning only those rows that are selected by full-text search query.
WITH x AS
(SELECT MessageID,
Message
FROM Messages m
WHERE CONTAINS (m.Message,
N'2x-4=0') )
SELECT MessageID,
Message
FROM x
WHERE x.Message LIKE "%2x-4=0%"

To understand why it happens you can run the following query (1033 is the English language id):
select * from sys.dm_fts_parser('2x-4=0', 1033, 0,1)
In my instance it would return the following results:
Note, all other parts of the search criteria are considered to be noise words except for 2x. Therefore, I suspect your full text index simply does not have the full 2x-4=0 string and instead you get results with occurrences of 2x.
I tried adding 2x-4=0 to my own FTS index and CONTAINS was able to find it as the top result for both CONTAINS(col, '2x-4=0') and CONTAINS(col, '"2x-4=0"'). However, partial matches were included too right after the exact match.
Note, that when extra white space is added around = in the search term the FTS parser won't accept it and complain about syntax error.

CONTAINS is more like an end-user search operation, with support for keywords like NEAR, AND and OR. Try adding quotes within the quotes, to force the exact search term:
SELECT MessageID, Message FROM Messages m WHERE CONTAINS (m.Message, N'"2x-4=0"');
This is called <simple-term> in the documentation.
You can also try the LIKE operator:
SELECT MessageID, Message FROM Messages m WHERE m.Message LIKE '%2x-4=0%';
But note that this is probably slower than CONTAINS because it doesn't use a full text search index. If it's too slow, maybe you can even combine both of them in one query, so the CONTAINS is used to filter the result set down to the non-noise words using the index, and then LIKE applies the final matching.

Search multiple keywords in DocumentDB collection

I have a Azure DocumentDB collection with a 100 documents. I have tokenized an array of search terms in each document for performing a search based on keywords.
I was able to search on just one keyword using below SQL query for DocumentDB:
SELECT VALUE c FROM root c JOIN word IN c.tags WHERE
CONTAINS(LOWER(word), LOWER('keyword'))
However, this only allows search based on single keyword. I want to be able to search given multiple keywords. For this, I tried below query:
SELECT * FROM c WHERE ARRAY_CONTAINS(c.tags, "Food") OR
ARRAY_CONTAINS(c.tags, "Dessert") OR ARRAY_CONTAINS(c.tags, "Spicy")
This works, but is case-sensitive. How do I make this case-insensitive? I tried using scalar function LOWER like this
LOWER(c.tags), LOWER("Dessert")
but this doesn't seem to work with ARRAY_CONTAINS.
Any idea how I can perform a case-insensitive search on multiple keywords using SQL query for DocumentDB?
Thanks,
AB

The best way to deal with the case sensitivity is to store them in the tags array with all lower case (or upper case) and then just do LOWER(<user-input-tag>) at query time.
As for your desire to search on multiple user input tags, your approach of building a series of OR clauses is probably the best approach.

Apache Pig: Extracting url query parameters that appear in arbitrary order

I have a logfile with urls that are tagged with custom Google Analytics campaign parameters (utm_source, utm_medium, utm_campaign). I need to extract the parameters from the urls and create a csv file where source, medium and campaign appear each in their own column (plus several other fields from the logfile).
This is how I started (url is the field that contains the url obviously):
extracted = foreach mydata GENERATE date, time,
FLATTEN(REGEX_EXTRACT_ALL(url, '.*utm_source=(.*)&utm_medium=(.*)&utm_campaign=(.*)&.*?'))
AS (source:CHARARRAY, medium:CHARARRAY, campaign:CHARARRAY);
This works, but only as long as the parameters appear in a fixed order (and are not preceeded by another parameter in the url).
So this will e.g. extract data from https://www.example.com/page.html?&utm_source=publisher&utm_medium=display&utm_campaign=standard&someotherparam but not from https://www.example.com/page.html?&utm_medium=display&utm_source=publisher&utm_campaign=standard&someotherparam. Since the parameter order is not consistent that doesn't work for me.
I have tried multiple conditions for the regexp separated by or (|) but that only ever gave me the first match. I have also tried to extract each parameter in it's own extract command and then join the data but that took ages and ended up duplicating the data.
So what would be the best (or at least a working) way to rewrite my pig command so that it will extract all three utm parameters from the urls independently from the order in which they appear ?

I would simply have three REGEX_ECTRACT:
... FOREACH mydata GENERATE FLATTEN(REGEX_EXTRACT(url, '.*utm_source=([^&]*)'), 1) AS (source:CHARARRAY)
...
Although you could probably do it with just one regex but I find this simpler and more readdable.

SQL: LIKE and Contains — Different results

I am using MS SQL Express SQL function Contains to select data. However when I selected data with LIKE operator, I realised that Contains function is missing a few rows.
Rebuilt indexes but it didn't help.
Sql: brs.SearchText like '%aprilis%' and CONTAINS(brs.SearchText, '*aprilis*')
The contains function missed rows like:
22-28.aprīlis
[1.aprīlis]
Sīraprīlis
PS. If I search directly CONTAINS(brs.SearchText, '*22-28.aprīlis*'), then it finds them

contains is functionality based on the full text index. It supports words, phrases, and prefixed matches on words, but not suffixed matches. So you can match words that start with 'aprilis' but not words that end with it or that contain it arbitrarily in the middle. You might be able to take advantage of a thesaurus for these terms.
This is explained in more detail in the documentation.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Test multiple regex on each document - pentaho

Related

How to query string inside a record in Google BigQuery? Docs not working

SQL Server full-text search for Latex content

Search multiple keywords in DocumentDB collection

Apache Pig: Extracting url query parameters that appear in arbitrary order

SQL: LIKE and Contains — Different results

Categories

Resources