Remove overlapping substrings within a BigQuery STRING field

Remove overlapping substrings within a BigQuery STRING field - sql

I'm trying to find the most efficient way to remove overlapping substrings from a string field value on BigQuery. My use case is the same as Combining multiple regex substitutions but within BigQuery.
If I sum up the post above:
With the following list of substrings: ["quick brown fox", "fox jumps"]
I want:
A quick brown fox jumps over the lazy dog to be replaced by A over the lazy dog.
My thoughts were to come up with a JS UDF that does a similar job than what's mentioned in the post above i.e. to create a mask of the whole string and loop over the substrings to identify which characters to remove... But do you have better ideas?
Thanks for your help

I couldn't find out how to do this in Standard SQL
Below is for BigQuery Standard SQL and does whole thing in one shot - just one [simple] query
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'A quick brown fox jumps over the lazy dog' text
), list AS (
SELECT ['quick brown fox', 'fox jumps'] phrases
)
SELECT text AS original_text, REGEXP_REPLACE(text, STRING_AGG(pattern, '|'), '') processed_text FROM (
SELECT DISTINCT text, SUBSTR(text, MIN(start), MAX(finish) - MIN(start) + 1) pattern FROM (
SELECT *, COUNTIF(flag) OVER(PARTITION BY text ORDER BY start) grp FROM (
SELECT *, start > LAG(finish) OVER(PARTITION BY text ORDER BY start) flag FROM (
SELECT *, start + phrase_len - 1 AS finish FROM (
SELECT *, LENGTH(cut) + 1 + OFFSET * phrase_len + IFNULL(SUM(LENGTH(cut)) OVER(win), 0) start
FROM `project.dataset.table`, list,
UNNEST(phrases) phrase,
UNNEST([LENGTH(phrase)]) phrase_len,
UNNEST(REGEXP_EXTRACT_ALL(text, r'(.+?)' || phrase)) cut WITH OFFSET
WINDOW win AS (PARTITION BY text, phrase ORDER BY OFFSET ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)))) GROUP BY text, grp
) GROUP BY text
with output
Row original_text processed_text
1 A quick brown fox jumps over the lazy dog A over the lazy dog
I tested above with few more complex / tricky texts and it still worked
Brief explanation:
gather all inclusions of phrases in list and their respective starts and ends
combine overlapping fragments and calculate their respective starts and ends
extract new fragments based on starts and end from above step 2
order DESC them by length and generate regexp expression
finally do REGEXP_REPLACE using regexp generated in above step 4
Above might look messy - but in reality it does all above in one query and in pure SQL

Using a custom JS UDF seems to work, but i've seen faster BigQuery..!
CREATE FUNCTION `myproject.mydataset.keyword_remover_js`(label STRING) RETURNS STRING LANGUAGE js AS """
var keywords = ["a quick brown fox", "fox jumps"] ;
var mask = new Array(label.length).fill(1);
var reg = new RegExp("(" + keywords.join("|") + ")", 'g');
var found;
while (found = reg.exec(label)) {
for (var i = found.index; i < reg.lastIndex; i++) {
mask[i] = 0;
}
reg.lastIndex = found.index+1;
}
var result = []
for (var i = 0; i < label.length; i++) {
if (mask[i]) {
result.push(label[i])
}
}
return result.join('').replace(/ +/g,' ').replace(/^ +| +$/,'')
""";

Related

Longest Common SubString, BigQuery, SQL

Given I have two a Table with Two string columns:
A
B
John likes to go jumpping
Max likes swimming but he also likes to go jummping
John is cool
max is smart
John
max
In Big-query SQL How can I find the longest common substring? such that I get
A
B
C
John likes to go jumping
Max likes swimming but he also likes to go jumping
likes to go jumping
John is cool
max is smart
is
John
max
null

Try below very much SQL'ish approach
select A, B,
(
select string_agg(word, ' ' order by a_pos) phrase
from unnest(split(A, ' ')) word with offset a_pos
join unnest(split(B, ' ')) word with offset b_pos
using(word)
group by b_pos - a_pos
order by length(phrase) desc
limit 1
) as C
from `project.dataset.table`
when applied to sample data in your question - output is
Obviously your example is very simple, so in real use case you might need to adjust above to reflect reality
Also, note: there are many other options/approaches for your problem that SO has already multiple answers for, including mine - for text similarity mostly based on using JS UDF and levenshtein distance or similar algorithms

This probably is not a problem for your SQL to solve (it is though very simple to solve via any scripting language). However, BigQuery does support JS based UDFs, which usually come in handy to solve such problems.
Here is an option (which at its core is not SQL) that you can take in BigQuery:
CREATE TEMP FUNCTION lcsub(a string, b string)
RETURNS STRING
LANGUAGE js AS """
a = a.split(' ');
b = b.split(' ');
let la = a.length;
let lb = b.length;
let output = [];
for (var i=0; i<la; i++){
for (var j=0; j<lb; j++){
if (a[i] == b[j]){
let u = [b[j]]
let aidx = i;
for (var k = j+1; k<lb; k++){
u.push(b[k]);
if (u.join(' ') == a.slice(i, aidx +1+1).join(' ')){
if (u.length >= output.length){
output = u;
}
}
else {
u.pop();
if (u.length >= output.length){
output = u;
}
break;
}
aidx += 1;
if (aidx > la -1){
break
}
}
}
}
}
return output.join(' ')
""";
select A, B, lcsub(A, B) as C from dataset.table

Add array of other records from the same table to each record

My project is a Latin language learning app. My DB has all the words I'm teaching, in the table 'words'. It has the lemma (the main form of the word), along with the definition and other information the user needs to learn.
I show one word at a time for them to guess/remember what it means. The correct word is shown along with some wrong words, like:
What does Romanus mean? Greek - /Roman/ - Phoenician - barbarian
What does domus mean? /house/ - horse - wall - senator
The wrong options are randomly drawn from the same table, and must be from the same part of speech (adjective, noun...) as the correct word; but I am only interested in their lemma. My return value looks like this (some properties omitted):
[
{ lemma: 'Romanus', definition: 'Roman', options: ['Greek', 'Phoenician', 'barbarian'] },
{ lemma: 'domus', definition: 'house', options: ['horse', 'wall', 'senator'] }
]
What I am looking for is a more efficient way of doing it than my current approach, which runs a new query for each word:
// All the necessary requires are here
class Word extends Model {
static async fetch() {
const words = await this.findAll({
limit: 10,
order: [Sequelize.literal('RANDOM()')],
attributes: ['lemma', 'definition'], // also a few other columns I need
});
const wordsWithOptions = await Promise.all(words.map(this.addOptions.bind(this)));
return wordsWithOptions;
}
static async addOptions(word) {
const options = await this.findAll({
order: [Sequelize.literal('RANDOM()')],
limit: 3,
attributes: ['lemma'],
where: {
partOfSpeech: word.dataValues.partOfSpeech,
lemma: { [Op.not]: word.dataValues.lemma },
},
});
return { ...word.dataValues, options: options.map((row) => row.dataValues.lemma) };
}
}
So, is there a way I can do this with raw SQL? How about Sequelize? One thing that still helps me is to give a name to what I'm trying to do, so that I can Google it.
EDIT: I have tried the following and at least got somewhere:
const words = await this.findAll({
limit: 10,
order: [Sequelize.literal('RANDOM()')],
attributes: {
include: [[sequelize.literal(`(
SELECT lemma FROM words AS options
WHERE "partOfSpeech" = "options"."partOfSpeech"
ORDER BY RANDOM() LIMIT 1
)`), 'options']],
},
});
Now, there are two problems with this. First, I only get one option, when I need three; but if the query has LIMIT 3, I get: SequelizeDatabaseError: more than one row returned by a subquery used as an expression.
The second error is that while the code above does return something, it always gives the same word as an option! I thought to remedy that with WHERE "partOfSpeech" = "options"."partOfSpeech", but then I get SequelizeDatabaseError: invalid reference to FROM-clause entry for table "words".
So, how do I tell PostgreSQL "for each row in the result, add a column with an array of three lemmas, WHERE existingRow.partOfSpeech = wordToGoInTheArray.partOfSpeech?"

Revised
Well that seems like a different question and perhaps should be posted that way, but...
The main technique remains the same. JOIN instead of sub-select. The difference being generating the list of lemmas for then piping then into the initial query. In a single this can get nasty.
As single statement (actually this turned out not to be too bad):
select w.lemma, w.defination, string_to_array(string_agg(o.defination,','), ',') as options
from words w
join lateral
(select defination
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma in( select lemma
from words
order by random()
limit 4 --<<< replace with parameter
)
group by w.lemma, w.defination;
The other approach build a small SQL function to randomly select a specified number of lemmas. This selection is the piped into the (renamed) function previous fiddle.
create or replace
function exam_lemma_definition_options(lemma_array_in text[])
returns table (lemma text
,definition text
,option text[]
)
language sql strict
as $$
select w.lemma, w.definition, string_to_array(string_agg(o.definition,','), ',') as options
from words w
join lateral
(select definition
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma = any(lemma_array_in)
group by w.lemma, w.definition;
$$;
create or replace
function exam_lemmas(num_of_lemmas integer)
returns text[]
language sql
strict
as $$
select string_to_array(string_agg(lemma,','),',')
from (select lemma
from words
order by random()
limit num_of_lemmas
) ll
$$;
Using this approach your calling code reduces to a needs a single SQL statement:
select *
from exam_lemma_definition_options(exam_lemmas(4))
order by lemma;
This permits you to specify the numbers of lemmas to select (in this case 4) limited only by the number of rows in Words table. See revised fiddle.
Original
Instead of using a sub-select to get the option words just JOIN.
select w.lemma, w.definition, string_to_array(string_agg(o.definition,','), ',') as options
from words w
join lateral
(select definition
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma = any(array['Romanus', 'domus'])
group by w.lemma, w.definition;
See fiddle. Obviously this will not necessary produce the same options as your questions provides due to random() selection. But it will get matching parts of speech. I will leave translation to your source language to you; or you can use the function option and reduce your SQL to a simple "select *".

OPENJSON - modify statement to ignore first part of the string

We receive auto-generated emails from an application, and we export those to our database as they arrive at the Inbox. The table is called dbo.MailArchive.
Up until recently, the body of the email has always looked like this...
Status: Completed
Successful actions count: 250
Page load count: 250
...except with different numbers and statuses. Note that there is a carriage return on the blank line after Page load count.
The entirety of this data gets written to a field called Mail_Body - then we run the following statement using OPENJSON to parse those lines into their own columns in the record:
DECLARE #PI varchar(7) = '%[^' + CHAR(13) + CHAR(10) + ']%';
SELECT j.Status,
j.Successful_Actions_Count,
j.Page_Load_Count
FROM dbo.MailArchive m
CROSS APPLY(VALUES(REVERSE(m.Mail_Body),PATINDEX(#PI,REVERSE(m.Mail_Body)))) PI(SY,I)
CROSS APPLY(VALUES(REVERSE(STUFF(PI.SY,1,PI.I,''))))S(FixedString)
CROSS APPLY OPENJSON (CONCAT('{"', REPLACE(REPLACE(S.FixedString, ': ', '":"'), CHAR(13) + CHAR(10), '","'), '"}'))
WITH (Status varchar(100) '$.Status',
Successful_Actions_Count int '$."Successful actions count"',
Page_Load_Count int '$."Page load count"') j;
Beginning today, there are certain emails where the body of the email looks like this:
Agent did not meet defined success criteria on this run.
Status: Completed
Successful actions count: 250
Page load count: 250
To clarify, that's one new line at the top, a carriage return at the end of that line, and a carriage return on the blank line between the new line and the Status line. At this time, there is no consistent way to predict which emails will come in with this new line, and which ones won't.
How can I modify our OPENJSON statement to say, If this first line exists in the body, skip/ignore it and parse lines 3 through 5, else just do exactly what I have above? Or perhaps even better to future-proof it, always ignore everything before the word Status?

Since your data has new leading and trailing rows, I think a simple aggregation in concert with a string_split() and a CROSS APPLY would be more effective than my previous XML answer and the current JSON approach
Example or dbFiddle
Select A.ID
,Status = stuff(Pos1,1,charindex(':',Pos1),'')
,Action = try_convert(int,stuff(Pos2,1,charindex(':',Pos2),''))
,PageCnt = try_convert(int,stuff(Pos3,1,charindex(':',Pos3),''))
From YourTable A
Cross Apply (
Select [Pos1] = max(case when Value like 'Status:%' then value end)
,[Pos2] = max(case when Value like '%actions count:%' then value end)
,[Pos3] = max(case when Value like 'Page load count:%' then value end)
From string_split(SomeCol,char(10))
) B
Returns
ID Status Action PageCnt
1 Completed 250 250
Note: Use an OUTER APPLY if you want to see NULLs

Find Substring - SQL

I need to find a substring that is in a text field that is actually partially xml. I tried converting it to xml and then use the .value method but to no avail.
The element(substring) I am looking for is a method name that looks like this:
AssemblyQualifiedName="IPMGlobal.CRM2011.IPM.CustomWorkflowActivities.ProcessChildRecords,
where the method at the end "ProcessChildRecords" could be another name such as "SendEmail". I know I can use the "CustomWorkflowActivities." and the , (comma) to find the substring (method name) but not sure how to accomplish it. In addition, there may be more that one instance listed of the **"CustomWorkflowActvities.<method>"**
Some Clarifications:
Below is my original query. It returns that first occurrence in each row but no additional. For example I might have in the string '...IPM.CustomWorkflowActivities.ProcessChildRecords...' and
'...IPM.CustomWorkflowActivities.GetworkflowContext...'
The current query only returns Approve Time Process,
ipm_mytimesheetbatch,
ProcessChildRecords
SELECT WF.name WFName,
(
SELECT TOP 1 Name
FROM entity E
WHERE WF.primaryentity = E.ObjectTypeCode
) Entity,
Convert(xml, xaml) Xaml,
SUBSTRING(xaml, Charindex('CustomWorkflowActivities.', xaml) + Len('CustomWorkflowActivities.'), Charindex(', IPMGlobal.CRM2011.IPM.CustomWorkflowActivities, Version=1.0.0.0', xaml) - Charindex('CustomWorkflowActivities.', xaml) - Len('CustomWorkflowActivities.'))
FROM FilteredWorkflow WF
WHERE 1 = 1
AND xaml LIKE '%customworkflowactivities%'
AND statecodename = 'Activated'
AND typename = 'Definition'
ORDER BY NAME

If you are using Oracle you could use REGEXP function:
WITH cte(t) as (
SELECT 'AssemblyQualifiedName="IPMGlobal.CRM2011.IPM.CustomWorkflowActivities.ProcessChildRecords,' FROM dual
)
SELECT t,
regexp_replace(t, '.*CustomWorkflowActivities.(.+)\,.*', '\1') AS r
FROM cte;
DBFiddle Demo
SQL Server:
WITH cte(t) as (
SELECT 'AssemblyQualifiedName="IPMGlobal.CRM2011.IPM.CustomWorkflowActivities.ProcessChildRecords,asfdsa'
)
SELECT t,SUBSTRING(t, s, CHARINDEX(',', t, s)-s)
FROM (SELECT t, PATINDEX( '%CustomWorkflowActivities.%', t) + LEN('CustomWorkflowActivities.') AS s
FROM cte
) sub;
DBFiddle Demo 2

bigquery url decode

Is there an easy way to do URL decoding within the BigQuery query language? I'm working with a table that has a column containing URL-encoded strings in some values. For example:
http://xyz.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz
I extract the "url" parameter like so:
SELECT REGEXP_EXTRACT(column_name, "url=([^&]+)") as url
from [mydataset.mytable]
which gives me:
http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345
What I would like to do is something like:
SELECT URL_DECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) as url
from [mydataset.mytable]
thereby returning:
http://www.example.com/hello?v=12345
I would like to avoid using multiple REGEXP_REPLACE() statements (replacing %20, %3A, etc...) if possible.
Ideas?

Below is built on top of #sigpwned answer, but slightly refactored and wrapped with SQL UDF (which has no limitation that JS UDF has so safe to use)
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT SAFE_CONVERT_BYTES_TO_STRING(
ARRAY_TO_STRING(ARRAY_AGG(
IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
), b''))
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i
));
SELECT
column_name,
URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url
FROM `project.dataset.table`
can be tested with example from question as below
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT SAFE_CONVERT_BYTES_TO_STRING(
ARRAY_TO_STRING(ARRAY_AGG(
IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
), b''))
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i
));
WITH `project.dataset.table` AS (
SELECT 'http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz' column_name
)
SELECT
URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url,
column_name
FROM `project.dataset.table`
with result
Row url column_name
1 http://www.example.com/hello?v=12345 http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz
Update with further quite optimized SQL UDF
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT STRING_AGG(
IF(REGEXP_CONTAINS(y, r'^%[0-9a-fA-F]{2}'),
SAFE_CONVERT_BYTES_TO_STRING(FROM_HEX(REPLACE(y, '%', ''))), y), ''
ORDER BY i
)
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}(?:%[0-9a-fA-F]{2})*|[^%]+")) y
WITH OFFSET AS i
));

It's a good feature request, but currently there is no built in BigQuery function that provides URL decoding.

One more workaround is using a user-defined function.
#standardSQL
CREATE TEMPORARY FUNCTION URL_DECODE(enc STRING)
RETURNS STRING
LANGUAGE js AS """
try {
return decodeURI(enc);;
} catch (e) { return null }
return null;
""";
SELECT ven_session,
URL_DECODE(REGEXP_EXTRACT(para,r'&kw=(\w|[^&]*)')) AS q
FROM raas_system.weblog_20170327
WHERE para like '%&kw=%'
LIMIT 10

I agree with everyone here that URLDECODE should be a native function. However, until that happens, it is possible to write a "native" URLDECODE:
SELECT id, SAFE_CONVERT_BYTES_TO_STRING(ARRAY_TO_STRING(ps, b'')) FROM (SELECT
id,
ARRAY_AGG(CASE
WHEN REGEXP_CONTAINS(y, r"^%") THEN FROM_HEX(SUBSTR(y, 2))
ELSE CAST(y AS bytes)
END ORDER BY i) AS ps
FROM (SELECT x AS id, REGEXP_EXTRACT_ALL(x, r"%[0-9a-fA-F]{2}|[^%]+") AS element FROM UNNEST(ARRAY['domodossola%e2%80%93locarno railway', 'gabu%c5%82t%c3%b3w']) AS x) AS x
CROSS JOIN UNNEST(x.element) AS y WITH OFFSET AS i GROUP BY id);
In this example, I've tried and tested the implementation with a couple of percent-encoded page names from Wikipedia as the input. It should work with your input, too.
Obviously, this is extremely unwieldly! For that reason, I'd suggest building a materialized join table, or wrapping this in a view, rather than using this expression "naked" in your query. However, it does appear to get the job done, and it doesn't hit the UDF limits.
EDIT: #MikhailBerylyant's post below has wrapped this cumbersome implementation into a nice, tidy little SQL UDF. That's a much better way to handle this!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove overlapping substrings within a BigQuery STRING field - sql

Related

Longest Common SubString, BigQuery, SQL

Add array of other records from the same table to each record

OPENJSON - modify statement to ignore first part of the string

Find Substring - SQL

bigquery url decode

Categories

Resources