BigQuery - remove duplicate substrings from string - google-bigquery

I have a string which contains repeated substrings like
The big black dog big black dog is a friendly friendly dog who who lives nearby nearby
How can I remove these in BQ so that the result looks like this
The big black dog is a friendly dog who lives nearby
I tried with regex capturing groups like in this question but no luck so far, it just returns the original phrase
with text as (
select "The big black dog big black dog is a friendly
friendly dog who who lives nearby nearby" as phrase
)
select REGEXP_REPLACE(phrase,r'([ \\w]+)\\1',r'$1') from text
Edit: assuming the repeated word or phrase follows right after the first instance of a word or phrase
i.e. dog dog is considered a repetition of dog,
but dog good dog is not
similarly for phrases:
good dog good dog is repetition of good dog
but good dog catch ball good dog is not considered a repetition

In addition to #Mikhail's wonderful SQL-ish approach, you might consider below regexp JS UDF approach.
regular expression explanation
CREATE TEMP FUNCTION dedup_repetition(s STRING) RETURNS STRING LANGUAGE js AS
"""
const re = /(.{2,}) (?=\\1)/g;
return s.replace(re, '');
""";
WITH sample_data AS (
SELECT 'The big black dog big black dog is a friendly friendly dog who who lives nearby nearby' phrase UNION ALL
SELECT 'good dog good dog' UNION ALL
SELECT 'good dog good dog good dog' UNION ALL
SELECT 'good dog good dog good' UNION ALL
SELECT 'dog dog dog' UNION ALL
SELECT 'good dog catch ball good dog'
)
SELECT dedup_repetition(phrase) deduped_phrase FROM sample_data;

Consider below naive approach
with recursive candidates as (
select id, phrase, offset_a, len, string_agg(word_b, ' ' order by offset_b) seq
from your_table, unnest([0,1,2,3,4,5]) as len,
unnest(split(phrase, ' ')) word_a with offset offset_a
join unnest(split(phrase, ' ')) word_b with offset offset_b
on offset_b between offset_a and offset_a + len
group by id, phrase, offset_a, len
), dups as (
select id,phrase, seq, row_number() over(partition by phrase order by array_length(split(seq, ' ')) desc) as pos
from (
select distinct t1.id, t1.phrase, t1.offset_a, t1.seq
from candidates t1
join candidates t2
on t1.seq = t2.seq
and t1.phrase = t2.phrase
and t1.offset_a = t2.offset_a + t2.len + 1
)
), iterations as (
select id, phrase, 1 as pos, phrase || ' ' as dedupped_phrase from your_table
union all
select i.id, i.phrase, pos + 1, regexp_replace(dedupped_phrase, r'(' || seq || ' )+', '' || seq || ' ')
from iterations i
join dups d
using(pos, phrase, id)
)
select id, phrase, trim(dedupped_phrase) as dedupped_phrase
from iterations
qualify 1 = row_number() over(partition by phrase order by pos desc)
So, assuming your_table is
the output is
Step 1 - CTE named candidates - here you are getting all possible words sequences to then consider in next step. Note: here in unnest([0,1,2,3,4,5]) len you are setting possible length of sequebces (number of words) to consider. you can use unnest(generate_array(0,5)) as len instead if you want to save some typing :o)
Step 2 - CTE named dups identifies duplicate sequences obtained in previous step and ranks them from most lengthy (by words count) to least ones
Step 3- CTE named iterations actually goes and eliminates dups one by one (for every phrase)
Step 4 - finally, the last SELECT retrieves only last iterations for each initial phrase

Related

Big Query SQL concatenating or linking values in piped strings or arrays of strings

I have two fields in a big BQ table. Both fields are strings. Both strings are formatted to represent several values separated by pipes, the same number of values in each string. I need to associate each value in the first set with each value in the second set. For example:
id names nums
x "a|b|c" "3|9|5"
y "d" "1"
z "e|f" "4|7"
I need to come out with a result like:
x,a,3
x,b,9
x,c,5
y,d,1
z,e,4
z,f,7
The second input string is a sequence of numbers but I don't mind if this comes out as numeric or string (I'll figure out the casting).
Seems obvious I need to use split() at some point can convert the strings to arrays, but how can I concatenate the arrays from left to right rather than lenght-wise?
I know I can use a double unnest with an index and select only where the indexes are equal, however I don't want to use this method because it makes the already large input table massive (before selecting the equal indexes).
Thanks for any thoughts!
Here is one method that generates a series of subscripts and uses arrays to extract the values:
select id, split(names, '|')[safe_ordinal(n)] as names,
split(nums, '|')[safe_ordinal(n)] as nums
from (select 'x' as id, 'a|b|c' as names, '3|9|5' as nums union all
select 'y', 'd', '1' union all
select 'z', 'e|f', '4|7'
) t cross join
unnest(generate_array(1, array_length(split(names, '|')))) as n;
This uses the length of names to determine how many values there are.
Alternative option for BigQuery Standard SQL
#standardSQL
SELECT id, name, num
FROM `project.dataset.table`,
UNNEST(SPLIT(names, '|')) name WITH OFFSET
JOIN UNNEST(SPLIT(nums, '|')) num WITH OFFSET
USING(OFFSET)
If apply to sample data in your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'x' id, 'a|b|c' names, '3|9|5' nums UNION ALL
SELECT 'y', 'd', '1' UNION ALL
SELECT 'z', 'e|f', '4|7'
)
SELECT id, name, num
FROM `project.dataset.table`,
UNNEST(SPLIT(names, '|')) name WITH OFFSET
JOIN UNNEST(SPLIT(nums, '|')) num WITH OFFSET
USING(OFFSET)
result is
Row id name num
1 x a 3
2 x b 9
3 x c 5
4 y d 1
5 z e 4
6 z f 7
Above can be refactored to use UDF so the main query gets simple and readable - as in below example
#standardSQL
CREATE TEMP FUNCTION xxx(arr1 ANY TYPE, arr2 ANY TYPE)
RETURNS ARRAY<STRUCT<name STRING, num STRING>> AS (
ARRAY(
SELECT AS STRUCT el1, el2
FROM UNNEST(SPLIT(arr1, '|')) el1 WITH OFFSET
JOIN UNNEST(SPLIT(arr2, '|')) el2 WITH OFFSET
USING(OFFSET)
));
SELECT id, x.*
FROM `project.dataset.table`, UNNEST(xxx(names, nums)) x
obviously with the same output

Specific string matching

I am working in SQL Server 2012. In my table, there is a column called St_Num and its data is like this:
St_Num status
------------------------------
128 TIMBER RUN DR EXP
128 TIMBER RUN DRIVE EXP
Now we can notice that there are spelling variations in the data above. What I would like to do is that if the number in this case 128 and first 3 letters in St_Num column are same then these both rows should be considered the same like this the output should be:
St_Num status
-----------------------------
128 TIMBER RUN DR EXP
I did some search regarding this and found that left or substring function can be handy here but I have no idea how they will be used here to get what I need and don't know even if they can solve my issue. Any help regarding how to get the desired output would be great.
This will output only the first of the matching rows:
with cte as (
select *,
row_number() over (order by (select null)) rn
from tablename
)
select St_Num, status from cte t
where not exists (
select 1 from cte
where
left(St_Num, 7) = left(t.St_Num, 7)
and
rn < t.rn
)
See the demo
This could possibly be done by using a subquery in the same way that you would eliminate duplicates in a table so:
SELECT Str_Num, status
FROM <your_table> a
WHERE NOT EXISTS (SELECT 1
FROM <your_table> b
WHERE SUBSTRING(b.Str_Num, 1, 7) = SUBSTRING(a.Str_Num, 1, 7));
This would only work however if the number is guaranteed to be 3 characters long, or if you don't mind it taking more characters in the case that the number is fewer characters.
You can use grouping by status and substring(St_Num,1,3)
with t(St_Num, status) as
(
select '128 TIMBER RUN DR' ,'EXP' union all
select '128 TIMBER RUN DRIVE','EXP'
)
select min(St_Num) as St_Num, status
from t
group by status, substring(St_Num,1,3);
St_Num status
----------------- ------
128 TIMBER RUN DR EXP
I don't really approve of your matching logic . . . but that is not your question. The big issue is how long is the number before the string. So, you can get the shortest of the addresses using:
select distinct t.*
from t
where not exists (select 1
from t t2
where left(t2.st_num, patindex('%[a-zA-Z]%') + 2, t.st_num) = left(t.st_num, patindex('%[a-zA-Z]%', t.st_num) + 2) and
len(t.St_Num) < len(t2.St_Num)
);
I still have odd feeling that your criteria is not enough to match same addresses but this might help, since it considers also length of the number:
WITH ParsedAddresses(st_num, exp, number)
AS
(
SELECT st_num,
exp,
number = ROW_NUMBER() OVER(PARTITION BY LEFT(st_num, CHARINDEX(' ', st_num) + 3) ORDER BY LEN(st_num))
FROM <table_name>
)
SELECT st_num, exp FROM ParsedAddresses
WHERE number = 1

SQL unique combinations

I have a table with three columns with an ID, a therapeutic class, and then a generic name. A therapeutic class can be mapped to multiple generic names.
ID therapeutic_class generic_name
1 YG4 insulin
1 CJ6 maleate
1 MG9 glargine
2 C4C diaoxy
2 KR3 supplies
3 YG4 insuilin
3 CJ6 maleate
3 MG9 glargine
I need to first look at the individual combinations of therapeutic class and generic name and then want to count how many patients have the same combination. I want my output to have three columns: one being the combo of generic names, the combo of therapeutic classes and the count of the number of patients with the combination like this:
Count Combination_generic combination_therapeutic
2 insulin, maleate, glargine YG4, CJ6, MG9
1 supplies, diaoxy C4C, KR3
One way to match patients by the sets of pairs (therapeutic_class, generic_name) is to create the comma-separated strings in your desired output, and to group by them and count. To do this right, you need a way to identify the pairs. See my Comment under the original question and my Comments to Gordon's Answer to understand some of the issues.
I do this identification in some preliminary work in the solution below. As I mentioned in my Comment, it would be better if the pairs and unique ID's existed already in your data model; I create them on the fly.
Important note: This assumes the comma-separated lists don't become too long. If you exceed 4000 characters (or approx. 32000 characters in Oracle 12, with certain options turned on), you CAN aggregate the strings into CLOBs, but you CAN'T GROUP BY CLOBs (in general, not just in this case), so this approach will fail. A more robust approach is to match the sets of pairs, not some aggregation of them. The solution is more complicated, I will not cover it unless it is needed in your problem.
with
-- Begin simulated data (not part of the solution)
test_data ( id, therapeutic_class, generic_name ) as (
select 1, 'GY6', 'insulin' from dual union all
select 1, 'MH4', 'maleate' from dual union all
select 1, 'KJ*', 'glargine' from dual union all
select 2, 'GY6', 'supplies' from dual union all
select 2, 'C4C', 'diaoxy' from dual union all
select 3, 'GY6', 'insulin' from dual union all
select 3, 'MH4', 'maleate' from dual union all
select 3, 'KJ*', 'glargine' from dual
),
-- End of simulated data (for testing purposes only).
-- SQL query solution continues BELOW THIS LINE
valid_pairs ( pair_id, therapeutic_class, generic_name ) as (
select rownum, therapeutic_class, generic_name
from (
select distinct therapeutic_class, generic_name
from test_data
)
),
first_agg ( id, tc_list, gn_list ) as (
select t.id,
listagg(p.therapeutic_class, ',') within group (order by p.pair_id),
listagg(p.generic_name , ',') within group (order by p.pair_id)
from test_data t join valid_pairs p
on t.therapeutic_class = p.therapeutic_class
and t.generic_name = p.generic_name
group by t.id
)
select count(*) as cnt, tc_list, gn_list
from first_agg
group by tc_list, gn_list
;
Output:
CNT TC_LIST GN_LIST
--- ------------------ ------------------------------
1 GY6,C4C supplies,diaoxy
2 GY6,KJ*,MH4 insulin,glargine,maleate
You are looking for listagg() and then another aggregation. I think:
select therapeutics, generics, count(*)
from (select id, listagg(therapeutic_class, ', ') within group (order by therapeutic_class) as therapeutics,
listagg(generic_name, ', ') within group (order by generic_name) as generics
from t
group by id
) t
group by therapeutics, generics;

Combine(concatenate) rows based on dates via SQL

I have the following table.
Animal Vaccine_Date Vaccine
Dog 1/1/2016 x
Dog 2/1/2016 y
Dog 2/1/2016 z
Cat 2/1/2016 y
Cat 2/1/2016 z
I want to be able to combine vaccines that are on the same animal and same date, so that they appear in the same cell. The table below is what the desired end result would be.
Animal Vaccine_Date Vaccine
Dog 1/1/2016 x
Dog 2/1/2016 y,z
Cat 2/1/2016 y,z
I have tried to create a volatile table to do so but I am not having any luck and I don't think Teradata recognizes the Group_concat.
UPDATED 20180419
Teradata (not sure which version) has added XMLAGG which would be a better choice here than recursive)
Original answer:
Teradata doesn't have group_concat/listagg functionality. There are a couple of workarounds. My favorite is to use a recursive CTE. It's not terribly efficient, but it's well documented and supported functionality.
In your case:
WITH RECURSIVE recCTE AS
(
SELECT
Animal,
Vaccine_Date,
CAST(min(Vaccine) as VARCHAR(50)) as vaccine_list, --big enough to hold concatenated list
1 as depth, --used to determine the largest/last group_concate (the full group) in the final SELECT
Vaccine
FROM table
GROUP BY 1,2
UNION ALL
SELECT
recCTE.Animal,
recCTE.Vaccine_Date,
recCTE.Vaccine || ',' || table.Vaccine
recCTE.depth + ,
table.Vaccine
FROM recCTE
INNER JOIN table ON
recCTE.Animal = table.Animal AND
recCTE.Vaccine_Date = Table.Vaccine_Date
table.vaccine > recCTE.vaccine
)
--Now select the result with the largest depth for each animal/vaccine_date combo
SELECT * FROM recCTE
QUALIFY ROW_NUMBER() OVER (PARTITION BY animal,vaccine_date ORDER BY depth desc) = 1
You may have to tweak that a little bit (possibly trim the vaccine values before concatenating and whatnot), but it should get you in the ballpark. You can check out the recursive CTE documentation at this link, but it's pretty dry. There are a lot of tutorials out there too, if you are unfamiliar. Teradata's implementation of recursive CTE is very similar to T-SQL and PostgresSQL's implementation as well.
As another option you can check out the as-of-yet undocumented tdstats.udfconcat() as explained by the extremely knowledgable #dnoeth in this thread on Teradata Community website.
Try this. STUFF function is an ideal for such situations:
SELECT
Animal, Vaccine_Date,
STUFF(
(SELECT DISTINCT ',' + Vaccine
FROM TableName
WHERE Animal = a.Animal AND Vaccine_Date = a.Vaccine_Date
FOR XML PATH (''))
, 1, 1, '') AS VaccineList
FROM TableName AS a
GROUP BY Animal, Vaccine_Date
You can use this Query,
SELECT
Animal,Vaccine_Date,
LISTAGG(Vaccine, ',') WITHIN GROUP (ORDER BY Vaccine) "names"
FROM table_name
GROUP BY Vaccine
Hope you got it

How to SELECT only few rows from a column in sql (SQLite Database Browser)?

I am using SQLite Database Browser.
Table : Test
Test has a single column "Words" with some values as shown below :
Words
--------
apple
pen
xerox
notebook
toys
zoo
stars
apes
Write an sql query (which should execute in SQLite Database Browser) to select words between 'xerox' and 'stars' & words from 'pen' up to apes.
This may be one option:
SELECT * FROM test
WHERE ROWID BETWEEN
(SELECT ROWID FROM test WHERE words = 'xerox') + 1
AND
(SELECT ROWID FROM test WHERE words = 'stars') - 1
UNION ALL
SELECT '---'
UNION ALL
SELECT * FROM test
WHERE ROWID BETWEEN
(SELECT ROWID FROM test WHERE words = 'pen')
AND
(SELECT ROWID FROM test WHERE words = 'apes');
select *
from Test
where Words between 'stars' and 'xerox'
or Words between 'apes' and 'pen'
Note that the first argument to between should be smaller than the second. So 'stars' comes before 'xerox'becausescomes beforex` in the alphabet.