Google BigQuery: iterate CONTAINS function over subquery - sql

Lets assume I have two tables:
girls prefixes
------ ----------
Le-na -na
Lo-ve -ve
Li-na -la
Lu-na -ta
Len-ka -ya
All girls names and prefixes are different length!
I want to select all girl names that contains prefixes table and to do it in a query(imagine I have many names and many prefixes).
I untested that for single case it is being completed like this:
SELECT girls,SOME(girls CONTAINS ("-na")) WITHIN RECORD FROM prefixes
But how do I implement iteration of CONTAINS function over subquery?
e.g.
SELECT girls,SOME(girls CONTAINS (SELECT * FROM prefixes))
WITHIN RECORD FROM prefixes
–– this doesn't work, cause Subselect not allowed in SELECT clause
I'd really appreciate any ideas, I've tried to search for this but couldn't find my case.

Have you tried just using join?
select *
from girls g join
prefixes p
on g.girls like concat('%', p.prefix);
This should work using standard SQL.

Assuming that the prefixes (well, suffixes) are always three characters, you can perform an efficient semi-join with the result of SUBSTR:
#standardSQL
WITH Girls AS (
SELECT name
FROM UNNEST(['Le-na', 'Lo-ve', 'Li-na', 'Lu-na', 'Len-ka']) AS name
),
Suffixes AS (
SELECT suffix
FROM UNNEST(['-na', '-ve', '-la', '-ta', '-ya']) AS suffix
)
SELECT
name
FROM Girls
WHERE EXISTS (
SELECT 1 FROM Suffixes WHERE suffix = SUBSTR(name, LENGTH(name) - 2)
);
Or you can use LIKE, but it is equivalent to performing a cross join with a filter, so it probably won't be as fast:
#standardSQL
WITH Girls AS (
SELECT name
FROM UNNEST(['Le-na', 'Lo-ve', 'Li-na', 'Lu-na', 'Len-ka']) AS name
),
Suffixes AS (
SELECT suffix
FROM UNNEST(['-na', '-ve', '-la', '-ta', '-ya']) AS suffix
)
SELECT
name
FROM Girls
WHERE EXISTS (
SELECT 1 FROM Suffixes WHERE name LIKE CONCAT('%', suffix)
);
Edit: another option that enumerates all name suffixes for use in the semi-join:
#standardSQL
WITH Girls AS (
SELECT name
FROM UNNEST(['Le-na', 'Lo-ve-lala', 'Li-na', 'Lu-eya', 'Len-ka']) AS name
),
Suffixes AS (
SELECT suffix
FROM UNNEST(['-na', '-ve', '-lala', '-ta', '-eya']) AS suffix
),
GirlNamePermutations AS (
SELECT name, SUBSTR(name, LENGTH(name) + 1 - len) AS name_suffix
FROM Girls
CROSS JOIN UNNEST(GENERATE_ARRAY(1, (SELECT MAX(LENGTH(suffix)) FROM Suffixes))) AS len
)
SELECT
name
FROM GirlNamePermutations
WHERE EXISTS (
SELECT 1
FROM Suffixes
WHERE suffix = name_suffix
);
If you know the range of suffix lengths, you could hard-code it instead, e.g. replace:
CROSS JOIN UNNEST(GENERATE_ARRAY(1, (SELECT MAX(LENGTH(suffix)) FROM Suffixes))) AS len
with:
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 5)) AS len

Below is for BigQuery Standard SQL
#standardSQL
WITH girls AS (
SELECT name
FROM UNNEST(['Le-na', 'Lo-ve', 'Li-na', 'Lu-na', 'Len-ka']) AS name
),
suffixes AS (
SELECT suffix
FROM UNNEST(['-na', '-ve', '-la', '-ta', '-ya']) AS suffix
)
SELECT name
FROM girls
JOIN suffixes
ON ENDS_WITH(name, suffix)
as an option - in case you will need to extend this to find fragments inside name - you can use REGEXP_CONTAINS
SELECT name
FROM girls
JOIN suffixes
ON REGEXP_CONTAINS(name, suffix)
or - STARTS_WITH to match by prefixes (vs. suffixes)
SELECT name
FROM girls
JOIN suffixes
ON STARTS_WITH(name, suffix)

Related

How does Partitioning By a Substring in T-SQL Work?

I found the perfect example while browsing through sites of what I'm looking for. In this code example, all country names that appear in long formatted rows are concatenated together into one result, with a comma and space between each country.
Select CountryName from Application.Countries;
Select SUBSTRING(
(
SELECT ',' + CountryName AS 'data()'
FROM Application.Countries FOR XML PATH('')
), 2 , 9999) As Countries
Source: https://www.mytecbits.com/microsoft/sql-server/concatenate-multiple-rows-into-single-string
My question is: how can you partition these results with a second column that would read as "Continent" in such a way that each country would appear within its respective continent? The theoretical "OVER (PARTITION BY Continent)" in this example would not work without an aggregate function before it. Perhaps there is a better way to accomplish this? Thanks.
Use a continents table (you seem not to have one, so derive one with distinct), and then use the same code in a cross apply using the where as a "join" condition:
select *
from
(
select distinct continent from Application.Countries
) t1
cross apply
(
Select SUBSTRING(
(
SELECT ',' + CountryName AS 'data()'
FROM Application.Countries as c FOR XML PATH('')
where c.continent=t1.continent
), 2 , 9999) As Countries
) t2
Note that it is more usual, and arguably has more finesse, to use stuff(x,1,1,'')instead of substring(x,2,9999) to remove the first comma.

How to unnest BigQuery nested records into multiple columns

I am trying to unnest the below table .
Using the below unnest query to flatten the table
SELECT
id,
name ,keyword
FROM `project_id.dataset_id.table_id`
,unnest (`groups` ) as `groups`
where id = 204358
Problem is , this duplicates the rows (except name) as is the case with flattening the table.
How can I modify the query to put the names in two different columns rather than rows.
Expected output below -
That's because the comma is a cross join - in combination with an unnested array it is a lateral cross join. You repeat the parent row for every row in the array.
One problem with pivoting arrays is that arrays can have a variable amount of rows, but a table must have a fixed amount of columns.
So you need a way to decide for a certain row that becomes a certain column.
E.g. with
SELECT
id,
name,
groups[ordinal(1)] as firstArrayEntry,
groups[ordinal(2)] as secondArrayEntry,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
If your array had a key-value pair you could decide using the key. E.g.
SELECT
id,
name,
(select value from unnest(groups) where key='key1') as key1,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
But that doesn't seem to be the case with your table ...
A third option could be PIVOT in combination with your cross-join solution but this one has restrictions too: and I'm not sure how computation-heavy this is.
Consider below simple solution
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (1, 2))
if applied to sample data in your question - output is
Note , when you apply to your real case - you just need to know how many such name_NNN columns to expect and extend respectively list - for example for offset + 1 in (1, 2, 3, 4, 5)) if you expect 5 such columns
In case if for whatever reason you want improve this - use below where everything is built dynamically for you so you don't need to know in advance how many columns it will be in the output
execute immediate (select '''
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (''' || string_agg('' || pos, ', ') || '''))
'''
from (select pos from (
select max(array_length(`groups`)) cnt
from `project_id.dataset_id.table_id`
), unnest(generate_array(1, cnt)) pos
))
Your question is a little unclear, because it does not specify what to do with other keywords or other columns. If you specifically want the first two values in the array for keyword "OVG", you can unnest the array and pull out the appropriate names:
SELECT id,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1
) as name_1,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1 OFFSET 1
) as name_2,
'OVG' as keyword
FROM `project_id.dataset_id.table_id` t
WHERE id = 204358;

Select rows where the title contains any term from a list

I have a datset that has product_id, product_url, email, product_title, etc.
I want to pull rows where product_title contains certain adjectives (Fabulous, Stunning, Rare, Amazing, Unique, etc etc. ) from a list of 400+ words.
How do I do this without doing a separate select function for each word? I am using SQLite
You can construct a single query using or:
select t.*
from t
where t.title like '%Fabulous%' or
t.title like '%Stunning%' or
. . . ;
If the words are stored in a separate table, you could use exists:
select t.*
from t
where exists (select 1
from interesting_words iw
where t.title like '%' || iw.word || '%'
);

tricky SQL with substrings

I have a table (postgres) with a varchar field that has content structured like:
".. John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9 .."
The uuid can occur in more than one record. But it must not occur for more than one combination of [givenname];[surname], according to a business rule.
That is, if the John Smith example above is present in the table, then if uuid 7c32e9e1.. occurs in any other record, the field in that record most also contain ".. John;Smith; .."
The problem is, this business rule has been violated due to some bug. And I would like to know how many rows in the table contains a uuid such that it occurs in more than one place with different combinations of [givenname];[surname].
I'd appreciate if someone could help me out with the SQL to accomplish this.
Use regular expressions to extract the UUID and the name from the string. Then aggregate per UUID and either count distinct names or compare minimum and maximum name:
select
substring(col, 'uuid=([[:alnum:]]+)') as uuid,
string_agg(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid'), ' | ') as names
from mytable
group by substring(col, 'uuid=([[:alnum:]]+)')
having count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) > 1;
Demo: https://dbfiddle.uk/?rdbms=postgres_12&fiddle=907a283a754eb7427d4ffbf50c6f0028
If you only want to count:
select
count(*) as cnt_uuids,
sum(num_names) as cnt_names,
sum(num_rows) as cnt_rows
from
(
select
count(*) as num_rows,
count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) as num_names
from mytable
group by substring(col, 'uuid=([[:alnum:]]+)')
having count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) > 1
) flaws;
But as has been mentioned already: This is not how a database should be used.
I assume you know all the reasons why this is a bad data format, but you are stuck with it. Here is my approach:
select v.user_id, array_agg(distinct names)
from (select v.id,
max(el) filter (where n = un) as user_id,
array_agg(el order by el) filter (where n in (un - 2, un - 1)) as names
from (select v.id, u.*,
max(u.n) filter (where el like 'uuid=%') over (partition by v.id) as un
from (values (1 , 'junkgoeshere;John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..'),
(2 , 'junkgoeshere;John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..'),
(3 , 'junkgoeshere;John;Smith;uuid=new_7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..'),
(4 , 'junkgoeshere;John;Jay;uuid=new_7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..')
) v(id, str) cross join lateral
unnest(regexp_split_to_array(v.str, ';')) with ordinality u(el, n)
) v
where n between un - 2 and un
group by v.id
) v
group by user_id
having min(names) <> max(names);
Here is a db<>fiddle.
This assumes that the fields are separated by semicolons. Your data format is just awful, not just as a string but because the names are not identified. So, I am assuming they are the two fields before the user_id field.
So, this implements the following logic:
Breaks up the string by semicolons, with an identifying number.
Finds the number for the user_id.
Extracts the previous two fields together and the user_id column.
Then uses aggregation to find cases where there are multiple matches.

SQL : Turning rows into columns

I need to turning the value of a row into column - for example:
SELECT s.section_name,
s.section_value
FROM tbl_sections s
this outputs :
section_name section_value
-----------------------------
sectionI One
sectionII Two
sectionIII Three
desired output :
sectionI sectionII sectionIII
-----------------------------------------
One Two Three
This is probably better done client-side in the programming language of your choice.
You absolutely need to know the section names in advance to turn them into column names.
Updated answer for Oracle 11g (using the new PIVOT operator):
SELECT * FROM
(SELECT section_name, section_value FROM tbl_sections)
PIVOT
MAX(section_value)
FOR (section_name) IN ('sectionI', 'sectionII', 'sectionIII')
For older versions, you could do some self-joins:
WITH
SELECT section_name, section_value FROM tbl_sections
AS
data
SELECT
one.section_value 'sectionI',
two.section_value 'sectionII',
three.section_value 'sectionIII'
FROM
select selection_value from data where section_name = 'sectionI' one
CROSS JOIN
select selection_value from data where section_name = 'sectionII' two
CROSS JOIN
select selection_value from data where section_name = 'sectionIII' three
or also use the MAX trick and "aggregate":
SELECT
MAX(DECODE(section_name, 'sectionI', section_value, '')) 'sectionI',
MAX(DECODE(section_name, 'sectionII', section_value, '')) 'sectionII',
MAX(DECODE(section_name, 'sectionIII', section_value, '')) 'sectionIII'
FROM tbl_sections