How to efficiently select records matching substring in another table using BigQuery? - sql

I have a table of several million strings that I want to match against a table of about twenty thousand strings like this:
#standardSQL
SELECT record.* FROM `record`
JOIN `fragment` ON record.name
LIKE CONCAT('%', fragment.name, '%')
Unfortunately this is taking an awful long time.
Considering that the fragment table is only 20k records, can I load it into a JavaScript array using a UDF and match it that way? I'm trying to figure out how to this right now but perhaps there's already some magic I could do here to make this faster. I tried a CROSS JOIN and got resource exceeded fairly quickly. I've also tried using EXISTS but I can't reference the record.name inside that subquery's WHERE without getting an error.
Example using Public Data
This seems to reflect about the same amount of data ...
#standardSQL
WITH record AS (
SELECT LOWER(text) AS name
FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
SELECT LOWER(name) AS name, COUNT(*)
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
)
SELECT record.* FROM `record`
JOIN `fragment` ON record.name
LIKE CONCAT('%', fragment.name, '%')

Below is for BigQuery Standard SQL
#standardSQL
WITH record AS (
SELECT LOWER(text) AS name
FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
SELECT DISTINCT LOWER(name) AS name
FROM `bigquery-public-data.usa_names.usa_1910_current`
), temp_record AS (
SELECT record, TO_JSON_STRING(record) id, name, item
FROM record, UNNEST(REGEXP_EXTRACT_ALL(name, r'\w+')) item
), temp_fragment AS (
SELECT name, item FROM fragment, UNNEST(REGEXP_EXTRACT_ALL(name, r'\w+')) item
)
SELECT AS VALUE ANY_VALUE(record) FROM (
SELECT ANY_VALUE(record) record, id, r.name name, f.name fragment_name
FROM temp_record r
JOIN temp_fragment f
USING(item)
GROUP BY id, name, fragment_name
)
WHERE name LIKE CONCAT('%', fragment_name, '%')
GROUP BY id
above was completed in 375 seconds, while original query is still running at 2740 seconds and keep running, so I will not even wait for it to complete

Mikhail's answer appears to be faster - but lets have one that doesn't need to SPLIT nor separate the text into words.
First, compute a regular expression with all the words to be searched:
#standardSQL
WITH record AS (
SELECT text AS name
FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
SELECT name AS name, COUNT(*)
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
)
SELECT FORMAT('(%s)',STRING_AGG(name,'|'))
FROM fragment
Now you can take that resulting string, and use it in a REGEX ignoring case:
#standardSQL
WITH record AS (
SELECT text AS name
FROM `bigquery-public-data.hacker_news.comments`
), largestring AS (
SELECT '(?i)(mary|margaret|helen|more_names|more_names|more_names|josniel|khaiden|sergi)'
)
SELECT record.* FROM `record`
WHERE REGEXP_CONTAINS(record.name, (SELECT * FROM largestring))
(~510 seconds)

As eluded to in my question, I worked on a version using a JavaScript UDF which solves this albeit in a slower way than the answer I accepted. For completeness, I'm posting it here because perhaps someone (like myself in the future) may find it useful.
CREATE TEMPORARY FUNCTION CONTAINS_ANY(str STRING, fragments ARRAY<STRING>)
RETURNS STRING
LANGUAGE js AS """
for (var i in fragments) {
if (str.indexOf(fragments[i]) >= 0) {
return fragments[i];
}
}
return null;
""";
WITH record AS (
SELECT text AS name
FROM `bigquery-public-data.hacker_news.comments`
WHERE text IS NOT NULL
), fragment AS (
SELECT name AS name, COUNT(*)
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE name IS NOT NULL
GROUP BY name
), fragment_array AS (
SELECT ARRAY_AGG(name) AS names, COUNT(*) AS count
FROM fragment
GROUP BY LENGTH(name)
), records_with_fragments AS (
SELECT record.name,
CONTAINS_ANY(record.name, fragment_array.names)
AS fragment_name
FROM record INNER JOIN fragment_array
ON CONTAINS_ANY(name, fragment_array.names) IS NOT NULL
)
SELECT * EXCEPT(rownum) FROM (
SELECT record.name,
records_with_fragments.fragment_name,
ROW_NUMBER() OVER (PARTITION BY record.name) AS rownum
FROM record
INNER JOIN records_with_fragments
ON records_with_fragments.name = record.name
AND records_with_fragments.fragment_name IS NOT NULL
) WHERE rownum = 1
The idea is that the list of fragments is relatively small enough that it can be processed in an array, similar to Felipe's answer using regular expressions. The first thing I do is create a fragment_array table which is grouped by the fragment lengths ... a cheap way of preventing an over-sized array which I found can cause UDF timeouts.
Next I create a table called records_with_fragments that joins those arrays to the original records, finding only those which contain a matching fragment using the JavaScript UDF CONTAINS_ANY(). This will result in a table containing some duplicates since one record may match multiple fragments.
The final SELECT then pulls in the original record table, joins to records_with_fragments to determine which fragment matched, and also uses the ROW_NUMBER() function to prevent duplicates, e.g. only showing the first row of each record as uniquely identified by its name.
Now, the reason I do the join in the final query is because in my actual data there are more fields I want besides just the string being matched. Earlier on in my actual data I create a table of DISTINCT strings which then later need to be re-joined.
Voila! Not the most elegant but it gets the job done.

Related

Select rows according to another table with a comma-separated list of items

Have a table test.
select b from test
b is a text column and contains Apartment,Residential
The other table is a parcel table with a classification column. I'd like to use test.b to select the right classifications in the parcels table.
select * from classi where classification in(select b from test)
this returns no rows
select * from classi where classification =any(select '{'||b||'}' from test)
same story with this one
I may make a function to loop through the b column but I'm trying to find an easier solution
Test case:
create table classi as
select 'Residential'::text as classification
union
select 'Apartment'::text as classification
union
select 'Commercial'::text as classification;
create table test as
select 'Apartment,Residential'::text as b;
You don't actually need to unnest the array:
SELECT c.*
FROM classi c
JOIN test t ON c.classification = ANY (string_to_array(t.b, ','));
db<>fiddle here
The problem is that = ANY takes a set or an array, and IN takes a set or a list, and your ambiguous attempts resulted in Postgres picking the wrong variant. My formulation makes Postgres expect an array as it should.
For a detailed explanation see:
How to match elements in an array of composite type?
IN vs ANY operator in PostgreSQL
Note that my query also works for multiple rows in table test. Your demo only shows a single row, which is a corner case for a table ...
But also note that multiple rows in test may produce (additional) duplicates. You'd have to fold duplicates or switch to a different query style to get de-duplicate. Like:
SELECT c.*
FROM classi c
WHERE EXISTS (
SELECT FROM test t
WHERE c.classification = ANY (string_to_array(t.b, ','))
);
This prevents duplication from elements within a single test.b, as well as from across multiple test.b. EXISTS returns a single row from classi per definition.
The most efficient query style depends on the complete picture.
You need to first split b into an array and then get the rows. A couple of alternatives:
select * from nj.parcels p where classification = any(select unnest(string_to_array(b, ',')) from test)
select p.* from nj.parcels p
INNER JOIN (select unnest(string_to_array(b, ',')) from test) t(classification) ON t.classification = p.classification;
Essential to both is the unnest surrounding string_to_array.

with XMLDIFF, how to compare only the fields that my xml elements have in common?

introduction:
I have query using a pipeline function. I won't change the names of the returned columns but I will add other columns.
I want to compare the result of the old query with the new query (syntaxal always the same (select * from mypipelinefunction) , but I have changed the pipeline function )
I have used "select *" instead of "select the name of the columns" because there is a lot names.
code:
the code example is simplified to focus on the problem addressed in the title. (no pipeline function. Only two "identic" queries are tested. The second query has one more column that the first.
SELECT
XMLDIFF (
XMLTYPE.createXML (
DBMS_XMLGEN.getxml ('select 1 one, 2 two from dual')),
XMLTYPE.createXML (
DBMS_XMLGEN.getxml ('select 1 one from dual')))
from dual.
I want that XMLDIFF to say that there is no difference because the only columns that I care about are the colums that are in common.
In short I would like to have this result
<xd:xdiff xsi:schemaLocation="http://xmlns.oracle.com/xdb/xdiff.xsd http://xmlns.oracle.com/xdb/xdiff.xsd" xmlns:xd="http://xmlns.oracle.com/xdb/xdiff.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</xd:xdiff>
instead of this result
<xd:xdiff xsi:schemaLocation="http://xmlns.oracle.com/xdb/xdiff.xsd http://xmlns.oracle.com/xdb/xdiff.xsd" xmlns:xd="http://xmlns.oracle.com/xdb/xdiff.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><xd:delete-node xd:node-type="element" xd:xpath="/ROWSET[1]/ROW[1]/TWO[1]"/></xd:xdiff>
Is this possible to force XMLdiff to compare only the columns that are in commun?
code
Another way to fix this problem would be to have a shortcut in TOAD that transform select * from t in select first_column, ......last_column from t. And it should work even if t is a pipeline function
If you only care about certain columns then wrap your query in a outer-query to only output the columns you care about:
SELECT XMLDIFF (
XMLTYPE.createXML (
DBMS_XMLGEN.getxml (
'SELECT one FROM (select 1 one, 2 two from dual)'
)
),
XMLTYPE.createXML (
DBMS_XMLGEN.getxml (
'SELECT one FROM (select 1 one from dual)'
)
)
) AS diff
FROM DUAL;
Which outputs:
DIFF
<xd:xdiff xsi:schemaLocation="http://xmlns.oracle.com/xdb/xdiff.xsd http://xmlns.oracle.com/xdb/xdiff.xsd" xmlns:xd="http://xmlns.oracle.com/xdb/xdiff.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><?oracle-xmldiff operations-in-docorder="true" output-model="snapshot" diff-algorithm="global"?></xd:xdiff>
db<>fiddle here

Extract nested values as columns Google BigQuery?

I have a table with nested values, like the following:
I'd like to grab the values, with keys as columns without multiple cross joins.
i.e.
SELECT
owner_id,
owner_type,
domain,
metafields.value AS name,
metafields.value AS image,
metafields.value AS location,
metafields.value AS draw
FROM
example_table
Obviously, the above won't work for this, but the following output would be desired:
In the actual table there are hundreds of metafields per owner_id, and hundreds of owner_ids, and owner_types. Multiple joins to other tables for owner_types is fine, but for the same owner type, I don't want to have to join multiple times.
Basically, I need to be able to select the key to which the column corresponds, and display the relevant value for that column. Without, having to display every metafield available.
Any way of doing this?
Consider below approach
select * except(id) from (
select t.* except(metafields),
to_json_string(t) id, key, value
from your_table t, unnest(metafields) kv
)
pivot (min(value) for key in ('name', 'image', 'location', 'draw'))
if applied to sample data in your question - output is
You can use the subqueries and SAFE_offset statement and get a value from an array at a specific location.
Also, you need to use STRING_AGG, which returns a value (either STRING or BYTES) obtained by concatenating non-null values.
With the information you shared, you can use the query below.
With this code, you will get all the columns separated by a comma:
WITH sequences AS
(
SELECT 1 as ID,"product" AS owner_type,"beta.com" AS domain,["name","image","lcation","draw"] AS metalfields_key, ["big","pic.png","utha","1"] AS metalfields_value
),
Val as(
SELECT distinct id, owner_type,domain, value FROM sequences, sequences.metalfields_value as value, sequences.metalfields_key
), text as(
SELECT
id, owner_type, domain,
STRING_AGG(value ORDER BY value) AS Text
FROM Val
GROUP BY owner_type, domain, id
)
In this code, you will get each element that is separated by a comma and return them by columns.
SELECT DISTINCT t1.id, t1.owner_type,domain,
split(t1.text, ',')[SAFE_offset(1)] as name,
split(t1.text, ',')[SAFE_offset(2)] as image,
split(t1.text, ',')[SAFE_offset(3)] as location,
split(t1.text, ',')[SAFE_offset(0)] as draw
from text as t1
You can see the result.

Find out what pattern was matched when using a LIKE query?

If I was to perform a normal query with a bunch of LIKE statements in it. Would it be possible to return which search term actually resulted in the row being returned?
So if I ran :
select cand_id
FROM cand_kw
WHERE client_id='client'
AND ( ( UPPER(kw) LIKE '%ANDREW%' AND UPPER(kw) LIKE '%POSTINGS%' )
OR ( UPPER(kw) LIKE '%BRET%' )
OR ( UPPER(kw) LIKE '%TIM%' )) ) )
And it returned some rows of results is there a way to tag on which term was actually matched in the row? So if '%ANDREW%' was what caused this row to be returned I could then show that information.
The data base engine is oracle 9i and I realize that this is normally a function something like full text searches that this database is not setup to handle so I am just trying to fake it in way.
It is a bit tricky, because more than one keyword may match. You could use a CASE expression in the SELECT clause, but then you would get the first matching keyword only.
Another approach would be to put each keyword on a separate row, use a join to filter the original table, and then aggregate the list of matching keyword.
So:
SELECT c.cand_id, LISTAGG(k.kw, ', ') WITHIN GROUP (ORDER BY k.kw) matches
FROM cand_kw c
INNER JOIN (
SELECT 'ANDREW' kw FROM DUAL
UNION ALL SELECT 'POSTINGS' FROM DUAL
UNION ALL SELECT 'BRET' FROM DUAL
UNION ALL SELECT 'TIM' FROM DUAL
) k ON c.kw LIKE '%' || k.kw || '%'
GROUP BY c.cand_id

I want to join two tables with a common column in Big query?

To join the tables, I am using the following query.
SELECT *
FROM(select user as uservalue1 FROM [projectname.FullData_Edited]) as FullData_Edited
JOIN (select user as uservalue2 FROM [projectname.InstallDate]) as InstallDate
ON FullData_Edited.uservalue1=InstallDate.uservalue2;
The query works but the joined table only has two columns uservalue1 and uservalue2.
I want to keep all the columns present in both the table. Any idea how to achieve that?
#legacySQL
SELECT <list of fields to output>
FROM [projectname:datasetname.FullData_Edited] AS FullData_Edited
JOIN [projectname:datasetname.InstallDate] AS InstallDate
ON FullData_Edited.user = InstallDate.user
or (and preferable)
#standardSQL
SELECT <list of fields to output>
FROM `projectname.datasetname.FullData_Edited` AS FullData_Edited
JOIN `projectname.datasetname.InstallDate` AS InstallDate
ON FullData_Edited.user = InstallDate.user
Note, using SELECT * in such cases lead to Ambiguous column name error, so it is better to put explicit list of columns/fields you need to have in your output
The way around it is in using USING() syntax as in example below.
Assuming that user is the ONLY ambiguous field - it does the trick
#standardSQL
SELECT *
FROM `projectname.datasetname.FullData_Edited` AS FullData_Edited
JOIN `projectname.datasetname.InstallDate` AS InstallDate
USING (user)
For example:
#standardSQL
WITH `projectname.datasetname.FullData_Edited` AS (
SELECT 1 user, 'a' field1
),
`projectname.datasetname.InstallDate` AS (
SELECT 1 user, 'b' field2
)
SELECT *
FROM `projectname.datasetname.FullData_Edited` AS FullData_Edited
JOIN `projectname.datasetname.InstallDate` AS InstallDate
USING (user)
returns
user field1 field2
1 a b
whereas using ON FullData_Edited.user = InstallDate.user gives below error
Error: Duplicate column names in the result are not supported. Found duplicate(s): user
Don't use subqueries if you want all columns:
SELECT *
FROM [projectname.FullData_Edited] as FullData_Edited JOIN
[projectname.InstallDate] as InstallDate
ON FullData_Edited.uservalue1 = InstallDate.uservalue2;
You may have to list out the particular columns you want to avoid duplicate column names.
While you are at it, you should also switch to standard SQL.