BQ giving incomplete results using CONTAINS condition

BQ giving incomplete results using CONTAINS condition - google-bigquery

I'm working with Bigquery to process some Adwords data and, more precisely, to extract all the url parameters from our destination URLs so we can organize it better and etc.
I wrote the following Query to give me back all the parameters available in the "DestinationURL" field in the table. As follows:
SELECT Parameter
FROM (SELECT NTH(1, SPLIT(Params,'=')) as Parameter,
FROM (SELECT
AdID,
NTH(1, SPLIT(DestinationURL,'?')) as baseurl,
split(NTH(2, SPLIT(DestinationURL,'?')),'&') as Params
FROM [adwords_accounts_ads.ads_all]
HAVING Params CONTAINS '='))
GROUP BY 1
Runnig this will give me 6 parameters. That is correct but incomplete, because in this testing table I know there are 2 other parameters in the URLs that were not fetched. One called 'group' and the other called 'utm_content'.
Now if I run:
SELECT Parameter
FROM (SELECT NTH(1, SPLIT(Params,'=')) as Parameter,
FROM (SELECT
AdID,
NTH(1, SPLIT(DestinationURL,'?')) as baseurl,
split(NTH(2, SPLIT(DestinationURL,'?')),'&') as Params
FROM [adwords_accounts_ads.ads_all]
HAVING Params CONTAINS 'p='))
GROUP BY 1
I get the "group" parameter showing.
question is: shouldn't the
"CONTAINS '='"
condition include the
"CONTAINS 'p='"
In the result? same happens for 't=' instead of '='
Does anyone know how I can fix that? or even how to extract all the parameters from a string that contains a URL?
ps: using LIKE yields the exact same thing
Thanks!

Split creates a REPEATED output type, and you have to FLATTEN the table to see correctly.
Here I used flatten on params and the output is now good:
SELECT nth(1,SPLIT(Params,'=')) AS Param,
nth(2,SPLIT(Params,'=')) AS Value
FROM flatten(SELECT
AdID,
NTH(1, SPLIT(DestinationURL,'?')) AS baseurl,
split(NTH(2, SPLIT(DestinationURL,'?')),'&') AS Params
FROM
(SELECT 1 AS AdID,'http://www.example.com.br/?h=Passagens+Aereas&source=google&vt=0' AS DestinationURL)
HAVING Params CONTAINS '=',
params
)
Outputs:
+-----+--------+------------------+---+
| Row | Param | Value | |
+-----+--------+------------------+---+
| 1 | h | Passagens+Aereas | |
| 2 | source | google | |
| 3 | vt | 0 | |
+-----+--------+------------------+---+
NOTE: The Web UI always flattens your result but If you select a destination table and uncheck "flatten results", you will get a single row with a repeated parts column.

Related

Combine query to get all the matching search text in right order

I have the following table:
postgres=# \d so_rum;
Table "public.so_rum"
Column | Type | Collation | Nullable | Default
-----------+-------------------------+-----------+----------+---------
id | integer | | |
title | character varying(1000) | | |
posts | text | | |
body | tsvector | | |
parent_id | integer | | |
Indexes:
"so_rum_body_idx" rum (body)
I wanted to do phrase search query, so I came up with the below query, for example:
select id from so_rum
where body ## phraseto_tsquery('english','Is it possible to toggle the visibility');
This gives me the results, which only match's the entire text. However, there are documents, where the distance between lexmes are more and the above query doesn't gives me back those data. For example: 'it is something possible to do toggle between the. . . visibility' doesn't get returned. I know I can get it returned with <2> (for example) distance operator by giving in the to_tsquery, manually.
But I wanted to understand, how to do this in my sql statement itself, so that I get the results first with distance of 1 and then 2 and so on (may be till 6-7). Finally append results with the actual count of the search words like the following query:
select count(id) from so_rum
where body ## to_tsquery('english','string & string . . . ')
Is it possible to do in a single query with good performance?

I don't see a canned solution to this. It sounds like you need to use plainto_tsquery to get all the results with all the lexemes, and then implement your own custom ranking function to rank them by distance between the lexemes, and maybe filter out ones with the wrong order.

How to compare value with multiple modified values from another table in BigQuery?

I am using Google BigQuery and I got the following issue:
I have a table (A) like this:
| time | request |
|------------------------|-----------------|
|2019-09-24 11:10:00 UTC | fakewebsite.com |
|2019-09-24 11:10:00 UTC | realwebsite.com |
|........................|.................|
|2019-09-24 11:10:00 UTC | foobwebsite.com |
|2019-09-24 11:10:00 UTC | barrwebsite.com |
And another table (B) like this:
| blacklist |
|---------------|
| foo.com |
| ... |
| bar.com |
I want to make a query that will grab a modified version of the values inside the blacklist field of table B as follows:
SPLIT(NET.REG_DOMAIN(blacklist), CONCAT('.',NET.PUBLIC_SUFFIX(blacklist)))[OFFSET(0)] AS to_exclude --this will return only "foo" from "foo.com"
and then return all values from the request field of table A where none of the to_exclude was found.
I know how to do this for one value but I don't know how to do this for multiple. I am looking for something like the following:
#standardSQL
WITH tmp_blacklist AS
(SELECT
SPLIT(NET.REG_DOMAIN(blacklist), CONCAT('.',NET.PUBLIC_SUFFIX(blacklist)))[OFFSET(0)] AS to_exclude
FROM
mydataset.B)
SELECT
request
FROM
mydataset.A
WHERE
request NOT LIKE ("%value1%", "%value2%", ..., "%valuen%") -- I can't use OR along with the NOT LIKE since the values are too many and they will change.
The n values are the values of the tmp_blacklist table.
Also if I don't define the table with the WITH and I define it after the NOT LIKE I am going to get the following error: Scalar subquery produced more than one element which makes sense if LIKE expects only one element. But then again that's half of the job done if it get's fixed since I want the "%value%" and not just the value of the table.
Now I searched online for a way to do this and I found people saying that it can't be done and then some workarounds with combinations of LIKE and IN where people said it will be very slow if one of the tables grows to have tons of data(my case).
What is the best way to do this?

One method uses not exists:
SELECT a.request
FROM mydataset.A a
WHERE NOT EXISTS (SELECT 1
FROM tmp_blacklist bl
WHERE a.request LIKE CONCAT('%', bl.to_exclude, '%'
);
Note that this can be expensive. You might want to test constructing the exclusion string as:
'value1|value2|value3'
and then using regular expressions.

How to replace text contained in one row with text contained in another row using a select statement

I am crafting a sql query that dynamically builds a where clause. I was able to transform the separate pieces of the where clause as return rows like so:
-------------------------------------------
| ID | Query Part |
-------------------------------------------
| TOKEN 1 | (A = 1 OR B = 2) |
-------------------------------------------
| TOKEN 2 | ([TOKEN 1] or C = 3 |
-------------------------------------------
| TOKEN 3 | ([TOKEN 2] and D = 4) |
-------------------------------------------
My goal is to wrap the current return results above in a stuff and or replace (or something entirely different I hadn't considered) to output the following result:
(((A=1 OR B=2) OR C=3) AND D=4)
Ideally there would be no temp table necessary but I am open to recommendations.
Thank you for any guidance, this has had me pretty stumped at work.

Its unusual. It looks like the query part you want is only Token 3. Then the process should replace any [token] tags in this query part with the corresponding query parts. With the subsequent resulting query part, again the process should replace any [token] tags with the corresponding query parts. This continues until there are no more [token] tags to replace.
I think there should there be a way of indicating the master query (ie token 3) , then use a recursive common table expression to build the expression up until there are no more [token]s.

ssrs report report filter with no duplicates used in query

I am having an issue and I'm not sure how to solve it.
I have an SSRS report that pulls from a table. I want a parameter filter to show de-duplicated values based on available options in one of the columns.
So my dataset with a query like:
SELECT * FROM table1 WITH (NOLOCK) WHERE col1 IN (#param)
Then I want a parameter called param that gets its available and default values from col1 in the above data set and I want them to be de-duplicated.
From reading online I learned I have to create a dummy param and use VBA code to de-duplicate that list.
So I have these params:
param_dummy that gets its available and default values from col1 in the above dataset
param that gets a de-duplicate list from param_dummy using Code.RemoveDuplicates
But I'm having an issue with circular logic. param gets its value from param_default which gets its value from the dataset/query which uses param.
How can I solve this?
One thought is to remove the WHERE col1 IN (#param) and instead use a filter on the Tablix table in the SSRS report. This works but I am wondering how efficient it is.
And/or if anyone has any other suggestions I am all ears.
Updated to add more details...
So let us say I have a table in my DB like so:
| id | col1 | col2 |
|----|------|--------|
| 1 | a | hello |
| 2 | b | how |
| 3 | a | are |
| 4 | c | you |
| 5 | d | on |
| 6 | a | this |
| 7 | b | lovely |
| 8 | c | day |
What I want is:
a Tablix to show all the fields from the table
a filter where the user can select between the available dropdowns in col1 (de-duplicated)
a text filter that allows nulls where a user can filter on col2
the parameters will have default values so the table will load on page load
So I have a dataset with a query like so:
SELECT
*
FROM dbo.table1
WHERE col1 IN (#col1options) AND (#col2value IS NULL OR col2 = #col2value)
Then for col1options I would make available and default options be Get values from a query and I would use the above dataset and col1.
But this won't work since the query/dataset depends on col1options which gets its default values from the query/dataset.
I can use a second dataset but that means making multiple calls to the SQL server and I want to avoid that.

I'm not sure I understand your issue so this is a guess...
If you mean you want to be able to filter your data by choosing one or more entries from a specific column in the table, but this column has duplicates and you want your parameter list to not show duplicates then this is what do to.
Create a new report
Add dataset dsMain as SELECT * FROM myTable WHERE myColumn IN (#myParam)
Add dataset dsParamValues as SELECT DISTINCT myColumn FROM myTable ORDER BY myColumn
Edit the #myParam parameter properties and set the available and default values to a query, then choose dsParamValues
Add you table/matrix control and set it's dataset property to dsMain

Found an easier solution.
Follow this link to build the "dummy" hidden parameter, the visible paramter and the de-dupe VBA code
Add a tablix properties filter where param is in the visible / non-hidden parameter from above VBA (FYI double click to add parameter)
Adding via double click will append a (0) at the end, remove the (0)
It should work as expected at that point! You should be able to select one, some or all parameters and your report should update accordingly.

Postgres matching against an array of regular expressions

My client wants the possibility to match a set of data against an array of regular expressions, meaning:
table:
name | officeId (foreignkey)
--------
bob | 1
alice | 1
alicia | 2
walter | 2
and he wants to do something along those lines:
get me all records of offices (officeId) where there is a member with
ANY name ~ ANY[.*ob, ali.*]
meaning
ANY of[alicia, walter] ~ ANY of [.*ob, ali.*] results in true
I could not figure it out by myself sadly :/.
Edit
The real Problem was missing form the original description:
I cannot use select disctinct officeId .. where name ~ ANY[.*ob, ali.*], because:
This application, stored data in postgres-xml columns, which means i do in fact have (after evaluating xpath('/data/clients/name/text()'))::text[]):
table:
name | officeId (foreignkey)
-----------------------------------------
[bob, alice] | 1
[anthony, walter] | 2
[alicia, walter] | 3
There is the Problem. And "you don't do that, that is horrible, why would you do it like this, store it like it is meant to be stored in a relation database, user a no-sql database for Document-based storage, use json" are no options.
I am stuck with this datamodel.

This looks pretty horrific, but the only way I can think of doing such a thing would be a hybrid of a cross-join and a semi join. On small data sets this would probably work pretty well. On large datasets, I imagine the cross-join component could hit you pretty hard.
Check it out and let me know if it works against your real data:
with patterns as (
select unnest(array['.*ob', 'ali.*']) as pattern
)
select
o.name, o.officeid
from
office o
where exists (
select null
from patterns p
where o.name ~ p.pattern
)
The semi-join helps protect you from cases where you have a name like "alicia nob" that would meet multiple search patterns would otherwise come back for every match.

You could cast the array to text.
SELECT * FROM workers WHERE (xpath('/data/clients/name/text()', xml_field))::text ~ ANY(ARRAY['wal','ant']);
When casting a string array into text, strings containing special characters or consisting of keywords are enclosed in double quotes kind of like {jimmy,"walter, james"} being two entries. Also when matching with ~ it is matched against any part of the string, not the same as LIKE where it's matched against the whole string.
Here is what I did in my test database:
test=# select id, (xpath('/data/clients/name/text()', name))::text[] as xss, officeid from workers WHERE (xpath('/data/clients/name/text()', name))::text ~ ANY(ARRAY['wal','ant']);
id | xss | officeid
----+-------------------------+----------
2 | {anthony,walter} | 2
3 | {alicia,walter} | 3
4 | {"walter, james"} | 5
5 | {jimmy,"walter, james"} | 4
(4 rows)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BQ giving incomplete results using CONTAINS condition - google-bigquery

Related

Combine query to get all the matching search text in right order

How to compare value with multiple modified values from another table in BigQuery?

How to replace text contained in one row with text contained in another row using a select statement

ssrs report report filter with no duplicates used in query

Postgres matching against an array of regular expressions

Categories

Resources