Is there a way to check if any items in a string array are in a string in Snowflake/Redshift? - sql

I am looking for a way to check if a string contains any words in another field which is a single string that holds a list of items. Something like this...
id items (STRING)
1 burger;hotdog
I have a second dataset that might look like...
transaction_id description amount
10 cheeseburger 10
Now I need to grab the amount if the description matches any items in the first table, in this case it does match with the string burger, however, i can't seem to get the SQL right since if I were to use LIKE ANY in Snowflake, i'd need to pass in **('%burger%",'%hotdog%') which are two separate strings - in this case I can't make explicit calls as each id/item permutation may be different in the first table. While in Redshift when I try to use
CASE WHEN lower(t.description) SIMILAR TO '%(' || replace(items,';','|') || ')%' then amount END
I get the following error: Specified types or functions (one per INFO message) not supported on Redshift tables.
Thanks in advance!

If your wanting a snowflake answer:
WITH keys AS (
SELECT * FROM VALUES (1,'burger;hotdog') a(id,items)
), data AS (
SELECT * FROM VALUES (10,'cheeseburger',10) b(transaction_id, description, amount)
), seq_keys AS (
SELECT s.seq_id, f.value as key
FROM (
SELECT seq8() as seq_id, k.*
FROM keys AS k
) AS s
,lateral flatten(input=>split(s.items,';')) F
)
SELECT d.*, sk.*
FORM data d
JOIN seq_keys sk ON d.description ILIKE '%'||sk.key||'%'
gives:
TRANSACTION_ID DESCRIPTION AMOUNT SEQ_ID KEY
10 cheeseburger 10 0 "burger"
which is you distinct on the SEQ_ID then you can de-dupe if there are multi keys that match.. I would be inclined to also add an ID to the "data table".

Related

Bigquery SQL: convert array to columns

I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is

pull specific data from broken json in database, using redshift

so I have table source that 1 important column is broken json. below is the sample of the data
event_properties
"{\"source\":\"barcode\",\"voucher_id\":684883298,\"voucher_name\":\"voucher 1\"}"
"{\"entryPoint\":\"voucher_selection-popup\",\"entry_point\":\"voucher_selection-popup\",\"source\":\"mobile\",\"voucher_id\":712001960,\"voucher_name\":\"voucher 2\"}"
"{\"source\":\"barcode\",\"voucher_id\":638584138,\"voucher_name\":\"voucher 1\"}"
"{\"source\":\"QR Static\",\"voucher_id\":642124374,\"voucher_name\":\"voucher 3\"}"
each line represent the 1 record. is there a way to extract the voucher id and voucher_name information since there are more than 1 variation in the data.
so the goal is to extract the voucher id and voucher name like this
voucher_id voucher_name
684883298 voucher 1
712001960 voucher 2
638584138 voucher 1
642124374 voucher 3
im using redshift
You can try this:
first verify if the JSON is valid with is_valid_json().
if not, inspect what is required to make it valid, in this case by removing leading and trailing ".
in this scenario, use trim() to remove the redundant " chars.
use json_extract_path_text() to obtain values.
SQL:
with json_data as (
select '"{\"source\":\"barcode\",\"voucher_id\":684883298,\"voucher_name\":\"voucher 1\"}"'::text as j union
select '"{\"entryPoint\":\"voucher_selection-popup\",\"entry_point\":\"voucher_selection-popup\",\"source\":\"mobile\",\"voucher_id\":712001960,\"voucher_name\":\"voucher 2\"}"'::text union
select '"{\"source\":\"barcode\",\"voucher_id\":638584138,\"voucher_name\":\"voucher 1\"}"'::text union
select '"{\"source\":\"QR Static\",\"voucher_id\":642124374,\"voucher_name\":\"voucher 3\"}"'::text)
select j,
is_valid_json(j),
trim('"' from j) as j_trimmed,
is_valid_json(j_trimmed),
json_extract_path_text(j_trimmed, 'voucher_id') as voucher_id
from json_data;
Yields the voucher_id values. Then use the same method to get the other keys' values.

SQL - count amount of occurences for items in different price diapasons

I have a question about SQL, and I honestly tried to search methods before asking. I will give an abstract (but precise) description below, and will greatly appreciate your example of solution (SQL query).
What I have:
Table A with category ids of the items and prices (in USD) for each item. category id has int type of value, price is string and looks like "USD 200000000" (real value is multiplied by 10^7). Tables also has a kind column with int type of value.
Table B with relation of category id and name.
What I need:
Get a table with price diapasons (like 0-100 | 100-200 | ...) as column names and count amount of items for each category id (as lines names) in all of the price diapasons. All results must be filtered by kind parameter (from table A) with value 3.
Questions, that I encountered (and which caused to ask for an example of SQL query):
Cut "USD from price string value, divide it by 10^7 and convert to float.
Gather diapasons of price values (0-100 | 100-200 | ...), with given step in the given interval (max price is considered as unknown at the start). Example: step 100 on 0-500 interval, and step 200 for values >500.
Put diapasons of price values into column names of the result table.
For each diapason, count amount of items in each category (category_id). Left limit of diapason shall not be considered (e.g. on 1000-1200 diapason, items with price 1000 shall not be considered).
Using B table, display name instead of category id.
Response is appreciated, ignorance will be understood.
If you only need category ids, then you do not need B. What you are looking for is conditional aggregation, something like:
select category_id,
sum(case when cast(substring(price, 4, 100) as int)/10000000 < 100 then 1 else 0 end) as price_000_100
sum(case when cast(substring(price, 4, 100) as int)/10000000 >= 100 and cast(substring(price, 4, 100) as int)/10000000 < 200
then 1 else 0
end) as price_100_200,
. . .
from a
group by category_id
There is no standard way to do what you describe.
That is because to do (3) you need a pivot aka crosstab, and this is not in ANSI SQL. Each DBMS has it's own implementation. Plus dynamic columns in a pivot table are an additional complication.
For example, Postgres calls it a "crosstab" and requires the tablefunc module to be installed. See this SO question and the documentation. Compare to SQL Server, which uses the PIVOT command.
You can get close using reasonably standard SQL.
Here is an example based on SQLite. A little bit of conversion would provide a solution for other systems, e.g. SUBSTR would be substring(string [from int] [for int]) in postgre.
Assuming a data table of format:
and a category name table of:
then the following code will produce:
WITH dataCTE AS
(SELECT product_id AS 'ID', CAST(SUBSTR(price, 5) AS INT)/1000000 AS 'USD',
CASE WHEN (CAST(SUBSTR(price, 5) AS INT)/1000000) <= 500 THEN
100 ELSE 200
END AS 'Interval'
FROM data
WHERE kind = 3),
groupCTE AS
(SELECT dataCTE.ID AS 'ID', dataCTE.USD AS 'USD', dataCTE.Interval AS 'Interval',
CASE WHEN dataCTE.Interval = 100 THEN
CAST(dataCTE.USD AS INT)/100
ELSE
(CAST(dataCTE.USD-500 AS INT)/200)+5
END AS 'GroupID'
FROM dataCTE),
cleanCTE AS
(SELECT *, CASE WHEN groupCTE.Interval = 100 THEN
CAST(groupCTE.GroupID *100 AS VARCHAR)
|| '-' ||
CAST((groupCTE.GroupID *100)+99 AS VARCHAR)
ELSE
CAST(((groupCTE.GroupID-5)*200)+500 AS VARCHAR)
|| '-' ||
CAST(((groupCTE.GroupID-5)*200)+500+199 AS VARCHAR)
END AS 'diapason'
FROM groupCTE
INNER JOIN cat_name AS cn ON groupCTE.ID = cn.cat_id)
SELECT *
FROM cleanCTE;
If you modify the last SELECT to:
SELECT name, diapason, COUNT(diapason)
FROM cleanCTE
GROUP BY name, diapason;
then you get a grouped output:
This is as close as you will get without specifying the exact system; even then you will have a problem with dynamically creating the column names.

How to find the next sequence number in oracle string field

I have a database table with document names stored as a VARCHAR and I need a way to figure out what the lowest available sequence number is. There are many gaps.
name partial seq
A-B-C-0001 A-B-C- 0001
A-B-C-0017 A-B-C- 0017
In the above example, it would be 0002.
The distinct name values total 227,705. The number of "partial" combinations is quite large A=150, B=218, C=52 so 1,700,400 potential combinations.
I found a way to iterate through from min to max per distinct value and list all the "missing" (aka available) values, but this seems inefficient given we are not using anywhere close to the max potential partial combinations (10,536 out of 1,700,400).
I'd rather have a table based on existing data with a partial value, it's next available sequence value, and a non-existent partial means 0001.
Thanks
Hmmmm, you can try this:
select coalesce(min(to_number(seq)), 0) + 1
from t
where partial = 'A-B-C-' and
not exists (select 1
from t t2
where t2.partial = t.partial and
to_number(T2.seq) = to_number(t.seq) + 1
);
EDIT:
For all partials you need a group by:
You can use to_char() to convert it back to a character, if necessary.
select partial, coalesce(min(to_number(seq)), 0) + 1
from t
where not exists (select 1
from t t2
where t2.partial = t.partial and
to_number(T2.seq) = to_number(t.seq) + 1
)
group by partial;

SQL group by number and replace characters

I have data stored in my database for mobile numbers.
I want to group by the column number in the database.
For example, some numbers may show 44123456789 and 0123456789 which is the same number. How can I group these together?
SELECT DIGITS(column_name) FROM table_name
You should use this format in DB then you assign it any variable, next you can matching their digits with the others.
Not sure it really suits you, but you could build this kind of subquery:
SELECT ta.`phone_nbr`,
COALESCE(list.`normalized_nbr`,ta.`phone_nbr`) AS nbr
FROM (
SELECT
t.`phone_nbr`,
SUBSTRING(t.`phone_nbr`,2) AS normalized_nbr
FROM `your_table` t
WHERE LEFT(t.`phone_nbr`,1) = '0'
UNION
SELECT
t.`phone_nbr`,
sub.`filter_nbr` AS normalized_nbr
FROM `your_table` t,
( SELECT
SUBSTRING(t2.`phone_nbr`,2) AS filter_nbr
FROM `your_table` t2
WHERE LEFT(t2.`phone_nbr`,1) = '0') sub
WHERE LEFT(t.`phone_nbr`,1) != '0'
AND t.`phone_nbr` LIKE CONCAT('%',sub.`filter_nbr`)
) list
LEFT OUTER JOIN `your_table` ta
ON ta.`phone_nbr` = list.`phone_nbr`
It will return you a list of phone numbers with their "normalized" number, i.e. with the 0 or international prefix removed if there is a duplicate match, and the raw number otherwise.
You can then use a GROUP BY clause on the nbr field, join on the phone_nbr for the rest of your query.
It has some limits, as it can unfortunately group similar stripped numbers. +49123456789, +44123456789 and 0123456789 will unfortunately have the same normalized number.