I'm trying to query all the unique values in my "Tags" column.
Each row within the tags column can consist of multiple values.
So without being forced to go into normalization, how can I query a multi-valued column?
Example Rows:
Networking
Professionalism
Time Management
Communication, Networking
Career Management, Professionalism
Networking
Communication
Attitude, Interpersonal Skills, Professionalism
Business Protocol, Career Management, Communication, Leadership
Business Protocol, Networking
If the maximum number of elements is predictable you can use this (please note that you need to use UNION, not UNION ALL)
Select DISTINCT thefield from thetable where Instr(thefield, ',') = 0
UNION
Select Distinct Mid(thefield, 1, Instr(thefield, ',')) from thetable Where len(thefield) - len(replace(thefield,',','')) = 1
UNION
Select Distinct Mid(thefield, Instr(thefield, ',')+1) from thetable Where len(thefield) - len(replace(thefield,',','')) = 1
UNION
Select Distinct Mid(thefield, Instr(thefield, ',')+1, Instr(Instr(thefield, ',')+1,thefield, ',')) from thetable Where len(thefield) - len(replace(thefield,',','')) = 2
UNION
Select Distinct Mid(thefield, Instr(Instr(thefield, ',')+1,thefield, ',')+1) from thetable Where len(thefield) - len(replace(thefield,',','')) = 2
--.. and so on (repeat last two Selects as many time as you wish, increasing the where condition by one for each pair)
Looks a bit clunky, but should do the job. Untested, so, you may have a stray comma appearing before or after a value
Related
First post, hope I don't do anything too crazy
I want to go from JSON/object to long in terms of formatting.
I have a table set up as follows (note: there will be a large but finite number of 50+ activity columns, 2 is a minimal working example). I'm not concerned about the formatting of the date column - different problem.
customer_id(varcahr), activity_count(object, int), activity_duration(object, numeric)
sample starting point
In this case I'd like to explode this into this:
customer_id(varcahr), time_period, activity_count(int), activity_duration(numeric)
sample end point - long
minimum data set
WITH smpl AS (
SELECT
'12a' AS id,
OBJECT_CONSTRUCT(
'd1910', 0,
'd1911', 26,
'd1912', 6,
'd2001', 73) as activity_count,
OBJECT_CONSTRUCT(
'd1910', 0,
'd1911', 260.1,
'd1912', 30,
'd2001', 712.3) AS activity_duration
UNION ALL
SELECT
'13b' AS id,
OBJECT_CONSTRUCT(
'd1910', 1,
'd1911', 2,
'd1912', 3,
'd2001', 4) as activity_count,
OBJECT_CONSTRUCT(
'd1910', 1,
'd1911', 2.2,
'd1912', 3.3,
'd2001', 4.3) AS activity_duration
)
select * from smpl
Extra credit for also taking this from JSON/object to wide (in Google Big Query it's SELECT id, activity_count.* FROM tbl
Thanks in advance.
I've tried tons of random FLATTEN() based joins. In this instance I probably just need one working example.
This needs to scale to a moderate but finite number of objects (e.g. 50)
I'll also see if I can combine with THIS - I'll see if I can combine it - Lateral flatten two columns without repetition in snowflake
Using FLATTEN:
WITH (...)
SELECT s1.ID, s1.KEY, s1.value AS activity_count, s2.value AS activity_duration
FROM (select ID, Key, VALUE from smpl,table(flatten(input=>activity_count))) AS s1
JOIN (select ID, Key, VALUE from smpl,table(flatten(input=>activity_duration))) AS s2
ON S1.ID = S2.ID AND S1.KEY = S2.KEY;
Output:
#Lukasz Szozda gets close but the answer doesn't scale as well with multiple variables (it's essentially a bunch of cartesian products and I'd need to do a lot of ON conditions). I have a known constraint (each field is in a strict format) so it's easy to recycle the key.
After WAY WAY WAY too much messing with this (off and on searches for weeks) it finally snapped and it's pretty easy.
SELECT
id, key, activity_count[key], activity_duration[key], activity_duration2[key]
FROM smpl, LATERAL flatten(input => activity_count);
You can also use things OTHER than key such as index
It's inspired by THIS link but I just didn't quite follow it.
https://stackoverflow.com/a/36804637/20994650
Not being an SQL expert, I am struggling with the following:
I inherited a larg-ish table (about 100 million rows) containing time-stamped events that represent stage transitions of mostly shortlived phenomena. The events are unfortunately recorded in a somewhat strange way, with the table looking as follows:
phen_ID record_time producer_id consumer_id state ...
000123 10198789 start
10298776 000123 000112 hjhkk
000124 10477886 start
10577876 000124 000123 iuiii
000124 10876555 end
Each phenomenon (phen-ID) has a start event and theoretically an end event, although it might not have been occured yet and thus not recorded. Each phenomenon can then go through several states. Unfortunately, for some states, the ID is recorded in either a product or a consumer field. Also, the number of states is not fixed, and neither is the time between the states.
To beginn with, I need to create an SQL statement that for each phen-ID shows the start time and the time of the last recorded event (could be an end state or one of the intermediate states).
Just considering a single phen-ID, I managed to pull together the following SQL:
WITH myconstants (var1) as (
values ('000123')
)
select min(l.record_time), max(l.record_time) from
(select distinct * from public.phen_table JOIN myconstants ON var1 IN (phen_id, producer_id, consumer_id)
) as l
As the start-state always has the lowest recorded-time for the specific phenomenon, the above statement correctly returns the recorded time range as one row irrespective of what the end state is.
Obviously here I have to supply the phen-ID manually.
How can I make this work that so I get a row of the start times and maxium recorded time for each unique phen-ID? Played around with trying to fit in something like select distinct phen-id ... but was not able to "feed" them automatically into the above. Or am I completely off the mark here?
Addition:
Just to clarify, the ideal output using the table above would like something like this:
ID min-time max-time
000123 10198789 10577876 (min-time is start, max-time is state iuii)
000124 10477886 10876555 (min-time is start, max-time is end state)
union all might be an option:
select phen_id,
min(record_time) as min_record_time,
max(record_time) as max_record_time
from (
select phen_id, record_time from phen_table
union all select producer_id, record_time from phen_table
union all select consumer_id, record_time from phen_table
) t
where phen_id is not null
group by phen_id
On the other hand, if you want prioritization, then you can use coalesce():
select coalesce(phen_id, producer_id, consumer_id) as phen_id,
min(record_time) as min_record_time,
max(record_time) as max_record_time
from phen_table
group by coalesce(phen_id, producer_id, consumer_id)
The logic of the two queries is not exactly the same. If there are rows where more than one of the three columns is not null, and values differ, then the first query takes in account all non-null values, while the second considers only the "first" non-null value.
Edit
In Postgres, which you finally tagged, the union all solution can be phrased more efficiently with a lateral join:
select x.phen_id,
min(p.record_time) as min_record_time,
max(p.record_time) as max_record_time
from phen_table p
cross join lateral (values (phen_id), (producer_id), (consumer_id)) as x(phen_id)
where x.phen_id is not null
group by x.phen_id
I think you're on the right track. Try this and see if it is what you are looking for:
select
min(l.record_time)
,max(l.record_time)
,coalesce(phen_id, producer_id, consumer_id) as [Phen ID]
from public.phen_table
group by coalesce(phen_id, producer_id, consumer_id)
I know there are some posts on pivoting, which I have used to get where I am today (thanks to the BQ community!). But this post seeks some advice on optimising this where there is a large number of pivot columns needed, distributed table joins are needed....as well and deudping. Not asking much right!
Objective:
We have 2 large BQ tables, with a full 10 years history that needs joining:
sales_order_header (13 GB - 1.35 million rows)
sales_order_line (50GM - 5 million rows)
This is a typical 'header/line' one to many relationship. The data for the tables arrives as 2 seperate streams unfortunately rather then 1 document style where the line is nested inside the header which would be ideal - but its not so distributed joins become necesary for some of the views our BI tool (Tableau) wants to periodically (every 60 mins) call to ingest 'cleansed' data that is:
deduped (both tables that is)
joined header to line (on salesOrderId)
each has its own array of 'sourceData' namve / value paris that needs unpacking / 'pivot' so its not an array
Point 3 presents an issue in its own right. We have a column called 'sourceData' which is basically where the core data is - its an array of string name value pairs (a row in BQ is a replication of a single row from a DB so the key is a column name and value the value for a single row).
Now I think here lay the issue, as there are 250 array entries (we know the exact number up front) , this equates to 250 'unnest' statements each and using the best approach I can think of using sub selects:
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
250 times
And this is done as a pattern for each of the header and the line tables repsective views.
So the SQL for the view for just retrieving a deduped, flattened/pivoted array for the sales_order_header table is as follows. The sales_order_line has the same pattern for its view:
#standardSQL
WITH latest_snapshot_dups AS (
SELECT
salesOrderId,
PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", lastUpdated) AS lastUpdatedTimestampUTC,
sourceData,
_PARTITIONTIME AS bqPartitionTime
FROM
`project.ds.sales_order_header_refdata`
),
latest_snapshot_nodups AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY salesOrderId ORDER BY lastUpdatedTimestampUTC DESC) AS rowNum
FROM latest_snapshot_dups
)
SELECT
salesOrderId,
lastUpdatedTimestampUTC,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'b') AS b,
....250 of these
FROM
latest_snapshot_nodups
WHERE
rowNum = 1
Although just showing one here, we have these two similar views (with total of 250 + 300 = 550 unique subqueries that unnest/pivot), and now I want to join the header with the line views and I run into an issue straight away exceeding a limit of subqueries.
Is there a better way to do this, assuming this is the data there is to work with? A better way to 'pivot' perhaps? Or a more efficient way building a single view that optimises the order of things, rather then using 2 discrete views?
Thanks for your help BQ Community!
I run into an issue straight away exceeding a limit of subqueries
You currently using below pattern (removed mot significant part of code for simplicity)
#standardSQL
SELECT
salesOrderId,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'b') AS b,
....250 OF these
FROM latest_snapshot_nodups
Try below pattern
#standardSQL
SELECT
salesOrderId,
MAX(IF(name = 'a', val, NULL)) AS a,
MAX(IF(name = 'b', val, NULL)) AS b,
....250 OF these
FROM latest_snapshot_nodups, UNNEST(sourceData) kv
GROUP BY salesOrderId
Im looking for something like SELECT PRODUCT(table.price) FROM table GROUP BY table.sale similar to how SUM works.
Have I missed something on the documentation, or is there really no PRODUCT function?
If so, why not?
Note: I looked for the function in postgres, mysql and mssql and found none so I assumed all sql does not support it.
For MSSQL you can use this. It can be adopted for other platforms: it's just maths and aggregates on logarithms.
SELECT
GrpID,
CASE
WHEN MinVal = 0 THEN 0
WHEN Neg % 2 = 1 THEN -1 * EXP(ABSMult)
ELSE EXP(ABSMult)
END
FROM
(
SELECT
GrpID,
--log of +ve row values
SUM(LOG(ABS(NULLIF(Value, 0)))) AS ABSMult,
--count of -ve values. Even = +ve result.
SUM(SIGN(CASE WHEN Value < 0 THEN 1 ELSE 0 END)) AS Neg,
--anything * zero = zero
MIN(ABS(Value)) AS MinVal
FROM
Mytable
GROUP BY
GrpID
) foo
Taken from my answer here: SQL Server Query - groupwise multiplication
I don't know why there isn't one, but (take more care over negative numbers) you can use logs and exponents to do:-
select exp (sum (ln (table.price))) from table ...
There is no PRODUCT set function in the SQL Standard. It would appear to be a worthy candidate, though (unlike, say, a CONCATENATE set function: it's not a good fit for SQL e.g. the resulting data type would involve multivalues and pose a problem as regards first normal form).
The SQL Standards aim to consolidate functionality across SQL products circa 1990 and to provide 'thought leadership' on future development. In short, they document what SQL does and what SQL should do. The absence of PRODUCT set function suggests that in 1990 no vendor though it worthy of inclusion and there has been no academic interest in introducing it into the Standard.
Of course, vendors always have sought to add their own functionality, these days usually as extentions to Standards rather than tangentally. I don't recall seeing a PRODUCT set function (or even demand for one) in any of the SQL products I've used.
In any case, the work around is fairly simple using log and exp scalar functions (and logic to handle negatives) with the SUM set function; see #gbn's answer for some sample code. I've never needed to do this in a business application, though.
In conclusion, my best guess is that there is no demand from SQL end users for a PRODUCT set function; further, that anyone with an academic interest would probably find the workaround acceptable (i.e. would not value the syntactic sugar a PRODUCT set function would provide).
Out of interest, there is indeed demand in SQL Server Land for new set functions but for those of the window function variety (and Standard SQL, too). For more details, including how to get involved in further driving demand, see Itzik Ben-Gan's blog.
You can perform a product aggregate function, but you have to do the maths yourself, like this...
SELECT
Exp(Sum(IIf(Abs([Num])=0,0,Log(Abs([Num])))))*IIf(Min(Abs([Num]))=0,0,1)*(1-2*(Sum(IIf([Num]>=0,0,1)) Mod 2)) AS P
FROM
Table1
Source: http://productfunctionsql.codeplex.com/
There is a neat trick in T-SQL (not sure if it's ANSI) that allows to concatenate string values from a set of rows into one variable. It looks like it works for multiplying as well:
declare #Floats as table (value float)
insert into #Floats values (0.9)
insert into #Floats values (0.9)
insert into #Floats values (0.9)
declare #multiplier float = null
select
#multiplier = isnull(#multiplier, '1') * value
from #Floats
select #multiplier
This can potentially be more numerically stable than the log/exp solution.
I think that is because no numbering system is able to accommodate many products. As databases are designed for large number of records, a product of 1000 numbers would be super massive and in case of floating point numbers, the propagated error would be huge.
Also note that using log can be a dangerous solution. Although mathematically log(a*b) = log(a)*log(b), it might not be in computers as we are not dealing with real numbers. If you calculate 2^(log(a)+log(b)) instead of a*b, you may get unexpected results. For example:
SELECT 9999999999*99999999974482, EXP(LOG(9999999999)+LOG(99999999974482))
in Sql Server returns
999999999644820000025518, 9.99999999644812E+23
So my point is when you are trying to do the product do it carefully and test is heavily.
One way to deal with this problem (if you are working in a scripting language) is to use the group_concat function.
For example, SELECT group_concat(table.price) FROM table GROUP BY table.sale
This will return a string with all prices for the same sale value, separated by a comma.
Then with a parser you can get each price, and do a multiplication. (In php you can even use the array_reduce function, in fact in the php.net manual you get a suitable example).
Cheers
Another approach based on fact that the cardinality of cartesian product is product of cardinalities of particular sets ;-)
⚠ WARNING: This example is just for fun and is rather academic, don't use it in production! (apart from the fact it's just for positive and practically small integers)⚠
with recursive t(c) as (
select unnest(array[2,5,7,8])
), p(a) as (
select array_agg(c) from t
union all
select p.a[2:]
from p
cross join generate_series(1, p.a[1])
)
select count(*) from p where cardinality(a) = 0;
The problem can be solved using modern SQL features such as window functions and CTEs. Everything is standard SQL and - unlike logarithm-based solutions - does not require switching from integer world to floating point world nor handling nonpositive numbers. Just number rows and evaluate product in recursive query until no row remain:
with recursive t(c) as (
select unnest(array[2,5,7,8])
), r(c,n) as (
select t.c, row_number() over () from t
), p(c,n) as (
select c, n from r where n = 1
union all
select r.c * p.c, r.n from p join r on p.n + 1 = r.n
)
select c from p where n = (select max(n) from p);
As your question involves grouping by sale column, things got little bit complicated but it's still solvable:
with recursive t(sale,price) as (
select 'multiplication', 2 union
select 'multiplication', 5 union
select 'multiplication', 7 union
select 'multiplication', 8 union
select 'trivial', 1 union
select 'trivial', 8 union
select 'negatives work', -2 union
select 'negatives work', -3 union
select 'negatives work', -5 union
select 'look ma, zero works too!', 1 union
select 'look ma, zero works too!', 0 union
select 'look ma, zero works too!', 2
), r(sale,price,n,maxn) as (
select t.sale, t.price, row_number() over (partition by sale), count(1) over (partition by sale)
from t
), p(sale,price,n,maxn) as (
select sale, price, n, maxn
from r where n = 1
union all
select p.sale, r.price * p.price, r.n, r.maxn
from p
join r on p.sale = r.sale and p.n + 1 = r.n
)
select sale, price
from p
where n = maxn
order by sale;
Result:
sale,price
"look ma, zero works too!",0
multiplication,560
negatives work,-30
trivial,8
Tested on Postgres.
Here is an oracle solution for anyone who needs it
with data(id, val) as(
select 1,1.0 from dual union all
select 2,-2.0 from dual union all
select 3,1.0 from dual union all
select 4,2.0 from dual
),
neg(val , modifier) as(
select exp(sum(ln(abs(val)))), case when mod(count(*),2) = 0 then 1 Else -1 end
from data
where val <0
)
,
pos(val) as (
select exp(sum(ln(val)))
from data
where val >=0
)
select (select val*modifier from neg)*(select val from pos) product from dual
I am practicing SQL in Microsoft SQL Server 2012 (not a homework question), and have a table Names. The table shows baby names by year, with columns Sex (gender of name), N (number of babies having that name), Yr (year), and Name (the name itself).
I need to write a query using only one SELECT statement that returns the most popular baby name by year, with gender, the year, and the number of babies named. So far I have;
SELECT *
From Names
ORDER By N DESC;
Which gives the highest values of N in DESC order, repeating years. I need to limit it to only the highest value in each year, and everything I have tried to do so has thrown errors. Any advice you can give me for this would be appreciated.
Off the top of my my head, something like the following would normally let you do it in (technically) one SELECT statment. That statement includes sub-SELECTs, but I'm not immediately seeing an alternative that wouldn't.
When there's joint top ranking names, both queries should bring back all joint top results so there may not be exactly one answer. If you then just need a random single representative row from those result, look at using select top 1, perhaps adding order by to get the first alphabetically.
Most popular by year regardless of gender:
-- ONE PER YEAR:
SELECT n.Year, n.Name, n.Gender, n.Qty FROM Name n
WHERE NOT EXISTS (
SELECT 1 FROM Name n2
WHERE n2.Year = n.Year
AND n2.Qty > n.Qty
)
Most popular by year for each gender:
-- ONE PER GENDER PER YEAR:
SELECT n.Year, n.Name, n.Gender, n.Qty FROM Name n
WHERE NOT EXISTS (
SELECT 1 FROM Name n2
WHERE n2.Year = n.Year
AND n2.Gender = n.Gender
AND n2.Qty > n.Qty
)
Performance is, despite the verbosity of the SQL, usually on a par with alternatives when using this pattern (often better).
There are other approaches, including using GROUP statements, but personally I find this one more readable and standard cross-DBMS.