Bigquery - Sum Product of multiple value within two columns - google-bigquery

I have this value
I want to calculate a new column, which will add the product of multiplication from ticket_units_count and price, so it must be:
5 * 33104.0 + 4 * 23449.0 = 259316
How to do that in bigquery?
I tried this one
SELECT
SUM(CAST(price AS FLOAT64) * CAST(ticket_units_count AS INT64))
FROM table
But it shows this error: Bad double value: 33104.0;23449.0
Need your help to specify the query to get the expected result

Consider below approach
select *,
( select sum(cast(_count as int64) * cast(_price as float64))
from unnest(split(ticket_units_count, ';')) _count with offset
join unnest(split(price, ';')) _price with offset
using (offset)
) as total
from your_table
if applied to sample data in your question - output is

Related

How do I get the closest match vlookup for entire column in Google Big Query SQL?

I am trying to take a column of original prices and enter a discount % and return the closest match to a predetermined set of values. These allowable values are found in another table that is just one column of prices. I am curious to hear how ties would be handled. Please note that this is for a long list of items, so this would have to apply to an entire column. The specific syntax needed is Google Big Query.
I envision this functioning similarly to excel's VLOOKUP approximate = 1. In practice, I will apply the same solution to multiple price points in the results table (ex. origPrice, 25%off, 50%off, and 75%off etc. ), but I figured that I could copy-paste the solution multiple times.
The below example shows a 50% price reduction.
allowableDiscounts
discountPrice
$51.00
$48.50
$40.00
productInfo
Item
OrigPrice
Apple
$100.00
Banana
$ 98.00
Desired Output
Item
OrigPrice
exact50off
closestMatch
Apple
$100.00
$50.00
$51.00
Banana
$ 98.00
$44.00
$40.00
I have researched solutions here and elsewhere. Most of what I found suggested sorting the allowableDiscounts table by the absolute value of the difference between exact50off and discountPrice. That worked great for one instance, but I could not figure out how to apply that to an entire list of prices.
I have workarounds both in SQL and excel that can accomplish the same task manually, but I am looking for something to match the above function so that way if the allowableDiscounts table changes, the calculations will reflect that without recoding.
SELECT
p.Item,
p.OrigPrice,
p.OrigPrice * 0.5 AS exact50off
--new code from allowableDiscounts.discountPrice
FROM
productInfo AS p
WHERE
--filters applied as needed
You may work it out with a CROSS JOIN, then compute the smallest difference and filter out the other generated records (with higher differences).
Smallest difference here is retrieved by assigning a rank to all differences in each partition <Item, OrigPrice> (with ROW_NUMBER), then all values ranked higher than 1 are discarded.
WITH cte AS (
SELECT *,
OrigPrice*0.5 AS exact50off,
ROW_NUMBER() OVER(PARTITION BY Item, OrigPrice ORDER BY ABS(discountPrice - OrigPrice*0.5)) AS rn
FROM productInfo
CROSS JOIN allowableDiscounts
)
SELECT Item,
OrigPrice,
exact50off,
discountPrice
FROM cte
WHERE rn = 1
In case the tables are large, as you stated, a cross join is not possible and a window function is the only solution.
First we generate a function nearest, which return the element (x or y) closest to a target value.
Then we define both tables, discountPrice and productInfo. Next, we union these tables as helper. The first column tmp holds the value 1, if the data is from the main table productInfo and we calculate the column exact50off. For the table discountPrice the tmp column in set to 0 and the exact50off column is filled with the entries discountPrice. We add the table discountPrice again, but for column exact75off.
We query the helper table and use:
last_value(if(tmp=0,exact50off,null) ignore nulls) over (order by exact50off),
tmp=0 : Keep only entries from the table discountPrice
last_value get nearest lowest value from table discountPrice
We run the same again, but with desc to obtain the nearest highest value.
The function nearest yields the nearest values of both.
Analog this is done for exact75off
create temp function nearest(target any type,x any type, y any type) as (if(abs(target-x)>abs(target-y),y,x) );
with allowableDiscounts as (select * from unnest([51,48.5,40,23,20]) as discountPrice ),
productInfo as (select "Apple" as item, 100 as OrigPrice union all select "Banana",98 union all select "Banana cheap",88),
helper as (
select 1 as tmp, # this column holds the info from which table the rows come forme
item,OrigPrice, # all columns of the table productInfo (2)
OrigPrice/2 as exact50off, # calc 50%
OrigPrice*0.25 as exact75off, # calc 75%
from productInfo
union all # table for 50%
select 0 as tmp,
null,null, # (2) null entries, because the table productInfo has two columns (2)
discountPrice as exact50off, #possible values for 50% off
null # other calc (75%)
from allowableDiscounts
union all # table for 75%
select 0 as tmp,
null,null, # (2) null entries, because the table productInfo has two columns (2)
null, # other calc (50%)
discountPrice, #possible values for 75% off
from allowableDiscounts
)
select *,
nearest(exact50off,
last_value(if(tmp=0,exact50off,null) ignore nulls) over (order by exact50off),
last_value(if(tmp=0,exact50off,null) ignore nulls) over (order by exact50off desc)
) as closestMatch50off,
nearest(exact75off,
last_value(if(tmp=0,exact75off,null) ignore nulls) over (order by exact75off),
last_value(if(tmp=0,exact75off,null) ignore nulls) over (order by exact75off desc)
) as closestMatch75off,
from helper
qualify tmp=1
order by exact50off
Yet another approach
create temp function vlookup(data array<float64>, key float64)
returns string language js as r'''
closestMatch = null;
closestDifference = Number.MAX_VALUE;
for (let i = 0; i < data.length; i++) {
difference = Math.abs(data[i] - key);
if (difference < closestDifference) {
closestMatch = data[i];
closestDifference = difference;
}
}
return closestMatch;
''';
with priceOffList as (
select *
from unnest([25, 50, 75]) off
)
select * from (
select Item, OrigPrice, off, offPrice, vlookup(arr, offPrice) as closestMatch
from productInfo,(select array_agg(discountPrice order by discountPrice) arr from allowableDiscounts), priceOffList
,unnest([OrigPrice * off / 100]) as offPrice
)
pivot (any_value(offPrice) offPrice, any_value(closestMatch) closestMatch for off in (25, 50, 75))
if applied to sample data in your question - output is
Use the ABS(X) function to compute the absolute values between the columns in the tables to make a match as an exact match or a difference in values between 1 and 4 for the various discount values as below, use a LEFT JOIN to get allow values in your leading table productInfo and either matching values or NULL from the allowableDiscounts table.
SELECT
p.Item,
p.OrigPrice,
p.OrigPrice * 0.5 AS exact50off,
p.OrigPrice * 0.25 AS exact25off,
p.OrigPrice * 0.75 AS exact75off,
q.discountPrice AS closestMatch
FROM
productInfo AS p
LEFT JOIN allowableDiscounts q on ABS(p.OrigPrice * 0.50 - q.discountPrice) = 0
OR ABS(p.OrigPrice * 0.50 - q.discountPrice) BETWEEN 0.01 AND 4.0
OR ABS(p.OrigPrice * 0.25 - q.discountPrice) = 0
OR ABS(p.OrigPrice * 0.75 - q.discountPrice) = 0
OR ABS(p.OrigPrice * 0.25 - q.discountPrice) BETWEEN 0.01 AND 4.0
OR ABS(p.OrigPrice * 0.75 - q.discountPrice) BETWEEN 0.01 AND 4.0;

How to unnest BigQuery nested records into multiple columns

I am trying to unnest the below table .
Using the below unnest query to flatten the table
SELECT
id,
name ,keyword
FROM `project_id.dataset_id.table_id`
,unnest (`groups` ) as `groups`
where id = 204358
Problem is , this duplicates the rows (except name) as is the case with flattening the table.
How can I modify the query to put the names in two different columns rather than rows.
Expected output below -
That's because the comma is a cross join - in combination with an unnested array it is a lateral cross join. You repeat the parent row for every row in the array.
One problem with pivoting arrays is that arrays can have a variable amount of rows, but a table must have a fixed amount of columns.
So you need a way to decide for a certain row that becomes a certain column.
E.g. with
SELECT
id,
name,
groups[ordinal(1)] as firstArrayEntry,
groups[ordinal(2)] as secondArrayEntry,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
If your array had a key-value pair you could decide using the key. E.g.
SELECT
id,
name,
(select value from unnest(groups) where key='key1') as key1,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
But that doesn't seem to be the case with your table ...
A third option could be PIVOT in combination with your cross-join solution but this one has restrictions too: and I'm not sure how computation-heavy this is.
Consider below simple solution
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (1, 2))
if applied to sample data in your question - output is
Note , when you apply to your real case - you just need to know how many such name_NNN columns to expect and extend respectively list - for example for offset + 1 in (1, 2, 3, 4, 5)) if you expect 5 such columns
In case if for whatever reason you want improve this - use below where everything is built dynamically for you so you don't need to know in advance how many columns it will be in the output
execute immediate (select '''
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (''' || string_agg('' || pos, ', ') || '''))
'''
from (select pos from (
select max(array_length(`groups`)) cnt
from `project_id.dataset_id.table_id`
), unnest(generate_array(1, cnt)) pos
))
Your question is a little unclear, because it does not specify what to do with other keywords or other columns. If you specifically want the first two values in the array for keyword "OVG", you can unnest the array and pull out the appropriate names:
SELECT id,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1
) as name_1,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1 OFFSET 1
) as name_2,
'OVG' as keyword
FROM `project_id.dataset_id.table_id` t
WHERE id = 204358;

How to calculate metrics between two tables

How to calculate metrics between two tables? In addition, I noticed that when using FROM tbl1, tbl2, there are noises, the WHERE filters did not work, a total count(*) was returned
Query:
select
count(*) filter(WHERE tb_get_gap.system in ('LINUX','UNIX')) as gaps,
SUM(CAST(srvs AS INT)) filter(WHERE tb_getcountsrvs.type = 'LZ') as total,
100 - (( gaps / total ) * 100)
FROM tb_get_gap, tb_getcountsrvs
Error:
SQL Error [42703]: ERROR: column "gaps" does not exist
I need to count in the tb_get_gap table by fields = ('LINUX', 'UNIX'), then a SUM ()in thesrvs field in the
tb_getcountsrvs table by fields = 'LZ' in type, right after
making this formula 100 - ((gaps / total) * 100)
It would seem that you cannot define gaps and also use it in the same query. In SQL Server you would have to use the logic twice. Maybe a subquery would work better.
select 100 - (t.gaps / t.total) * 100)
from
(
select
count(*) filter(WHERE tb_get_gap.system in ('LINUX','UNIX')) as gaps,
SUM(CAST(srvs AS INT)) filter(WHERE tb_getcountsrvs.type = 'LZ') as total
FROM tb_get_gap, tb_getcountsrvs
) t

sql subqueries! problem with ANY operator in basic sql query. dont know why it is required in my query and how will it work?

I get the following error
Scalar subquery contains more than one row; SQL statement: select * from trip where price= ( select price from hiking_trip ); [90053-193]
From the SQL
SELECT *
FROM trip
WHERE price = (
SELECT price
FROM hiking_trip
);
I know the error disappears if I add ANY to my code after =. but I don't understand why it does not work? Shouldn't it give me the price equal to the given condition? and why ANY will make it work?
UPDATE: i got the point you told me but then
select *
from country
where exists
(
select *
from mountain
where
mountain.country_id=country.id
);
in this query won't the select *statement in subquery return more than one row or column yet it is working here??
It's because the subquery SELECT price FROM hiking_trip returns multiple rows.
You would need to use IN instead of = for it to work.
SELECT * FROM trip WHERE price IN ( SELECT price FROM hiking_trip );
This is because your subquery is returning more than one row.
You can use IN instead of =
SELECT * FROM trip WHERE price IN ( SELECT price FROM hiking_trip );
or you can use LIMIT
SELECT * FROM trip WHERE price = (SELECT price FROM hiking_trip LIMIT 1)
which will filter the subquery result with first row
This fails because = is expecting a single value for the comparison, not a list of values which is what the select returns. Technically, it wants a scalar subquery -- a subquery that returns 1 column and 0 or 1 rows.
You can fix the syntax problem in many ways.
Use aggregation to return one value:
WHERE trip.price = ( SELECT MAX(price) FROM hiking_trip )
Use limit to return one value:
WHERE trip.price = (SELECT price FROM hiking_trip LIMIT 1)
Use in to match any value:
WHERE trip.price IN (SELECT price FROM hiking_trip)
Use any:
WHERE trip.price = ANY (SELECT price FROM hiking_trip)

Is there a better way to calculate the median (not average)

Suppose I have the following table definition:
CREATE TABLE x (i serial primary key, value integer not null);
I want to calculate the MEDIAN of value (not the AVG). The median is a value that divides the set in two subsets containing the same number of elements. If the number of elements is even, the median is the average of the biggest value in the lowest segment and the lowest value of the biggest segment. (See wikipedia for more details.)
Here is how I manage to calculate the MEDIAN but I guess there must be a better way:
SELECT AVG(values_around_median) AS median
FROM (
SELECT
DISTINCT(CASE WHEN FIRST_VALUE(above) OVER w2 THEN MIN(value) OVER w3 ELSE MAX(value) OVER w2 END)
AS values_around_median
FROM (
SELECT LAST_VALUE(value) OVER w AS value,
SUM(COUNT(*)) OVER w > (SELECT count(*)/2 FROM x) AS above
FROM x
GROUP BY value
WINDOW w AS (ORDER BY value)
ORDER BY value
) AS find_if_values_are_above_or_below_median
WINDOW w2 AS (PARTITION BY above ORDER BY value DESC),
w3 AS (PARTITION BY above ORDER BY value ASC)
) AS find_values_around_median
Any ideas?
Yes, with PostgreSQL 9.4, you can use the newly introduced inverse distribution function PERCENTILE_CONT(), an ordered-set aggregate function that is specified in the SQL standard as well.
WITH t(value) AS (
SELECT 1 UNION ALL
SELECT 2 UNION ALL
SELECT 100
)
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY value)
FROM
t;
This emulation of MEDIAN() via PERCENTILE_CONT() is also documented here.
Indeed there IS an easier way. In Postgres you can define your own aggregate functions. I posted functions to do median as well as mode and range to the PostgreSQL snippets library a while back.
http://wiki.postgresql.org/wiki/Aggregate_Median
A simpler query for that:
WITH y AS (
SELECT value, row_number() OVER (ORDER BY value) AS rn
FROM x
WHERE value IS NOT NULL
)
, c AS (SELECT count(*) AS ct FROM y)
SELECT CASE WHEN c.ct%2 = 0 THEN
round((SELECT avg(value) FROM y WHERE y.rn IN (c.ct/2, c.ct/2+1)), 3)
ELSE
(SELECT value FROM y WHERE y.rn = (c.ct+1)/2)
END AS median
FROM c;
Major points
Ignores NULL values.
Core feature is the row_number() window function, which has been there since version 8.4
The final SELECT gets one row for uneven numbers and avg() of two rows for even numbers. Result is numeric, rounded to 3 decimal places.
Test shows, that the new version is 4x faster than (and yields correct results, unlike) the query in the question:
CREATE TEMP TABLE x (value int);
INSERT INTO x SELECT generate_series(1,10000);
INSERT INTO x VALUES (NULL),(NULL),(NULL),(3);
For googlers: there is also http://pgxn.org/dist/quantile
Median can be calculated in one line after installation of this extension.
Simple sql with native postgres functions only:
select
case count(*)%2
when 1 then (array_agg(num order by num))[count(*)/2+1]
else ((array_agg(num order by num))[count(*)/2]::double precision + (array_agg(num order by num))[count(*)/2+1])/2
end as median
from unnest(array[5,17,83,27,28]) num;
Sure you can add coalesce() or something if you want to handle nulls.
CREATE TABLE array_table (id integer, values integer[]) ;
INSERT INTO array_table VALUES ( 1,'{1,2,3}');
INSERT INTO array_table VALUES ( 2,'{4,5,6,7}');
select id, values, cardinality(values) as array_length,
(case when cardinality(values)%2=0 and cardinality(values)>1 then (values[(cardinality(values)/2)]+ values[((cardinality(values)/2)+1)])/2::float
else values[(cardinality(values)+1)/2]::float end) as median
from array_table
Or you can create a function and use it any where in your further queries.
CREATE OR REPLACE FUNCTION median (a integer[])
RETURNS float AS $median$
Declare
abc float;
BEGIN
SELECT (case when cardinality(a)%2=0 and cardinality(a)>1 then
(a[(cardinality(a)/2)] + a[((cardinality(a)/2)+1)])/2::float
else a[(cardinality(a)+1)/2]::float end) into abc;
RETURN abc;
END;
$median$
LANGUAGE plpgsql;
select id,values,median(values) from array_table
Use the Below function for Finding nth percentile
CREATE or REPLACE FUNCTION nth_percentil(anyarray, int)
RETURNS
anyelement as
$$
SELECT $1[$2/100.0 * array_upper($1,1) + 1] ;
$$
LANGUAGE SQL IMMUTABLE STRICT;
In Your case it's 50th Percentile.
Use the Below Query to get the Median
SELECT nth_percentil(ARRAY (SELECT Field_name FROM table_name ORDER BY 1),50)
This will give you 50th percentile which is the median basically.
Hope this is helpful.