Inclusion of nulls with ANY_VALUE in BigQuery

Inclusion of nulls with ANY_VALUE in BigQuery - google-bigquery

I have a 'vendors' table that looks like this...
**company itemKey itemPriceA itemPriceB**
companyA, 203913, 20, 10
companyA, 203914, 20, 20
companyA, 203915, 25, 5
companyA, 203916, 10, 10
It has potentially millions of rows per company and I want to query it to bring back a representative delta between itemPriceA and itemPriceB for each company. I don't care which delta I bring back as long as it isn't zero/null (like row 2 or 4), so I was using ANY_VALUE like this...
SELECT company
, ANY_VALUE(CASE WHEN (itemPriceA-itemPriceB)=0 THEN null ELSE (itemPriceA-itemPriceB) END)
FROM vendors
GROUP BY 1
It seems to be working but I notice 2 sentences that seem contradictory from Google's documentation...
"Returns NULL when expression is NULL for all rows in the group. ANY_VALUE behaves as if RESPECT NULLS is specified; rows for which expression is NULL are considered and may be selected."
If ANY_VALUE returns null "when expression is NULL for all rows in the group" it should NEVER return null for companyA right (since only 2 of 4 rows are null)? But the second sentence sounds like it will indeed include the null rows.
P.s. you may be wondering why I don't simply add a WHERE clause saying "WHERE itemPriceA-itemPriceB>0" but in the event that a company has ONLY matching prices, I still want the company to be returned in my results.

Clarification
I'm afraid the accepted answer will have to show stronger evidence that contradicts the docs.
#Raul Saucedo suggests that the following BigQuery documentation is referring to WHERE clauses:
rows for which expression is NULL are considered and may be selected
This is not the case. WHERE clauses are not mentioned anywhere in the ANY_VALUE docs. (Nowhere on the page. Try to ctrl+f for it.) And the docs are clear, as I'll explain.
#d3wannabe is correct to wonder about this:
It seems to be working but I notice 2 sentences that seem contradictory from Google's documentation...
"Returns NULL when expression is NULL for all rows in the group. ANY_VALUE behaves as if RESPECT NULLS is specified; rows for which expression is NULL are considered and may be selected."
But the docs are not contradictory. The 2 sentences coexist.
"Returns NULL when expression is NULL for all rows in the group." So if all rows in a column are NULL, it will return NULL.
"ANY_VALUE behaves as if RESPECT NULLS is specified; rows for which expression is NULL are considered and may be selected." So if the column has rows mixed with NULLs and actual data, it will select anything from that column, including nulls.
How to create an ANY_VALUE without nulls in BigQuery
We can use ARRAY_AGG to turn a group of values into a list. This aggregate function has the option to INGORE NULLS. We then select 1 item from the list after ignoring nulls.
If we have a table with 2 columns: id and mixed_data, where mixed_data has some rows with nulls:
SELECT
id,
ARRAY_AGG( -- turn the mixed_data values into a list
mixed_data -- we'll create an array of values from our mixed_data column
IGNORE NULLS -- there we go!
LIMIT 1 -- only fill the array with 1 thing
)[SAFE_OFFSET(0)] -- grab the first item in the array
AS any_mixed_data_without_nulls
FROM your_table
GROUP BY id
See similar answers here:
https://stackoverflow.com/a/53508606/6305196
https://stackoverflow.com/a/62089838/6305196
Update, 2022-08-12
There is evidence that the docs may be inconsistent with the actual behavior of the function. See Samuel's latest answer to explore his methodology.
However, we cannot know if the docs are incorrect and ANY_VALUE behaves as expected or if ANY_VALUE has a bug and the docs express the intended behavior. We don't know if Google will correct the docs or the function when they address this issue.
Therefore I would continue to use ARRAY_AGG to create a safe ANY_VALUE that ignores nulls until we see a fix from Google.
Please upvote the issue in Google's Issue Tracker to see this resolved.

This is an explanation about how “any_value works with null values”.
With any_value always return the first value, if there is a value different from null.
SELECT ANY_VALUE(fruit) as any_value
FROM UNNEST([null, "banana",null,null]) as fruit;
Return null if all rows have null values. Refers at this sentence
“Returns NULL when expression is NULL for all rows in the group”
SELECT ANY_VALUE(fruit) as any_value
FROM UNNEST([null, null, null]) as fruit
Return null if one value is null and you specified in the where clause. Refers to these sentences
“ANY_VALUE behaves as if RESPECT NULLS is specified; rows for which
expression is NULL are considered and may be selected.”
SELECT ANY_VALUE(fruit) as any_value
FROM UNNEST(["apple", "banana", null]) as fruit
where fruit is null
Always depends which filter you are using and the field inside the any_value.
You can see this example, return two rows that are different from 0.
SELECT ANY_VALUE(e).company, (itemPriceA-itemPriceB) as value
FROM `vendor` e
where (itemPriceA-itemPriceB)!=0
group by e.company

The documentation says that "NULL are considered and may be" returned by an any_value statement. However, I am quite sure the documentation is wrong here. In the current implementation, which was tested on 13th August 2022, the any_value will return the first value of that column. However, if the table does not have an order by specified, the sorting may be random due to processing of the data on several nodes.
For testing a large table of nulls is needed. To generate_array will come handy for that. This array will have several entries and the value zero for null. The first 1 million entries with value zero are generated in the table tmp. Then table tbl adds before and after the [-100,0,-90,-80,3,4,5,6,7,8,9] the 1 million zeros. Finally, calculating NULLIF(x,0) AS x replaces all zeros by null.
Several test of any_value using the test table tbl are done. If the table is not further sorted, the first value of that column is returned: -100.
WITH
tmp AS (SELECT ARRAY_AGG(0) AS tmp0 FROM UNNEST(GENERATE_ARRAY(1,1000*1000))),
tbl AS (
SELECT
NULLIF(x,0) AS x,
IF(x!=0,x,NULL) AS y,
rand() AS rand
FROM
tmp,
UNNEST(ARRAY_CONCAT(tmp0, [0,0,0,0,0,-100,0,-90,-80,3,4,5,6,7,8,9] , tmp0)) AS x )
SELECT "count rows", COUNT(1) FROM tbl
UNION ALL SELECT "count items not null", COUNT(x) FROM tbl
UNION ALL SELECT "any_value(x): (returns first non null element in list: -100)", ANY_VALUE(x) FROM tbl
UNION ALL SELECT "2nd run", ANY_VALUE(x) FROM tbl
UNION ALL SELECT "3rd run", ANY_VALUE(x) FROM tbl
UNION ALL SELECT "any_value(y)", ANY_VALUE(y) FROM tbl
UNION ALL SELECT "order asc", ANY_VALUE(x) FROM (Select * from tbl order by x asc)
UNION ALL SELECT "order desc (returns largest element: 9)", ANY_VALUE(x) FROM (Select * from tbl order by x desc)
UNION ALL SELECT "order desc", ANY_VALUE(x) FROM (Select * from tbl order by x desc)
UNION ALL SELECT "order abs(x) desc", ANY_VALUE(x) FROM (Select * from tbl order by abs(x) desc )
UNION ALL SELECT "order abs(x) asc (smallest number: 3)", ANY_VALUE(x) FROM (Select * from tbl order by abs(x) asc )
UNION ALL SELECT "order rand asc", ANY_VALUE(x) FROM (Select * from tbl order by rand asc )
UNION ALL SELECT "order rand desc", ANY_VALUE(x) FROM (Select * from tbl order by rand desc )
This gives following result:
The first not null entry, -100 is returned.
Sorting the table by this column causes the any_value to always return the first entry
In the last two examples, the table is ordered by random values, thus any_value returns random entries
If the dataset is larger than 2 million rows, the table may be internally split to be processed; this will result in a not ordered table. Without the order by command the first entry on the table and thus the result of any_value cannot be predicted.
For testing this, please replace the 10th line by
UNNEST(ARRAY_CONCAT(tmp0,tmp0,tmp0,tmp0,tmp0,tmp0,tmp0,tmp0, [0,0,0,0,0,-100,0,-90,-80,3,4,5,6,7,8,9] , tmp0,tmp0)) AS x )

Related

Data Studio obtain a transposed table from BigQuery and hidden conditional formatting in text/strings

Several shops should be monitored about their status in a Data Studio dashboard.
There are less than 20 shops and I show here only two in the example. In the BigQuery table there is shop column and following columns: status, info, sold_today and update_time. The shop and update_time columns are always filled, but the other ones are filled only if there is a change.
Task: For each shop the last entries of all columns should be shown.
Here is the BigQuery code for the sample table:
create or replace table dsadds.eu.dummy as(
Select "shop A" as shop, 1000 as sold_today, "sale ABC" as info, 0 as status,timestamp( "2022-09-05") as update_time
union all select "shop A",null,null,1,"2022-09-06"
union all select "shop A" as shop, 500 as sold_today, "open" as status,3,"2022-09-01"
union all Select "shop B" as shop, 700 as sold_today, "open" as status,3,current_timestamp() as update_time
)
This table looks in Data Studio with conditional formatting, Status=1 marked red, like this:
As you can see the "Shop A" is shown several times and with null values.
With following custom BigQuery in Data Studio I can obtain the last entry of each shop:
with tbl as
(select shop,
array_agg(sold_today ignore nulls order by update_time desc limit 1)[safe_offset(0)] sold_today,
array_agg(info ignore nulls order by update_time desc limit 1)[safe_offset(0)] info,
array_agg(status ignore nulls order by update_time desc limit 1)[safe_offset(0)] status,
from dsadds.eu.dummy
group by 1
)
select * from tbl
resulting in following table, showing all needed information:
However, the users would like to have this table to be transposed and look like this:
-->
On the right hand side it is shown with the final textbox for the labeling of the rows.
Of course, it is possible to build for each entry a Scorecard, but with 10 shops and three field for each, the limit of charts per page was reached.
Question
Is there a way to transpose a table and also do the conditional formatting?

The task is to return one column for each shop and a column id to sort the results. A column has to have one data type and for different rows, we cannot return one time a string and the other time a integer. Thus all integer values have to be formatted in BigQuery as strings.
For transpose, we build an tlb_array. The grouping by the shop generates an array for each one. The array has as first entry the shop name shop and as 2nd entry the column info and as 3rd entry we cast the sold_today column, which is an integer value to a string. We also include an id as entry number. By unnesting this array we unflatten the data and group it again by id in the next select statement. Here, we create a column for each shop and the if condition only considers data for this shop. Thus we end up with a table with three rows, with the row number in id. The needed data is in the shop columns.
with tbl as
(select shop,
array_agg(sold_today ignore nulls order by update_time desc limit 1)[safe_offset(0)] sold_today,
array_agg(info ignore nulls order by update_time desc limit 1)[safe_offset(0)] info,
array_agg(status ignore nulls order by update_time desc limit 1)[safe_offset(0)] status,
from dsadds.eu.dummy
group by 1
),
tlb_array as (
Select shop,X.* from tbl,
unnest([
struct(1 as id,shop as value),
struct(2,info),
struct(3,cast(sold_today as string))]) X
)
select id,
any_value(if(shop="shop A",value,null)) as ShopA,
any_value(if(shop="shop B",value,null)) as ShopB,
from tlb_array
group by 1
Thus we return only text via strings. But we want to apply conditional formatting without adding further columns. The trick is to include special characters in the returned string. ´ or ' is possible, but this would disturb the user. Therefore, the use of space characters is a good way. There are several unicode characters for different space distances. Thus a number can be encoded to space characters. Following UDF has to be hosted by you. It encodes each decimal digit of a number in a different unicode space character.
CREATE OR REPLACE FUNCTION `dsadds.us.number_to_space`(x INT64) AS (
(
SELECT
CONCAT(" ",
string_agg(SUBSTRING(
CODE_POINTS_TO_STRING([0x2007, 0x2002, 0x2004, 0x2005, 0x2006, 0x2008, 0x2009, 0x200A, 0x202F, 0x205F]),
y-47,1),"")
,"- ")
FROM
UNNEST(TO_CODE_POINTS(CAST(x AS string))) y )
);
Then you can use this function in your Custom BigQuery in Data Studio:
with tbl as
(select shop,
array_agg(sold_today ignore nulls order by update_time desc limit 1)[safe_offset(0)] sold_today,
array_agg(info ignore nulls order by update_time desc limit 1)[safe_offset(0)] info,
array_agg(status ignore nulls order by update_time desc limit 1)[safe_offset(0)] status,
from dsadds.us.dummy
group by 1
),
tlb_array as (
Select shop,X.* from tbl,
unnest([
struct(1 as id,concat(shop,dsadds.us.number_to_space(status)) as value),
struct(2,concat(info,dsadds.us.number_to_space(status))),
struct(3,cast(sold_today as string))]) X
)
select id,
any_value(if(shop="shop A",value,null)) as ShopA,
any_value(if(shop="shop B",value,null)) as ShopB,
from tlb_array
group by 1
This will result in following needed table. The (hidden) space characters have to be copied from the table (only in view not in edit mode of Data Studio possible) and condition formatting rules added (text contrains: ). Please also adding a textbox over the first column to hide it and enter the labels for each row.

How to get 0 if no row found from sql query in sql server

I am getting blank value with this query from sql server
SELECT TOP 1 Amount from PaymentDetails WHERE Id = '5678'
it has no row,that is why its returning blank,So I want if no row then it should return 0
I already tried with COALESCE ,but its not working
how to solve this?

You are selecting an arbitrary amount, so one method is aggregation:
SELECT COALESCE(MAX(Amount), 0)
FROM PaymentDetails
WHERE Id = '5678';
Note that if id is a number, then don't use single quotes for the comparison.
To be honest, I would expect SUM() to be more useful than an arbitrary value:
SELECT COALESCE(SUM(Amount), 0)
FROM PaymentDetails
WHERE Id = '5678';

You can wrap the subquery in an ISNULL:
SELECT ISNULL((SELECT TOP 1 Amount from PaymentDetails WHERE Id = '5678' ORDER BY ????),0) AS Amount;
Don't forget to add a column (or columns) to your ORDER BY as otherwise you will get inconsistent results when more than one row has the same value for Id. If Id is unique, however, then remove both the TOP and ORDER BY as they aren't needed.
You should never, however, use TOP without an ORDER BY unless you are "happy" with inconsistent results.

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start

I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.

Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.

; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

Purposely having a query return blank entries at regular intervals

I want to write a query that returns 3 results followed by blank results followed by the next 3 results, and so on. So if my database had this data:
CREATE TABLE table (a integer, b integer, c integer, d integer);
INSERT INTO table (a,b,c,d)
VALUES (1,2,3,4),
(5,6,7,8),
(9,10,11,12),
(13,14,15,16),
(17,18,19,20),
(21,22,23,24),
(25,26,37,28);
I would want my query to return this
1,2,3,4
5,6,7,8
9,10,11,12
, , ,
13,14,15,16
17,18,19,20
21,22,23,24
, , ,
25,26,27,28
I need this to work for arbitrarily many entries that I select for, have three be grouped together like this.
I'm running postgresql 8.3

This should work flawlessly in PostgreSQL 8.3
SELECT a, b, c, d
FROM (
SELECT rn, 0 AS rk, (x[rn]).*
FROM (
SELECT x, generate_series(1, array_upper(x, 1)) AS rn
FROM (SELECT ARRAY(SELECT tbl FROM tbl) AS x) x
) y
UNION ALL
SELECT generate_series(3, (SELECT count(*) FROM tbl), 3), 1, (NULL::tbl).*
ORDER BY rn, rk
) z
Major points
Works for a query that selects all columns of tbl.
Works for any table.
For selecting arbitrary columns you have to substitute (NULL::tbl).* with a matching number of NULL columns in the second query.
Assuming that NULL values are ok for "blank" rows.
If not, you'll have to cast your columns to text in the first and substitute '' for NULL in the second SELECT.
Query will be slow with very big tables.
If I had to do it, I would write a plpgsql function that loops through the results and inserts the blank rows. But you mentioned you had no direct access to the db ...

In short, no, there's not an easy way to do this, and generally, you shouldn't try. The database is concerned with what your data actually is, not how it's going to be displayed. It's not an appropriate scope of responsibility to expect your database to return "dummy" or "extra" data so that some down-stream process produces a desired output. The generating script needs to do that.
As you can't change your down-stream process, you could (read that with a significant degree of skepticism and disdain) add things like this:
Select Top 3
a, b, c, d
From
table
Union Select Top 1
'', '', '', ''
From
table
Union Select Top 3 Skip 3
a, b, c, d
From
table
Please, don't actually try do that.

You can do it (at least on DB2 - there doesn't appear to be equivalent functionality for your version of PostgreSQL).
No looping needed, although there is a bit of trickery involved...
Please note that though this works, it's really best to change your display code.
Statement requires CTEs (although that can be re-written to use other table references), and OLAP functions (I guess you could re-write it to count() previous rows in a subquery, but...).
WITH dataList (rowNum, dataColumn) as (SELECT CAST(CAST(:interval as REAL) /
(:interval - 1) * ROW_NUMBER() OVER(ORDER BY dataColumn) as INTEGER),
dataColumn
FROM dataTable),
blankIncluder(rowNum, dataColumn) as (SELECT rowNum, dataColumn
FROM dataList
UNION ALL
SELECT rowNum - 1, :blankDataColumn
FROM dataList
WHERE MOD(rowNum - 1, :interval) = 0
AND rowNum > :interval)
SELECT *
FROM dataList
ORDER BY rowNum
This will generate a list of those elements from the datatable, with a 'blank' line every interval lines, as ordered by the initial query. The result set only has 'blank' lines between existing lines - there are no 'blank' lines on the ends.

How to do this query in T-SQL

I have table with 3 columns A B C.
I want to select * from this table, but ordered by a specific ordering of column A.
In other words, lets' say column A contains "stack", "over", "flow".
I want to select * from this table, and order by column A in this specific ordering: "stack", "flow", "over" - which is neither ascending nor descending.
Is it possible?

You can use a CASE statement in the ORDER BY clause. For example ...
SELECT *
FROM Table
ORDER BY
CASE A
WHEN 'stack' THEN 1
WHEN 'over' THEN 2
WHEN 'flow' THEN 3
ELSE NULL
END
Check out Defining a Custom Sort Order for more details.

A couple of solutions:
Create another table with your sort order, join on Column A to the new table (which would be something like TERM as STRING, SORTORDER as INT). Because things always change, this avoids hard coding anything and is the solution I would recommend for real world use.
If you don't want the flexibility of adding new terms and orders, just use a CASE statement to transform each term into an number:
CASE A WHEN 'stack' THEN 1 WHEN 'over' THEN 2 WHEN 'flow' THEN 3 END
and use it in your ORDER BY.

If you have alot of elements with custom ordering, you could add those elements to a table and give them a value. Join with the table and each column can have a custom order value.
select
main.a,
main.b,
main.c
from dbo.tblMain main
left join tblOrder rank on rank.a = main.a
order by rank.OrderValue
If you have only 3 elements as suggested in your question, you could use a case in the order by...
select
*
from dbo.tblMain
order by case
when a='stack' then 1
when a='flow' then 2
when a='over' then 3
else 4
end

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Inclusion of nulls with ANY_VALUE in BigQuery - google-bigquery

Related

Data Studio obtain a transposed table from BigQuery and hidden conditional formatting in text/strings

How to get 0 if no row found from sql query in sql server

Select finishes where athlete didn't finish first for the past 3 events

Purposely having a query return blank entries at regular intervals

How to do this query in T-SQL

Categories

Resources