Possible to do subselect in Google Sheets - sql

I have the following data in the movies data range:
I would like to do a subselect to get the highest-grossing movie for that director. I can do it by adding a new column like this:
Note that I've used the hacky 'nested-query' notation to remove the header row and just return a single scalar value:
=QUERY(QUERY(movies, "SELECT MAX(C) WHERE A='"&A2&"' GROUP BY A", 0), "SELECT * OFFSET 1", 0)
However, I was wondering if I could just do a single query on the director|movie|boxoffice columns with a subselect within the query statement, I suppose it would come out to something like:
=QUERY(movies, "SELECT A, B, C, (SELECT MAX(C) WHERE A='"&A2&"' GROUP BY A)", 0)
I believe the answer to this is a straight 'no', but I was curious if there's any sort of sub-query composability within the google sheets query language, or if I just need to sort of figure out workarounds here?
https://developers.google.com/chart/interactive/docs/querylanguage

try:
=INDEX(IFNA(VLOOKUP(A2:A, SORT(A2:C, 3, ), 3, )))
or whole:
=INDEX({A1:C, {"highestgrossing"; IFNA(VLOOKUP(A2:A, SORT(A2:C, 3, ), 3, ))}})

Related

Query function Google Sheet - Aggregation + (unwanted) Sorting

I am trying to run what it started as a simple task but it turned out to be more complicated.
I must run a local sum of a column over different elements of another column with a query function.
The issue arises because the query performs an unwanted sorting of the grouped column (it is in the format of working weeks - strings) and I cannot get it to unsort or re-sort in the original format.
Initial query is:
=query(A1:B350,"select A, sum(B) group by A")
See the example:
click here to see example
Subsequently I tried with:
=query(A1:B350,"select A, sum(B) where A matches '"&join("|", query(G2:G, "select G where G is not null"))& "' group by A")
like so:
click here to see example
but the unwanted sorting remains.
Any idea on how to force the initial sorting or preventing it from changing?
Thank you in advance
To sort correctly, you need to align single digits. You can do this either in the source data or using a formula:
=QUERY({INDEX(REGEXREPLACE(A:A,"-(\d)$","-0$1")),B:B},"SELECT Col1, SUM(Col2) GROUP BY Col1")
try:
=INDEX(IFNA(VLOOKUP(G2:G,
QUERY(A1:B350, "select A,sum(B) group by A label sum(B)''"), {1, 2}, 0)))

SQL Distinct keyword

I would be thankful if someone could help me categorize the word Distinct. I am learning sql and understand what it does but is it a function, an attribute or a keyword just like SELECT, FROM and WHERE etc. I guess it to be a keyword and in which case what does it mean to write two keywords together (i.e SELECT DISTINCT <tuple of attributes> FROM <relation>)?
It is a keyword and could be used in different contexts:
Select distinct field1, field2, field3
from myTable;
Within this context the returned data has only 1 row per each distinct values of field1, field2 and field3 column values. ie:
field1, field2, field3
1, 2, 3
1, 2, 3
1, 1, 1
1, 2, 1
with distinct would return:
1, 2, 3
1, 1, 1
1, 2, 1
IOW it is like group by on all fields included in select.
It is also used with aggregations like this:
Select count(distinct productId)
from OrderDetails;
Would count each productID only once within the group (here in example didn't add any special grouping). Above query for example would answer a question like how many of our products had any sale so far?
I have done a bit of searching and found it is called :
Conditional expressions : DISTINCT predicate.
Not sure if it is what you are looking for.
It is a keyword, used primarily in two contexts:
SELECT DISTINCT
COUNT(DISTINCT . . .)
In some databases, it is also allowed with set-functions, such as UNION DISTINCT. This highlights the default behavior of the set-function (which removes duplicates).
Conceptually, it modifies the action to work only on distinct values.
It can be used with other aggregation functions, but that usage is not useful. It usually implies a problem with the data, usually a data modeling problem.
In the language it is a keyword, but note that at the core, those language keywords serve to denote invocations of operators ("relational" operators, though pseudo-relational is really more accurate) and operators are functions ...
So there is a bit of a case to say that it serves both purposes, and that the distinction you are asking about is actually rather irrelevant.
examples
SELECT ... : invocation of "bag" projection / SELECT DISTINCT ... : invocation of "relational" projection.
WHERE a IS NOT DISTINCT FROM b : the "more relational" equality operator (that yields true if both a and b are null) / WHERE a = b : the "less relational" equality operator (that yields false if both a and b are null).

PostgreSQL: How to access column on anonymous record

I have a problem that I'm working on. Below is a simplified query to show the problem:
WITH the_table AS (
SELECT a, b
FROM (VALUES('data1', 2), ('data3', 4), ('data5', 6)) x (a, b)
), my_data AS (
SELECT 'data7' AS c, array_agg(ROW(a, b)) AS d
FROM the_table
)
SELECT c, d[array_upper(d, 1)]
FROM my_data
In the my data section, you'll notice that I'm creating an array from multiple rows, and the array is returned in one row with other data. This array needs to contain the information for both a and b, and keep two values linked together. What would seem to make sense would be to use an anonymous row or record (I want to avoid actually creating a composite type).
This all works well until I need to start pulling data back out. In the above instance, I need to access the last entry in the array, which is done easily by using array_upper, but then I need to access the value in what used to be the b column, which I cannot figure out how to do.
Essentially, right now the above query is returning:
"data7";"(data5,6)"
And I need to return
"data7";6
How can I do this?
NOTE: While in the above example I'm using text and integers as the types for my data, they are not the actual final types, but are rather used to simplify the example.
NOTE: This is using PostgreSQL 9.2
EDIT: For clarification, Something like SELECT 'data7', 6 is not what I'm after. Imagine that the_table is actually pulling from database tables and not the WITH statement the I put in for convenience, and I don't readily know what data is in the table.
In other words, I want to be able to do something like this:
SELECT c, (d[array_upper(d, 1)]).b
FROM my_data
And get this back:
"data7";6
Essentially, once I've put something into an anonymous record by using the row() function, how do I get it back out? How do I split up the 'data5' part and the 6 part so that they don't both return in one column?
For another example:
SELECT ROW('data5', 6)
makes 'data5' and 6 return in one column. How do I take that one column and break it back into the original two?
I hope that clarifies
If you can install the hstore extension:
with the_table as (
select a, b
from (values('data1', 2), ('data3', 4), ('data5', 6)) x (a, b)
), my_data as (
select 'data7' as c, array_agg(row(a, b)) as d
from the_table
)
select c, (avals(hstore(d[array_upper(d, 1)])))[2]
from my_data
;
c | avals
-------+-------
data7 | 6
This is just a very quick throw together around a similarish problem - not an answer to your question. This appears to be one direction towards identifying columns.
with x as (select 1 a, 2 b union all values (1,2),(1,2),(1,2))
select a from x;

get first or last item in an aggregate when doing GROUP BY

I came across the following old discussion on Google Groups about the capability of selecting the first/last value in an aggregate:
https://groups.google.com/forum/?fromgroups=#!msg/bigquery-discuss/1WAJw1UC73w/_RbUCsMIvQ4J
I was wondering if the answer given is still up-to-date. More specifically, is it possible, without doing JOIN or using nested records to do something like:
SELECT foo, LAST(bar) last_bar FROM table GROUP BY foo HAVING last_bar = b
that for the following table:
foo, bar
1, a
1, b
2, b
2, c
3, b
would return:
foo, last_bar
1, b
3, b
If it is not possible, I was thinking about doing the same with a combination of
GROUP_CONCAT and REGEXP_MATCH on the end of the concatenation:
SELECT foo, GROUP_CONCAT(bar) concat_bar from table GROUP BY foo HAVING REGEXP_MATCH(concat_bar, "b$")
but that only works if aggregation is done in the order of the rows. Is it the case?
I like to use array aggregation to get first/last values:
SELECT foo, ARRAY_AGG(bar)[OFFSET(0)] AS bar FROM test GROUP BY foo;
You can also add LIMIT to aggregation: ARRAY_AGG(bar LIMIT 1) to make it faster.
It lets you use ORDER BY if you want to sort it by a column or get the last value instead:
ARRAY_AGG(bar ORDER BY foo DESC)
Also you can filter out null values with ARRAY_AGG(bar IGNORE NULLS)
I was trying to solve a similar problem and came to the same conclusion using GROUP_CONCAT
Give this a try:
SELECT foo, REGEXP_REPLACE(group_concat(bar),".*,","") as last_bar
FROM [dataset.table]
GROUP BY foo
There is no guarantee to the ordering of records stored in BigQuery, so this would likely fail at some point. Will the "last entry" always be the largest? If so, perhaps the following is what you're looking for?
SELECT foo, MAX(bar) FROM test GROUP BY foo

Purposely having a query return blank entries at regular intervals

I want to write a query that returns 3 results followed by blank results followed by the next 3 results, and so on. So if my database had this data:
CREATE TABLE table (a integer, b integer, c integer, d integer);
INSERT INTO table (a,b,c,d)
VALUES (1,2,3,4),
(5,6,7,8),
(9,10,11,12),
(13,14,15,16),
(17,18,19,20),
(21,22,23,24),
(25,26,37,28);
I would want my query to return this
1,2,3,4
5,6,7,8
9,10,11,12
, , ,
13,14,15,16
17,18,19,20
21,22,23,24
, , ,
25,26,27,28
I need this to work for arbitrarily many entries that I select for, have three be grouped together like this.
I'm running postgresql 8.3
This should work flawlessly in PostgreSQL 8.3
SELECT a, b, c, d
FROM (
SELECT rn, 0 AS rk, (x[rn]).*
FROM (
SELECT x, generate_series(1, array_upper(x, 1)) AS rn
FROM (SELECT ARRAY(SELECT tbl FROM tbl) AS x) x
) y
UNION ALL
SELECT generate_series(3, (SELECT count(*) FROM tbl), 3), 1, (NULL::tbl).*
ORDER BY rn, rk
) z
Major points
Works for a query that selects all columns of tbl.
Works for any table.
For selecting arbitrary columns you have to substitute (NULL::tbl).* with a matching number of NULL columns in the second query.
Assuming that NULL values are ok for "blank" rows.
If not, you'll have to cast your columns to text in the first and substitute '' for NULL in the second SELECT.
Query will be slow with very big tables.
If I had to do it, I would write a plpgsql function that loops through the results and inserts the blank rows. But you mentioned you had no direct access to the db ...
In short, no, there's not an easy way to do this, and generally, you shouldn't try. The database is concerned with what your data actually is, not how it's going to be displayed. It's not an appropriate scope of responsibility to expect your database to return "dummy" or "extra" data so that some down-stream process produces a desired output. The generating script needs to do that.
As you can't change your down-stream process, you could (read that with a significant degree of skepticism and disdain) add things like this:
Select Top 3
a, b, c, d
From
table
Union Select Top 1
'', '', '', ''
From
table
Union Select Top 3 Skip 3
a, b, c, d
From
table
Please, don't actually try do that.
You can do it (at least on DB2 - there doesn't appear to be equivalent functionality for your version of PostgreSQL).
No looping needed, although there is a bit of trickery involved...
Please note that though this works, it's really best to change your display code.
Statement requires CTEs (although that can be re-written to use other table references), and OLAP functions (I guess you could re-write it to count() previous rows in a subquery, but...).
WITH dataList (rowNum, dataColumn) as (SELECT CAST(CAST(:interval as REAL) /
(:interval - 1) * ROW_NUMBER() OVER(ORDER BY dataColumn) as INTEGER),
dataColumn
FROM dataTable),
blankIncluder(rowNum, dataColumn) as (SELECT rowNum, dataColumn
FROM dataList
UNION ALL
SELECT rowNum - 1, :blankDataColumn
FROM dataList
WHERE MOD(rowNum - 1, :interval) = 0
AND rowNum > :interval)
SELECT *
FROM dataList
ORDER BY rowNum
This will generate a list of those elements from the datatable, with a 'blank' line every interval lines, as ordered by the initial query. The result set only has 'blank' lines between existing lines - there are no 'blank' lines on the ends.