Implementing a SQL query without window functions

Implementing a SQL query without window functions - sql

I have read that it is possible to implement anything you might do in a SQL window function, with creative use of joins, etc, but I cannot figure out how. I'm using SQLite in this project, which doesn't currently have window functions.
I have a table with four columns:
CREATE TABLE foo (
id INTEGER PRIMARY KEY,
x REAL NOT NULL,
y REAL NOT NULL,
val REAL NOT NULL,
UNIQUE(x,y));
and a convenience function DIST(x1, y1, x2, y2) that returns the distance between two points.
What I want: For every row in that table, I want the entire row in that same table within a certain distance [eg 25 km], with the lowest "val". For rows with the same "val", I want to use lowest distance as a tie breaker.
My current solution is running n+1 queries, which works but is ucky:
SELECT * FROM foo;
... then, for each row returned, I run [where "src" is the row I just got]:
SELECT * FROM foo
WHERE DIST(foo.x, foo.y, src.x, src.y)<25
ORDER BY val ASC, DIST(foo.x, foo.y, src.x, src.y) ASC
LIMIT 1
But I really want it in a single query, partially for my own interest, and partially because it makes it much easier to work with some other tools I have.

Use your query to get the ID of the wanted row, then use that to join the tables:
SELECT *
FROM (SELECT foo.*,
(SELECT id
FROM (SELECT id,
x,
y,
foo.x AS foo_x,
foo.y AS foo_y,
val
FROM foo)
WHERE DIST(foo_x, foo_y, x, y) < 25
ORDER BY val, DIST(foo_x, foo_y, x, y)
LIMIT 1
) AS id2
FROM foo)
JOIN foo AS foo2 ON id2 = foo2.id;

Related

Bigquery SQL: convert array to columns

I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!

The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.

Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is

average of derived attributes in new table

How can I calculate the average of a person (in this case player) x & y position whilst creating a new table and adding said average to a new column.
CREATE TABLE PlayerStatistics AS SELECT
PLAY_Name
FROM
player;
ALTER TABLE
PlayerStatistics ADD AveragePosition DECIMAL(6, 5)
SELECT
AVG(
Player1(T1) - X,
Player1(T1) - Y
))
FROM
tracksdataview
The end result of the code is a new table with one column of the player's name/id and another column that has an average value of both the x and y positions in each row.

Depending on your DBMS, you may be able to combine your calculation and CREATE TABLE statement.
CREATE TABLE PlayerStatistics AS
SELECT
PLAY_Name,
CAST((Player_X + Player_Y) / 2 AS DECIMAL(6,5)) AS AveragePosition
FROM player p
LEFT JOIN tracksdataview tdv ON p.play_name = tdv.play_name -- Get track data (if any)
;
You may need to CAST the x, y values as FLOAT before doing the division. Give it a try and let me know.

I suspect, that a new table might not be the best solution to your problem. Consider the case where a position X or Y changes over time. This will then not be reflected in your "derived" attributes in the separate table.
My suggestion would be to generate a view that will always "look" at the original table:
CREATE VIEW PlayerStatistics AS
SELECT *, ax-X devX, ay-Y devY
FROM tracksdataview t
INNER JOIN (SELECT playerId, AVG(X) ax, AVG(Y) ay FROM tracksdataview GROUP BY playerId) ta
ON ta.playerId=t.playerId
As I was uncertain about the type of "average" you want I calculated an average over all positions of a particular player and then created two columns showing the player's x- and y- deviations from their average positions.
(I also made the assumption that an ID-columns (playerId) exists ...)

Finding the pair of points whose distance from each other is maximal

I have a very small database which includes 6 points, with those columns id, the_geom, descr. And my aim to write a PL/pgSQL function which finds the the pair of points whose distance from each other is maximal. As an output, I would like to show the id or descr of two points and also the distance between them.
I have tried to do a function with returns table but setof text would be better solution?

You may try something like a cross join to find all combinations, then order by the difference. If your table name was foo something similar to:
SELECT set1.id, set2.id, abs(set1.the_geom - set2.the_geom) --- May want to use earth_distance extension ehre
FROM foo set1, foo set2
WHERE set1.id != set2.id
ORDER BY 3 DESC;
And if you need earth distance to calculate the distance itself - http://www.postgresql.org/docs/9.3/static/earthdistance.html

SQL return exactly one row or null in a select sub-query

In Oracle, is it possible to have a sub-query within a select statement that returns a column if exactly one row is returned by the sub-query and null if none or more than one row is returned by the sub-query?
Example:
SELECT X,
Y,
Z,
(SELECT W FROM TABLE2 WHERE X = TABLE1.X) /* but return null if 0 or more than 1 rows is returned */
FROM TABLE1;
Thanks!

How about going about it in a different way? A simple LEFT OUTER JOIN with a subquery should do what you want:
SELECT T1.X
,T1.Y
,T1.Z
,T2.W
FROM TABLE1 AS T1
LEFT OUTER JOIN (
SELECT X
,W
FROM TABLE2
GROUP BY X,W
HAVING COUNT(X) = 1
) AS T2 ON T2.X = T1.X;
This will only return items that have exactly 1 instance of X, and LEFT OUTER JOIN it back to the table when appropriate (leaving the non-matches NULL).
This is also ANSI-compliant, so it is quite performant.

Besides a CASE solution or rewriting the inline subquery as an outer join, this will work, if you can apply an aggregate function (MIN or MAX) on the W column:
SELECT X,
Y,
Z,
(SELECT MIN(W) FROM TABLE2 WHERE X = TABLE1.X HAVING COUNT(*) = 1) AS W
FROM TABLE1;

SELECT
X, Y, Z, (SELECT W FROM TABLE2 WHERE X = TABLE1.X HAVING COUNT(*) = 1)
FROM
TABLE1;

my answer is: dont use subselects (unless you are sure ...)
no need and not a good idea to use a subselect here as PlantTheIdea mentioned because of two things
explaination:
subselect means:
one select for each row of the primary select result set. i.e. if you get 1000 rows, you also get 1000 (small) select statemts in your db-system (ignoring optimizer here)
and(!)
with a subselect you have a good chance to hide (or override) a heavy database or select problem. that means: you are only expecting none (NULL) or one (exactly) row (both easily resolvable with a [left outer] join). if there are more than one in your subselect there is something wrong, the SQL Error points that out
the "HAVING COUNT(X) = 1" of course correct, has the small (or not small) problem, thats: "why is there a count of more than one row?"
I spent hours of lifetime finding a workarround like this, just ending up in "dont do it if you are realy sure ..."
I see that in opposite to a "having" like this
...
HAVING date=max(date) -- depends on sql dialect
or
where date = select max(date) from same_table
and with my last example i again want to point out: if you get here more than one row (both from today ;.) you have a DB problem - you chould use a timestamp instead for example

Purposely having a query return blank entries at regular intervals

I want to write a query that returns 3 results followed by blank results followed by the next 3 results, and so on. So if my database had this data:
CREATE TABLE table (a integer, b integer, c integer, d integer);
INSERT INTO table (a,b,c,d)
VALUES (1,2,3,4),
(5,6,7,8),
(9,10,11,12),
(13,14,15,16),
(17,18,19,20),
(21,22,23,24),
(25,26,37,28);
I would want my query to return this
1,2,3,4
5,6,7,8
9,10,11,12
, , ,
13,14,15,16
17,18,19,20
21,22,23,24
, , ,
25,26,27,28
I need this to work for arbitrarily many entries that I select for, have three be grouped together like this.
I'm running postgresql 8.3

This should work flawlessly in PostgreSQL 8.3
SELECT a, b, c, d
FROM (
SELECT rn, 0 AS rk, (x[rn]).*
FROM (
SELECT x, generate_series(1, array_upper(x, 1)) AS rn
FROM (SELECT ARRAY(SELECT tbl FROM tbl) AS x) x
) y
UNION ALL
SELECT generate_series(3, (SELECT count(*) FROM tbl), 3), 1, (NULL::tbl).*
ORDER BY rn, rk
) z
Major points
Works for a query that selects all columns of tbl.
Works for any table.
For selecting arbitrary columns you have to substitute (NULL::tbl).* with a matching number of NULL columns in the second query.
Assuming that NULL values are ok for "blank" rows.
If not, you'll have to cast your columns to text in the first and substitute '' for NULL in the second SELECT.
Query will be slow with very big tables.
If I had to do it, I would write a plpgsql function that loops through the results and inserts the blank rows. But you mentioned you had no direct access to the db ...

In short, no, there's not an easy way to do this, and generally, you shouldn't try. The database is concerned with what your data actually is, not how it's going to be displayed. It's not an appropriate scope of responsibility to expect your database to return "dummy" or "extra" data so that some down-stream process produces a desired output. The generating script needs to do that.
As you can't change your down-stream process, you could (read that with a significant degree of skepticism and disdain) add things like this:
Select Top 3
a, b, c, d
From
table
Union Select Top 1
'', '', '', ''
From
table
Union Select Top 3 Skip 3
a, b, c, d
From
table
Please, don't actually try do that.

You can do it (at least on DB2 - there doesn't appear to be equivalent functionality for your version of PostgreSQL).
No looping needed, although there is a bit of trickery involved...
Please note that though this works, it's really best to change your display code.
Statement requires CTEs (although that can be re-written to use other table references), and OLAP functions (I guess you could re-write it to count() previous rows in a subquery, but...).
WITH dataList (rowNum, dataColumn) as (SELECT CAST(CAST(:interval as REAL) /
(:interval - 1) * ROW_NUMBER() OVER(ORDER BY dataColumn) as INTEGER),
dataColumn
FROM dataTable),
blankIncluder(rowNum, dataColumn) as (SELECT rowNum, dataColumn
FROM dataList
UNION ALL
SELECT rowNum - 1, :blankDataColumn
FROM dataList
WHERE MOD(rowNum - 1, :interval) = 0
AND rowNum > :interval)
SELECT *
FROM dataList
ORDER BY rowNum
This will generate a list of those elements from the datatable, with a 'blank' line every interval lines, as ordered by the initial query. The result set only has 'blank' lines between existing lines - there are no 'blank' lines on the ends.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Implementing a SQL query without window functions - sql

Related

Bigquery SQL: convert array to columns

average of derived attributes in new table

Finding the pair of points whose distance from each other is maximal

SQL return exactly one row or null in a select sub-query

Purposely having a query return blank entries at regular intervals

Categories

Resources