BigQuery: running last value and table join - google-bigquery

Table_1 is my Sales table:
Time | item | ...
-----------------
1 | X | ...
1 | Y | ...
2 | X | ...
4 | X | ...
6 | X | ...
6 | Y | ...
Table_2 is my Cost table
Time | item | Cost
-----------------
1 | X | a
1 | Y | b
3 | X | c
4 | X | d
4 | Y | e
5 | X | f
What I'm trying to achieve is:
For each row in Table_1, get the latest Cost value from Table_2 (i.e. with at most, Table_1 row's Time)
The result should look like this:
Time | item | ... | Cost
------------------------
1 | X | ... | a
1 | Y | ... | b
2 | X | ... | a
4 | X | ... | d
6 | X | ... | f
6 | Y | ... | e
(I know it's straight forward with traditional SQL using a subquery in the SELECT section or unequal joins, but BigQuery doesn't allow it)

Try below:
SELECT sales.time AS [time], sales.item AS item, cost
FROM (
SELECT sales.item, sales.time, cost,
cost.time - sales.time AS delta,
ROW_NUMBER() OVER(PARTITION BY sales.item, sales.time ORDER BY delta DESC) AS win
FROM Table_1 as sales
LEFT JOIN Table_2 as cost
ON sales.item = cost.item
WHERE cost.time - sales.time <= 0
)
WHERE win = 1
ORDER BY 1, 2
Should give you exactly result you expect
time item cost
1 x a
1 y b
2 x a
4 x d
6 x f
6 y e

Related

Remove CROSS JOIN LATERAL from postgres query that spans many to many

I have the following three tables (many to many):
Location
+====+==============+===+===+=============+
| id | coord_system | x | y | last_update |
+====+==============+===+===+=============+
| | | | | |
+----+--------------+---+---+-------------+
Mapping
+=============+============+
| location_id | history_id |
+=============+============+
| | |
+-------------+------------+
History
+====+=======+======+
| id | speed | date |
+====+=======+======+
| | | |
+----+-------+------+
The location table represents physical x, y locations within a specific coordinate system. For each x, y location at least one row in the history table exists. Each row in the history table can point to multiple rows in the location table.
Important to note is that (coord_system, x, y) is indexed and is unique. I don't think it makes a difference but all ids and coord_system are UUIDs. In the code examples below I will use letter to make it easier to read. The location and history have additional columns, but do not change the scope of the question. The last_update column on the location table should match the date column on the History table (I come back to this later in the post).
The goal is to fetch the most recent history row for a range of (coor_system, x, y). Currently this is done with a CROSS JOIN LATERAl, like
SELECT *
FROM location loc
CROSS JOIN LATERAL
(SELECT *
FROM history hist
LEFT JOIN mapping map ON hist.id = map.history_id
WHERE map.location_id = loc.id
ORDER BY date DESC limit(1)) AS records
WHERE loc.coord_system = '43330ccc-3f42-4f05-8ec5-18cb659bfd2d'
AND (x >= 403047
AND x <= 404047)
AND (y >= 16451337
AND y <= 16452337);
For this specific range of x, y and coord_system the query takes ~25 seconds to run and returns 182 351 rows.
I am not extremely experienced in SQL, but thought that the goal of this query could also be achieved using a regular join. If I do a join across the three tables, with the same x, y and coord_system "filters" it takes about 2 seconds and returns ~3 million rows. I tried to be clever and use the dates to prune down the result:
SELECT *
FROM history hist
RIGHT JOIN mapping map ON hist.id = map.history_id
RIGHT JOIN location loc ON loc.id = map.location_id
WHERE loc.coord_system = '43330ccc-3f42-4f05-8ec5-18cb659bfd2d'
AND (x >= 403047
AND x <= 404047)
AND (y >= 16451337
AND y <= 16452337)
AND location.last_update = hist.date
This got very close to the same result as the original query. The result was 182 485 rows in ~3 seconds. Unfortunately the result needs to be exactly the same. I am guessing I made a logical mistake in the query I made and came here hoping someone can point it out.
My question is: is there a clever way that will allow a join to take only the rows that have the "newest" date from the history.date column? As is expected I am trying to make the query run as quickly as possible while maintaining the correct result set.
In the table below I show a toy example of the join and the results I would expect (marked in the "return_row" column).
+=============+==============+===+===+=============+============+============+=======+============+============+
| location.id | coord_system | x | y | location_id | history_id | history.id | speed | date | return_row |
+=============+==============+===+===+=============+============+============+=======+============+============+
| 0 | a | 1 | 1 | 0 | 0 | 0 | 3.0 | 2020/10/31 | * |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
| 0 | a | 1 | 1 | 0 | 1 | 1 | 3.1 | 2020/10/30 | |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
| 0 | a | 1 | 1 | 0 | 2 | 2 | 3.2 | 2020/10/29 | |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
| 1 | a | 1 | 2 | 1 | 3 | 3 | 3.1 | 2020/10/31 | * |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
| 1 | a | 1 | 2 | 1 | 4 | 4 | 3.0 | 2020/10/30 | |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
| 2 | a | 2 | 2 | 2 | 5 | 5 | 4 | 2020/10/31 | * |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
| 3 | b | 1 | 1 | 3 | 6 | 6 | 5 | 2020/10/1 | * |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
Does it work better with DISTINCT ON?
SELECT DISTINCT ON (l.id) l.id, h.date, ... -- enumerate the columns here
FROM location l
LEFT JOIN mapping m ON m.location_id = l.id
LEFT JOIN history h ON h.id = m.history_id
WHERE
l.coord_system = '43330ccc-3f42-4f05-8ec5-18cb659bfd2d'
AND l.x BETWEEN 403047 AND 404047
AND l.y BETWEEN 16451337 AND 16452337
ORDER BY l.id, h.date DESC

Lexicographical sorting of a Postgres table column based on values of another column

I have a table, say initial_freq, in a PostgreSQL database (version 10.4):
initial | freq
---------+------
r | 20
s | 20
a | 10
m | 10
p | 7
k | 6
d | 5
n | 3
g | 3
c | 3
v | 3
b | 3
h | 2
y | 2
j | 2
i | 1
The requirement is that whenever there is a tie in the freq column,
the corresponding values in the initial column must be sorted
alphabetically.
The required output looks like this:
initial | freq
---------+------
r | 20
s | 20
a | 10
m | 10
p | 7
k | 6
d | 5
b | 3
c | 3
g | 3
n | 3
v | 3
h | 2
j | 2
y | 2
i | 1
This is a part of a large problem, most of which I have solved except this one.
I realize that this might be a dynamic programming problem, and I can solve it in other programming languages.
I am a complete novice in the SQL world. Any help will be much
appreciated.
Use ORDER BY to order by freq DESC and then by initial.
SELECT
*
FROM
your_table
ORDER BY
freq DESC, initial;

SQL delete row where one value matches that a value in a DIFFERENT column

I want to delete a row if a particular x value matches a y value in any row in the same table.
Ex:
| x | y |
| 4 | 2 |
| 2 | 6 |
| 8 | 1 |
| 3 | 1 |
| 7 | 8 |
| 9 | 5 |
would become:
| x | y |
| 4 | 2 |
| 3 | 1 |
| 7 | 8 |
| 9 | 5 |
Use EXISTS
Delete from
yourtable where exists (select 1 from tab b where b.y =yourtable.x)
If your DB allows it, a self-join may work:
DELETE FROM foo AS xside
LEFT JOIN foo AS yside ON xside.y = yside.x
Delete from tab where x in ( select y from tab)
Alternate version to counter null values in y column.
Delete from tab t where exists ( select 1 from tab ta where ta.y = t.x)

Window running function except current row

I have a theoretical question, so I'm not interested in alternative solutions. Sorry.
Q: Is it possible to get the window running function values for all previous rows, except current?
For example:
with
t(i,x,y) as (
values
(1,1,1),(2,1,3),(3,1,2),
(4,2,4),(5,2,2),(6,2,8)
)
select
t.*,
sum(y) over (partition by x order by i) - y as sum,
max(y) over (partition by x order by i) as max,
count(*) filter (where y > 2) over (partition by x order by i) as cnt
from
t;
Actual result is
i | x | y | sum | max | cnt
---+---+---+-----+-----+-----
1 | 1 | 1 | 0 | 1 | 0
2 | 1 | 3 | 1 | 3 | 1
3 | 1 | 2 | 4 | 3 | 1
4 | 2 | 4 | 0 | 4 | 1
5 | 2 | 2 | 4 | 4 | 1
6 | 2 | 8 | 6 | 8 | 2
(6 rows)
I want to have max and cnt columns behavior like sum column, so, result should be:
i | x | y | sum | max | cnt
---+---+---+-----+-----+-----
1 | 1 | 1 | 0 | | 0
2 | 1 | 3 | 1 | 1 | 0
3 | 1 | 2 | 4 | 3 | 1
4 | 2 | 4 | 0 | | 0
5 | 2 | 2 | 4 | 4 | 1
6 | 2 | 8 | 6 | 4 | 1
(6 rows)
It can be achieved using simple subquery like
select t.*, lag(y,1) over (partition by x order by i) as yy from t
but is it possible using only window function syntax, without subqueries?
Yes, you can. This does the trick:
with
t(i,x,y) as (
values
(1,1,1),(2,1,3),(3,1,2),
(4,2,4),(5,2,2),(6,2,8)
)
select
t.*,
sum(y) over w as sum,
max(y) over w as max,
count(*) filter (where y > 2) over w as cnt
from t
window w as (partition by x order by i
rows between unbounded preceding and 1 preceding);
The frame_clause selects just those rows from the window frame that you are interested in.
Note that in the sum column you'll get null rather than 0 because of the frame clause: the first row in the frame has no row before it. You can coalesce() this away if needed.
SQLFiddle

select the most recent in all groups of with the same value in one column

The question isn't very clear, but I'll illustrate what I mean, suppose my table is like such:
item_name | date added | val1 | val2
------------------------------------
1 | date+1 | 10 | 20
1 | date | 12 | 21
2 | date+1 | 5 | 6
3 | date+3 | 3 | 1
3 | date+2 | 5 | 2
3 | date | 3 | 1
And I want to select row 1, 3, 4 as they are the most recent entries for each item
Try this:
select *
from tableX t1
where t1.date_added = (select max(t2.date_added)
from tableX t2
where t2.item_name = t1.item_name )