How can I generate a UID on-the-fly in bigquery SQL? - sql

I am trying to join a table with itself. Here is a MWE of the problem:
WITH elems as (
SELECT letter, generate_uuid() randomid
FROM
UNNEST(SPLIT('aabcdefghij', '')) letter
),
l as (SELECT * FROM ten_elems),
r as (SELECT * FROM ten_elems)
--SELECT * FROM l INNER JOIN r on l.randomid = r.randomid
SELECT * FROM l INNER JOIN r on l.letter = r.letter
If you run this, you will see that the random IDs on the left and on the right are different. Obviously if, instead, you uncommment the other join, it returns no results. The same happens for row_number() OVER (), and because my top level elements are not unique I cannot simply use row_number() OVER (ORDER BY letter) as it will still (potentially) assign different IDs to the two "a" entries.
The actual table is obviously way more complex, and contains arrays of arrays. However, as here, the top level elements are not necessarily unique, so I need to generate UIDs before unnesting, so I can later join them together correctly.
I understand that a work-around would be to save the table with the UID first, and then do the self-join, but I had hoped I wouldn't need to do that, as in general this data doesn't need identification at this level. So if there is some way of making the UID persistent through my queries, rather than generated anew on-demand, it would really help me.

WITH tables store in Memory and I think generate_uuid is not persistent because it was made to always regenerate unique even in a in memory access. If you create a truth temporal table that fixes the issue.
Example of a script creating a temporal table for 5 seconds in here: your-project.dataset.test_guid_2 then using it.
CREATE TABLE `your-project.dataset.test_guid_2`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 5 SECOND)
) AS
SELECT letter, CAST(generate_uuid() AS STRING) randomid
FROM
UNNEST(SPLIT('abcdefghij', '')) letter;
WITH
l as (SELECT * FROM `your-project.dataset.test_guid_2`),
r as (SELECT * FROM `your-project.dataset.test_guid_2`)
--SELECT * FROM l INNER JOIN r on l.randomid = r.randomid
SELECT * FROM l INNER JOIN r on l.letter = r.letter
Output:

Related

Using Select * in a SQL JOIN returns the wrong id value for the wrong table

I have two tables (PlayerDTO and ClubDTO) and am using a JOIN to fetch data as follows:
SELECT * FROM PlayerDTO AS pl
INNER JOIN ClubDTO AS cl
ON pl.currentClub = cl.id
WHERE cl.nation = 7
This returns the correct rows from PlayerDTO, but in every row the id column has been changed to the value of the currentClub column (eg instead of pl.id 3,456 | pl.currentClub 97, it has become pl.id 97 | pl.currentClub 97).
So I tried the query listing all the columns by name instead of Select *:
SELECT pl.id, pl.nationality, pl.currentClub, pl.status, pl.lastName FROM PlayerDTO AS pl
INNER JOIN ClubDTO AS cl
ON pl.currentClub = cl.id
WHERE cl.nation = 7
This works correctly and doesn’t change any values.
PlayerDTO has over 100 columns (I didn’t list them all above for brevity, but I included them all in the query) but obviously I don’t want to write every column name in every query.
So could somebody please explain why Select * changes the id value and what I need to do to make it work correctly? All my tables have a column called id, is that something to do with it?
SELECT *... is, according to the docs...
shorthand for “select all columns.” (Source: Dev.MySQL.com
Both your tables have id columns, so which should be returned? It's not indicated, so MySQL makes a guess. So select what you want to select...
SELECT pl.id, *otherfieldsyouwant* FROM PlayerDTO AS pl...
Or...
SELECT pl.* FROM PlayerDTO AS pl...
Typically, SELECT * is bad form. The odds you are using every field is astronomically low. And the more data you pull, the slower it is.

Create new rows depending on values in Hive/Impala

I am trying to do an operation on Hive/Impala and I don't know how to continue. First, I explain what I want to do. I have the following table:
Well, I want to create a new row for each missing position, and assign it a zero value. The table would look like this:
I do not know if it is possible to create this functionality in Hive or Impala, either one would suit me.
Thanks a lot!
You can use a trick in Hive where you generate a string of spaces and then split the string into an array and turn the array into a table:
select pe.i, coalesce(t.value, 0) as value
from (select i, x
from (select max(position) as max_position
from t
) p lateral view
posexplode(split(space(p.max_position), ' ')) pe as i, x
) pe left join
t
on pe.i = t.position;
Based on #GordonLinoff's answer, i get what I want, but i made some changes. Basically, it is what he says, but splitting his answer in two different queries. This is because in Hive you can not do LATERAL VIEW and JOIN in the same query. The solution would be:
create table t1 as
select i, x
from (select max(position) as max_position from t) p
lateral view posexplode(split(space(p.max_position), ' ')) pe as i, x
select a.i, coalesce(b.value, 0) as value
from t1 a LEFT JOIN t b
on a.i = b.position
where a.i != 0
Thanks Gordon!

Selecting ambiguous column from subquery with postgres join inside

I have the following query:
select x.id0
from (
select *
from sessions
inner join clicked_products on sessions.id0 = clicked_products.session_id0
) x;
Since id0 is in both sessions and clicked_products, I get the expected error:
column reference "id0" is ambiguous
However, to fix this problem in the past I simply needed to specify a table. In this situation, I tried:
select sessions.id0
from (
select *
from sessions
inner join clicked_products on sessions.id0 = clicked_products.session_id0
) x;
However, this results in the following error:
missing FROM-clause entry for table "sessions"
How do I return just the id0 column from the above query?
Note: I realize I can trivially solve the problem by getting rid of the subquery all together:
select sessions.id0
from sessions
inner join clicked_products on sessions.id0 = clicked_products.session_id0;
However, I need to do further aggregations and so do need to keep the subquery syntax.
The only way you can do that is by using aliases for the columns returned from the subquery so that the names are no longer ambiguous.
Qualifying the column with the table name does not work, because sessions is not visible at that point (only x is).
True, this way you cannot use SELECT *, but you shouldn't do that anyway. For a reason why, your query is a wonderful example:
Imagine that you have a query like yours that works, and then somebody adds a new column with the same name as a column in the other table. Then your query suddenly and mysteriously breaks.
Avoid SELECT *. It is ok for ad-hoc queries, but not in code.
select x.id from
(select sessions.id0 as id, clicked_products.* from sessions
inner join
clicked_products on
sessions.id0 = clicked_products.session_id0 ) x;
However, you have to specify other columns from the table sessions since you cannot use SELECT *
I assume:
select x.id from (select sessions.id0 id
from sessions
inner join clicked_products
on sessions.id0 = clicked_products.session_id0 ) x;
should work.
Other option is to use Common Table Expression which are more readable and easier to test.
But still need alias or selecting unique column names.
In general selecting everything with * is not a good idea -- reading all columns is waste of IO.

Select all rows after a certain id from other table

For an assignment I have to do certain consults from a db.
I need to find all the moves that the pokemon "bulbasaur" has and list their names, power, pp and accuracy.
I've tried this command and I get only one move out of 9 as a result:
SELECT name, power, pp, accuracy
FROM MOVES
WHERE id=(SELECT move_id FROM POKEMONS_MOVES WHERE pokemon_id=(SELECT id
FROM POKEMONS WHERE name="bulbasaur"));
vine-whip|45|25|100
I am using sqlite3 btw. Thanks in advance.
Since there is a CrossRef table in between, this can be easily achieved through Joins.
SELECT
m.name, m.type_id, m.power, m.pp, m.accuracy
FROM
Moves m
INNER JOIN
Pokemons_Moves pm ON pm.move_id = m.id
INNER JOIN
Pokemons p ON p.id = pm.pokemon_id
WHERE
p.name = "bulbasaur"
You are using scalar subqueries:
WHERE id=(SELECT ...)
This means that if the subquery returns more than one row, the database ignores all rows but the first. (Other databases will error out if that happens.)
To search through all IDs returned by the subquery, you have to use the IN operator:
SELECT name, power, pp, accuracy
FROM Moves
WHERE id IN (SELECT move_id
FROM Pokemons_Moves
WHERE pokemon_id = (SELECT id
FROM Pokemons
WHERE name = 'bulbasaur'));
(The second subquery returns only one value.)

Filter data based on a condition in Redshift

I came across one more issue while resolving the previous problem:
So, I have this data:
For each route -> I want to get only those rows where ob exists in rb. Hence, this output:
I know this also needs to worked through a temp table. Earlier I was doing this as suggested by #smb:
select * from table_name as a
inner join
(select load, rb from table_name
group by load, rb) as b
on a.load = b.load
and
a.ob = b.rb
but this solution will give me:
And this is incorrect as it doesn’t take into account the route.
It’d be great if you guys could help :)
Thanks
updated to add in route -
The answer would be in a nested join. The concept is
Get a list of distinct pairs of obs and rbs
Join to the original data where ob = ob and lane = rb
Code as follows:
select * from table_name as a
inner join
(select route, ob, rb from table_name
group by route, ob, rb) as b
on a.ob = b.ob
and
a.lane = b.rb
and
a.route = b.route
I have done an example using a temp table here so you can see it in action.
Note that if your data is large you should consider making sure your dist key in the join. This makes sure that redshift knows that no rows need to be joined across different compute nodes so it can execute multiple local joins and therefore be more efficient.
few ways (in statement is simple but often slower on larger sets)
select *
from table
where lane in (select rb from table)
or (i find exists faster on larger sets, but try both )
select *
from table
where exists (select 'x' from table t_inner
where t_inner.rb = table.lane)
either way create an index on the rb column for speed