I am trying to do an operation on Hive/Impala and I don't know how to continue. First, I explain what I want to do. I have the following table:
Well, I want to create a new row for each missing position, and assign it a zero value. The table would look like this:
I do not know if it is possible to create this functionality in Hive or Impala, either one would suit me.
Thanks a lot!
You can use a trick in Hive where you generate a string of spaces and then split the string into an array and turn the array into a table:
select pe.i, coalesce(t.value, 0) as value
from (select i, x
from (select max(position) as max_position
from t
) p lateral view
posexplode(split(space(p.max_position), ' ')) pe as i, x
) pe left join
t
on pe.i = t.position;
Based on #GordonLinoff's answer, i get what I want, but i made some changes. Basically, it is what he says, but splitting his answer in two different queries. This is because in Hive you can not do LATERAL VIEW and JOIN in the same query. The solution would be:
create table t1 as
select i, x
from (select max(position) as max_position from t) p
lateral view posexplode(split(space(p.max_position), ' ')) pe as i, x
select a.i, coalesce(b.value, 0) as value
from t1 a LEFT JOIN t b
on a.i = b.position
where a.i != 0
Thanks Gordon!
Related
I am trying to join a table with itself. Here is a MWE of the problem:
WITH elems as (
SELECT letter, generate_uuid() randomid
FROM
UNNEST(SPLIT('aabcdefghij', '')) letter
),
l as (SELECT * FROM ten_elems),
r as (SELECT * FROM ten_elems)
--SELECT * FROM l INNER JOIN r on l.randomid = r.randomid
SELECT * FROM l INNER JOIN r on l.letter = r.letter
If you run this, you will see that the random IDs on the left and on the right are different. Obviously if, instead, you uncommment the other join, it returns no results. The same happens for row_number() OVER (), and because my top level elements are not unique I cannot simply use row_number() OVER (ORDER BY letter) as it will still (potentially) assign different IDs to the two "a" entries.
The actual table is obviously way more complex, and contains arrays of arrays. However, as here, the top level elements are not necessarily unique, so I need to generate UIDs before unnesting, so I can later join them together correctly.
I understand that a work-around would be to save the table with the UID first, and then do the self-join, but I had hoped I wouldn't need to do that, as in general this data doesn't need identification at this level. So if there is some way of making the UID persistent through my queries, rather than generated anew on-demand, it would really help me.
WITH tables store in Memory and I think generate_uuid is not persistent because it was made to always regenerate unique even in a in memory access. If you create a truth temporal table that fixes the issue.
Example of a script creating a temporal table for 5 seconds in here: your-project.dataset.test_guid_2 then using it.
CREATE TABLE `your-project.dataset.test_guid_2`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 5 SECOND)
) AS
SELECT letter, CAST(generate_uuid() AS STRING) randomid
FROM
UNNEST(SPLIT('abcdefghij', '')) letter;
WITH
l as (SELECT * FROM `your-project.dataset.test_guid_2`),
r as (SELECT * FROM `your-project.dataset.test_guid_2`)
--SELECT * FROM l INNER JOIN r on l.randomid = r.randomid
SELECT * FROM l INNER JOIN r on l.letter = r.letter
Output:
I came across one more issue while resolving the previous problem:
So, I have this data:
For each route -> I want to get only those rows where ob exists in rb. Hence, this output:
I know this also needs to worked through a temp table. Earlier I was doing this as suggested by #smb:
select * from table_name as a
inner join
(select load, rb from table_name
group by load, rb) as b
on a.load = b.load
and
a.ob = b.rb
but this solution will give me:
And this is incorrect as it doesn’t take into account the route.
It’d be great if you guys could help :)
Thanks
updated to add in route -
The answer would be in a nested join. The concept is
Get a list of distinct pairs of obs and rbs
Join to the original data where ob = ob and lane = rb
Code as follows:
select * from table_name as a
inner join
(select route, ob, rb from table_name
group by route, ob, rb) as b
on a.ob = b.ob
and
a.lane = b.rb
and
a.route = b.route
I have done an example using a temp table here so you can see it in action.
Note that if your data is large you should consider making sure your dist key in the join. This makes sure that redshift knows that no rows need to be joined across different compute nodes so it can execute multiple local joins and therefore be more efficient.
few ways (in statement is simple but often slower on larger sets)
select *
from table
where lane in (select rb from table)
or (i find exists faster on larger sets, but try both )
select *
from table
where exists (select 'x' from table t_inner
where t_inner.rb = table.lane)
either way create an index on the rb column for speed
2 Table Joined by ID pointer on the 2nd Table.
A really old database that splits content into 8000 char, now I need to combine them again.
[id] ,[headline] ,[body] ,[body2] ,[picname]
[picpath] ,[postrelease] ,[postdate]
[id] ,[pr1id] ,[body2]
Pr1id points to the main data table. The Main data table's [body2] is a bool "yes" or Null.
I want both body to be combined into one in final output.
Select * FROM dbo.pressrelease_tbl m
LEFT JOIN dbo.pressrelease2_tbl m1
ON m1.pr1id = m.id
I am stuck on the concatenate part.
Use CONCAT() and make sure at least one of the strings is nvarchar(max):
SELECT m.*, CONCAT(CAST(m.body AS nvarchar(max)), m1.body2) concatBody
FROM dbo.pressrelease_tbl m
LEFT JOIN dbo.pressrelease2_tbl m1
ON m1.pr1id = m.id
I have query that results into one row table and I need to get this result in subsequent computation. Here is non working simplified example (just to depict what I'm trying to achieve):
SELECT amount / (SELECT SUM(amount) FROM [...]) FROM [...]
I tried some nested sub-selects and joins (cross join of the one row table with the other table) but didn't find any working solution. Is there a way to get this working in BigQuery?
Thanks, Radek
EDIT:
ok, I found solution:
select
t1.x / t2.y as z
from
(select 1 as k, amount as x from [...] limit 10) as t1
join
(select 1 as k, sum(amount) as y from [...]) as t2
on
t1.k = t2.k;
but not sure if this is the best how to do it...
With the recently announced ratio_to_report() window function:
SELECT RATIO_TO_REPORT(amount) OVER() AS z
FROM [...]
ratio_to_report takes the amount, and divides it by the sum of all the result rows amounts.
The way you've found (essentially a cross join using a dummy key) is the best way I know of to do this query. We've thought about adding an explicit cross join operator to make it easier to see how to do this, but cross join can get expensive if not done correctly (e.g. if done on two large tables can create n^2 results).
Is is possible to accomplish the equivalent of a LEFT JOIN with subselect where multiple columns are required.
Here's what I mean.
SELECT m.*, (SELECT * FROM model WHERE id = m.id LIMIT 1) AS models FROM make m
As it stands now doing this gives me a 'Operand should contain 1 column(s)' error.
Yes I know this is possible with LEFT JOIN, but I was told it was possible with subselect to I'm curious as to how it's done.
There are many practical uses for what you suggest.
This hypothetical query would return the most recent release_date (contrived example) for any make with at least one release_date, and null for any make with no release_date:
SELECT m.make_name,
sub.max_release_date
FROM make m
LEFT JOIN
(SELECT id,
max(release_date) as max_release_date
FROM make
GROUP BY 1) sub
ON sub.id = m.id
A subselect can only have one column returned from it, so you would need one subselect for each column that you would want returned from the model table.