Why Hive returns this result? - hive

An interesting question. There is a table test.test which comprises 11 columns (all string) and only one row of data:
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+--+
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 | c10 | c11 |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+--+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+--+
If I execute the query in Hive:
select c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11
from test
where c1='1'
sort by c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11
limit 10;
It returns:
1 2 11 3 4 5 6 7 8 9 10
The content of columns changed...
The query returns normal result via Spark-SQL. I also tried the following queries in Hive:
select..from..where..sort by..
select..from..sort by..limit..
select..from..where..limit..
and all of them return normal result.
Could anyone please explain what results in this wired result?

Related

Extract all rows with the same value in c2 column if at least in one row we have required value in column c1

Sorry for the title, I'll do my best to explain it here
We have a table like the following:
c1 | c2 |c3
____________
111 | 11 | 2
123 | 11 | 3
111 | 44 | 4
156 | 88 | 7
111 | 44 | 8
444 | 44 | 1
123 | 11 | 4
123 | 55 | 4
So firstly I want to check c3 and find out whether we have a row with the value = 2 in c3 column
'Where c3 = 2'
this part is easy
After that we take the value of c1 (from the row that we get in previous step) - it will be 111 (as in this row we have 2 in c3)
And now is the difficult part where we want to get all the rows where if c2 value at least once had 111 value in c1 we'll take all the rows with that value in c2
For example, in the first row, we have 111 in c1, we take the value from c2 and look for all the rows from the table where c2 value = 11 (even though c1 value won't be 111 anymore)
Those the rows I want to get in the end:
c1 | c2 |c3
____________
111 | 11 | 2
123 | 11 | 3
111 | 44 | 4
111 | 44 | 8
444 | 44 | 1
123 | 11 | 4
Hmmm . . . one way translates your requirements almost directly into in conditions:
select t.*
from t
where t.c2 in (select t2.c2
from t t2
where t2.c1 in (select t3.c1 from t t3 where t3.c3 = 2)
);

Create variations of data efficiently

I have 4 parameters. I need to create a new table that holds all possible variations of these 4 parameters divided by themself.
this is an example of 4 parameters:
the original parameters:
p1 | p2 | p3 | p4 |
=====+=====+=====+=====+
8 | 8 | 8 | 8 |
The new table should contain:
p1 | p2 | p3 | p4 |
=====+=====+=====+=====+
8 | 8 | 8 | 8 | The original raw
1 | 8 | 8 | 8 | One cell devided by 8 (4 rows overall)
8 | 1 | 8 | 8 |
8 | 8 | 1 | 8 |
8 | 8 | 8 | 1 |
1 | 1 | 8 | 8 | Two cells divided by 8 (6 rows overall)
1 | 8 | 1 | 8 |
1 | 8 | 8 | 1 |
8 | 1 | 1 | 8 |
8 | 1 | 8 | 1 |
8 | 8 | 1 | 1 |
1 | 1 | 1 | 8 | Three cells divided by 8 (4 rows overall)
1 | 1 | 8 | 1 |
1 | 8 | 1 | 1 |
8 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | All cells divided by 8 (1 row overall)
I'm looking for the most efficient way to do it, because the next level might be to do the same but with 5 parameters (and do all kinds of mathematical operations).
I thought to use a WHILE loop, but I don't know how can I "run" on the columns like a nested for loop in c/c++/java/python etc. Are there other ways to create this? What is the efficient way to do it?
The cross join solution:
select *
from
(values (8),(1)) as q1(p1)
cross join
(values (8),(1)) as q2(p2)
cross join
(values (8),(1)) as q3(p3)
cross join
(values (8),(1)) as q4(p4)
Based on your comment, you can use apply:
select p1.p1, p2.p2, p3.p3, p4.p4
from t cross apply
(values (t.p1, t.p1 / 8.0)) p1(p1) cross apply
(values (t.p2, t.p2 / 8.0)) p2(p2) cross apply
(values (t.p3, t.p3 / 8.0)) p3(p3) cross apply
(values (t.p4, t.p4 / 8.0)) p4(p4) ;
Note: A SQL query returns a fixed number of columns. If you need to handle a variable number of parameters, then you need dynamic SQL.
That said, it should be obvious how to handle whatever specific number of parameters you need to.

Postgres Query Based on Previous and Next Rows

I'm trying to solve the bus routing problem in postgresql which requires visibility of previous and next rows. Here is my solution.
Step 1) Have one edges table which represents all the edges (the source and target represent vertices (bus stops):
postgres=# select id, source, target, cost from busedges;
id | source | target | cost
----+--------+--------+------
1 | 1 | 2 | 1
2 | 2 | 3 | 1
3 | 3 | 4 | 1
4 | 4 | 5 | 1
5 | 1 | 7 | 1
6 | 7 | 8 | 1
7 | 1 | 6 | 1
8 | 6 | 8 | 1
9 | 9 | 10 | 1
10 | 10 | 11 | 1
11 | 11 | 12 | 1
12 | 12 | 13 | 1
13 | 9 | 15 | 1
14 | 15 | 16 | 1
15 | 9 | 14 | 1
16 | 14 | 16 | 1
Step 2) Have a table which represents bus details like from time, to time, edge etc.
NOTE: I have used integer format for "from" and "to" column for faster results as I can do an integer query, but I can replace it with any better format if available.
postgres=# select id, "busedgeId", "busId", "from", "to" from busedgetimes;
id | busedgeId | busId | from | to
----+-----------+-------+-------+-------
18 | 1 | 1 | 33000 | 33300
19 | 2 | 1 | 33300 | 33600
20 | 3 | 2 | 33900 | 34200
21 | 4 | 2 | 34200 | 34800
22 | 1 | 3 | 36000 | 36300
23 | 2 | 3 | 36600 | 37200
24 | 3 | 4 | 38400 | 38700
25 | 4 | 4 | 38700 | 39540
Step 3) Use dijkstra algorithm to find the nearest path.
Step 4) Get the upcoming buses from the busedgetimes table in the earliest first order for the nearest path detected by dijkstra algorithm.
Problem: I am finding it difficult to make the query for the Step 4.
For example: If I get the path as edges 2, 3, 4, to travel from source vertex 2 to target vertex 5 in the above records. To get the first bus for the first edge, it's not so hard as I can simply query with from < 'expected departure' order by from desc but for the second edge, the from condition requires to time of first result row. Also, query requires edge ids filter.
How can I achieve this in a single query?
I am not sure if I understood your problem correctly. But getting values from other rows this can be done by window functions (https://www.postgresql.org/docs/current/static/tutorial-window.html):
demo: db<>fiddle
SELECT
id,
lag("to") OVER (ORDER BY id) as prev_to,
"from",
"to",
lead("from") OVER (ORDER BY id) as next_from
FROM bustimes;
The lag function moves the value of the previous row into the current one. The lead function does the same with the next row. So you are able to calculate a difference between last arrival and current departure or something like that.
Result:
id prev_to from to next_from
18 33000 33300 33300
19 33300 33300 33600 33900
20 33600 33900 34200 34200
21 34200 34200 34800 36000
22 34800 36000 36300
Please notice that "from" and "to" are reserved words in PostgreSQL. It would be better to chose other names.

How to use previous row's column's value for calculating the next row's column's value

I have a table
Id | Aisle | OddEven | Bay | Size | Y-Axis
3 | A1 | Even | 14 | 10 | 100
1 | A1 | Even | 16 | 10 |
6 | A1 | Even | 20 | 10 |
12 | A1 | Even | 26 | 5 | 150
10 | A1 | Even | 28 | 5 |
11 | A1 | Even | 32 | 5 |
2 | A1 | Odd | 13 | 10 | 100
5 | A1 | Odd | 17 | 10 |
4 | A1 | Odd | 19 | 10 |
9 | A1 | Odd | 23 | 5 | 150
7 | A1 | Odd | 25 | 5 |
8 | A1 | Odd | 29 | 5 |
want to look like this
Id | Aisle | OddEven | Bay | Size | Y-Axis
1 | A1 | Even | 14 | 10 | 100
2 | A1 | Even | 16 | 10 | 110
3 | A1 | Even | 20 | 10 | 120
4 | A1 | Even | 26 | 5 | 150
5 | A1 | Even | 28 | 5 | 155
6 | A1 | Even | 32 | 5 | 160
7 | A1 | Odd | 13 | 10 | 100
8 | A1 | Odd | 17 | 10 | 110
9 | A1 | Odd | 19 | 10 | 120
10 | A1 | Odd | 23 | 5 | 150
11 | A1 | Odd | 25 | 5 | 155
12 | A1 | Odd | 29 | 5 | 160
I need a select query and update query. What its doing is there are already some Y-Axis Number been filled (at the start of the Odd/Even) then I need to take the previous row's Y-Axis column's value and adds to the current rows's size which = to current Y-Axis. Needs to keep doing it until it finds another Y-Axis has the value it skips the calculation and next row is using that number.
My thinking process is this:
Id will definitely be used, however, the Id is not sequence as shown my example
so I need to have
ROW_Number OVER (PARTITION BY Aisle,OddEven,Bay Order BY Aisle,OddEven,Bay)
Then some kind of JOIN the same table but the ON is T1.RN = T2.RN - 1
Where I am stuck is but the first row has not previous value it will try to update that value.
Anyone have an idea for SQL Query 2008 for Select and Update will be greatly appreciated! Thanks.
You seem to want a cumulative sum. This would be easier in SQL Server 2012+. You can do this in SQL Server 2008 using outer apply:
select t.*, cume_value
from t outer apply
(select sum(size) + sum(yaxis) as cume_value
from t t2
where t2.aisle = t.aisle and t2.oddeven = t.oddeven and
t2.bay < t.bay
) t2;
A little more difficult on 2008, but I think this is what you are looking for
Declare #Table table (Id int,Aisle varchar(25),OddEven varchar(25),Bay int,Size int,[Y-Axis] int)
Insert Into #Table values
(3,'A1','Even',14,10 ,100),
(1,'A1','Even',16,10 ,0),
(6,'A1','Even',20,10 ,0),
(12,'A1','Even',26,5,150),
(10,'A1','Even',28,5,0),
(11,'A1','Even',32,5,0),
(2,'A1','Odd',13,10 ,100),
(5,'A1','Odd',17,10 ,0),
(4,'A1','Odd',19,10 ,0),
(9,'A1','Odd',23,5,150),
(7,'A1','Odd',25,5,0),
(8,'A1','Odd',29,5,0)
;with cteBase as (
Select *
,IDNew=Row_Number() over (Order By Aisle,Bay)
,RowNr=Row_Number() over (Order By Aisle,OddEven,Bay)
From #Table
)
, cteGroup as (Select TmpRowNr=RowNr,GrpNr=Row_Number() over (Order By RowNr) from cteBase where [Y-Axis]>0)
, cteFinal as (
Select A.*
,GrpNr = (Select max(GrpNr) from cteGroup Where TmpRowNr<=RowNr)
From cteBase A
)
Select ID=Row_Number() over (Order By A.OddEven,A.Bay)
,A.Aisle
,A.OddEven
,A.Bay
,A.Size
,[Y-Axis] = Sum(case when B.[Y-Axis]>0 then B.[Y-Axis] else B.Size end)
From cteFinal A
Join cteFinal B on (B.RowNr<=A.RowNr and A.GrpNr=B.GrpNr)
Group By
A.IDNew
,A.Aisle
,A.OddEven
,A.Bay
,A.Size
Order By A.OddEven,A.Bay
Returns
ID Aisle OddEven Bay Size Y-Axis
1 A1 Even 14 10 100
2 A1 Even 16 10 110
3 A1 Even 20 10 120
4 A1 Even 26 5 150
5 A1 Even 28 5 155
6 A1 Even 32 5 160
7 A1 Odd 13 10 100
8 A1 Odd 17 10 110
9 A1 Odd 19 10 120
10 A1 Odd 23 5 150
11 A1 Odd 25 5 155
12 A1 Odd 29 5 160
I gotta leave my computer so update query should be easy to move on from here.
Below is the select query;
select row_number() over (order by oddeven,bay) id,
Aisle,
OddEven,
Bay,
Size,
max(ISNULL([Y-Axis],0)) over (partition by Aisle, OddEven,Size order by bay)
+ sum(CASE WHEN [Y-Axis] is null THEN Size ELSE 0 END) over (partition by Aisle,OddEven,size order by Bay) as [Y-Axis]
from oddseven
order by id

Using an outerjoin to find where all corresponding values for a tuple are zero

I have the following table data (e0 is the primary key):
+-----+----+----+----+----+
| e0 | e1 | e2 | e3 | e4 |
+-----+----+----+----+----+
| 111 | 2 | 5 | 7 | 0 |
| 222 | 2 | 5 | 7 | 0 |
| 333 | 3 | 6 | 8 | 7 |
| 444 | 1 | 3 | 2 | 2 |
| 555 | 1 | 3 | 2 | 0 |
| 666 | 1 | 3 | 2 | 0 |
| 777 | 6 | 3 | 4 | 0 |
| 888 | 6 | 3 | 4 | 0 |
| 999 | 6 | 3 | 4 | 0 |
+-----+----+----+----+----+
This is part of an exercise where I need to use an outerjoin to find which tuples of (e1,e2,e3) have ALL corresponding values of e4 as 0 (i.e. the query has to return (2,5,7) and (6,3,4)). I've tried a few solutions, but all of them still include (1,3,2) which is not meant to happen.
Does anybody have an idea for an outerjoin that would return (2,5,7) and (6,3,4)?
I would just use NOT EXISTS but to express that using outer joins you can use the below.
SELECT DISTINCT a.e1,
a.e2,
a.e3
FROM data a
LEFT OUTER JOIN data b
ON a.e1 = b.e1
AND a.e2 = b.e2
AND a.e3 = b.e3
AND b.e4 <> 0
WHERE b.e1 IS NULL
SQL Fiddle
And the NOT EXISTS method
SELECT DISTINCT a.e1,
a.e2,
a.e3
FROM data a
WHERE NOT EXISTS (SELECT *
FROM data b
WHERE a.e1 = b.e1
AND a.e2 = b.e2
AND a.e3 = b.e3
AND b.e4 <> 0)
SQL Fiddle
I'm not sure if this actually gives the desired results (semantically); but it doesn't use an OUTER JOIN at all:
SELECT e1, e2, e3 FROM (
SELECT e0, e1, e2, e3, e4, COUNT(*) AS c FROM data
GROUP BY e1, e2, e3
HAVING c > 1
) AS b
WHERE b.e4 = 0
It does give the rows and columns you specify from your data set; but I'm not sure I'm understanding the question quite right.
Why do you need to use an OUTER JOIN? Is this equivalent to Martin Smith's answer?