Impala sub-query returns strange result - hive

I am running the below impala query and i am getting strange results. Why does the 2th query below returns zero results and how to overcome this. I am doing several data pipelines with multiple tables, so i have to use the "with" in the beginning.
1. Query: select * from test where name <> 'INSERT'
+----+--------+
| id | name |
+----+--------+
| 2 | DELETE |
| 2 | HELLO |
+----+--------+
Fetched 2 row(s) in 0.13s
2. Query: with temp as (select * from test where name <> 'INSERT') select * from temp
Modified 0 row(s) in 0.23s
3. Query: with temp as (select * from test where name <> 'HELLO') select * from temp
+----+--------+
| id | name |
+----+--------+
| 2 | DELETE |
| 1 | INSERT |
+----+--------+
Fetched 2 row(s) in 0.12s
It should give the record names with 'HELLO' and 'DELETE' for the 2nd query. but its giving no results. Also noticed the output says "modified", so i am guessing its trying to execute it as DML.
Note : Using Impala Shell v2.11.0-cdh5.14.2
The same query works fine in hive.

It seems to work on my side
with temp as (SELECT *
FROM
(SELECT 'DELETE' AS name
UNION SELECT 'HELLO' AS name
UNION SELECT 'INSERT' AS name) AS subq
WHERE name <> 'INSERT')
select * from temp;
+---------+
| name |
+---------+
| HELLO |
| DELETE |
+---------+
2 rows selected (0.118 seconds)
Could you post the EXPLAIN PLAN of your second query?

Related

Padding to the result of a DISTINCT Sqlite query

I searched and figured out that I could use either substr with || or a printf statement with format specifiers in order to add padding to the results, but that doesn't seem to work if I had DISTINCT in the sqlite query.
I've a table called timeLapse that looks like so:
+----+-------+-----------+
| ID | Time | Status |
+----+-------+-----------+
| 1 | 0.001 | Initiated |
| 1 | 0.002 | Cranked |
| 3 | 0.002 | Initiated |
| 2 | 0.002 | Initiated |
| 2 | 0.003 | Cranked |
+----+-------+-----------+
I could query the distinct IDs with something like SELECT distinct(ID) FROM timeLapse as IDs, which returns this:
+-----+
| IDs |
+-----+
| 1 |
| 2 |
| 3 |
+-----+
However, I would like to pad the resultant distinct rows like so:
+----------+
| IDs |
+----------+
| Object-1 |
| Object-2 |
| Object-3 |
+----------+
My query SELECT substr('Object-' || DISTINCT(ID), 10, 10) as IDs FROM timeLapse results in an error:
"[17:22:47] Error while executing SQL query on database 'machining': near "distinct": syntax error"
Could someone please help me understand what am I doing wrong here? I am enormously thankful for your time and help.
get distinct() first before using substr() function.
select substr('Object-' || t1.ID, 1, 10) as IDs
from (SELECT DISTINCT(ID) ID FROM timeLapse) t1
see sqlfiddle
All credits to the user named ϻᴇᴛᴀʟ, as I only understood from their answer that I should have a sub-query within this query where the DISTINCT should go into.
This resolves my problem:
select printf('Object-%s', t1.ID) as IDs
FROM (SELECT DISTINCT(id) ID FROM timeLapse) t1

Selecting only distinct rows based on a column in Knex

I'm using Knex, a pretty nice SQL builder.
I've got a table called Foo which has 3 columns
+--------------+-----------------+
| id | PK |
+--------------+-----------------+
| idFoo | FK (not unique) |
+--------------+-----------------+
| serialNumber | Number |
+--------------+-----------------+
I'd like to select all rows with idFoo IN (1, 2, 3).
However I'd like to avoid duplicate records based on the same idFoo.
Since that column is not unique there could be many rows with the same idFoo.
A possible solution
My query above will of course return all with idFoo IN (1, 2, 3), even duplicates.
db.select(
"id",
"idFoo",
"age"
)
.from("foo")
.whereIn("idFoo", [1, 2, 3])
However this will return results with duplicated idFoo's like so:
+----+-------+--------------+
| id | idFoo | serialNumber |
+----+-------+--------------+
| 1 | 2 | 56454 |
+----+-------+--------------+
| 2 | 3 | 75757 |
+----+-------+--------------+
| 3 | 3 | 00909 |
+----+-------+--------------+
| 4 | 1 | 64421 |
+----+-------+--------------+
What I need is this:
+----+-------+--------------+
| id | idFoo | serialNumber |
+----+-------+--------------+
| 1 | 2 | 56454 |
+----+-------+--------------+
| 3 | 3 | 00909 |
+----+-------+--------------+
| 4 | 1 | 64421 |
+----+-------+--------------+
I can take the result and use Javascript to filter out the duplicates. I'd specifically like to avoid that and write this in Knex.
The question is how can I do this with Knex code?
I know it can be done with plain SQL (perhaps something using GROUP BY) but I'd specifically like to achieve this in "pure" knex without using raw SQL.
Knex.js supports groupBy natively. You can write:
knex('foo').whereIn('id',
knex('foo').max('id').groupBy('idFoo')
)
Which is rewritten to the following SQL:
SELECT * FROM foo
WHERE id IN (
SELECT max(id) FROM foo
GROUP BY idFoo
)
Note that you need to use the subselect to make sure you won't mix values from diffrent rows within the same group.
In normal sql you do it like this.
You perform a self join and try to find a row with same idFoo but bigger id, if you dont find it you have NULL. And will know you are the bigger one.
SELECT t1.id, t1.idFoo, t1.serialNumber
FROM foo as t1
LEFT JOIN foo as t2
ON t1.id < t2.id
AND t1.idFoo = t2.idFoo
WHERE t2.idFoo IS NULL
So check for left join on knex.js
EDIT:
Just check documentation build this (not tested):
knex.select('t1.*')
.from('foo as t1')
.leftJoin('foo as t2', function() {
this.on('t1.id', '<', 't2.id')
.andOn('t1.idFoo ', '=', 't2.idFoo')
})
.whereNull("t2.idFoo")

Aggregate ENTIRE rows based on single field without querying source twice or using CTEs?

Assume I have the following table:
+--------+--------+--------+
| field1 | field2 | field3 |
+--------+--------+--------+
| a | a | 1 |
| a | b | 2 |
| a | c | 3 |
| b | a | 1 |
| b | b | 2 |
| c | b | 2 |
| c | b | 3 |
+--------+--------+--------+
I want to select only the rows where field3 is the minimum value, so only these rows:
+--------+--------+--------+
| field1 | field2 | field3 |
+--------+--------+--------+
| a | a | 1 |
| b | a | 1 |
| c | b | 2 |
+--------+--------+--------+
The most popular solution is to query the source twice, once directly and then joined to a subquery where the source is queried again and then aggregated. However, since my data source is actually a derived table/subquery itself, I'd have to duplicate the subquery in my SQL which is ugly. The other option is to use the WITH CTE and reuse the subquery which would be nice, but Teradata, the database I am using, doesn't support CTEs in views, though it does in macros which is not an option for me now.
So is it possible in standard SQL to group multiple records into a single record by using only a single field in the aggregation without querying the source twice or using a CTE?
This is possible using a window function:
select *
from (
select column_1, column_2, column_3,
min(column_3) over (partition by column_1) as min_col_3
from the_table
) t
where column_3 = min_col_3;
The above is standard SQL and I believe Teradata also supports window functions.
The derived table is necessary because you can't refer to a column alias in the where clause - at least not in standard SQL.
I think Teradata actually allows that using the qualify operator, but as I have never used it, I am not sure:
select *
from the_table
qualify min(column_3) over (partition by column_1) = column_3;
Use NOT EXISTS to return a row if there are no other row with same field1 value but a lower field3 value:
select * from table t1
where not exists (select 1 from table t2
where t2.field1 = t1.field1
and t2.field3 < t1.field3)

Stored Procedures - Updating and Inserting

I'm new to using stored procedures, what is the best way to update and insert using stored procedures. I have two tables and I can match them by a distinct ID, I want to update if the ID exists in both my load table and my destination table, and I want to insert if the item does not exist in my destination table. Just an example template would be very helpful, thanks!
If I understood well, you want to select values in one table and insert them in other table. If the id exists in the second table, you need to update the row. If I'm not wrong you need something like this:
mysql> select * from table_1;
+----+-----------+-----------+
| id | name | last_name |
+----+-----------+-----------+
| 1 | fagace | acero |
| 2 | ratangelo | saleh |
| 3 | hectorino | josefino |
+----+-----------+-----------+
3 rows in set (0.00 sec)
mysql> select * from table_2;
+----+-----------+-----------+
| id | name | last_name |
+----+-----------+-----------+
| 1 | fagace | acero |
| 2 | ratangelo | saleh |
+----+-----------+-----------+
2 rows in set (0.00 sec)
mysql> insert into table_2 select t1.id,t1.name,t1.last_name from table_1 t1 on duplicate key update name=t1.name, last_name=t1.last_name;
Query OK, 1 row affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> select * from table_2;
+----+-----------+-----------+
| id | name | last_name |
+----+-----------+-----------+
| 1 | fagace | acero |
| 2 | ratangelo | saleh |
| 3 | hectorino | josefino |
+----+-----------+-----------+
3 rows in set (0.00 sec)
mysql>
You should look up the SQL MERGE statement.
It allows you to perform UPSERT's - i.e. INSERT if a key value does not already exist, UPDATE if the key does exist.
http://technet.microsoft.com/en-us/library/bb510625.aspx
However, your requirement to check for a key value in 2 places before you perform an update does make it more complex. I haven't tried this but I would think a VIEW or a CTE could be used to establish if the ID exists in both your tables & then base the MERGE on the CTE/VIEW.
But definitely start by looking at MERGE!

Google Cloud SQL - "SELECT .. AS .." doesn't change the column name

I'm trying to retrieve some columns value from a google cloud sql database, and I want to rename the output column as follows:
select id as tot from mytable
the query result I get is this:
id
value1
value2
value3
...
as you can see the column name is not changed as I wanted.. I also tried
select id as 'tot' from mytable
and (in fact this works, but obviously every row value is set to 'id' instead of the real row value)
select 'id' as 'tot' from mytable
the table schema I'm testing on is as simple as possible:
create table mytable(id varchar(10))
Has anyone encountered this problem before? Am I missing something or doing something wrong?
Thanks in advance for any help, best regards
This works as expected using the command line tool and looks like a bug in the web ui.
Here's what I got from the command line tool use (google_sql.sh)
sql> insert into mytable values('1'), ('2'), ('3');
3 row(s) affected.
sql> select * from mytable;
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
+----+
3 rows in set (0.10 sec)
sql> select id as 'tot' from mytable
-> ;
+-----+
| tot |
+-----+
| 1 |
| 2 |
| 3 |
+-----+
3 rows in set (0.09 sec)
sql> select 'id' as 'tot' from mytable
-> ;
+-----+
| tot |
+-----+
| id |
| id |
| id |
+-----+
3 rows in set (0.09 sec)
I just logged a bug for this: https://code.google.com/p/googlecloudsql/issues/detail?id=60