Insert row number into table - hive

I am trying to insert a row number into a table. The row_number() function works when performing a select query but the query doesn't work when I use it as part of an INSERT INTO TABLE query. I have also tried via Create Table As Select but I get the same seemingly generic error.
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
Example: This does not work.
INSERT INTO TABLE tablea
SELECT
column1,
column2,
row_number() over (order by column2 desc)
FROM
tableb;
Example: This does work
SELECT
column1,
column2,
row_number() over (order by column2 desc)
FROM
tableb;
Any pointers? Thanks!
EDIT: I'm using Hive 1.1.0 as part of CDH 5.4.8.

I have tried the thing you wanted to do and it is working. here is my HQL statements:
create table tablea (id int, string name);
insert into tablea values (1, 'test1');
insert into tablea values (2, 'test2');
create table tableb (id int, name string, row_num int);
insert into tableb select id, name, row_number() over ( order by name desc) from tablea;
select * from tableb;
outcome
+------------+--------------+-----------------+--+
| tableb.id | tableb.name | tableb.row_num |
+------------+--------------+-----------------+--+
| 2 | test2 | 1 |
| 1 | test1 | 2 |
+------------+--------------+-----------------+--+

OK it looks like this is because the storage format was ORC. Setting the table to TEXTFILE and the problem goes away.

Related

SQL Server : SELECT ID having only a single condition

I have a patients table with details such as conditions that the patient has. from the below table I want to select Patients, Claims which have ONLY a single condition - 'Hypertension'. Example Patient B is the expected output. Patient A will not be selected because he claimed for multiple conditions.
+----+---------+--------------+
| ID | ClaimID | Condition |
+----+---------+--------------+
| A | 14234 | Hypertension |
| A | 14234 | Diabetes |
| A | 63947 | Diabetes |
| B | 23853 | Hypertension |
+----+---------+--------------+
I tried using the NOT IN condition as below but doesn't seem to help
SELECT ID, ClaimID, Condition
FROM myTable
WHERE Condition IN ('Hypertension')
AND Condition NOT IN ('Diabetes')
One method uses not exists:
select t.*
from mytable t
where t.condition = 'Hypertension' and
not exists (select 1
from mytable t2
where t2.id = t.id and t2.condition <> t.condition
);
Or you can do it like this:
select
id,
claim_id,
condition
from
patient
where
id in
(
select
id
from
patient
group by
id having count (distinct condition) = 1
);
Result:
id claim_id condition
-- ----------- ----------------
B 23853 Hypertension
(1 rows affected)
Setup:
create table patient
(
id varchar(1),
claim_id int,
condition varchar(16)
);
insert into patient (id, claim_id, condition) values ('A', 14234, 'Hypertension');
insert into patient (id, claim_id, condition) values ('A', 14234, 'Diabetes');
insert into patient (id, claim_id, condition) values ('A', 63947, 'Diabetes');
insert into patient (id, claim_id, condition) values ('B', 23853, 'Hypertension');
You can do this with a CTE.
I set up this CTE with two parameters, one being the Condition you seek, and the other being the max number of combined conditions to find (in your case 1).
DECLARE #myTable TABLE (Id VARCHAR(1), ClaimID INT, Condition VARCHAR(100))
INSERT INTO #myTable (Id, ClaimID, Condition)
SELECT 'A',14234,'Hypertension' UNION ALL
SELECT 'A',14234,'Diabetes' UNION ALL
SELECT 'A',63947,'Diabetes' UNION ALL
SELECT 'B',23853,'Hypertension'
DECLARE #Condition VARCHAR(100)
DECLARE #MaxConditions TINYINT
SET #Condition='Hypertension'
SET #MaxConditions=1
; WITH CTE AS
(
SELECT *, COUNT(2) OVER(PARTITION BY ClaimID) AS CN
FROM #myTable T1
WHERE EXISTS (SELECT 1 FROM #myTable T2 WHERE T1.ClaimID=T2.ClaimID AND T2.Condition=#Condition)
)
SELECT *
FROM CTE
WHERE CN<=#MaxConditions
If you don't care about the fluff, and just want all ClaimID's with just ONE condition regardless of which condition it is use this.
DECLARE #myTable TABLE (Id VARCHAR(1), ClaimID INT, Condition VARCHAR(100))
INSERT INTO #myTable (Id, ClaimID, Condition)
SELECT 'A',14234,'Hypertension' UNION ALL
SELECT 'A',14234,'Diabetes' UNION ALL
SELECT 'A',63947,'Diabetes' UNION ALL
SELECT 'B',23853,'Hypertension'
DECLARE #MaxConditions TINYINT
SET #MaxConditions=1
; WITH CTE AS
(
SELECT *, COUNT(2) OVER(PARTITION BY ClaimID) AS CN
FROM #myTable T1
)
SELECT *
FROM CTE
WHERE CN<=#MaxConditions
Here is one method using Having clause
SELECT t.*
FROM mytable t
WHERE EXISTS (SELECT 1
FROM mytable t2
WHERE t2.id = t.id
HAVING Count(CASE WHEN condition = 'Hypertension' THEN 1 END) > 0
AND Count(CASE WHEN condition != 'Hypertension' THEN 1 END) = 0)
And yet a couple of other ways to do this:
declare #TableA table(Id char,
ClaimId int,
Condition varchar(250));
insert into #TableA (id, claimid, condition)
values ('A', 14234, 'Hypertension'),
('A', 14234, 'Diabetes'),
('A', 63947, 'Diabetes'),
('B', 23853, 'Hypertension')
select id, claimid, condition
from #TableA a
where not exists(select id
from #TableA b
where a.id = b.id
group by b.id
having count(b.id) > 1)
OR
;with cte as
(
select id, claimid, condition
from #TableA
)
,
cte2 as
(
Select id, count(Id) as counts
from cte
group by id
having count(id) < 2
)
Select cte.id, claimid, condition
From cte
inner join
cte2
on cte.id = cte2.id
I decided to revise my answer into an appropriate one.
A simple solution to your question is to count the rows instead of the ID values (since it's not an integer).
Here is a simple introduction:
SELECT
ID
FROM
#PatientTable
GROUP BY
ID
HAVING
ID = ID AND COUNT(*) = 1
This will Return the ID B
+----+
| ID |
+----+
| B |
+----+
Surely, this is not enough, as you may work with a large data and need more filtering.
So, we will go and use it as a sub-query.
Using it as a sub-query it's simple :
SELECT
ID,
ClaimID,
Condition
FROM
#PatientTable
WHERE
ID = (SELECT ID AS NumberOfClaims FROM #PatientTable GROUP BY ID HAVING ID = ID AND COUNT(*) = 1)
This will return
+----+---------+--------------+
| ID | ClaimID | Condition |
+----+---------+--------------+
| B | 23853 | Hypertension |
+----+---------+--------------+
So far so good, but there is another issue we may face. Let's say you have a multiple Claims from a multiple patients, using this query as is will only show one patient. To show all patients we need to use IN rather than = under the WHERE clause
WHERE
ID IN (SELECT ID AS NumberOfClaims FROM #PatientTable GROUP BY ID HAVING ID = ID AND COUNT(*) = 1)
This will list all patients that falls under this condition.
If you need more conditions to filter, you just add them to the WHERE clause and you'll be good to go.
SELECT id, sum(ct)
FROM (SELECT customer_id, CASE WHEN category = 'X' THEN 0 else 1
end ct
FROM MASTER_TABLE
) AS t1
GROUP BY id
HAVING sum(ct) = 0
id which will have sum(ct) more than 1, will have multiple conditions
Use joins instead of subquery. Joins are always better in performance. You can use below query.
SELECT T1.id, T1.claimid, T1.Condition
FROM mytable T1
INNER JOIN
(
select id, count(Condition) counter
from mytable
group by id HAVING COUNT(DISTINCT CONDITION)=1
) T2 ON T1.ID=T2.ID
WHERE T2.counter=1

Output the results of several SELECT statements to an excel sheet in their own columns

I have a query that I want to turn into a stored proc which has, right now, about 6 select statements in it of similar data. Each one just brings back phone numbers in one column except each of the columns is named differently.
Basically it is:
SELECT PhoneNumber as PhoneGroup1 FROM PhoneNumberTable
SELECT PhoneNumber as PhoneGroup2 FROM PhoneNumberTable
SELECT PhoneNumber as PhoneGroup3 FROM PhoneNumberTable
It is actually more complex than that, but those are the results I get in a nutshell.
I then will go and copy/paste each column and its header name into a spreadsheet into Column A for PhoneGroup1, Column B for PhoneGroup2, etc.
PhoneGroup1 | PhoneGroup2 | PhoneGroup3
4856562281 | 9498675309 | 6238471273
7452837719 | 5739542855 | 4745856147
8472639273 | 6495232247 | 9516538847
Is there any way I can have this export to an excel sheet?
Thank you guys for any guidance!
I think I understand what you're trying to do. Do you have something like this:
declare #tbl1 table ( id int )
declare #tbl2 table ( id int )
insert into #tbl1 values(1),(2),(3)
insert into #tbl2 values(10),(20),(30)
select * from #tbl1
union
select * from #tbl2
which returns this result set:
id
----
1
2
3
10
20
30
but you really want this result set?
id1 id2
---- ----
1 10
2 20
3 30
I can see a way to do this using row numbers. Basically, you give each row returned from the individual tables a row number, and then you join the tables together matching on the row numbers. It looks like this in my example:
declare #tbl1 table ( id int )
declare #tbl2 table ( id int )
insert into #tbl1 values(1),(2),(3)
insert into #tbl2 values(10),(20),(30)
select t1.id as id1, t2.id as id2
from
(
select 'table1' as header, id, row_number() over (order by id) rnum
from #tbl1 t1
) t1
inner join
(
select 'table2' as header, id, row_number() over (order by id) rnum
from #tbl2 t2
) t2 on t1.rnum = t2.rnum
To add a column you have to add another join to the query. If your tables have different numbers of rows and you want to see all rows, use left full outer joins instead of inner joins.

is it a bug in SQL Server 2008?

create table Mytable1
(ID int,
Fname varchar(50)
)
create table Mytable2
(ID int,
Lname varchar(50)
)
insert into Mytable1 (ID,Fname)
values (1,'you')
insert into Mytable1 (ID,Fname)
values (2,'Tou')
insert into Mytable1 (ID,Fname)
values (3,'Nou')
insert into Mytable2 (ID,Lname)
values (1,'you2')
The field Fname does not exist in table Mytable2 But we have a result for the following query :
select * from Mytable1 where Fname in (select Fname from Mytable2)
Note : I use sql server 2008 the result is all rows of table Mytable1
is it a bug in SQL ?
No, it's not a bug.
You can see what's happening a bit clearer if you add table aliases to the fields used throughout the query:
select * from Mytable1 mt1
where mt1.Fname in (select mt1.Fname from Mytable2 mt2)
- ie. the subquery is referencing (and returning) values from the main query.
If you change the query to:
select * from Mytable1 mt1
where mt1.Fname in (select mt2.Fname from Mytable2 mt2)
- you get an error.
(SQLFiddle here)
No, this is not a bug: http://bugs.mysql.com/bug.php?id=26801
Apparently, this references Fname from Mytable1:
mysql> select *, (select Lname from Mytable1 limit 1) from Mytable2 where Lname in (select Lname from Mytable1 );
+------+-------+--------------------------------------+
| ID | Lname | (select Lname from Mytable1 limit 1) |
+------+-------+--------------------------------------+
| 1 | you2 | you2 |
+------+-------+--------------------------------------+
1 row in set (0.01 sec)

Checking for duplicate data in SQL Server

Please don't ask me why but there is a lot of duplicate data where every field is duplicated.
For example
alex, 1
alex, 1
liza, 32
hary, 34
I will need to eliminate from this table one of the alex, 1 rows
I know this algorithm will be very ineffecient, but it does not matter. I will need to remove duplicate data.
What is the best way to do this? Please keep in mind I do not have 2 fields, I actually have about 10 fields to check on.
As you said, yes this will be very inefficient, but you can try something like
DECLARE #TestTable TABLE(
Name VARCHAR(20),
SomeVal INT
)
INSERT INTO #TestTable SELECT 'alex', 1
INSERT INTO #TestTable SELECT 'alex', 1
INSERT INTO #TestTable SELECT 'liza', 32
INSERT INTO #TestTable SELECT 'hary', 34
SELECT *
FROM #TestTable
;WITH DuplicateVals AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY Name, SomeVal ORDER BY (SELECT NULL)) RowID
FROM #TestTable
)
DELETE FROM DuplicateVals WHERE RowID > 1
SELECT *
FROM #TestTable
I understand this does not answer the specific question (eliminating dupes in SAME table), but I'm offering the solution because it is very fast and might work best for the author.
Speedy solution, if you don't mind creating a new table, create a new table with the same schema named NewTable.
Execute this SQL
Insert into NewTable
Select
name,
num
from
OldTable
group by
name,
num
Just include every field name in both the select and group by clauses.
Method A. You can get a deduped version of your data using
SELECT field1, field2, ...
INTO Deduped
FROM Source
GROUP BY field1, field2, ...
for example, for your sample data,
SELECT name, number
FROM Source
GROUP BY name, number
yields
alex 1
hary 34
liza 32
then simply delete the old table, and rename the new one. Of course, there are a number of fancy in-place solutions, but this is the clearest way to do it.
Method B. An in-place method is to create a primary key and delete duplicates that way. For example, you can
ALTER TABLE Source ADD sid INT IDENTITY(1,1);
which makes Source look like this
alex 1 1
alex 1 2
liza 32 3
hary 34 4
then you can use
DELETE FROM Source
WHERE sid NOT IN
(SELECT MIN(sid)
FROM Source
GROUP BY name, number)
which will give the desired result. Of course, "NOT IN" is not exactly the most efficient, but it will do the job. Alternatively, you can LEFT JOIN the grouped table (maybe stored in a TEMP table), and do the DELETE that way.
create table DuplicateTable(name varchar(10), number int)
insert DuplicateTable
values
('alex', 1),
('alex', 1),
('liza', 32),
('hary', 34);
with cte
as
(
select *, row_number() over(partition by name, number order by name) RowNumber
from DuplicateTable
)
delete cte
where RowNumber > 1
A bit different solution which requires primary key(or unique index):
Suppose you have a table your_table(id - PK, name, and num)
DELETE
FROM your_table
FROM your_table AS t2
WHERE
(select COUNT(*) FROM your_table y
where t2.name = y.name and t2.num = y.num) >1
AND t2.id !=
(SELECT top 1 id FROM your_table z
WHERE t2.name = z.name and t2.num = z.num);
I assumed that name and num are NOT NULL, if they can contain NULL values, you need to change wheres in sub-queries.

How do I Populate a 2-Column table with Unrelated Data from 2 Different Sources?

I have 2 tables, each with an identity column. What I want to do is populate a new 2-column table with those identities so that it results in a pairing of the identities.
Now, I am perfectly able to populate one column of my new table with the identities from one of the tables, but can't get the identities from the other table into my new table. If this isn't the best 1st step to take though, please let me know.
Thank you
You may want to try something like the following:
INSERT INTO t3 (id, value_1, value_2)
SELECT t1.id, t1.value, t2.value
FROM t1
JOIN t2 ON (t2.id = t1.id);
Test case (MySQL):
CREATE TABLE t1 (id int, value int);
CREATE TABLE t2 (id int, value int);
CREATE TABLE t3 (id int, value_1 int, value_2 int);
INSERT INTO t1 VALUES (1, 100);
INSERT INTO t1 VALUES (2, 200);
INSERT INTO t1 VALUES (3, 300);
INSERT INTO t2 VALUES (1, 10);
INSERT INTO t2 VALUES (2, 20);
INSERT INTO t2 VALUES (3, 30);
Result:
SELECT * FROM t3;
+------+---------+---------+
| id | value_1 | value_2 |
+------+---------+---------+
| 1 | 100 | 10 |
| 2 | 200 | 20 |
| 3 | 300 | 30 |
+------+---------+---------+
3 rows in set (0.00 sec)
You can populate a table with the INSERT...SELECT syntax, and the SELECT can be the result of a join between two (or more) tables.
INSERT INTO NewTable (col1, col2)
SELECT a.col1, b.col2
FROM a JOIN b ON ...conditions...;
So if you can express the pairing as a SELECT, you can insert it into your table.
If the two tables are unrelated and there's no way to express the pairing, then you're asking how to make a non-relational data store, and there are no relational rules for that.
An option would be to create a counter for each of the columns that would operate as a unique identifier and then join on the counter.
For SQL Server this would work:
SELECT one.column1, two.column2
FROM (SELECT RANK() OVER (ORDER BY column1) AS id,
column1
FROM table1) one
LEFT JOIN (SELECT RANK() OVER (ORDER BY column2) AS id,
column2
FROM table2) two ON one.id = two.id