I am new to HIVE and HADOOP , here i am trying to join two sample tables in hive where the tables do not have any primary foreign key relationship just for practicing :-
the tables are as follows
Employees table:-
id name gender salary departmentid
1 mark male 3333 1
2 Steve male 5464 3
3 Ben male 3873 2
4 bender male 9298 1
5 fender male 654 2
departments table:-
id name location
1 IT NEW YORK
2 HR LONDON
3 PAYROLL SYDNEY
hive> select employees.name as employee_name, departments.name as department_name
> from employees
> join departments on departments.id = employees.departmentid;
RESULT :-
Query ID =
cloudera_20170911030505_93378edb-f8b8-45d0-9141-3fe065211f3d
Total jobs = 1
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
since i am new to hive from SQL suggest me how to solve this error ? any help would be appreciated.
set hive.auto.convert.join=false; Fixed the issue in my case..
Query:
select employees.name as employee_name, departments.name as department_name
from employees T1
join departments T2 on T1.departmentid=T2.id;
Hive up to version 0.13 does not support Primary key concepts. This has been introduced in the later releases on hive.
So, we just need to take care is that the columns are present, in case of duplicates in the columns, it will create multiple records.
If you want to have some other use cases, you can try using Left Outer Join, Right Outer join, Full join. Be careful while using cross joins.
If the error still persists, please send more details about the table schema that you have used. You can use show create table db_name.table_name to view your complete schema.
Related
I am having trouble wrapping my head around the on statement when doing a self-join. Let's say we have the following table:
employeeid
name
managerid
salary
1
Mike
3
35000
2
Rob
1
45000
3
Todd
NULL
25000
4
Ben
1
55000
5
Sam
1
65000
I want to perform a self join to return the employee name and their manager's name.
When I perform the following self join I get an incorrect result:
SELECT E.name as Employee,M.name as Manager
FROM tblEmployees E
LEFT JOIN tblEmployees M
ON E.Employeeid=M.managerid
However, when I reverse the columns on the on statement using the query below:
SELECT E.name as Employee,M.name as Manager
FROM tblEmployees E
LEFT JOIN tblEmployees M
ON E.managerid=M.Employeeid
I get the correct answer.
Why? How do I know which columns to select in an on statement?
Here's my explanation:
The table you have is structured with each row representing an employee in the company.
You are interested in determining who is each employee's manager.
You are able to find that by joining the table on itself where the lookup values are the manager ids (managerid) and the reference column are the employee ids (employeeid).
The first query is wrong because the employeeid column is being used for the lookup values and the managerid column is being used for reference.
To get the manager of each employee you need to look use the managerid column as the lookup column and the employeeid column as the reference column.
Hope that's not too confusing!
I'm supposed to answer this for class, and it's tricky (for me)
Write a SELECT query to output the name of all employees with the name of their supervisor. If the employee has no supervisor, the supervisor name column should contain the text 'No Supervisor'.
The primary key field in my db is the employeeid and they are provided with names, and each student also has a supervisorid
The table for this is shown below (sorry for the layout):
employeeid lastname firstname salary supervisorid
1 Stolz Ted 25000 NULL
2 Boswell Nancy 23000 1
3 Hargett Vincent 22000 1
4 Weekley Kevin 22000 3
5 Metts Geraldine 22000 2
6 McBride Jeffrey 21000 2
7 Xiong Jay 20000 3
I was wondering how I could go about this statement without using the case statement to apply each of the 7 students with:
when concat(firstname,' ',lastname)='Nancy Boswell' then 'Ted Stolz'
In larger tables this would simply be a HUGE statement, is there a better way to do it?
Thanks!
EDIT:
I've now tried this:
SELECT
EMP1.employeeid as 'employee',
EMP2.supervisorid as 'manager'
FROM
employee EMP1
LEFT OUTER JOIN
employee EMP2
ON
emp1.employeeid = emp2.supervisorid;
However, I am seeing duplicate fields, for some reason employee 2 and 3 are appearing twice, meaning there are 9 fields showing instead of 7.
Also, I need to display their names, not their id's does that mean I need to join the join that i've already done to the employee name ? How would I do this?
Thanks for the feedback guys!
You need to link the table with itself based on the supervisorId. This might be strange if you are new to SQL but it is very common to do. You tell with SQL to add the row of the supervisor to the row of the employee via its primary key.
SELECT
*
FROM
EMPLOYEES EMP1
LEFT OUTER JOIN
EMPLOYEES EMP2
ON
-- make link between tables here
Note that the above query is not 100% correct / complete, its an indication. The LEFT OUTER JOIN statement makes the employees without supervisor have null values for the supervisor, otherwise the whole record would be left out.
I have a table in SQL Server 2008 that for explanation purposes contains, ID, Employee and ManagerID.
eg:
ID Employee ManagerID
1 A NULL
2 B 2
3 C 2
I want to write a query that returns all non related ManagerID's and ID's where ManagerID is equal to the ID.
The result should look like this,
ID Employee ManagerID
1 A NULL
2 B 2
In essence no managers can be managers of managers.
At first I thought that it would be simple using a SELF Join and an EXCLUDE SQL statement however I cannot get this to work. I would prefer not to USE the EXCLUDE statement as my actual table has more columns and related data that I would like to return.
If you could help, I would be grateful.
select employee, managerid
from your_table
where managerid is null
or managerid = id
Here is the requirement
I have two tables like below, OrderList is a data table which include 3 fields stored StaffId which is a foreign key from staff table. Noted that some records may not have a value in this table.
OrderList
OrderId Marketing_Staff_ID Finance_Staff_Id ManagerId
1 STAFF001 STAFF002 STAFF003
2 STAFF005 STAFF003
3 STAFF004 STAFF004 STAFF003
4 STAFF001 STAFF002 STAFF003
5 STAFF001 STAFF007
Staff
Staff_Id Staff_Name
STAFF001 Jack C.K.
STAFF002 William. C
STAFF005 Someone
I want to write a SQL statement can also select staff name for each record form OrderList, (For these records without staff ID, leave N/A in the name field)
OrderId Mkt StaffID Name Finance StaffId Name ManagerId, Name
1 STAFF001 Jack C.K. STAFF002 William. STAFF003 Chan.Chi
So how can I write the SQL? Left join or sth?
Thank you very much as I am really a beginner in SQL.
You need multiple LEFT JOINs:
select ol.orderid,
ol.marketing_staff_id,
coalesce(ms.staff_name, 'n/a') as marketing_name,
ol.finance_staff_id,
coalesce(fi.staff_name, 'n/a') as finance_name
from orderlist as ol
left join staff as ms on ms.staff_id = ol.marketing_staff_id
left join staff as fi on fi.staff_id = ol.finance_staff_id
I'll leave it up to you to add the join for the manager.
The above is ANSI SQL and should work in every DBMS. There is a slight possibility that your DBMS does not support the coalesce function which simply replaces a NULL value with something different. You will need to check the manual of your DBMS in that case.
I have an Access database that has two tables that are related by PK/FK. Unfortunately, the database tables have allowed for duplicate/redundant records and has made the database a bit screwy. I am trying to figure out a SQL statement that will fix the problem.
To better explain the problem and goal, I have created example tables to use as reference:
alt text http://img38.imageshack.us/img38/9243/514201074110am.png
You'll notice there are two tables, a Student table and a TestScore table where StudentID is the PK/FK.
The Student table contains duplicate records for students John, Sally, Tommy, and Suzy. In other words the John's with StudentID's 1 and 5 are the same person, Sally 2 and 6 are the same person, and so on.
The TestScore table relates test scores with a student.
Ignoring how/why the Student table allowed duplicates, etc - The goal I'm trying to accomplish is to update the TestScore table so that it replaces the StudentID's that have been disabled with the corresponding enabled StudentID. So, all StudentID's = 1 (John) will be updated to 5; all StudentID's = 2 (Sally) will be updated to 6, and so on. Here's the resultant TestScore table that I'm shooting for (Notice there is no longer any reference to the disabled StudentID's 1-4):
alt text http://img163.imageshack.us/img163/1954/514201091121am.png
Can you think of a query (compatible with MS Access's JET Engine) that can accomplish this goal? Or, maybe, you can offer some tips/perspectives that will point me in the right direction.
Thanks.
The only way to do this is through a series of queries and temporary tables.
First, I would create the following Make Table query that you would use to create a mapping of the bad StudentID to correct StudentID.
Select S1.StudentId As NewStudentId, S2.StudentId As OldStudentId
Into zzStudentMap
From Student As S1
Inner Join Student As S2
On S2.Name = S1.Name
Where S1.Disabled = False
And S2.StudentId <> S1.StudentId
And S2.Disabled = True
Next, you would use that temporary table to update the TestScore table with the correct StudentID.
Update TestScore
Inner Join zzStudentMap
On zzStudentMap.OldStudentId = TestScore.StudentId
Set StudentId = zzStudentMap.NewStudentId
The most common technique to identify duplicates in a table is to group by the fields that represent duplicate records:
ID FIRST_NAME LAST_NAME
1 Brian Smith
3 George Smith
25 Brian Smith
In this case we want to remove one of the Brian Smith Records, or in your case, update the ID field so they both have the value of 25 or 1 (completely arbitrary which one to use).
SELECT min(id)
FROM example
GROUP BY first_name, last_name
Using min on ID will return:
ID FIRST_NAME LAST_NAME
1 Brian Smith
3 George Smith
If you use max you would get
ID FIRST_NAME LAST_NAME
25 Brian Smith
3 George Smith
I usually use this technique to delete the duplicates, not update them:
DELETE FROM example
WHERE ID NOT IN (SELECT MAX (ID)
FROM example
GROUP BY first_name, last_name)