I've been wondering how MapReduce works for Hive. More specifically, I want to understand how the data in a table is turned into key-value pairs.
I have this table with, say, 3 partitions on HDFS
emp_table
+---+---------------+---+----------+
| id| name|age|department|
+---+---------------+---+----------+
| 1| James Gordon| 30| Homicide|
| 2| Harvey Bullock| 35| Homicide|
| 3|Kristen Kringle| 28| Records|
| 4| Edward Nygma| 30| Forensics|
| 5| Lee Thompkins| 31| Forensics|
+---+---------------+---+----------+
and I run this query on it
SELECT id, name, department, count(department) FROM emp_table GROUP BY department;
How would the data be broken down into key/value pairs?
My theory is that the key would be the column name and values would be the, well, values for the particular column.
Key Value
id 1, 2, 3, 4, 5
name James Gordon, Harvey Bullock, Kristen Kringle, Edward Nygma, Lee Thompkins
department Homicide, Homicide, Records, Forensics, Forensics
I haven't found any resources on the net regarding this, so I'm not sure if I'm correct. Could someone help clarify this for me?
Also, please let me know if I have made any incorrect assumptions (which I suspect are many)
Hive execution engine do generate a detailed plan to run the mapreduce. The plan consist of all the details like
number of mapreduce job
key-value and join condition on each of the map-reduce.
just execute the below command on hive prompt and walk through the plan to understand the key-value in mapreduce.
explain SELECT id, name, department, count(department) FROM emp_table GROUP BY department;
Must also see EXPLAIN EXTENDED and a sample analysis of explain output.
Related
I want to join or update the following two tables and also add up df for existing words. So if the word endeavor does not exist in the first table, it should be added with its df value or if the word hello exists in both tables df should be summed up.
FYI I'm using MariaDB and PySpark to do word counts on documents and calculate tf, df, and tfidf values.
Table name: df
+--------+----+
| word| df|
+--------+----+
|vicinity| 5|
| hallo| 2|
| admire| 3|
| settled| 1|
+--------+----+
Table name: word_list
| word| df|
+----------+---+
| hallo| 1|
| settled| 1|
| endeavor| 1|
+----------+---+
So in the end the updated/combined table should look like this:
| word| df|
+----------+---+
| vicinity| 5|
| hallo| 3|
| admire| 3|
| settled| 2|
| endeavor| 1|
+----------+---+
What I've tried to do so far is the following:
SELECT df.word, df.df + word_list.df FROM df FULL OUTER JOIN word_list ON df.word=word_list.word
SELECT df.word FROM df JOIN word_list ON df.word=word_list.word
SELECT df.word FROM df FULL OUTER JOIN word_list ON df.word=word_list.word
None of them worked, I either get a table with just null values, some null values, or some exception. I'm sure there must be an easy SQL statement to achieve this but I've been stuck with this for hours and also haven't found anything relatable on stack overflow.
You just need to UNION the two tables first, then aggregate on the word. Since the tables are identically structured it's very easy. Look at this fiddle. I have used maria 10.3 since you didn't specify, but these queries should be completely compliant with (just about) any DBMS.
https://dbfiddle.uk/?rdbms=mariadb_10.3&fiddle=c6d86af77f19fc1f337ad1140ef07cd2
select word, sum(df) as df
from (
select * from df
UNION ALL
select * from word_list
) z
group by word
order by sum(df) desc;
UNION is the vertical cousin of JOIN, that is, UNION joins to datasets vertically or row-wise, and JOIN adds them horizontally, that is by adding columns to the output. Both datasets need to have the same number of columns for the UNION to work, and you need to use UNION ALL here so that the union returns all rows, because the default behavior is to return unique rows. In this dataset, since settled has a value of 1 in both tables, it would only have one entry in the UNION if you don't use the ALL keyword, and so when you do the sum the value of df would be 1 instead of 2, as you are expecting.
The ORDER BY isn't necessary if you are just transferring to a new table. I just added it to get my results in the same order as your sample output.
Let me know if this worked for you.
Due to partial duplicates in some of my database, after some LEFT JOINs I wind up with several (but not all) rows where I have partial data, along with NULLs. For a unique user, one row may have a ZIP code, and another row may have the STATE of that same user.
Let me show you an example:
|email |state |zip |
|-----------------|------|------|
|unique#email.com |NULL |40502 |
|unique#email.com |KY |NULL |
|other#email.com |FL |34744 |
|other#email.com |FL |34744 |
|third#email.com |OH |NULL |
Rows with full duplicates (such as other#email.com in my example) are easy enough to cleanup with a GROUP BY clause, and some people like third#email.com in my example have NULLs and that's ok, but for unique#email.com I have the state in one row and zip in another, what is the best way to combine those two into one row?
A desired result would be:
|email |state |zip |
|-----------------|------|------|
|unique#email.com |KY |40502 |
|other#email.com |FL |34744 |
|third#email.com |OH |NULL |
For the data you have provided, you can use aggregation:
select email, max(state) as state, max(zip) as zip
from t
group by email;
That said, you can probably fix this in the query used to generate the data. Also, if you want multiple rows for a given email in the result set, then you should ask a new question with a clearer example of data.
I came upon a problem with source data that I need to solve.
I have a table with All Employees column that has several names that I would like to extract and fill other columns with. Please see example below where I have the All Employees from raw data and I have to fill all other ones on the right.
Task|All Employees |Lead Employee1|Lead Employee2|Lead Employee|Reg Employee1|Reg Employee2|Reg Employee
1 |Mark Emily Robert|Mark |Emily |Multiple |Robert |NULL |Robert
2 |Mark Robert |Mark |NULL |Mark |Robert |NULL |Robert
3 |Robert |NULL |NULL |NULL |Robert |NULL |Robert
There's around 50 employees and a small rotation (people come and go).
The easiest solution would be to use several nested IIFs for every group (more or less 20 employees per group). That would mean changing the IIF every time there is a change in the team. I was thinking of streamlining it a bit and use additional table where I could keep track of current and previous employees below.
Team members table
Employee|Position
Mark |Lead Employee
Emily |Lead Employee
Robert |Reg Employee
There should be one employee per group assigned to a task so I have to keep track of all situations where there is a multiple of them (handing over a task to a colleague for vacation for example).
I don't have a problem with getting data for a group (simple WHERE clause) but I don't know if there is a way to use some LIKE expression that would check if there is any occurence of (for example) a Lead Employee and fill a table with it. I know that filling the second column would be easier because I would use similar query and just exclude already found employee (replace it with an empty string).
Can you tell me of this is doable (if yes please give me some hint or direction) or should I stick with nested IIFs?
I have a data source in which I need to return all pairs of events (event1, event2) from a single data source, where field1 from event1 matches field2 from event2.
For example, let's say I have the following data.
I need to return a pair of events where field id from event1, matches field referrer_id from event2. Let's say, to get the following report.
Adam Anderson referred Betty Burger on 2016-01-02 08:00:00.000
Adam Anderson referred Carol Camp on 2016-01-03 08:00:00.000
Betty Burger referred Darren Dougan on 2016-01-04 08:00:00.000
In sql I can do this quite easily with the following command.
select a.first_name as first1, a.last_name as last1, b.first_name as first2,
b.last_name as last2, b.date as date
from myTable a
inner join myTable b on a.id = b.referrer_id;
Which returns the following table,
which gives exactly the data I need.
Now, I've been attempting to replicate this in a splunk query and have run into quite a few issues. First I attempted to use the transaction command, but that aggregated all of the related events together as opposed to matching them a pair at a time.
Next, I attempted to use a subsearch, first finding the id and then searching in the subsearch, first for the first event by id and the appending the second event by referral_id. Then, since append creates a new row instead of appending to the same row, using a stats to aggregate the resulting rows by the matching id field. I did attempt to use appendcols but that didn't return anything for me.
...
| table id
| map search="search id=$id$
| fields first_name, last_name, id
| rename first_name as first1
| rename last_name as last1
| rename id as match_id
| append [search $id$
| search referral_id=$id$
| fields first_name, last_name, referral_id, date
| rename first_name as first2
| rename last_name as span2
| rename referral_id as match_id]"
| fields first1, last1, first2, last2, match_id, time
| stats values(first1) as first1, values(last1) as last1, values(first2) as first2,
values(last2) as last2, values(time) as time by id
The above query works for me and gives me the table I need, but it is incredibly slow due to the repeated searches over the entire time frame, and also limited by the map maxsearches which, for whatever reason, cannot be set to unlimited.
This seems like an overly complicated solution, especially in comparison to the sql query. Surely there must exist a simpler, faster way that this can be done, which isn't limited by the arbitrary limited settings or the multiple repeating search queries. I would greatly appreciate any help.
I ended up using append. Using join gave me faster results, but didn't result in every matching pair, for my example it would return 2 rows instead of three, returning Adam with Betty, but not returning Adam with Carol.
Using append returned a full list, and using stats by id gave me the result I was looking for, a full list of each matching pair. It also gave extra empty fields, so I had to remove those, and then manipulate the resulting mv's into their own individual rows. Splunk doesn't offer a multifield mv expand, so I used a workaround.
...
| rename id as matchId, first_name as first1, last_name as last1
| table matchId, first1, last1
| append [
search ...
| rename referrer_id as matchId, first_name as first2, last_name as last2
| table matchId, first2, last2, date]
| stats list(first1) as first1, list(last1) as last1, list(first2) as first2, list(last2) as last2, list(date) as date by matchId
| search first1!=null last1!=null first2!=null last2!=null
| eval zipped=mvzip(mvzip(first2, last2, ","), date, ",")
| mvexpand zipped
| makemv zipped delim=","
| eval first2=mvindex(zipped, 0)
| eval last2=mvindex(zipped, 1)
| eval date=mvindex(zipped, 2)
| fields - zipped
This is faster than using a map with multiple subsearches, and gives all the of results. It is still limited by the maximum size of the subsearch, but at least provides the necessary data.
So I have a data model which is set up with a table that contains NAME, ID, and CONDITION columns for a series of objects (each object has a unique id number). The rest of the attributes for these objects are contained in columns of several respective tables based on the object type (there are some different attributes associated with each type). All the type-specific tables have an ID column so that the objects can be matched to the master list.
I want to write an sql query that will return information about objects of several different types based on the CONDITION tied to their unique ID.
Here is a simplified example of what I am working with:
object_master_list
| ID | NAME | CONDITION |
-------------------------
|1234| obj1| true|
|0000| obj2| false|
|1236| obj3| true|
|0001| obj4| false|
|5832| obj5| true|
|6698| obj6| false|
|6699| obj7| false|
obj_type_one
| ID | NAME | HEIGHT |
-------------------------
|1234| obj1| o1height|
|0000| obj2| o2height|
|5832| obj5| o5height|
|6699| obj7| o7height|
obj_type_two
| ID | NAME | WEIGHT |
-------------------------
|1236| obj3| o3height|
|0001| obj4| o4height|
|6698| obj6| o6height|
As you can see, there is no correlation between NAME and type or ID and type.
I am currently working in iReport, and I have been using the query designer and editing it manually as necessary.
Right now an example query would look like:
SELECT
object_master_list."NAME" AS NAME,
obj_type_one."HEIGHT" AS HEIGHT,
obj_type_two."WEIGHT" AS WEIGHT
FROM
object_master_list INNER JOIN obj_type_one ON object_master_list."ID" =
obj_type_one."ID"
INNER JOIN obj_type_two ON obj_type_two."ID" = object_master_list."ID"
WHERE
object_master_list."CONDITION" = 'true'
My data is returning no results. From the research I have done on sql joins, I believe this is happening:
Where circle "A" represents my master list.
iReport stores and utilizes the values returned from a query row by row, with a field for each column. So ideally I should end up with this:
$F{NAME} which will receive the following values in succession ("obj1", "obj3", "obj5")
$F{HEIGHT} with value series (o1hieght, null, o5height)
$F{HEIGHT} with value series (null, o3weight, null)
The table representation I suppose would look like this:
| NAME | HEIGHT | WEIGHT |
------------------------------
| obj1| o1height| null|
| obj3| null| o3weight|
| obj5| o5height| null|
My question is how do I accomplish this?
I ran in to this on a smaller scale before, so I am aware that I could use subreports or create multiple data sets, but frankly I have a lot of object types and I would rather not if I could help it. I am also not allowed to add a TYPE column to the master list.
Thanks in advance for any replies.
You can use left join in the following way :
select o1.name, o2.height, o3.weight
from object_master_list o1 left join obj_type_one o2 on o1.id = o2.id
left join obj_type_two o3 on o1.id = o3.id
where o1.condition = 'true'
SQL Fiddle