Given the resoult of a GROUP operation how can I store each nested-bag in a folder named after the group? - apache-pig

I've a relation D:
grunt> DESCRIBE D;
D: {i: int,l: chararray}
on which a GROUP is applied:
grunt> G = group D by i;
grunt> illustrate G;
-------------------------------------
| D | i:int | l:chararray |
-------------------------------------
| | 1 | B |
| | 1 | A |
| | 2 | A |
-------------------------------------
-----------------------------------------------------------------------
| G | group:int | D:bag{:tuple(i:int,l:chararray)} |
-----------------------------------------------------------------------
| | 1 | {(1, B), (1, A)} |
| | 2 | {(2, A)} |
-----------------------------------------------------------------------
How can I store each nested bag G.D in a file named as the corresponding group? I.e. /ouput/1, /output/2
I understand I can't use a store operation in a foreach block. In fact the following doeasn't work:
grunt> foreach G { store D into '/output/' + ((chararray) group) }

MultiStorage() option will work for you. It will be available in piggybank jar. You need to download from this link http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm and set it in your classpath.
Example:
input
1,A
1,B
2,A
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage(',') AS (i:int,l:chararray);
B = GROUP A BY i;
STORE B INTO 'output' USING org.apache.pig.piggybank.storage.MultiStorage('output', '0');
Now output folder contains 2 dirs named 1 and 2 where the corresponding group value will be stored inside this folder.
Output:
output$ ls
1 2 _SUCCESS
Reference:
https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/piggybank/storage/MultiStorage.html

Related

Joining two tables with grouped elements in Oracle sql

So i have the following two tables (simplified):
Table 1: FOLDERS
ID | DESC_FOLDER | TEMPLATE_ID
---------------------------------
... | ... | ...
20 | Folder 1 | 52
21 | Folder 2 | 55
... | | ...
Table 2: TEMPLATES
ID | DESC_TEMPLATE | GROUP
-----------------------------
... | ... | ...
51 | Template 1 | abc
52 | Template 2 | abc
53 | Template 3 | abc
54 | Template 4 | abc
55 | Template 5 | NULL
... | ... | ...
The result should be a list with all the templates and their corresponding folder.
Expected Result:
DESC_TEMPLATE | DESC_FOLDER
---------------------------
Template 1 | Folder 1
Template 2 | Folder 1
Template 3 | Folder 1
Template 4 | Folder 1
Template 5 | Folder 2
I have problems with the grouped templates, because only one template of each group is connected to the folder. The following sql command obviously only returns the templates directly connected to the folder. How to extend my command to get the desired output?
Select
T.DESC_TEMPLATE,
F.DESC_FOLDER
from
TEMPLATES T,
FOLDERS F
where
T.ID = F.TEMPLATE_ID
Thanks a lot for your help!
I think a window function will solve your problem:
Select T.DESC_TEMPLATE,
MAX(F.DESC_FOLDER) OVER (PARTITION BY t.GROUP) as DESC_FOLDER
from TEMPLATES T left join
FOLDERS F
on T.ID = F.TEMPLATE_ID;
where
T.ID = F.TEMPLATE_ID (+)

Add Column in a Spark Dataframe ,based on a parametric sql query dependent on values of some fields of the dataframe

I have several Spark Dataframes(we can call them Table a, table b etc).
I want to add a column just to table a, based on a result of a query to one of the other tables, but this table will change every time based on a value of one of the fields of table a. So this query should be parametric.
Below I show an example to make the problem clear:
Every table has the column OID and a column TableName with the name of the current table, plus other columns.
This is the fixed query to be performed on Tab A to add new column:
Select $ColumnName from $TableName where OID=$oids
Tab A
| oids|TableName |ColumnName | other fields|New Column: ValueOidDb
================================================================
| 2 | Book | Title | x |result query:harry potter
| 8 | Book | Isbn | y |result query: 556
| 1 | Author | Name | z |result query:Tolkien
| 4 | Category |Description| b |result query: Commedy
Tab Book
| OID |TableName |Title |Isbn |other fields|
================================================================
| 2 | Book |harry potter| 123 | x |
| 8 | Book | hobbit | 556 | y |
| 21 | Book | etc | 8942 | z |
| 5 | Book | etc2 | 984 | b |
Tab Author
| OID |TableName |Name |nationality |other fields|
================================================================
| 5 | Author |J.Rowling | eng | x |
| 2 | Author |Geor. Martin| us | y |
| 1 | Author | Tolkien | eng | z |
| 13 | Author | Dan Brown | us | b |
| OID | TableName |Description |
=====================================
| 12 | Category | Fantasy |
| 4 | Category | Commedy |
| 9 | Category | Thriller |
| 7 | Category | Action |
I tried with this udf
def setValueOid = (oid: Int,TableName: String, TableColumn: String) => {
try{
sqlContext.sql(s"Select $currTableColumn from $currTableName where OID = $curroid ").first().toString()
}
catch{
case x: java.lang.NullPointerException => "error"
}
}
sqlContext.udf.register("setValueOid", setValueOid)
val FinalRtxf = sqlContext.sql("SELECT all the column of TAB A ,"
+ " setValueOid(oid, Table,AttributeDatabaseColumn ) as ValueOidDb"
+ " FROM TAB A")
I put the code in a try catch because otherwise it gives me a nullpointerexception, but it doesn't work, because it always returns a "problem".
If I try this function without a sql query by just passing some manual parameters it works perfectly:
val try=setValueOid(8,"BOOK","ISBN")
try: String = [0977326403 ] FINISHED
Took 4 sec. Last updated by anonymous at November 20 2016, 3:29:28 AM.
I read here that is not possible to make a query inside a udf
Trying to execute a spark sql query from a UDF
So how can I solve my problem? I don't know how to make a parametric join. I tried this:
%sql
Select all attributes TAB A,
FROM TAB A as a
join (Select $AttributeDatabaseColumn ,TableName from $Table where OID=$oid) as b
on a.Table=b.TableName
but it gave me this exception:
org.apache.spark.sql.AnalysisException: cannot recognize input near '$' 'AttributeDatabaseColumn' ',' in select clause; line 3 pos 1 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:318)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
One option:
transform each Book, Author, Category to a form:
root
|-- oid: integer (nullable = false)
|-- tableName: string (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
For example first record in Book:
val book = Seq((2L, "Book",
Map("title" -> "harry potter", "Isbn" -> "123", "other field" -> "x")
)).toDF("oid", "title", "properties")
+---+---------+---------------------------------------------------------+
|oid|tableName|properties |
+---+---------+---------------------------------------------------------+
|2 |Book |Map(title -> harry potter, Isbn -> 123, other field -> x)|
+---+---------+---------------------------------------------------------+
union Book, Author, Category as properties.
val properties = book.union(author).union(category)
join with base table:
val comb = properties.join(table, Seq($"oid", $"tableName"))
use case when ... based on tableName to add new column from properties field.

Displaying a pair that have same value in another table

I'm trying to make a query that pair a worker that work on the same place. The relational model I'm asking looks like this:
Employee(EmNum, name)
Work(FiNum*, EmNum*)
Field(FiNum, Title)
(bold indicates primary key)
right now my code looks like
SELECT work.finum, e1.name,e1.emnum,e2.name,e2.emnum
FROM employee e1
INNER JOIN employee e2
on e1.EmNum = e2.EmNum
INNER JOIN work
on e1.emnum = work.emnum
This gives me result like
| finum | name | emnum | name_1 | emnum_1 |
| 1 | a | 1 | a | 1 |
| 1 | b | 2 | b | 2 |
| 2 | c | 3 | c | 3 |
| 3 | d | 4 | d | 4 |
| 3 | e | 5 | e | 5 |
while I want the result to be like
| finum | name | emnum | name_1 | emnum_1 |
| 1 | a | 1 | b | 2 |
| 1 | b | 2 | a | 1 |
| 3 | d | 4 | e | 4 |
| 3 | e | 5 | d | 5 |
I'm quite new at sql so I can't really think of a way to do this. Any help or input would be helpful.
Thanks
Your question is slightly unclear, but my guess is that you're trying to find employees that worked on the same place = same finum in work, but different row. That you can do this way:
SELECT w1.finum, e1.name,e1.emnum, e2.name,e2.emnum
from work w1
join work w2 on w1.finum = w2.finum and w1.emnum != w2.emnum
join employee e1 on e1.emnum = w1.emnum
join employee e2 on e2.emnum = w2.emnum
If you don't want to repeat the records (1 <-> 2 + 2 <-> 1 change the != in the join to > or <)
I'm trying to make a query that pair a worker that work on the same place.
Presumably the "places" are represented by the Field table. If you want to pair up employees on that basis then you should be performing a join conditioned on field numbers being the same, as opposed to one conditioned on employee numbers being the same.
It looks like your main join wants to be a self-join of Work to Work of records with matching FiNum. To get the employee names in the result then you will need also to join Employee twice. To avoid employees being paired with themselves, you will want to filter those cases out via a WHERE clause.

How would this get stored in a database?

Ok so I have a name, a time/data, and an array...
so in my database I store in the table, the:
name, time/data, but the array?
This array doesn't have a fixed size... It is an array of tuples (x,y) <- even thats an array
I want to associate the name and time/date with this array. I heard its not good and impossible to store an array in a database, I'm using sqllite3..
How do I solve this problem? Do I just let the name and timedate table point to a newly constructed table for the array?
Just create a table with X, Y, timestamp and a foreignkey to an entry in an entry table.
Data table:
ID | Index | X | Y | EntryID
0 | 0 | 3.2 | 4.3 | 1
1 | 1 | 2.1 | 1.2 | 1
........
n-1 | n-1 | xn | yn | 1
# The above is from array 1, below from another array
n | 0 | 2.2 | 2.4 | 2
n+1 | 1 | 2.1 | 1.9 | 2
.........
n+m-1 | m-1 | xm | ym | 2
Entry table:
ID | Name | DateTime
1 | user3043594 | 2013-..
2 | Steinar Lima | 2012-..
You store all entries from all arrays in this table, and filter them based on entry id. Then you can do a join to get the user name from the user table.

Renaming fields after a JOIN takes time?

In the following code, how much does renaming fields after a join hurt the computation time of the script? Is it optimized in Pig? Or does it really go through every record?
-- tables A: (f1, f2, id) and B: (g1, g2, id) to be joined by id
C = JOIN A BY id, B by id;
C = FOREACH C GENERATE A::f1 AS f1, A::f2 AS f2, B::id AS id, B::g1 AS g1, B::g2 AS g2;
Does the FOREACH command go through every record of C? If yes, is there a way to optimize?
Thanks.
Don't worry about optimizing this, there may be a slight overhead in renaming the fields, but it won't trigger an addition Map/Reduce job. The field projection will occur in the reducer after your JOIN.
Consider the two pieces of code and the Map Reduce plans given by explain below.
Without Renaming
A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);
C = join A by id, B by id;
store C into 'output';
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-30
Map Plan
Union[tuple] - scope-31
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
| | |
| | Project[bytearray][2] - scope-21
| |
| |---A: New For Each(false,false,false)[bag] - scope-7
| | |
| | Project[bytearray][0] - scope-1
| | |
| | Project[bytearray][1] - scope-3
| | |
| | Project[bytearray][2] - scope-5
| |
| |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
| |
| Project[bytearray][2] - scope-23
|
|---B: New For Each(false,false,false)[bag] - scope-15
| |
| Project[bytearray][0] - scope-9
| |
| Project[bytearray][1] - scope-11
| |
| Project[bytearray][2] - scope-13
|
|---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
----------------
With Renaming
A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);
C = join A by id, B by id;
C = foreach C generate A::f1 as f1, -- This
A::f2 as f2, -- section
B::id as id, -- is
B::g1 as g1, -- different
B::g2 as g2; --
store C into 'output';
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-41
Map Plan
Union[tuple] - scope-42
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
| | |
| | Project[bytearray][2] - scope-21
| |
| |---A: New For Each(false,false,false)[bag] - scope-7
| | |
| | Project[bytearray][0] - scope-1
| | |
| | Project[bytearray][1] - scope-3
| | |
| | Project[bytearray][2] - scope-5
| |
| |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
| |
| Project[bytearray][2] - scope-23
|
|---B: New For Each(false,false,false)[bag] - scope-15
| |
| Project[bytearray][0] - scope-9
| |
| Project[bytearray][1] - scope-11
| |
| Project[bytearray][2] - scope-13
|
|---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
| |
| Project[bytearray][0] - scope-27
| |
| Project[bytearray][1] - scope-29
| |
| Project[bytearray][5] - scope-31
| |
| Project[bytearray][3] - scope-33
| |
| Project[bytearray][4] - scope-35
|
|---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
----------------
The difference is in the Reduce plans. Without renaming:
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
versus with renaming:
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
| |
| Project[bytearray][0] - scope-27
| |
| Project[bytearray][1] - scope-29
| |
| Project[bytearray][5] - scope-31
| |
| Project[bytearray][3] - scope-33
| |
| Project[bytearray][4] - scope-35
|
|---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
In short, there will be other things you can optimize in your script before worrying about renaming. Since you'll be going through every record anyway because of the join, renaming will just be a cheap extra step.