How to covert tuple to string in pig? - apache-pig

I have data as
id company
1 (a,b)
2 (a,c)
3 (f,g,h)
company is tuple, I generate it from BagToTuple(sortedbag.company) AS company.
I would like to remove the formate of tuple, I would like the data is looked as following:
id company
1 a b
2 a c
3 f g h
I would like the company column has no brackets and separate by space. Thanks.
===================update
I have the data set as
id company
1 a
1 b
1 a
2 c
2 a
I wrote the code as following:
record = load....
grp = GROUP record BY id;
newdata = FOREACH grp GENERATE group AS id,
COUNT(record) AS counts,
BagToTuple(record.company) AS company;
The output is looks like:
id count company
1 3 (a,b,a)
2 2 (c,a)
But I would like company can be sorted and distinct, and no Brackets, and divide by space.
What I expect result is as following:
id count company
1 3 a b
2 2 a c

I think you can just replace BagToTuple with BagToString in the last step:
newdata2 = FOREACH grp
GENERATE group AS id, COUNT(record) as counts,
BagToString(record.company, ' ') as company:chararray;
STORE newdata2 into outdir using PigStorage('#');
After the script runs
$ cat outdir2/part-r-00000
1#3#a b a
2#2#a c

for general tuple to bag, if you don't want UDF, you can do BagToString(TOBAG( your tuple ))

You can use the in-built FLATTEN() operator. http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Flatten+Operator.

Related

SQL Update table with all corresponding data from another table concatenated?

Running PostgreSQL 12.4. I am trying to accomplish this but the syntax given there doesn't seem to be working on psql, and I could not find another approach.
I have the following data:
Table 1
ID Trait
1 X
1 Y
1 Z
2 A
2 B
Table 2
ID Traits, Listed
1
2
3
4
I would like to create the following result:
Table 2
ID Traits, Listed
1 X + Y + Z
2 A + B
Concatenating with + would be ideal, as some traits have inherent commas.
Thank you for any help!
Try something like:
update table2
SET traits = agg.t
FROM
(select id,string_agg(trait, ',') t from table1 group by id) agg
where
table2.id = agg.id;
dbfiddle
Concatenating with + would be ideal, as some traits have inherent commas.
You can use whatever delimiter you like (just change the second argument to string_agg).

How to Take Average of Data Within a Column to create new Variable

I have a simple task which I cannot wrap my head around being a novice coder.
I have a data set which I am trying to manipulate.
It appears as this:
UniqueID Day Var AverageVar
1 1 X
1 2 Y
1 3 Z
2 1 A
2 2 B
2 3 C
I would like to create this new "AverageVar" variable which computes an average across the three days for each unique ID.
So, for example, the AverageVar for the first three rows I would like to create and have (X + Y + Z)/3 displayed. Is there any easy code for this in SQL or R?
SELECT * INTO newtable
FROM
(SELECT UniqueID, AVG(Var) as AverageVar
FROM table
GROUP BY UniqueID);
SELECT O.UniqueID, O.Day, O.Var, N.AverageVar
FROM oldtable O
INNER JOIN
newtable N
ON O.UniqueID = N.UniqueID;

How to Quickly Flatten a SQL Table

I'm using Presto. If I have a table like:
ID CATEGORY VALUE
1 a ...
1 b
1 c
2 a
2 b
3 b
3 d
3 e
3 f
How would you convert to the below without writing a case statement for each combination?
ID A B C D E F
1
2
3
I've never used Presto and the documentation seems pretty thin, but based on this article it looks like you could do
SELECT
id,
kv['A'] AS A,
kv['B'] AS B,
kv['C'] AS C,
kv['D'] AS D,
kv['E'] AS E,
kv['F'] AS F
FROM (
SELECT id, map_agg(category, value) kv
FROM vtable
GROUP BY id
) t
Although I'd recommend doing this in the display layer if possible since you have to specify the columns. Most reporting tools and UI grids support some sort of dynamic pivoting that will create columns based on the source data.
My 2 cents:
If you know "possible" values:
SELECT
m['web'] AS web,
m['shopping'] AS shopping,
m['news'] AS news,
m['music'] AS music,
m['images'] AS images,
m['videos'] AS videos,
m[''] AS empty
FROM (
SELECT histogram(data_tab) AS m
FROM datahub
WHERE
year = 2017
AND month = 5
AND day = 7
AND name = 'search'
) searches
No PIVOT function (yet)!

Same entity from different tables/procedures

I have 2 procedures (say A and B). They both return data with similar columns set (Id, Name, Count). To be more concrete, procedures results examples are listed below:
A:
Id Name Count
1 A 10
2 B 11
B:
Id Name Count
1 E 14
2 F 15
3 G 16
4 H 17
The IDs are generated as ROW_NUMBER() as I don't have own identifiers for these records because they are aggregated values.
In code I query over the both results using the same class NameAndCountView.
And finally my problem. When I look into results after executing both procedures sequentially I get the following:
A:
Id Name Count
1 A 10 ->|
2 B 11 ->|
|
B: |
Id Name Count |
1 A 10 <-|
2 B 11 <-|
3 G 16
4 H 17
As you can see results in the second set are replaced with results with the same IDs from the first. Of course the problem take place because I use the same class for retrieving data, right?
The question is how to make this work without creating additional NameAndCountView2-like class?
If possible, and if you don't really mind about the original Id values, maybe you can try having the first query return even Ids :
ROW_NUMBER() over (order by .... )*2
while the second returns odd Ids :
ROW_NUMBER() over (order by .... )*2+1
This would also allow you to know where the Ids come from.
I guess this would be repeatable with N queries by having the query number i selecting
ROW_NUMBER() over (order by .... )*n+i
Hope this will help

How to do 'Summarizing' in Pig Latin?

Im trying to do a summarize operation with pig.
For example, I have a table called t3:
product price country
A 5 Italy
B 4 USA
C 12 France
A 5 Italy
B 7 Russia
I need to do a summarize operation, using 2 keys: product and country.
I do concatenate operation, using product and country
I have to calculate the price, summarizing the price values just where CONCAT result repeats
Where CONCAT result does not repeat, price remains the same as in t3 table.
The expected output could be:
CONCAT Price_1
AItaly 10
BUSA 4
CFrance 12
BRussia 7
In pig i write following script (the code is wrong, but just to show an idea):
t3 = LOAD '/home/Desktop/3_table/3_table.data' AS (product:chararray, price:int, country:chararray);
c1 = FOREACH t3 GENERATE CONCAT(product, country);
c2 = FOREACH t3 GENERATE *, c1;
product_1 = GROUP c2 BY c1;
price_1 = FOREACH product_1 GENERATE group, SUM(product_1.price);
STORE price_1 INTO 'summarise_by_2_ID' USING PigStorage('\t');
Maybe someone can explain how to reach the expected result?
Thanks a lot in advance!
If you want to calculate the sum per product and country you do not need to use the concat function. Just group by those two fields.
A = LOAD 's.txt' USING PigStorage('\t') AS (product:chararray, price:int, country:chararray);
B = GROUP A BY (product, country);
C = FOREACH B GENERATE CONCAT(group.product,group.country), SUM(A.price);
Actually, the concat is not necessary here, it is only to format the output as expected.
DUMP C
(AItaly,10)
(BUSA,4)
(BRussia,7)
(CFrance,12)