How do I add a column, preserving the existing columns, without listing them all? - apache-pig

I want to add a new column to an alias, preserving all the existing ones.
A = foreach A generate
A.id as id,
A.date as date,
A.foo as foo,
A.bar as bar,
A.foo / A.bar as foobar;
Can I do that without listing all of them explicitly?

Yes, let's say you have an alias like:
A: {num1:int, num2:int}
and you want to calculate the sum while keeping num1 and num2. You can do this like:
B = FOREACH A GENERATE *, num1 + num2 AS num3:int ;
DESCRIBE B;
B: {num1:int, num2:int, num3:int}
Used like this, the * operator generates all fields.

Related

Python unpacking operator / unpack subquery in separated columns

In Python there is the unpacking operator (*), which allows you to take an iterator or iterable (tuple, list, generator, etc.) and pass each of its items as an argument to a function. I want to do the same thing with a Postgresql subquery. But I can't find any information anywhere.
I want to do something like this:
INSERT INTO tabla1(a, b, c) SELECT a, *(SELECT b, c FROM tabla2 LIMIT 1) FROM tabla3
To avoid doing two almost identical selects and speed up my queries.
I want to AVOID something like this:
INSERT INTO tabla1(a, b, c) SELECT a, (SELECT b FROM tabla2 LIMIT 1), (SELECT c FROM tabla2 LIMIT 1) FROM tabla3
I tried the following:
review documentation
use with statement (It doesn't work for me because I can't relate a query column to the subquery, which would be necessary for me)
Read a question from this site (I don't have the link)
My question would be, is there something similar to that in Postgresql or any way to affect multiple columns with a single subquery? For example, something like the with statement with which you can do name_of_the_query.column?
Edit
This is the query I did with with, the test query with real names, I hope it makes my question better quality.
WITH ztabla02 AS (SELECT (CASE WHEN LEFT(maecuent.cuenta, 1) IN ('1','2','3') THEN
array[(SELECT descripcio FROM ztabla02 WHERE c_tabla='PR' AND c_clave=maecuent.cod_pcia), 'ARGENTINA']
ELSE
array['', (SELECT descripcio FROM ztabla02 WHERE c_tabla='PA' AND c_clave=maecuent.cod_pcia)]
END)
AS tuple)
SELECT ztabla02.tuple[1], ztabla02.tuple[2] FROM maecuent
Error:
ERROR: falta una entrada para la tabla «maecuent» en la cláusula FROM
LINE 1: WITH ztabla02 AS (SELECT (CASE WHEN LEFT(maecuent.cuenta, 1)...
^
You can do it with a CTE
with t2 as (
SELECT b, c FROM tabla2 LIMIT 1
)
INSERT INTO tabla1(a, b, c)
SELECT t3.a, t2.b, t2.c
FROM tabla3 t3
CROSS JOIN t2
Note LIMIT without ORDER BY will return an arbitrary row.

PIG REPLACE with NULL

I have three values A, B and C.
I want to be able to replace the value of C with a NULL value if A AND B have values in their cells.
Unsure where to go. I've tried something like
FOR EACH X GENERATE REPLACE(C, ((A IS NOT NULL AND B IS NOT NULL) ? NULL:C) ;
But unsure if this will work, it doesn't seem right. I don't want to add any more values, just update the value of C?
Maybe something like
FOR EACH X GENERATE (A IS NOT NULL AND B IS NOT NULL) ? NULL:C AS NEW_C;
Then drop C, whilst retaining A, B and NEW_C?
You can simply do:
Y = FOREACH X GENERATE A, B, (A IS NOT NULL AND B IS NOT NULL ? NULL : C) AS C;
There is no need to create NEW_C and then drop C since no fields are carried into the new relation unless you explicitly name them (unless you use GENERATE * so that all fields are carried through).

Oracle: outer join(+) with or clause replacement

I have an enormous select that schematically looks like this:
SELECT c_1, c_2, ..., c_j FROM t_1, t_2, ..., t_k
WHERE e_11 = e_12(+)
AND e_21 = e_22(+)
AND ...
AND e_l1 = e_l2(+)
ORDER BY o
where j, k and l are in hundreds and e_mn is a column from some table. I need to add new columns A_1 and A_2 to the select from a new table T. The new columns are connected to the former select via a column call it B from a table R. I want those rows where A_1 = B or A_2 = B or those rows where there is no correspondeing A_i to the value B.
Suppose I only had to deal with tables T and R then I want this:
SELECT * FROM R
LEFT OUTER JOIN T
ON (A_1 = B OR A_2 = B)
To mimic this behaviour I'd want something like this in the big select:
SELECT c_1, c_2, ..., c_j, A_1, A_2 FROM t_1, t_2, ..., t_k, T
WHERE e_11 = e_12(+)
AND e_21 = e_22(+)
AND ...
AND e_l1 = e_l2(+)
AND (B = A_1(+) OR B = A_2(+))
ORDER BY o
this is, however, syntactically incorrect since the (+) operator cannot be used with the OR caluse. And if I leave out the (+)'s I lose those rows where there is no corresponding A_i to B.
What are my options here? Can I somehow find a way to do this without changing the whole body of the select? I doubt there is a reasonable way to do this, nevertheless I'd appreciate any help.
Thanks.

How to use a variable AS a where clause?

I have one where clause which I have to use multiple times. I am quite new to Oracle SQL, so please forgive me for my newbe mistakes :). I have read this website, but could not find the answer :(. Here's the SQL statement:
var condition varchar2(100)
exec :condition := 'column 1 = 1 AND column2 = 2, etc.'
Select a.content, b.content
from
(Select (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,3)) as content
from table_name
where category = X AND :condition
group by (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,3))
) A
,
(Select (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,100)) as content
from table_name
where category = Y AND :condition
group by (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,100))) B
GROUP BY
a.content, b.content
The content field is a CLOB field and unfortunately all values needed are in the same column. My query does not work ofcourse.
You can't use a bind variable for that much of a where clause, only for specific values. You could use a substitution variable if you're running this in SQL*Plus or SQL Developer (and maybe some other clients):
define condition = 'column 1 = 1 AND column2 = 2, etc.'
Select a.content, b.content
from
(Select (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,3)) as content
from table_name
where category = X AND &condition
...
From other places, including JDBC and OCI, you'd need to have the condition as a variable and build the query string using that, so it's repeated in the code that the parser sees. From PL/SQL you could use dynamic SQL to achieve the same thing. I'm not sure why just repeating the conditions is a problem though, binding arguments if values are going to change. Certainly with two clauses like this it seems a bit pointless.
But maybe you could approach this from a different angle and remove the need to repeat the where clause. Querying the table twice might not be efficient anyway. You could apply your condition once as a subquery, but without knowing your indexes or the selectivity of the conditions this could be worse:
with sub_table as (
select category, content
from my_table
where category in (X, Y)
and column 1 = 1 AND column2 = 2, etc.
)
Select a.content, b.content
from
(Select (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,3)) as content
from sub_table
where category = X
group by (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,3))
) A
,
(Select (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,100)) as content
from sub_table
where category = Y
group by (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,100))) B
GROUP BY
a.content, b.content
I'm not sure what the grouping is for - to eliminate duplicates? This only really makes sense if you have a single X and Y record matching the other conditions, doesn't it? Maybe I'm not following it properly.
You could also use a case statement:
select max(content_x), max(content_y)
from (
select
case when category = X
then DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,3) end as content_x,
case when category = Y
then DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,100) end as content_y,
from my_table
where category in (X, Y)
and column 1 = 1 AND column2 = 2, etc.
)

Self cross-join in pig is disregarded

If one have data like those:
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
And then a cross-join is done on A, A:
B = CROSS A, A;
DUMP B;
(1,2,3)
(4,2,1)
Why is second A optimized out from the query?
info: pig version 0.11
== UPDATE ==
If I sort A like:
C = ORDER A BY a1;
D = CROSS A, C;
It will give a correct cross-join.
davek is correct -- you cannot CROSS (or JOIN) a relation with itself. If you wish to do this, you must create a copy of the data. In this case, you can use another LOAD statement. If you want to do this with a relation further down a pipeline, you'll need to duplicate it using FOREACH.
I have several macros that I use frequently and IMPORT by default in all of my Pig scripts in case I need them. One is used for just this purpose:
DEFINE DUPLICATE(in) RETURNS out
{
$out = FOREACH $in GENERATE *;
};
This will work for you wherever in your pipeline you need a duplicate:
A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = DUPLICATE(A1);
B = CROSS A1, A2;
Note that even though A1 and A2 are identical, you cannot assume that the records are in the same order. But if you are doing a CROSS or JOIN, this probably doesn't matter.
I think you have to load the data twice to achieve what you want.
i.e.
A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = LOAD 'data' AS (a1:int,a2:int,a3:int);
B = CROSS A1, A2;