Filtering inside a FOREACH block by a condition value calculated inside the different blocks in PIG script - apache-pig

I have 2 datasets, and I need to find matching records which match records
from dataset 1 to dataset 2, as such :
dataset 1 = [sourceID, details, key]
1, details1, 1111
2, details2, 1112
3, details3, 1113
4, details4, 1114
...
dataset2 = [key1, key2, number]
1111,1112,3
1111,1114,1
1112,1113,11
...
output:
1, details1, 1111, 2, details2, 1112, 3
1, details1,1111, 4, details4, 1114, 1
2, details2, 1112, 3, details3, 11
....
I tried as follows:
a = foreach dataset1 {
b = filter dataset2 by dataset1.key1 matches dataset1.key;
c = filter dataset2 by datset1.key2 matches dataset1.key;
generate b, c;
};
Any help would be great, please.
Many thanks.

Run two joins?
B = join dataset1 by key, dataset2 by key1;
C = join dataset1 by key, B by key2;

Related

How to compute cosine similarity between two texts in presto?

Hello everyone: I wanted to use COSINE_SIMILARITY in Presto SQL to compute the similarity between two texts. Unfortunately, COSINE_SIMILARITY does not take the texts as the inputs; it takes maps instead. I am not sure how to convert the texts into those maps in presto. I want the following, if we have a table like this:
id
text1
text2
1
a b b
b c
Then we can compute the cosine similarity as:
COSINE_SIMILARITY(
MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 0]),
MAP(ARRAY['a', 'b', 'c'], ARRAY[0, 1, 1])
)
i.e., two texts combined has three words: 'a', 'b', and 'c'; text1 has 1 count of 'a', 2 counts of 'b', and 0 count of 'c', which goes as the first MAP; similarly, text2 has 0 count of 'a', 1 count of 'b', and 1 count of 'c', which goes as the second MAP.
The final table should look like this:
id
text1
text2
all_unique_words
map1
map2
similarity
1
a b b
b c
[a b c]
[1, 2, 0]
[0, 1, 1]
0.63
How can we convert two texts into two such maps in presto? Thanks in advance!
Use split to transform string into array and then depended on Presto version either use unnest+histogram trick or array_frequency:
-- sample data
with dataset(id, text1, text2) as (values (1, 'a b b', 'b c'))
-- query
select id, COSINE_SIMILARITY(histogram(t1), histogram(t2))
from dataset,
unnest (split(text1, ' '), split(text2, ' ')) as t(t1, t2)
group by id;
Output:
id
_col1
1
0.6324555320336759

How to use a table in SQL WITH statement?

I am trying to use a pre-existing table in the SQL statement at the bottom of the question rather than the data that is being generated in the SQL statement. Currently, there is some data that is generated using:
WITH polys(poly_id, geom) AS (VALUES (1, 'POLYGON((1 1, 1 5, 4 5, 4 4, 2 4, 2 2, 4 2, 4 1, 1 1))'::GEOMETRY),
(2, 'POLYGON((6 6, 6 10, 8 10, 9 7, 8 6, 6 6))'::GEOMETRY)),
However, let's say I already have a table named polys with the poly_id and geom columns, exactly as what would be created above. How can I insert my pre-existing polys table into this SQL statement (i.e. what syntax would I use)?
I have tried the following to add a pre-existing polys table using:
CREATE TABLE polys_pts AS
WITH polys(poly_id, geom) AS,
with the following error:
ERROR: syntax error at or near ","
LINE 2: WITH polys(poly_id, geom) AS,
^
Full Code:
CREATE TABLE polys_pts AS
WITH polys(poly_id, geom) AS (VALUES (1, 'POLYGON((1 1, 1 5, 4 5, 4 4, 2 4, 2 2, 4 2, 4 1, 1 1))'::GEOMETRY),
(2, 'POLYGON((6 6, 6 10, 8 10, 9 7, 8 6, 6 6))'::GEOMETRY)),
pnt_clusters AS (SELECT polys.poly_id,
CASE
WHEN ST_Area(polys.geom)>9 THEN ST_ClusterKMeans(pts.geom, 8) OVER(PARTITION BY polys.poly_id)
ELSE ST_ClusterKMeans(pts.geom, 2) OVER(PARTITION BY polys.poly_id)
END AS cluster_id, pts.geom FROM polys,
LATERAL ST_Dump(ST_GeneratePoints(polys.geom, 1000, 1)) AS pts),
centroids AS (SELECT cluster_id, ST_PointOnSurface(ST_collect(geom)) AS geom FROM pnt_clusters GROUP BY poly_id, cluster_id),
neg_buffer AS (SELECT poly_id, (ST_Buffer(geom, -0.4, 'endcap=flat join=round')) geom FROM polys GROUP BY poly_id, polys.geom),
neg_buffer_pts_out AS (SELECT a.cluster_id, (a.geom) geom FROM centroids a WHERE EXISTS (SELECT 1 FROM neg_buffer b WHERE ST_Intersects(a.geom, b.geom))),
neg_buffer_pts_in AS (SELECT a.cluster_id, (a.geom) geom FROM centroids a WHERE NOT EXISTS (SELECT 1 FROM neg_buffer b WHERE ST_Intersects(a.geom, b.geom))),
snap_pts_clusters_in AS (SELECT DISTINCT ST_ClosestPoint(ST_ExteriorRing(a.geom), b.geom) AS geom FROM neg_buffer a, neg_buffer_pts_in b),
node_pts AS (SELECT ST_StartPoint(ST_ExteriorRing(geom)) geom FROM neg_buffer),
snap_pts AS (SELECT b.cluster_id, a.geom FROM snap_pts_clusters_in a JOIN centroids b ON ST_DWithin(a.geom, b.geom, 0.4))
SELECT a.cluster_id, (a.geom) geom FROM snap_pts a WHERE NOT EXISTS (SELECT 1 FROM node_pts b WHERE ST_Intersects(a.geom, b.geom))
UNION SELECT c.cluster_id, (c.geom) geom FROM neg_buffer_pts_out c ORDER BY cluster_id;
I'm not sure of understanding your question so i give you a broad answer.
To create a table from a query you must use:
CREATE TABLE foo AS
SELECT * FROM my_table;
CTEs are builded as:
WITH
tmp1 AS (
SELECT * from my_table1
), -- commna
tmp2 AS (
SELECT * from my_table2
)
SELECT * from tmp1 JOIN tmp2 ON tmp1.id = tmp2.id -- no comma
;
Note that the are , to separate different "temporary" tables defined in the CTE but the final sentence is not preceded with a ,
So to create a table from a CTE the syntax will be:
CREATE TABLE foo AS
WITH
tmp1 AS (
SELECT * from my_table1
),
tmp2 AS (
SELECT * from my_table2
)
SELECT * from tmp1 JOIN tmp2 ON tmp1.id = tmp2.id -- no comma
;
Create a table from a VALUES clause is the same as the other cases:
CREATE TABLE polys2 AS
VALUES
(1, 'POLYGON((1 1, 1 5, 4 5, 4 4, 2 4, 2 2, 4 2, 4 1, 1 1))'::GEOMETRY),
(2, 'POLYGON((6 6, 6 10, 8 10, 9 7, 8 6, 6 6))'::GEOMETRY)
;
If you already have a table called polys2 that has been created for example like is shown in the previous example, you can replace
CREATE TABLE polys_pts AS
WITH
polys(poly_id, geom) AS (
VALUES
(1, 'POLYGON((1 1, 1 5, 4 5, 4 4, 2 4, 2 2, 4 2, 4 1, 1 1))'::GEOMETRY),
(2, 'POLYGON((6 6, 6 10, 8 10, 9 7, 8 6, 6 6))'::GEOMETRY)),
pnt_clusters AS (SELECT polys.poly_id, ...
with
CREATE TABLE polys_pts AS
WITH
polys(poly_id, geom) AS (
SELECT poly_id, geom FROM polys2
),
pnt_clusters AS (SELECT polys.poly_id, ...
um, the question is not 100% clear to me - ... I am not familiar with pecularities of postgresql, but my first bet would be to try
WITH polys(...) AS (...),
pnt_clusters AS (...)
CREATE polys_pts AS (
SELECT ..
FROM polys... etc.
)
but I guess this is not allowed since WITH only goes with DML statements (data manipulation unlike data definition (DDL) statements like CREATE)
so.. my next bet would be to try using polys and pnt_clusters that you defined inside WITH clause, inline inside the SELECT statement, given that
WITH a AS (
SELECT x, y FROM z
)
SELECT *
FROM a
is the same as
SELECT *
FROM (
SELECT x, y
FROM z
) AS a
well, otherwise I would split the process into two steps - create some kind of temporary tables first for polys and pnt_clusters and then do the create...
The definition of a CTE must be a complete statement, so you have to use
WITH polys(poly_id, geom) AS (
SELECT *
FROM (VALUES
(1, 'POLYGON((1 1, 1 5, 4 5, 4 4, 2 4, 2 2, 4 2, 4 1, 1 1))'::GEOMETRY),
(2, 'POLYGON((6 6, 6 10, 8 10, 9 7, 8 6, 6 6))'::GEOMETRY)
) AS p(p, g)
)

SQL ARRAY: Select ID from my_table where "arrayvalue" = "defined_arrayvalue"

This is a beginner-question relating arrays. I hope the answer is simple.
The example is taken from Oracle Spatial, but I think it is valid for all arrays.
I have this SELECT:
SELECT
D.FID
, D.GEOM.SDO_ELEM_INFO -- column GEOM contains spatial data
FROM
my_table D
I get this result:
73035 MDSYS.SDO_ELEM_INFO_ARRAY(1, 2, 1)
73036 MDSYS.SDO_ELEM_INFO_ARRAY(1, 4, 3, 1, 2, 1, 11, 2, 2, 19, 2, 1)
73037 MDSYS.SDO_ELEM_INFO_ARRAY(1, 2, 1)
Now I want to SELECT all rows where (1,2,1) is defined:
SELECT
D.FID
, D.GEOM.SDO_ELEM_INFO
FROM
my_table D
WHERE
-- Pseudo-Code is following
D.GEOM.SDO_ELEM_INFO is "(1, 2, 1)";
So, in simple words: "array_from_row = defined_array".
I found a lot about IMPLODE and TABLE and COLLECT etc. But how to define a clause on two arrays?
Thanks for help!
Try IN clause, you can also use both
SELECT
D.FID
, D.GEOM.SDO_ELEM_INFO
FROM
my_table D
WHERE
D.GEOM.SDO_ELEM_INFO in (1, 2, 1) or ( D.GEOM.SDO_ELEM_INFO = 1 or D.GEOM.SDO_ELEM_INFO = 2 or D.GEOM.SDO_ELEM_INFO = 3);

May I know how can I construct the follow query in SQL Server?

CREATE TABLE (
A INT NOT NULL,
B INT NOT NULL
)
A is an enumerated values of 1, 2, 3, 4, 5
B can be any values
I would like to count() the number of occurrence group by B, with a specific subset of A e.g. {1, 2}
Example:
A B
1 7 *
2 7 *
3 7
1 8 *
2 8 *
1 9
3 9
When B = 7, A = 1, 2, 3. Good
When B = 8, A = 1, 2. Good
When B = 9, A = 1, 3. Not satisfy, 2 is missing
So the count will be 2 (when B = 7 and 8)
If I've understood you correctly, we want to find B values for which we have both a 1 and a 2 in A, and then we want to know how many of those we have.
This query does this:
declare #t table (A int not null, B int not null)
insert into #t(A,B) values
(1,7),
(2,7),
(3,7),
(1,8),
(2,8),
(1,9),
(3,9)
select COUNT(DISTINCT B) from (
select B
from #t
where A in (1,2)
group by B
having COUNT(DISTINCT A) = 2
) t
One or both of the DISTINCTs may be unnecessary - it depends on whether your data can contain repeating values.
If I understand correctly and the requirement is to find Bs with a series of As that doesn't have any "gaps", you could compare the difference between the minimal and maximal A with number of records (per B, of course):
SELECT b
FROM mytable
GROUP BY b
HAVING COUNT(*) + 1 = MAX(a) - MIN(a)
SELECT COUNT(DISTINCT B) FROM TEMP T WHERE T.B NOT IN
(SELECT B FROM
(SELECT B,A,
LAG (A,1) OVER (PARTITION BY B ORDER BY A) AS PRE_A
FROM Temp) K
WHERE K.PRE_A IS NOT NULL AND K.A<>K.PRE_A+1);

SQL querying the same table twice with criteria

I have 1 table
table contains something like:
ID, parent_item, Comp_item
1, 123, a
2, 123, b
3, 123, c
4, 456, a
5, 456, b
6, 456, d
7, 789, b
8, 789, c
9, 789, d
10, a, a
11, b, b
12, c, c
13, d, d
I need to return only the parent_items that have a Comp_item of a and b
so I should only get:
123
456
Here is a canonical way to do this:
SELECT parent_item
FROM yourTable
WHERE Comp_item IN ('a', 'b')
GROUP BY parent_item
HAVING COUNT(DISTINCT Comp_item) = 2
The idea here to aggregate by parent_item, restricting to only records having a Comp_item of a or b, then asserting that the distinct number of Comp_item values is 2.
Alternatively you could use INTERSECT:
select parent_item from my_table where comp_item = 'a'
intersect
select parent_item from my_table where comp_item = 'b';
If you have a parent item table, the most efficient method is possibly:
select p.*
from parent_items p
where exists (select 1 from t1 where t1.parent_id = p.parent_id and t1.comp_item = 'a') and
exists (select 1 from t1 where t1.parent_id = p.parent_id and t1.comp_item = 'b');
For optimal performance, you want an index on t1(parent_id, comp_item).
I should emphasize that I very much like the aggregation solution by Tim. I bring this up because performance was brought up in a comment. Both intersect and group by expend effort aggregating (in the first case to remove duplicates, in the second explicitly). An approach like this does not incur that cost -- assuming that a table with unique parent ids is available.