Left Semi Join on Geo-Spatial tables in Spark-SQL & GeoMesa - apache-spark-sql

Problem:
I have 2 tables (d1 & d2) containing Geo-spatial points. I want to carry out the following query:
select * from table 1 where table1.point is within 50km of any point in table2.point
I am using Spark-SQL with GeoMesa & Accumulo to achieve the same. (Spark as processing engine, Accumulo as Data Store & GeoMesa for GeoSpatial libraries).
The above query is kind of left semi join but I am not sure on how to achieve it using Spark-SQL because as far as I have read subqueries can't be used in where clause.

Was able to achieve this using:
select * from d1 left semi join d2 on st_contains(st_bufferPoint(d1.point, 10000.0), d2.point)
Spark broadcasted d2 & is carrying out joins but it is still taking more time as the size of d1 is 5 billion & d2 is 10 million.
Not sure though if there is any more efficient way to achieve the same.

Related

Presto SQL left joining using ST_intersects, ST_crosses yield unexpected results

Sorry if title is not informative.
Using AWS Athena.
Have two tables:
1.transaction_table
location time status type .... deleted_at
BLOB 2020-09-01
BLOB 2020-09-02
BLOB 2020-09-03
2.area_table
boundary created_at deleted_at
POLYGON((...)) 2020-09-01 null
POLYGON((...)) 2020-09-01 null
POLYGON((...)) 2020-09-01 2020-10-01
For Each row in transaction_table I want add appropriate boundary
select date(time) as dt
, count(time) As cnt
from transaction_table t
left join area_table a
on ST_intersects(boundary, ST_Point(ST_X(t.location), ST_Y(t.location)))
where t.status = 'complete'
and t.deleted_at is null
and t.time >= date('2020-09-01')
and a.deleted_at is null
group by date(withdraw_time);
Problem is when I use ST_intersects or ST_contains daily cnt decreases from query without left join which does not make sense to me since left join will always output equal or greater rows that left table.
Both left, right table have none null values and there are no multiple joins that increases number of rows (if so, query with left join more rows than without)
Right now using ST_Crosses fixes the problem -> outputs same result with/without left join. But I am not sure why number of rows decrease in my query above.
EDIT: ST_Crosses doesn't seem to join any rows hence same value as querying without left join. So my question is why does daily cnt decrease when using left join ST_intersects or ST_contains? Same query in Mysql(ST_point -> point) runs perfectly fine.
From https://prestodb.io/docs/current/functions/geospatial.html and https://dev.mysql.com/doc/refman/5.7/en/gis-class-point.html.
Point(lat,lng) give point object which is zero-dimension
ST_Point(lat,lng) is a geometry and is 2-d.
So I guess using ST_intersects(Geom,Geom) and ST_intersects(Geom, Point) works diffrently, but this still do not explain reduced daily cnt on left join.
Athena is based on Presto 0.172 - according to their release notes there were no geospatial functions available:
Presto Functions in Athena
Presto 01.72 Documentation
Athena's geospatial functions are implemented as a Presto Plugin and the full Reference is available here: List of Supported Geospatial Functions.
One thing to consider is actually the order of arguments of ST_POINT being ST_POINT(longitude, latitude), so longitude being the first argument and latitude the second.
You are also referring left and right table in the where condition, this definitely could result in having less rows.

BigQuery join too slow for a table of small size

I have a table with the following details:
- Table Size 39.6 MB
- Number of Rows 691,562
- 2 columns : contact_guid STRING, program_completed STRING
- column 1 data type is like uuid . around 30 char length
- column 2 data type is string with around 50 char length
I am trying this query:
#standardSQL
SELECT
cp1.contact_guid AS p1,
cp2.contact_guid AS p2,
COUNT(*) AS cnt
FROM
`data.contact_pairs_program_together` cp1
JOIN
`data.contact_pairs_program_together` cp2
ON
cp1.program_completed=cp2.program_completed
WHERE
cp1.contact_guid < cp2.contact_guid
GROUP BY
cp1.contact_guid,
cp2.contact_guid having cnt >1 order by cnt desc
Time taken to execute: 1200 secs
I know I am doing a self join and it is mentioned in best practices to avoid self join.
My Questions:
I feel this table size in terms of mb is too small for BigQuery therefore why is it taking so much time? And what does small table mean for BigQuery in context of join in terms of number of rows and size in bytes?
Is the number of rows too large? 700k ^ 2 is 10^11 rows during join. What would be a realistic number of rows for joins?
I did check the documentation regarding joins, but did not find much regarding how big a table can be for joins and how much time can be expected for it to run. How do we estimate rough execution time?
Execution Details:
As shown on the screenshot you provided - you are dealing with an exploding join.
In this case step 3 takes 1.3 million rows, and manages to produce 459 million rows. Steps 04 to 0B deal with repartitioning and re-shuffling all that extra data - as the query didn't provision enough resources to deal with these number of rows: It scaled up from 1 parallel input to 10,000!
You have 2 choices here: Either avoid exploding joins, or assume that exploding joins will take a long time to run. But as explained in the question - you already knew that!
How about if you generate all the extra rows in one op (do the join, materialize) and then run another query to process the 459 million rows? The first query will be slow for the reasons explained, but the second one will run quickly as BigQuery will provision enough resource to deal with that amount of data.
Agree with below suggestions
see if you can rephrase your query using analytic functions (by Tim)
Using analytic functions would be a much better idea (by Elliott)
Below is how I would make it
#standardSQL
SELECT
p1, p2, COUNT(1) AS cnt
FROM (
SELECT
contact_guid AS p1,
ARRAY_AGG(contact_guid) OVER(my_win) guids
FROM `data.contact_pairs_program_together`
WINDOW my_win AS (
PARTITION BY program_completed
ORDER BY contact_guid DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
), UNNEST(guids) p2
GROUP BY p1, p2
HAVING cnt > 1
ORDER BY cnt DESC
Please try and let us know if helped

SQL : how to calculate an average from a query

I've been using SQL for about a week at my first full time job, and I'm trying to calculate some statistics from a query where I've combined columns from separate tables.
Specifically, I'm trying to calculate an average from a combined table, where I have applied filters (or constraints? I'm not clear on the SQL lingo).
From doing research on Google, I learned how to calculate an average:
SELECT AVG(column_name)
FROM table_name
The problem I'm having is that this only seems to work with existing tables in the database, not with new queries I have created.
A simplified version of my code is as follows:
SELECT
Animal_Facts.Animal_Name, Animal_Facts.Prev_Reg_Amount,
Names.Given_Name, Animal_Class.Class_Description
FROM
Names
INNER JOIN
Animal_Facts ON Names.Name_Key = Animal_Facts.Name_Key
INNER JOIN
Animal_Class ON Animal_Facts.Class_Key = Animal_Class.Class_Key
This query creates combines four columns from three tables, where Class_Description describes whether the animal is desexed, microchipped, owned by a pensioner etc, and Pre_Reg_Amount is the registration fee paid.
I want to find the average fee paid by pensioners, so I included the following line of code to filter the table:
AND Animal_Class.Class_Description LIKE ('%pensioner%')
And then to calculate the average I add:
SELECT AVG(Animal_Facts.Prev_Reg_Amount) from Animal_Facts
So my total code is:
SELECT
Animal_Facts.Animal_Name, Animal_Facts.Prev_Reg_Amount,
Names.Given_Name, Animal_Class.Class_Description
FROM
Names
INNER JOIN
Animal_Facts ON Names.Name_Key = Animal_Facts.Name_Key
INNER JOIN
Animal_Class ON Animal_Facts.Class_Key = Animal_Class.Class_Key
AND Animal_Class.Class_Description LIKE ('%pensioner%')
SELECT AVG(Animal_Facts.Prev_Reg_Amount)
FROM Animal_Facts
Now the problem is, after checking this calculation in Excel, I'm not actually getting the average of the pensioner data, but the average of all the data. Is there a way to calculate averages (and other statistics) directly from my created table in SQL?
Note: I am able to calculate all these statistics by exporting the data to Excel, but it is much more time consuming. I'd much rather learn how to do this within SQL.
SELECT AVG(af.Prev_Reg_Amount)
FROM
Animal_Facts af
INNER JOIN Animal_Class ac
ON af.Class_Key = ac.Class_Key
AND Class_Description LIKE ('%pensioner%')

ms access query very slow

I have this ms access query:
SELECT t1.sb, suchbegriff2, menge
FROM (SELECT artnr & '/' & [lfdnr-kal] AS sb, left(suchbegriff,7) &
val(right(suchbegriff,4)) AS suchbegriff2
FROM kvks
WHERE suchbegriff like '*/*') AS t1
INNER JOIN (SELECT artnr & '/' & [lfdnr-kal] AS sb,
[artnr-hz] & '/' & val(lfdnr) AS hz, menge
FROM konf
WHERE [artnr-hz]<>'') AS t2
ON (t1.sb=t2.sb) AND (t1.suchbegriff2=t2.hz);
It runs really very slow (over 30 sec.). I figured out, it is because the inner join part. If I leave this, the speed is correct.
Maybe it is because of the fact slow, that the joined fields are calculated expressions?
EDIT:
I modified the query based on the answer of Smandoli:
SELECT kvks.artnr & '/' & kvks.[lfdnr-kal] AS sb,
left(suchbegriff,7) & val(right(suchbegriff,4)) AS suchbegriff2,
konf.menge
FROM kvks, konf
WHERE kvks.suchbegriff like '*/*'
and konf.[artnr-hz]<>''
and kvks.artnr=konf.artnr
and kvks.[lfdnr-kal]=konf.[lfdnr-kal]
and left(suchbegriff,7) & val(right(suchbegriff,4))=[artnr-hz] & '/' & val(lfdnr)
It runs now correct.
Thanks for your contribution.
You do have a complicated mess with those calculated fields. Why not join more directly? This query below leaves one '/' unaccounted for, but should tell you what I'm thinking of.
SELECT
t1.sb,
left(st1.uchbegriff,7) & val(right(t1.suchbegriff,4)) AS suchbegriff2,
t1.menge
FROM kvks AS t1
INNER JOIN konf AS t2
WHERE (t1.suchbegriff like '*/*')
AND (t2.artnr-hz<>'')
AND (t1.artnr=t2.artnr)
AND (t1.lfdnr-kal=t2.lfdnr-kal)
AND (left(t1.suchbegriff,7)=t1.[artnr-hz])
AND (val(right(t1.suchbegriff,4))=val(t2.hz));
For the inner join, you can try to use a saved query (or temp table) instead of writing the query at run time.
So, I would first try to abstract this query
SELECT artnr & '/' & [lfdnr-kal] AS sb,
[artnr-hz] & '/' & val(lfdnr) AS hz, menge
FROM konf
WHERE [artnr-
hz]<>'') AS t2
ON (t1.sb=t2.sb) AND (t1.suchbegriff2=t2.hz)
Second of all, if possible, I would abstract some of the functions in the queries. You could do this with VBA, or manipulating the data outside of the queries.
Third, you could always create a field on your table that combines the two fields together that you need.
E.x: Make a new column in your konf table that stores the value of artnr & '/' & [lfdnr-kal]
What you need to do is limit the functions/calculations/coalescing of fields at run time. That's a lot for a query to do, and if it's running slow I would see a direction correlation either between that, or something incorrect with your indexes/joins.
If you've identified this as a join issue, you can use VBA to spin up a temp table with your queries, and use those as the record source instead of the SQL.
Also, if you don't utilize a temp table, at least save the queries. This allows Access to have a plan for running the queries, whereas your query is 100% run-time dependent.
Your query runs slow because of "Nesting" and then "Joining". You can try creating the temp tables and used that table in query. Creating temp table is good practice rather than making the query complex.
99999!!! I think you have just to implement some index
(t1.sb=t2.sb) AND (t1.suchbegriff2=t2.hz);
these ones are very suspicious. Are the 4 indexed?

Oracle Creating Table using Data from Previous Tables with Calculations

I currently have two tables, cts(time, symbol, open, close, high, low, volume) and dividends(time, symbol, dividend). i am attempting to make a third table named, dividend_percent with columns Time, Date and Percent. To get the percentage for the dividend I believe the formula to be ((close-(open+dividend))/open)*100.
The request however exceeded the size allowed by oraclexe and thus failed but i don't believe my request should have been that big.
SQL> create table dividend_percent
2 as (select c.Time, c.Symbol, (((c.close-(c.open+d.dividend))/c.open)*100) P
RCNT
3 from cts c inner join dividend d
4 on c.Symbol=d.Symbol);
from cts c inner join dividend d
*
ERROR at line 3:
ORA-12953: The request exceeds the maximum allowed database size of 11 GB
Am i writing the query wrong or in such a way that's really inefficient? the two tables are big but i don't think too big.
Perhaps you could make a view which combines the two tables and performs the necessary calculations when needed:
CREATE VIEW DIVIDEND_PERCENT_VIEW AS
SELECT c.TIME,
c.SYMBOL,
((c.CLOSE - (c.OPEN + d.DIVIDEND)) / c.OPEN) * 100 AS PRCNT
FROM CTS c
INNER JOIN DIVIDEND d
ON c.SYMBOL = d.SYMBOL AND
c.TIME = d.TIME
WHERE c.OPEN <> 0;
This would avoid duplicating the data, eliminate the need to store everything twice, and perform the PRCNT calculation for data added after the view is created as well as for pre-existing data.
Perhaps you could use materialized view if you are intending to perform DML operations as well as keep the table in sync.