How to perform an equi-join on Geode with non-key column - gemfire

Now I have 3 regions like below:
region1 ID, phone_number, name
region2 ID, credit_bill_number
region3 phone_number, phone_bill_number
I know that I can join region1 and region2 by co-locating region2 with region1, and perform the query by using "ID" to join.
I'm wondering is there a way to perform a join on region1, region2, region3, which region1 join region2 by "ID" and region1 join region3 by "phone_number"?

All regions involved in a join must be colocated Partitioned Regions or Replicated Regions. Bottom line is, the data involved in the join must all be on the same JVM. If you can't figure out a way to write a PartitionResolver that forces the Region3 data to co-locate with the Region1 data then Region3 could be set up as a ReplicatedRegion, which would enable the join to work. Just make sure you mention Region1 first.

Related

How does SQL count(distinct) work in this case?

I'm trying to find the match no in which Germany played against Poland. This is from https://www.w3resource.com/sql-exercises/soccer-database-exercise/sql-subqueries-exercise-soccer-database-4.php. There are two tables : match_details and soccer_country. I don't understand how the count(distinct) works in this case. Can someone please clarify? Thanks!
SELECT match_no
FROM match_details
WHERE team_id = (
SELECT country_id
FROM soccer_country
WHERE country_name = 'Germany')
OR team_id = (
SELECT country_id
FROM soccer_country
WHERE country_name = 'Poland')
GROUP BY match_no
HAVING COUNT(DISTINCT team_id) = 2;
As Lamak mentioned, what an ugly consideration for a query, but many ways to approach a query.
As mentioned, counting for (Distinct team_id) makes sure that there are only 2 unique teams. If there is ever a Cartesian result, you could get repetition of multiple rows showing more than one instance of both teams. So the count of distinct on the TEAM_ID eliminates that.
Now, that said, Other "team" query data structures I have seen have a single record for the match and a column for EACH TEAM playing the match. That is easier by a long-shot, but still a relatively easy query.
Break the query down a little, and consider a large scale set of data (not that this, or any sort of even professional league would have such large record counts to give delay with a sql engine).
Your first criteria is games with Germany. So lets start with that.
SELECT
md1.match_no
FROM
match_details md1
JOIN soccer_country sc1
on md1.team_id = sc1.country_id
AND sc1.country_name = 'Germany'
So, why even look at any other record/match if Germany is not even part of the match on either side. Of which this in itself would return 6 matches from the sample data of 51 matches. So now, all you need to do is join AGAIN to the match details table a second time for only those matches, but ALSO the second team is Poland
SELECT
md1.match_no
FROM
match_details md1
JOIN soccer_country sc1
on md1.team_id = sc1.country_id
AND sc1.country_name = 'Germany'
-- joining again for the same match Germany was already qualified
JOIN match_details md2
on md1.match_no = md2.match_no
-- but we want the OTHER team record since Germany was first team
and md1.team_id != md2.team_id
-- and on to the second country table based on the SECOND team ID
JOIN soccer_country sc2
on md2.team_id = sc2.country_id
-- and the second team was Poland
AND sc2.country_name = 'Poland'
Yes, may be a longer query, but by eliminating 45 other matches (again, thinking a LARGE database), you have already saved blowing through tons of data to a very finite set. And now finishing only those Germany / Poland. No aggregates, counts, distincts, just direct joins.
FEEDBACK
Lets take a look at some BAD sample data... which as all programmers know, there is no such thing (NOT). Anyhow, lets take a look at these few matches.
Match Team ID blah
52 Poland Just put the names here for simplistic purposes
52 Poland
53 Germany
53 Germany
If you were to run the query without DISTINCT Teams, both match 52 and 53 would show up... As Poland is one team and appears 2 times for match 52, and similarly Germany 2 times for match 53. By doing DISTINCT Team, you can see that for each match, there is only 1 team being returned and thus excluded. Does that help? Again, no such thing as bad data :)
And yet another sample match where more than 2 teams created
Match Team ID
54 France
54 Poland
54 England
55 Hungary
56 Austria
In each of these matches, NONE would be returned. Match 54 has 3 distinct teams, and Match 55 and 56 only have single entry, thus no opponent to compete against.
2nd FEEDBACK
To clarify the query. If you look at the short query for just Germany, that aliased instance of "md1" is already sitting on any given record for a Germany match. So the second join to the "md2", I only care about the same match, so I can join on the same match_no. However, in the "md2" alias, the "!=" means NOT EQUAL. ! = logical NOT. So the join is saying from the MD1, join to the MD2 alias on the same match id. However, only give me where the teams are NOT the same. So the first instance holds Germany's team ID (already qualified) and thus give me the secondary team id. So now I can use the secondary (md2) instance team ID to join to the country to confirm only for Poland.
Does this now clarify things for you?

Find entries in same table WITHOUT similar location

Imagine I have a table of fast food restaurants (FASTFOOD). Each of them has geo coordinates set in columns GEO_X and GEO_Y, as well as a column FRANCHISE. Franchise may be MCDONALDS or BURGERKING.
I want to select all Burger Kings which do NOT have a McDonalds within a specific distance, measured in geo coordinate degrees.
How do I do this?
I AM able to list the Burger Kings that DO have a McDonalds within a certain distance:
select t.*
from FASTFOOD t
INNER JOIN FASTFOOD s ON (ABS(t.geo_x - s.geo_x) < 0.01 AND ABS(t.geo_y - s.geo_y) < 0.01)
WHERE t.FRANCHISE= 'BURGERKING'
AND s.FRANCHISE = 'MCDONALDS';
But I have no idea how to find the "opposite".
The result sets of my query are the same whether I use an INNER JOIN, LEFT JOIN, RIGHT JOIN, or FULL OUTER JOIN, as all entries do have set geo coordinates.
I AM able to list the Burger Kings that DO have a McDonalds within a certain distance
use
[all burgerkings] except [burgerkings that have a mcd nearby]
that should only leave those without one nearby
Try this
--below mentioned 'geo_z' is specified distance between two FRANCHISE'S
select *
from (
select t.*
from FASTFOOD t
inner join FASTFOOD s on t.PrimaryKey=s.PrimaryKey
where ABS(t.geo_x-t.geo_y) > geo_z
) d
where d.FRANCHISE='BURGERKING'

JOIN the same table on two columns

I use JOINs to replace country and product IDs in import and export data with actual country and products names stored in separate tables. In the data source table (data), there are two columns with country IDs, for origin and destination, both of which I am replacing with country names.
The code I have come up with refers to the country_names table twice – as country_names, and country_names2, – which doesn’t seem to be very elegant. I expected to be able to refer to the table just once, by a single name. I would be grateful if someone pointed me to a more elegant and maybe more efficient way to achieve the same result.
SELECT
country_names.name AS origin,
country_names2.name AS dest,
product_names.name AS product,
SUM(data.export_val) AS export_val,
SUM(data.import_val) AS import_val
FROM
OEC.year_origin_destination_hs92_6 AS data
JOIN
OEC.products_hs_92 AS product_names
ON
data.hs92 = product_names.hs92
JOIN
OEC.country_names AS country_names
ON
data.origin = country_names.id_3char
JOIN
OEC.country_names AS country_names2
ON
data.dest = country_names2.id_3char
WHERE
data.year > 2012
AND data.export_val > 1E8
GROUP BY
origin,
dest,
product
The table to convert product IDs to product names has 6K+ rows. Here is a small sample:
id hs92 name
63215 3215 Ink
2130110 130110 Lac
21002 1002 Rye
2100200 100200 Rye
52706 2706 Tar
20902 902 Tea
42203 2203 Beer
42302 2302 Bran
178703 8703 Cars
The table to convert country IDs to country names (which is the table I have to JOIN on twice) has 264 rows for all countries in the world. (id_3char is the column used.) Here is a sample:
id id_3char name
euchi chi Channel Islands
askhm khm Cambodia
eublx blx Belgium-Luxembourg
eublr blr Belarus
eumne mne Montenegro
euhun hun Hungary
asmng mng Mongolia
nabhs bhs Bahamas
afsen sen Senegal
And here is a sample of data from the import and export data table with a total of 205M rows that has the two columns origin and dest that I am making a join on:
year origin dest hs92 export_val import_val
2009 can isr 300410 2152838.47 3199.24
1995 chn jpn 590190 275748.65 554154.24
2000 deu gmb 100610 1573508.44 1327.0
2008 deu jpn 540822 10000.0 202062.43
2010 deu ukr 950390 1626012.04 159423.38
2006 esp prt 080530 2470699.19 125291.33
2006 grc ind 844859 8667.0 3182.0
2000 ltu deu 630399 6018.12 5061.96
2005 usa zaf 290219 2126216.52 34561.61
1997 ven ecu 281122 155347.73 1010.0
I think you already have it done such that it can be considered good enough to just use as is :o)
Meantime, If for some reason you really-really want to avoid two joins on that country table - what you can do is to materialize below select statement into let's say `OEC.origin_destination_pairs` table
SELECT
o.id_3char o_id_3char,
o.name o_name,
d.id_3char d_id_3char,
d.name d_name
FROM `OEC.country_names` o
CROSS JOIN `OEC.country_names` d
Then you can just join on that new table as below
SELECT
country_names.o_name AS origin,
country_names.d_name AS dest,
product_names.name AS product,
SUM(data.export_val) AS export_val,
SUM(data.import_val) AS import_val
FROM OEC.year_origin_destination_hs92_6 AS data
JOIN OEC.products_hs_92 AS product_names
ON data.hs92 = product_names.hs92
JOIN OEC.origin_destination_pairs AS country_names
ON data.origin = country_names.o_id_3char
AND data.dest = country_names2.d_id_3char
WHERE data.year > 2012
AND data.export_val > 1E8
GROUP BY
origin,
dest,
product
The motivation behind above is cost of storing and querying in your particular case
Your `OEC.country_names` table is just about 10KB in size
Each time you query it you pay as if it is 10MB (Charges are rounded to the nearest MB, with a minimum 10 MB data processed per table referenced by the query, and with a minimum 10 MB data processed per query.)
So, if you will materialize above mentioned table - it will still be less than 10MB so no difference in querying charges
Similar situation with storing that table - no visible changes in charges
You can check more about pricing here

Oracle REGEXP_SUBSTR for string matching b/w two columns

The problem
Users are frequently inputting "country name" strings into the "city name" field. Heuristically, this appears to be an extremely common practice. For example, a user might put "TAIPEI TAIWAN" in the city name when only "TAIPEI" should be input and then the country would be "TAIWAN". I am working to aggregate these instances for this specific field (your help will allow me to expand this to other columns and tables) and then identify where possible rankings associated with strictly the "country" names in the "city" field.
I have two tables that I am attempting to leverage to track down data validation issues. Tbl1 is named "Customer_Address" comprised of geographic columns like (Customer_Num, Address, City_Name, State, Country_Code, Zipcode). Tbl2 named "HR_Countries" is clean table of 2-digit ISO country codes with their corresponding name values (Lebanon, Taiwan, China, Syria, Russia, Ukraine, etc) and some other fields not presently used.
The initial step is to query "Customer_Address" to find City_Names LIKE a series of OR statements (LIKE '%CHINA', OR LIKE 'TAIWAN', OR etc etc) and count the number of occurrences where the City_Name is like the designated country_name string I passed it and the results are pretty good. I've coded in some exclusions to deal with things like "Lebanon, OH" so my overall results are satisfactory for the first phase.
Part of the query does a LEFT join from Tbl1 to Tbl2 to add the risk rating from tbl2 as a result of the query against tbl1:
LEFT JOIN tbl2 risk
ON INSTR(addr.CITY_NM, risk.COUNTRY_NAME,1) <> 0
Example of Tbl1 Data Output (head(tbl1), n=7)
CountryNameInCity CountOfOccurences RR
China 15 High
Taiwan 2000 Medium
Japan 250 Low
Taipei, Taiwan 25 NULL
Kabul, Afghanistan 10 NULL
Shenzen China 100 NULL
Afghanistan 52 Very High
Example of Tb2 Data (head(tbl2), n=6)
CountryName CountryCode RR
China CN High
Taiwan TW High
Iraq IQ Very High
Cuba CU Medium
Lebanon LB Very High
Greece GR High
So my question(s) are as follows:
1) Instead of manually passing in a series of OR-statements for country codes is there a better way to using Tbl2 as the matching "LIKE" driving the query?
2) Can you recommend a better way of comparing the output of the query (see Tbl1 example) and ensuring that multiple strings (Taipei, Taiwan, etc) are appropriately aggregated and bring back the correct 'RR' rating.
Thanks for taking the time to review this and respond.

Modelling country adjacency in SQL

I'm trying to model which countries border each other in MySQL. I have three tables:
nodes
-----
node_id MEDIUMINT
countries
---------
country_id MEDIUMINT (used as a foreign key for nodes.node_id)
country CHAR(64)
iso_code CHAR(2)
node_adjacency
--------------
node_id_1 MEDIUMINT (used as a foreign key for nodes.node_id)
node_id_2 MEDIUMINT (used as a foreign key for nodes.node_id)
I appreciate the nodes table is redundant in this example, but this is part of a larger architecture where nodes can represent many other items other than countries.
Here's some data (IDs (which appear in all three tables) and countries)
59 Bosnia and Herzegovina
86 Croatia
130 Hungary
178 Montenegro
227 Serbia
232 Slovenia
Croatia is bordered by all the other countries, and this is represented in the node_adjacency table as:
59 86
86 130
86 178
86 227
86 232
So Serbia's ID may appear as a node_id_1 or a node_id_2. The data in this table is essentially non directed graph data.
Questions:
Given the name 'Croatia', what SQL should I use to retrieve its neighbours?
Bosnia and Herzegovina
Hungary
Montenegro
Serbia
Slovenia
Would there be any retrieval efficiency gains in storing the adjacency information as directed graph data? E.g. Croatia borders Hungary, and Hungary borders Croatia, essentially duplicating storage of the relationships:
86 130
130 86
This is just off the top of my head, so I don't know if it's the most performant solution and it may need a tweak, but I think it should work:
SELECT
BORDER.country
FROM
Countries AS C
LEFT OUTER JOIN Node_Adjacency NA1 ON
NA1.node_id_1 = C.country_id OR
NA1.node_id_2 = C.country_id
INNER JOIN Countries AS BORDER ON
(
BORDER.country_id = NA1.node_id_1 OR
BORDER.country_id = NA1.node_id_2
) AND
BORDER.country_id <> C.country_id
WHERE
C.country = 'CROATIA'
Since your graph is not directed, I don't think that it makes sense to store it as a directed graph. You might also want to Google "Celko SQL Graph" as he has done a lot of advanced work on trees, graphs, and hierarchies in SQL and has an excellent book devoted to the subject.
I would store both relations (Hungary borders Croatia, Croatia borders Hungary) so that you only ever need to query one column.
SELECT c.country FROM countries AS c
INNER JOIN node_adjacency AS n
ON n.node_id_1 = c.countryID
WHERE c.countryID = 86
To do both columns, simply union two queries together (borrowing from Matthew Jones):
SELECT c.country FROM countries AS c
INNER JOIN node_adjacency AS n
ON n.node_id_1 = c.countryID
WHERE c.countryID = 86
UNION
SELECT c.country FROM countries AS c
INNER JOIN node_adjacency AS n
ON n.node_id_2 = c.countryID
WHERE c.countryID = 86
If you do it this way, you duplicate your query instead of your data (use 50% less space), and it's still really simple.
You can create a union view to avoid duplication:
CREATE VIEW adjacency_view (node_id_1, node_id_2)
AS
SELECT node_id_1, node_id_2 FROM node_adjacency
UNION ALL
SELECT node_id_2, node_id_1 FROM node_adjacency
So your query becomes quite straightforward:
SELECT c1.country
FROM adjacency_view
INNER JOIN countries AS c1 on c1.country_id = adjacency_view.node_id_1
INNER JOIN countries AS c2 on c2.country_id = adjacency_view.node_id_2
WHERE c2.country = 'CROATIA'
If you are duplicating relationships (i.e. country A shares border with B, and B shares border with A) you can get a way with a simple select. If you store only one relationship per pair of countries you will need to search by both columns in node_adjacency table (running two select statements and performing a union).