Join Query in Apache Ignite - ignite

How is spatial join query executed among the nodes in PARTITIONED mode? As Ignite partition the data (default 1024) among the nodes using Rendezvous Affinity hashing, how is join operation executed among the partitions? Suppose I have two spatial datasets in the cache (pCache and qCache), each contains 10 partitions(1, .., 10). How is ignite perform the join operation on this two dataset? Is it partition1 of pCache with partition1 of qCache?
My second question: How is ignite perform the same operation in case of distributed join?

There is no correspondence between partitions of different caches. If you run a join operation, then by default only local lookup will be performed.If data is not collocated, then this approach may give you a partial result.
When all-to-all mapping is performed, then every node has to communicate with every other node, so messages are totally sent in the cluster, where is the number of nodes. This is called distributed joins, and it affects performance significantly. It may be enabled either in the connection string in case of JDBC driver, or by using SqlFieldsQuery#setDistributedJoins(...) method in case of cache query API.
The recommended way to do joins is to collocate the data in a way, that no distributed joins are needed. Ignite has a feature called affinity collocation, designed specially for this purpose. You can specify a field of an object, that will be used to calculate the affinity function. Value of this field doesn't have to be unique, but it should be a part of a key. So, if you want to perform joins on two tables, you may collocate them by affinity, so no distributed joins will be needed.

Related

Best practice to do join in Apache Ignite

I have two large tables A and B, and I want to join these two tables on two columns, say, project_id and customer_id.
When I do the join in Apache Ignite, I find that the performance is very bad. After investigating, I think the problem lies in that the data recide in different nodes randomly.
When the join happens, there are data transfer between nodes to make the same project_id and customer_id from A and B into same node.
For my case,
Load data into the Ignite cluster based on A and B's project_id and customer_id, so that, there is no data transfer when doing the join. The solution can work but not flexible.
Use only one node to hold all the data. This solution can work but there is memory limit for a single node(Not too much data can be held by one node)
I would ask which solution would be a better choice, thanks!
The former (1.) is recommended. You should load the data in the fashion so that data for the same project_id and customer_id is on the same node in both tables.
This is called affinity collocation and it is paramount to get right to have good performance of Ignite queries (and sometimes for them to work at all).
Ignite will take care of affinity collocation for you if you setup it correctly, but there are a few caveats right away:
Affinity key has to be a part of primary key (not a value field)
Affinity key has to be single (so you have to choose between project_id and customer_id) or a composite type (a nested POJO with its own implications) or a synthetic value maybe?
There is possibility of uneven partition distribution. Imagine you have a single large customer (or project). When processing this customer, all nodes but a single one will be idle and unused.

Efficient combination of a large number of affinity calls in Apache Ignite

We would like to compute on a large, partition-able dataset or 'products' in Ignite (100.000+ products, each linked to a large amount of extra data in different caches). We require several use cases:
1) Launch a compute job, limited to a large number (100's) of products, with a strong focus on responsiveness (<200ms). We can use the product ID as an affinity key to collocate all extra data with the products. But affinityRun only allows a single key to be specified, which would mean we need to launch 100's of compute jobs. Ideally we would be able to do an affinityRun on the entire set of product IDs at once, and let Ignite distribute the compute job to the relevant nodes, but we struggle to find a way to do this. (The compute job would then use local queries only on those compute nodes.)
2) Launch a compute job over the entire space of products in an efficient manner. We could launch the compute job on each compute node and use local queries, but that would no longer give us the benefits of falling back to backup partitions in case a primary partition is unavailable. This is an extreme case of problem number 1, just with a huge (all) number of product IDs as input.
We've been brainstorming about this for a while now, but it seems like we're missing something. Any ideas?
There is a version of affinityRun that takes a partition number as a parameter. Distribute your task per partition and each node on the receiving end will be processing data residing in that partition number (just run a scan query for the partition). In case of failure, you'll just restart the process for a partition and can filter out already processed items with a custom logic.
Affinity job is nothing but the one which execute on the data node where key/value resides.
There are several ways to send job to particular node and not only affinity key. for example, you can send based on consistentID and in 2.4.10(if I remember correctly), they added way to query backup explicitly.
Regarding your scenario, I can think of below solution-
SqlFieldsQuery query = new SqlFieldsQuery("select productID from CacheTable").setLocal(true);
You can prepare affinity job with above SQL where you will select all products(from that node only) and iterate over them and do all queries locally only to get all products information like this. Send that job to required node and do your computation and reduce the result and return to client.

Clustering in BigQuery using CREATE TABLE

Unsure if I cluster correctly. Basicly I am looking at GCP Billing Info of say 50 clients. Each client has a Billing_ID and I cluster on that billing_ID. I use the clustered table for a data studio dashboard
See the the SQL query below to see what I do right now
CREATE OR REPLACE TABLE `dashboardgcp`
PARTITION BY DATE(usage_start_time)
CLUSTER BY billing_account_id
AS
SELECT
*
FROM
`datagcp`
WHERE
usage_start_time BETWEEN TIMESTAMP('2019-01-01')
AND TIMESTAMP(CURRENT_DATE)
It is succesfully clustered like this, I am just not a noticeable query performance increase!
So I thought by clustering it with billing_ID I should see an increase in dashboard performance
Please consider the following points:
Cluster structure
A Cluster field is composed of an array of fields, like boxes, outer to inner, As state in BigQuery link
When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
This means As #Gordon wrote, in your query the WHERE part needs to start from the outer field to the inner one to make the most out of your cluster field. In your case, if the userId is part of the WHERE you need to change your cluster field to match this
Cluster limitation
Cluster typically works better for query who scan over 1GB of data, So if you are not scanning this amount of data you won't see the improvement you are looking for
Cluster with Ingestion tables
Assuming your dara is not static and you keep adding data to your table, datagcp, you need to be aware that cluster indexing is a process which BigQuery perform off-line to the insert operation and a separate one to partitioning.
The side effect is that your clustering "weaken" over-time. To solve this you will need to use merge command to re-build your cluster in order to get the most out of your cluster
From the docs:
“Over time, as more and more operations modify a table, the degree to which the data is sorted begins to weaken, and the table becomes partially sorted”.

Best way to lookup elements in GemFire Region

I have Regions in GemFire with a large number of records.
I need to lookup elements in those Regions for validation purposes. The lookup is happening for every item we scan; There can be more than 10000 items.
What will be an efficient way to look up element in Regions?
Please suggest.
Vikas-
There are several ways in which you can look up, or fetch multiple elements from a GemFire Region.
As you can see, a GemFire Region indirectly implements java.util.Map, and so provides all the basic Map operations, such as get(key):value, in addition to several other operations that are not available in Map like getAll(Collection keys):Map.
Though, get(key):value is not going to be the most "efficient" method for looking up multiple items at once, but getAll(..) allows you to pass in a Collection of keys for all the values you want returned. Of course, you have to know the keys of all the values you want in advance, so...
You can obtain GemFire's QueryService from the Region by calling region.getRegionService().getQueryService(). The QueryService allows you to write GemFire Queries with OQL (or Object Query Language). See GemFire's User Guide on Querying for more details.
The advantage of using OQL over getAll(keys) is, of course, you do not need to know the keys of all the values you might need to validate up front. If the validation logic is based on some criteria that matches the values that need to be evaluated, you can express this criteria in the OQL Query Predicate.
For example...
SELECT * FROM /People p WHERE p.age >= 21;
To call upon the GemFire QueryService to write the query above, you would...
Region people = cache.getRegion("/People");
...
QueryService queryService = people.getRegionSevice().getQueryService();
Query query = queryService.newQuery("SELECT * FROM /People p WHERE p.age >= $1");
SelectResults<Person> results = (SelectResults<Person>) query.execute(asArray(21));
// process (e.g. validate) the results
OQL Queries can be parameterized and arguments passed to the Query.execute(args:Object[]) method as shown above. When the appropriate Indexes are added to your GemFire Regions, then the performance of your Queries can improve dramatically. See the GemFire User Guide on creating Indexes.
Finally, with GemFire PARTITION Regions especially, where your Region data is partitioned, or "sharded" and distributed across the nodes (GemFire Servers) in the cluster that host the Region of interests (e.g. /People), then you can combine querying with GemFire's Function Execution service to query the data locally (to that node), where the data actually exists (e.g. that shard/bucket of the PARTITION Regioncontaining a subset of the data), rather than bringing the data to you. You can even encapsulate the "validation" logic in the GemFire Function you write.
You will need to use the RegionFunctionContext along with the PartitionRegionHelper to get the local data set of the Region to query. Read the Javadoc of PartitionRegionHelper as it shows the particular example you are looking for in this case.
Spring Data GemFire can help with many of these concerns...
For Querying, you can use the SD Repository abstraction and extension provided in SDG.
For Function Execution, you can use SD GemFire's Function ExeAnnotation support.
Be careful though, using the SD Repository abstraction inside a Function context is not just going to limit the query to the "local" data set of the PARTITION Region. SD Repos always work on the entire data set of the "logical" Region, where the data is necessarily distributed across the nodes in a cluster in a partitioned (sharded) setup.
You should definitely familiarize yourself with GemFire Partitioned Regions.
In summary...
The approach you choose above really depends on several factors, such as, but not limited to:
How you organized the data in the first place (e.g. PARTITION vs. REPLICATE, which refers to the Region's DataPolicy).
How amenable your validation logic is to supplying "criteria" to, say, an OQL Query Predicate to "SELECT" only the Region data you want to validate. Additionally, efficiency might be further improved by applying appropriate Indexing.
How many nodes are in the cluster and how distributed your data is, in which case a Function might be the most advantageous approach... i.e. bring the logic to your data rather than the data to your logic. The later involves selecting the matching data on the nodes where the data resides that could involve several network hops to the nodes containing the data depending on your topology and configuration (i.e. "single-hop access", etc), serializing the data to send over the wire thereby increasing the saturation on your network, and so on and so forth).
Depending on your UC, other factors to consider are your expiration/eviction policies (e.g. whether data has been overflowed to disk), the frequency of the needed validations based on how often the data changes, etc.
Most of the time, it is better to validate the data on the way in and catch errors early. Naturally, as data is updated, you may also need to perform subsequent validations, but that is no substitute for early (as possible) verifications where possible.
There are many factors to consider and the optimal approach is not always apparent, so test and make sure your optimizations and overall approach has the desired effect.
Hope this helps!
Regards,
-John
Set up the PDX serializer and use the query service to get your element. "Select element from /region where id=xxx". This will return your element field without deserializing the record. Make sure that id is indexed.
There are other ways to validate quickly if your inbound data is streaming rather than a client lookup, such as the Function Service.

How to Let Spark Handle Bigger Data Sets?

I have a very complex query that needs to join 9 or more tables with some 'group by' expressions . Most of these tables have almost the same of numbers of the rows. These tables also have some columns that can be used as the 'key' to partition the tables.
Previously, the app ran fine, but now the data set has 3~4 times data as before. My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling and the application stalls (no matter how I adjust the memory, partition, executors, etc.). The actual data probably is just dozens of Gs.
I would think that if the partitioning works properly, Spark shouldn't do shuffle so much and the join should be done on each node. It is puzzling that why Spark is not so 'smart' to do so.
I could split the data set (with the 'key' I mentioned above) into many data sets that these data sets can be dealt with independently. But the burden will be on myself...it discounts the very reason to use Spark. What other approaches that could help?
I use Spark 2.0 over Hadoop YARN.
My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling
When joining datasets if the size of one side is less than a certain configurable size, spark broadcasts the entire table to each executor so that join may be performed locally everywhere. Your above observation is consistent with this. You can also provide broadcast hint explicitly to the spark, like so df1.join(broadcast(df2))
Other than that, can you please provide more specifics about your problem?
[Sometime ago I was also grappling with the issue of join and shuffles for one of our jobs that had to handle couple of TBs. We were using RDDs (and not the dataset api). I wrote about my findings [here]1. These may be of some use to you are try to reason about the underlying data shuffle.]
Update: According to documentation -- spark.sql.autoBroadcastJoinThreshold is the configurable property key. 10 MB is its default value. And it does the following:
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.
So apparently, this is supported only for the Hive tables.