I have two large tables A and B, and I want to join these two tables on two columns, say, project_id and customer_id.
When I do the join in Apache Ignite, I find that the performance is very bad. After investigating, I think the problem lies in that the data recide in different nodes randomly.
When the join happens, there are data transfer between nodes to make the same project_id and customer_id from A and B into same node.
For my case,
Load data into the Ignite cluster based on A and B's project_id and customer_id, so that, there is no data transfer when doing the join. The solution can work but not flexible.
Use only one node to hold all the data. This solution can work but there is memory limit for a single node(Not too much data can be held by one node)
I would ask which solution would be a better choice, thanks!
The former (1.) is recommended. You should load the data in the fashion so that data for the same project_id and customer_id is on the same node in both tables.
This is called affinity collocation and it is paramount to get right to have good performance of Ignite queries (and sometimes for them to work at all).
Ignite will take care of affinity collocation for you if you setup it correctly, but there are a few caveats right away:
Affinity key has to be a part of primary key (not a value field)
Affinity key has to be single (so you have to choose between project_id and customer_id) or a composite type (a nested POJO with its own implications) or a synthetic value maybe?
There is possibility of uneven partition distribution. Imagine you have a single large customer (or project). When processing this customer, all nodes but a single one will be idle and unused.
Related
Unsure if I cluster correctly. Basicly I am looking at GCP Billing Info of say 50 clients. Each client has a Billing_ID and I cluster on that billing_ID. I use the clustered table for a data studio dashboard
See the the SQL query below to see what I do right now
CREATE OR REPLACE TABLE `dashboardgcp`
PARTITION BY DATE(usage_start_time)
CLUSTER BY billing_account_id
AS
SELECT
*
FROM
`datagcp`
WHERE
usage_start_time BETWEEN TIMESTAMP('2019-01-01')
AND TIMESTAMP(CURRENT_DATE)
It is succesfully clustered like this, I am just not a noticeable query performance increase!
So I thought by clustering it with billing_ID I should see an increase in dashboard performance
Please consider the following points:
Cluster structure
A Cluster field is composed of an array of fields, like boxes, outer to inner, As state in BigQuery link
When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
This means As #Gordon wrote, in your query the WHERE part needs to start from the outer field to the inner one to make the most out of your cluster field. In your case, if the userId is part of the WHERE you need to change your cluster field to match this
Cluster limitation
Cluster typically works better for query who scan over 1GB of data, So if you are not scanning this amount of data you won't see the improvement you are looking for
Cluster with Ingestion tables
Assuming your dara is not static and you keep adding data to your table, datagcp, you need to be aware that cluster indexing is a process which BigQuery perform off-line to the insert operation and a separate one to partitioning.
The side effect is that your clustering "weaken" over-time. To solve this you will need to use merge command to re-build your cluster in order to get the most out of your cluster
From the docs:
“Over time, as more and more operations modify a table, the degree to which the data is sorted begins to weaken, and the table becomes partially sorted”.
I want all nodes in a cluster to have equal number data load. With
default Affinity function it is not happening.
As of now, we have 3 nodes. We use group ID as affinity key, and we have 3
group IDs (1, 2 and 3). And we limit cache partitions to group IDs. Overall
nodes=group IDs=cache partitions. So that each node have equal number of
partitions.
Will it be okay to write custom Affinity function? And
what will we lose doing so? Did anyone write custom Affinity function?
The affinity function doesn't guarantee an even distribution across all nodes. It's statistical... and three values isn't really enough to make sure the data is "fairly" distributed.
So, yes, writing a new affinity function would work. The downsides being you need to make it fast (it's called a lot) and you'd be hard-coding it to your current node topology. What happens when you choose to add a new node? What happens when a node fails? Also, you'd be potentially putting all your data into three partitions which make it harder to scale out (one of the main advantages of Ignite's architecture).
As an alternative, I'd look at your data model. Splitting your data into three chunks is too coarse for things to work automatically.
How is spatial join query executed among the nodes in PARTITIONED mode? As Ignite partition the data (default 1024) among the nodes using Rendezvous Affinity hashing, how is join operation executed among the partitions? Suppose I have two spatial datasets in the cache (pCache and qCache), each contains 10 partitions(1, .., 10). How is ignite perform the join operation on this two dataset? Is it partition1 of pCache with partition1 of qCache?
My second question: How is ignite perform the same operation in case of distributed join?
There is no correspondence between partitions of different caches. If you run a join operation, then by default only local lookup will be performed.If data is not collocated, then this approach may give you a partial result.
When all-to-all mapping is performed, then every node has to communicate with every other node, so messages are totally sent in the cluster, where is the number of nodes. This is called distributed joins, and it affects performance significantly. It may be enabled either in the connection string in case of JDBC driver, or by using SqlFieldsQuery#setDistributedJoins(...) method in case of cache query API.
The recommended way to do joins is to collocate the data in a way, that no distributed joins are needed. Ignite has a feature called affinity collocation, designed specially for this purpose. You can specify a field of an object, that will be used to calculate the affinity function. Value of this field doesn't have to be unique, but it should be a part of a key. So, if you want to perform joins on two tables, you may collocate them by affinity, so no distributed joins will be needed.
I am new to Apache Ignite and come from a Data Warehousing background.
So pardon me if I try to relate to Ignite through DBMS jargon.
I have gone through forums but I am still unclear about some of the basics.
I also would like specific answers to the scenario I have posted later.
1.) CacheMode=PARTITIONED
a.) When a cache is declared as partitioned, does the data get equally
partitioned across all nodes by default?
b.) Is there an option to provide a "partition key" based on which the data
would be distributed across the nodes? Is this what we call the Affinity
Key?
c.) How is partitioning different from affinity and can a cache have both
partition and affinity key?
2.) Affinity Concept
With an Affinity Key defined, when I load data (using loadCache()) into a partitioned cache, will the source rows be sent to the node they belong to or all the nodes on the cluster?
3.) If I create one index on the cache, does it by default become the partition/
affinity key as well? In such a scenario, how is a partition different from index?
SCNEARIO DESCRIPTION
I want to load data from a persistent layer into a Staging Cache (assume ~2B) using loadCache(). The cache resides on a 4 node cluster.
a.) How to load data such that each node has to process only 0.5B records?
Is is by using Partitioned Cache mode and defining an Affinity Key?
Then I want to read transactions from the Staging Cache in TRANSACTIONAL atomicity mode, lookup a Target Cache and do some operations.
b.) When I do the lookup on Target Cache, how can I ensure that the lookup is happening only on the node where the data resides and not do lookup on all the nodes on which Target Cache resides?
Would that be using the AffinityKeyMapper API? If yes, how?
c.) Lets say I wanted to do a lookup on a key other than Affinity Key column, can creating an index on the lookup column help? Would I end up scanning all nodes in that case?
Staging Cache
CustomerID
CustomerEmail
CustomerPhone
Target Cache
Seq_Num
CustomerID
CustomerEmail
CustomerPhone
StartDate
EndDate
This is answered on Apache Ignite users forum: http://apache-ignite-users.70518.x6.nabble.com/Understanding-Cache-Key-Indexes-Partition-and-Affinity-td11212.html
Ignite uses AffinityFunction [1] for data distribution. AF implements two mappings: key->partition and partition->node.
Key->Partition mapping is definitely map entry to partition. It doesn't bother of backups, but data collocation\distribution over partitions.
Usually, entry key (actually it's hashcode) is used to calculate partition entry belongs to.
But you can use AffinityKey [2] that would be use instead to manage data collocation. See also 'org.apache.ignite.cache.affinity.AffinityKey' javadoc.
Partition->Node mapping determines primary and backup nodes for partition. It doesn't bother of data collocation, but backups and partition distribution among nodes
Cache.loadCache just makes all nodes to call localLoadCache method. Which calls CacheStore.loadCache. So, each of grid nodes will load all the data from cache store and then discard data that is not local for the node.
Same data may resides on several nodes if you use a backups. AffinityKey should be a part of entry key and if AffinityKey mapping is configured then AffinityKey will be used instead of entry key for entry->partition mapping
and AffinityKey will be passed to AffinityFunction.
Indexes always resides on same node with the data.
a. To achieve this you should implement CacheStore.loadCache method to load data for certain partitions. E.g. you can store partitionID for each row in database.
However, if you change AF or partitions numbers you should update partitionID for entries in database as well.
The other way. If it is posible, you can load all the data in single node and then add other nodes to the grid. Data will rebalanced over nodes automatically.
b. AffinityKey is always used if it is as it shoud be part of entry key. So, lookup will always be happening on the node where the data resides.
c. I can't understand the question. Would you please clarify if it still is actual?
How small should a table using Diststyle ALL be in Amazon Redshift?
It says here: http://dwbitechguru.blogspot.com/2014/11/performance-tuning-in-amazon-redshift.html
that for vey small tables, redshift should use diststyle ALL instead of EVEN or KEY. How Small is small? If I was to specify a row number in the where clause of the query: select relname, reldiststyle from pg_class how many rows should I specify?
It really depends on the cluster size you are using. DISTSTYLE ALL will copy the data of your table to all nodes - to mitigate data transfer requirement across nodes. You can find out the size of your table and Redshift nodes available size, if you can afford to copy table multiple times per node, do it!
Also, if you have a requirement of joining other tables with this table very very frequently, like in 70% of your queries, I believe it is worth the space if you want better query performance.
If your Join keys across tables are same in terms of cardinality, then you can also afford to distribute all tables on that key so that similar keys lie in same node which will obviate replication of data.
I would suggest trying out the two options above, and comparing average query run times of around 10 queries and then come to a decision.
By considering a Star Schema, the distribution style All is normally used for dimension tables. Doing this have the advantage to speed up joins, let's explain this through an example. If we would like to obtain the quantity saled of each product by country, we would require to join the fact_sales with the dim_store table on store_id key.
So, setting diststyle all on dim_store enable us to do a JOIN result in parallel compared to the disvantage of shuffling when enabling diststyle even. However, you can let Redshift automatically handle an optimal distribution style by setting distyle auto, for more info check this link.