What is the difference between the affinityCall and affinityRun methods?
The doc doesn't say: https://apacheignite.readme.io/docs/affinity-collocation
And the javadoc is nearly identical:
/**
* Executes given job on the node where partition is located (the partition is primary on the node)
* </p>
* It's guaranteed that the data of all the partitions of all participating caches,
* the affinity key belongs to, will present on the destination node throughout the job execution.
*
* #param cacheNames Names of the caches to to reserve the partition. The first cache uses for affinity co-location.
* #param partId Partition to reserve.
* #param job Job which will be co-located on the node with given affinity key.
* #return Job result.
* #throws IgniteException If job failed.
*/
vs.
/**
* Executes given job on the node where partition is located (the partition is primary on the node)
* </p>
* It's guaranteed that the data of all the partitions of all participating caches,
* the affinity key belongs to, will present on the destination node throughout the job execution.
*
* #param cacheNames Names of the caches to to reserve the partition. The first cache is used for
* affinity co-location.
* #param partId Partition number.
* #param job Job which will be co-located on the node with given affinity key.
* #throws IgniteException If job failed.
*/
Thank you
The difference is that IgniteCompute#affinityCall() may return some result of computation, but IgniteCompute#affinityRun() always returns void.
Related
We recently moved to BigQuery's flat-rate billing model. This means we purchase a finite amount of slots and all BigQuery queries in our organisation will make use of those slots.
I am wondering whether or not the number of slots used by a query will get affected by use of LIMIT. TO put it another way, will this query:
select *
from project.dataset.table
use more slots than this one?
select *
from project.dataset.table
limit 10
?
I was using postgres java jdbc driver. This error pop up when I was doing a large batch query SELECT * FROM mytable where (pk1, pk2, pk3) in ((?,?,?),(?,?,?).....) with ~20k composite ids (i.e., ~60k placeholder).
The callstack for the exception:
org.postgresql.util.PSQLException: ERROR: stack depth limit exceeded
Hint: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2552)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2284)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:322)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:481)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:401)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:164)
at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:114)
at io.agroal.pool.wrapper.PreparedStatementWrapper.executeQuery(PreparedStatementWrapper.java:78)
...
This looks like a server side error. It's tricky because:
it's hard for me to configure server side things...
even I can configure that, but it's hard to know "how large is the query that will blow up the server side stack"
Any ideas for this? Or what's the best practice to do such large id query.
I am not sure about the maximum entries for an IN clause (1000?), but it is way less than 20K. The common way to handle that many would be to create a staging table, in this case containing the variables. Call then v1, v2, v3. Now load the staging table from a file (CSV) then use:
select *
from mytable
where (pk1, pk2, pk3) in
(select v1,v2,v3
from staging_table
);
With this format there is no limit for items in the staging table.
Once the process is complete truncate the staging table in preparation for the next cycle.
I am not sure about the maximum entries for an IN clause (1000?), but it is way less than 20K. The common way to handle that many would be to create a stage table, in this case containing the variables. Call then v1, v2, v3. Now load the stage table from a file (CSV) then use:
select *
from mytable
where (pk1, pk2, pk3) in
(select v1,v2,v3
from staging_table
I have a large table (about 59 millions rows, 7.1 GB) already ordered as i want, and I want to query this table and get a row_number() for each row of the table.
Unfortunately I get the error
Resources exceeded during query execution: The query could not be executed in the allotted memory.
Is there a way to increase allotted memory in BigQuery ?
Here is my query, I don't see how I can simplify it, but if you have any advices I'll take it
SELECT
row_number() over() as rowNumber,
game,
app_version,
event_date,
user_pseudo_id,
event_name,
event_timestamp,
country,
platform
FROM
`mediation_time_BASE`
Here is the complete error message :
Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 146% of limit. Top memory consumer(s): analytic OVER() clauses: 98% other/unattributed: 2%
Edit:
the query here represents a list of event starts and ends, and I need to link the start event with its end, so I follow this tip : https://www.interfacett.com/blogs/how-to-use-values-from-previous-or-next-rows-in-a-query-in-sql-server/
For that I need to have the rows with row_number() in order to separate this subquery in 2 (event start in one hand and event end in the other), join them and then have one row per event with the start and end of the event, as follow (where subquery represents the query with the row_number()):
SELECT
(case when lead(inter.rowNumber) OVER(ORDER BY inter.rowNumber) - inter.rownumber =1
then lead(inter.rowNumber) OVER(ORDER BY inter.rowNumber)
else inter.rownumber end) as rowNumber,
min(inter_success.rowNumber) as rowNumber_success,
inter.game,
inter.app_version,
inter.event_date,
inter.user_pseudo_id,
inter.event_timestamp as event_start,
min(inter_success.event_timestamp) as event_end,
inter_success.event_name as results
FROM
(SELECT * FROM `subquery` where event_name = 'interstitial_fetch') as inter INNER JOIN
(SELECT * FROM `subquery` where event_name = 'interstitial_fetch_success') as inter_success
ON inter.rowNumber < inter_success.rowNumber and inter.game= inter_success.game and inter.app_version = inter_success.app_version and inter.user_pseudo_id = inter_success.user_pseudo_id
GROUP BY inter.rowNumber,inter.game,inter.app_version,inter.event_date,inter.user_pseudo_id,inter.event_timestamp,inter_success.event_name
This works fine with a smaller dataset, but doesn't for 59 million rows...
TL;DR: You don't need to increase the memory for BigQuery.
In order to answer that you need to understand how BigQuery works. BigQuery relies on executor machines called slots. These slots are all similar in type and have a limited amount of memory.
Now, many of the operations split the data between multiple slots (like GROUP BY), each slot performs a reduction on a portion of the data and sends the result upwards in the execution tree.
Some operations must be performed on a single machine (like SORT and OVER) see here.
When your data overflows the slot's memory, you experience the described error. Hence, what you really need is to change the slot type to a higher memory machine. That's unfortunately not possible. You will have to follow the query best practices to avoid single slot operations on too much data.
One thing that may help you, is to calculate the OVER() with PARTITIONS, thus each partition will be sent to a different machine. see this example. Another thing that usually helps is to move to STANDARD SQL if you haven't done that yet.
AS per the offical documetation you need to request an increse in the slots for your reservation...
Maximum concurrent slots per project for on-demand pricing — 2,000
The default number of slots for on-demand queries is shared among all
queries in a single project. As a rule, if you're processing less than
100 GB of queries at once, you're unlikely to be using all 2,000
slots.
To check how many slots you're using, see Monitoring BigQuery using
Stackdriver. If you need more than 2,000 slots, contact your sales
representative to discuss whether flat-rate pricing meets your needs.
refer to this for slots1, the process for requesting more memory is here2
For increasing the BigQuery slots in the project, you may have to contact Google Cloud support or buy reservations.
I assume you were using the with clause for the subquery which in turns runs out of memory. My proposed solution is to create a expiring table that will expire in a few days automatically with the syntax of
OPTIONS(expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 5 DAY))
With this approach, I imagine inserting the 59 million rows of query result into a expiring table will use a lot less slots. Replace your subsequent subquery with the expiring table name.
To avoid being billed for the expiring table, you may delete it after all the dependent queries are executed.
TL;DR
If a replica node goes down and new partition map is not available yet, will a read with consistency level = ALL fail?
Example:
Given this Aerospike cluster setup:
- 3 physical nodes: A, B, C
- Replicas = 2
- Read consistency level = ALL (reads consult both nodes holding the data)
And this sequence of events:
- A piece of data "DAT" is stored into two nodes, A and B
- Node B goes down.
- Immediately after B goes down, a read request ("request 1") is performed with consistency ALL.
- After ~1 second, a new partition map is generated. The cluster is now aware that B is gone.
- "DAT" now becomes replicated at node C (to preserve replicas=2).
- Another read request ("request 2") is performed with consistency ALL.
It is reasonable to say "request 2" will succeed.
Will "request 1" succeed? Will it:
a) Succeed because two reads were attempted, even if one node was down?
b) Fail because one node was down, meaning only 1 copy of "DAT" was available?
Request 1 and request 2 will succeed. The behavior of the consistency level policies are described here: https://discuss.aerospike.com/t/understanding-consistency-level-overrides/711.
The gist for read/write consistency levels is that they only apply when there are multiple versions of a given partition within the cluster. If there is only one version of a given partition in the cluster then a read/write will only go to a single node regardless of the consistency level.
So given an Aerospike cluster of A,B,C where A is master and B is
replica for partition 1.
Assume B fails and C is now replica for partition 1. Partition 1
receives a write and the partition key is changed.
Now B is restarted and returns to the cluster. Partition 1 on B will
now be different from A and C.
A read arrives with consistency all to node A for a key on Partition
1 and there are now 2 versions of that partition in the cluster. We
will read the record from nodes A and B and return the latest
version (not fail the read).
Time lapse
Migrations are now complete, for partition 1, A is master, B is
replica, and C no longer has the partition.
A read arrives with consistency all to node A. Since there is only
one version of Partition 1, node A responds to the client without
consulting node B.
I have used default degree of parallelism in order to gain performance tuning and I got the best results too. but I doubt it will impact when some other job access the same table at same time.
sample code below.
select /*+ FULL(customer) PARALLEL(customer, default) */ customer_name from customer;
The number of servers available is 8 . How this default degree of parallelism works? will it affect if some other job running query on same table at same time? Before moving this query to production , I would like to know whether this will impact ? Thanks!
From documentation:
PARALLEL (DEFAULT):
The optimizer calculates a degree of parallelism equal to the number
of CPUs available on all participating instances times the value of
the PARALLEL_THREADS_PER_CPU initialization parameter.
The maximum degree of parallelism is limited by the number of CPUs in the system. The formula used to calculate the limit is :
PARALLEL_THREADS_PER_CPU * CPU_COUNT * the number of instances available
by default, all the opened instances on the cluster but can be constrained using PARALLEL_INSTANCE_GROUP or service specification. This is the default.