Is there any params for setting 'Application Priority'? - hive

I am looking for a way to set application priority for task in hive. When task is committed, I want to set a high priority (like 100) to it. this param can be seen on page:
I am looking for a param like 'set mapreduce.map.memory.mb=4096;',so I can set priority for this task.

In hive session, also it is possible to set the same property in map-reduce program configuration:
SET mapreduce.job.priority=VERY_HIGH;
For the job_id:
hadoop job -set-priority <job-id> <priority>
There are five possible priority levels:
VERY_HIGH
HIGH
NORMAL
LOW
VERY_LOW
Queue priority configuration: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Setup_for_application_priority.

Related

Spring JMS - Message-driven-channel-adapter The number of consumers doesn't reduce to the standard level

I have a message-driven-channel-adapter and I defined the max-concurrent-consumers as 100 and concurrent-consumers as 2. When I tried a load test, I saw that the concurrent-consumers increased but after the load test, The number of consumers didn't reduce to the standard level. I'm checking it with RabbitMQ management portal.
When the project restarted (no load test), the GET (Empty) is 650/s but after load test it stays about 2500/s. It is not returning to 650/s. I think concurrent-consumers property is being increased to a number but is not being reduced to original value.
How can I make it to reduce to normal level again?
Here is my message-driven-channel-adapter definition:
<int-jms:message-driven-channel-adapter id="inboundAdapter"
auto-startup="true"
connection-factory="jmsConnectionFactory"
destination="inboundQueue"
channel="requestChannel"
error-channel="errorHandlerChannel"
receive-timeout="-1"
concurrent-consumers="2"
max-concurrent-consumers="100" />
With receiveTimeout=-1; the container has no control over the idle consumer (it is blocked in the jms client).
You also need to set max-messages-per-task for the container to consider stopping a consumer.
<int-jms:message-driven-channel-adapter id="inboundAdapter"
auto-startup="true"
connection-factory="jmsConnectionFactory"
destination-name="inboundQueue"
channel="requestChannel"
error-channel="errorHandlerChannel"
receive-timeout="5000"
concurrent-consumers="2"
max-messages-per-task="10"
max-concurrent-consumers="100" />
The time elapsed for an idle consumer is receiveTimeout * maxMessagesPerTask.

change jta transaction timeout from default to custom

I am using Atomikos for JTA transaction.
I have following setting for JTA:
UserTransactionImp userTransactionImp = new UserTransactionImp();
userTransactionImp.setTransactionTimeout(900);
but when my code perform JTA transaction, then if it takes more than 5 minutes (which is default value) then it throws exception:
Caused by: com.atomikos.icatch.RollbackException: Prepare: NO vote
at com.atomikos.icatch.imp.ActiveStateHandler.prepare(ActiveStateHandler.java:231)
at com.atomikos.icatch.imp.CoordinatorImp.prepare(CoordinatorImp.java:681)
at com.atomikos.icatch.imp.CoordinatorImp.terminate(CoordinatorImp.java:970)
at com.atomikos.icatch.imp.CompositeTerminatorImp.commit(CompositeTerminatorImp.java:82)
at com.atomikos.icatch.imp.CompositeTransactionImp.commit(CompositeTransactionImp.java:336)
at com.atomikos.icatch.jta.TransactionImp.commit(TransactionImp.java:190)
... 25 common frames omitted
it looks like its taking the default jta transaction timeout (even though i am setting timeout explicitely (to 15 minutes/900 seconds).
I tried using following properties in application.properties file however it still takes the default timeout value(300 seconds).
spring.jta.atomikos.properties.max-timeout=600000
spring.jta.atomikos.properties.default-jta-timeout=10000
I have also tried with below property but no luck:
spring.transaction.default-timeout=900
Can anyone suggest if I need any other setting? I am using wildfly plugin, spring boot and atomikos api for JTA transaction.
From the Atomikos documentation:
com.atomikos.icatch.max_timeout
Specifies the maximum timeout (in milliseconds) that can be allowed for transactions. Defaults to 300000. This means that calls to UserTransaction.setTransactionTimeout() with a value higher than configured here will be max'ed to this value. For 4.x or higher, a value of 0 means no maximum (i.e., unlimited timeouts are allowed).
Indeed, if you take a look at the Atomikos library source code (for both versions 4.0.0M4 and 3.7.0), in the createCC method from class com.atomikos.icatch.imp.TransactionServiceImp you will see:
387: if ( timeout > maxTimeout_ ) {
388: timeout = maxTimeout_;
389: //FIXED 20188
390: LOGGER.logWarning ( "Attempt to create a transaction with a timeout that exceeds maximum - truncating to: " + maxTimeout_ );
391: }
So any attempt to specify a longer transaction timeout gets capped to maxTimeout_ which has a default value of 300000 set during initialization if none is specified.
You can set the com.atomikos.icatch.max_timeout as a JVM argument with:
-Dcom.atomikos.icatch.max_timeout=900000
or you could use The Advanced Case recipe specified in the Configuration for Spring Section from the Atomikos documentation.
I've resolved similar problem where configuration in application.yml (or application. properties) of Spring Boot did not get picked up.
There was even a log that I later found mentioned in official docs.
However, I added transactions.properties file (next to the application.yml) where I set mine desired properties.
# Atomikos properties
# Service must be defined!
com.atomikos.icatch.service = com.atomikos.icatch.standalone.UserTransactionServiceFactory
# Override default properties.
com.atomikos.icatch.log_base_dir = ./atomikos
Some properties can be set within transactions.properties and other within jta.properties file.

Graphdb's loadrdf tool loads ontology and data very slow

I am using GraphDB loadrdf tool to load an ontology and a fairly big data. I set pool.buffer.size=800000 and jvm -Xmx to 24g. I tried both parallel and serial modes. They both slow down once the repo total statements go over about 10k. It eventually slows down to 1 or 2 statements/second. Does anyone know if this is a normal behavior of loadrdf or there's a way to optimize the performance?
Edit I have increased tuple-index-memory. See part of my repo ttl configuration:
owlim:entity-index-size "45333" ;
owlim:cache-memory "24g" ;
owlim:tuple-index-memory "20g" ;
owlim:enable-context-index "false" ;
owlim:enablePredicateList "false" ;
owlim:predicate-memory "0" ;
owlim:fts-memory "0" ;
owlim:ftsIndexPolicy "never" ;
owlim:ftsLiteralsOnly "true" ;
owlim:in-memory-literal-properties "false" ;
owlim:transaction-mode "safe" ;
owlim:transaction-isolation "true" ;
owlim:disable-sameAs "true";
But somehow the process still slows down. It starts with "Global average rate: 1,402 st/s". But slows down to "Global average rate: 20 st/s" after "Statements in repo: 61,831". I give my jvm: -Xms24g -Xmx36g
can you please post your repository configuration? Inside it, there is a parameter tuple-index-memory - this will determine the amount of changes(disc pages) that we are allowed to keep in memory. The bigger this value is the smaller amount of flushes we are going to do.
Check if this is set to a value like 20G in your setup and retry the process again.
I've looked at you repository configuration ttl. There is this parameter: entity-index-size=45333 whose value needs to be increased, e.g. set it to 100 million (entity-index-size=100000000). Default value for that parameter in GraphDB 7 is 10M, but since you've set it explicitly it gets overriden.
You can read more about that parameter here

Is it possible to set dynamic download delay in scrapy?

I know that a constant delay can be set in
settings.py
DOWNLOAD_DELAY = 2
however, if I set the delay to 2s it is not efficient enough. If I set the DOWNLOAD_DELAY = 0.
The crawler is able to crawl about 10 pages. after that, the target page will return something like " you are requesting too frequently ".
What I want to do is the keep the download_delay to 0. once the "requesting too frequently" msg is found in the html. it change the delay to 2s. After a while it switch back to zero.
is there any module can do this? or any other better idea to handle such case?
Update:
I found that is a extension call AutoThrottle
but is it able to customize some logic like this??
if (requesting too frequently) is found
increase the DOWNLOAD_DELAY
If right after you get anti-spider page, then in 2 seconds you can get data page, then what you are asking probably requires writing a downloader middleware
that checks for anti-spider page, reset all scheduled requests to a renew-queue, start a looping call when spider is idle to get request from the renew-queue, (the looping interval is your hack for a new download delay), and try to decide when the download delay is not necessary again (requires some tests), then stop the looping and reschedule all the requests in renew-queue to scrapy scheduler. You will need to use redis queue in case of distributed crawl.
With download delay set to 0, in my experience throughput can go easily above 1000 items/min. If anti-spider page pops up after 10 responses, then it is not worth the effort.
Instead maybe you can try to find out how fast does your target server allow, may be 1.5s, 1s, 0.7s, 0.5s etc. Then maybe redesign your product takes into consideration the throughput your crawler can achieve.
You can use Auto Throttle extension now. It is turned off by default. You can add these parameters in your project's settings.py file to enable it.
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 300
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = True
Yes, You can use the time module to set the dynamic delay.
import time
for i in range(10):
*** Operations 1****
time.sleep( i )
*** Operations 2****
Now you can see the delay between Operations 1 and Operations 2.
Note:
the variable 'i' is in the form of seconds.

Scalable delayed task execution with Redis

I need to design a Redis-driven scalable task scheduling system.
Requirements:
Multiple worker processes.
Many tasks, but long periods of idleness are possible.
Reasonable timing precision.
Minimal resource waste when idle.
Should use synchronous Redis API.
Should work for Redis 2.4 (i.e. no features from upcoming 2.6).
Should not use other means of RPC than Redis.
Pseudo-API: schedule_task(timestamp, task_data). Timestamp is in integer seconds.
Basic idea:
Listen for upcoming tasks on list.
Put tasks to buckets per timestamp.
Sleep until the closest timestamp.
If a new task appears with timestamp less than closest one, wake up.
Process all upcoming tasks with timestamp ≤ now, in batches (assuming
that task execution is fast).
Make sure that concurrent worker wouldn't process same tasks. At the same time, make sure that no tasks are lost if we crash while processing them.
So far I can't figure out how to fit this in Redis primitives...
Any clues?
Note that there is a similar old question: Delayed execution / scheduling with Redis? In this new question I introduce more details (most importantly, many workers). So far I was not able to figure out how to apply old answers here — thus, a new question.
Here's another solution that builds on a couple of others [1]. It uses the redis WATCH command to remove the race condition without using lua in redis 2.6.
The basic scheme is:
Use a redis zset for scheduled tasks and redis queues for ready to run tasks.
Have a dispatcher poll the zset and move tasks that are ready to run into the redis queues. You may want more than 1 dispatcher for redundancy but you probably don't need or want many.
Have as many workers as you want which do blocking pops on the redis queues.
I haven't tested it :-)
The foo job creator would do:
def schedule_task(queue, data, delay_secs):
# This calculation for run_at isn't great- it won't deal well with daylight
# savings changes, leap seconds, and other time anomalies. Improvements
# welcome :-)
run_at = time.time() + delay_secs
# If you're using redis-py's Redis class and not StrictRedis, swap run_at &
# the dict.
redis.zadd(SCHEDULED_ZSET_KEY, run_at, {'queue': queue, 'data': data})
schedule_task('foo_queue', foo_data, 60)
The dispatcher(s) would look like:
while working:
redis.watch(SCHEDULED_ZSET_KEY)
min_score = 0
max_score = time.time()
results = redis.zrangebyscore(
SCHEDULED_ZSET_KEY, min_score, max_score, start=0, num=1, withscores=False)
if results is None or len(results) == 0:
redis.unwatch()
sleep(1)
else: # len(results) == 1
redis.multi()
redis.rpush(results[0]['queue'], results[0]['data'])
redis.zrem(SCHEDULED_ZSET_KEY, results[0])
redis.exec()
The foo worker would look like:
while working:
task_data = redis.blpop('foo_queue', POP_TIMEOUT)
if task_data:
foo(task_data)
[1] This solution is based on not_a_golfer's, one at http://www.saltycrane.com/blog/2011/11/unique-python-redis-based-queue-delay/, and the redis docs for transactions.
You didn't specify the language you're using. You have at least 3 alternatives of doing this without writing a single line of code in Python at least.
Celery has an optional redis broker.
http://celeryproject.org/
resque is an extremely popular redis task queue using redis.
https://github.com/defunkt/resque
RQ is a simple and small redis based queue that aims to "take the good stuff from celery and resque" and be much simpler to work with.
http://python-rq.org/
You can at least look at their design if you can't use them.
But to answer your question - what you want can be done with redis. I've actually written more or less that in the past.
EDIT:
As for modeling what you want on redis, this is what I would do:
queuing a task with a timestamp will be done directly by the client - you put the task in a sorted set with the timestamp as the score and the task as the value (see ZADD).
A central dispatcher wakes every N seconds, checks out the first timestamps on this set, and if there are tasks ready for execution, it pushes the task to a "to be executed NOW" list. This can be done with ZREVRANGEBYSCORE on the "waiting" sorted set, getting all items with timestamp<=now, so you get all the ready items at once. pushing is done by RPUSH.
workers use BLPOP on the "to be executed NOW" list, wake when there is something to work on, and do their thing. This is safe since redis is single threaded, and no 2 workers will ever take the same task.
once finished, the workers put the result back in a response queue, which is checked by the dispatcher or another thread. You can add a "pending" bucket to avoid failures or something like that.
so the code will look something like this (this is just pseudo code):
client:
ZADD "new_tasks" <TIMESTAMP> <TASK_INFO>
dispatcher:
while working:
tasks = ZREVRANGEBYSCORE "new_tasks" <NOW> 0 #this will only take tasks with timestamp lower/equal than now
for task in tasks:
#do the delete and queue as a transaction
MULTI
RPUSH "to_be_executed" task
ZREM "new_tasks" task
EXEC
sleep(1)
I didn't add the response queue handling, but it's more or less like the worker:
worker:
while working:
task = BLPOP "to_be_executed" <TIMEOUT>
if task:
response = work_on_task(task)
RPUSH "results" response
EDit: stateless atomic dispatcher :
while working:
MULTI
ZREVRANGE "new_tasks" 0 1
ZREMRANGEBYRANK "new_tasks" 0 1
task = EXEC
#this is the only risky place - you can solve it by using Lua internall in 2.6
SADD "tmp" task
if task.timestamp <= now:
MULTI
RPUSH "to_be_executed" task
SREM "tmp" task
EXEC
else:
MULTI
ZADD "new_tasks" task.timestamp task
SREM "tmp" task
EXEC
sleep(RESOLUTION)
If you're looking for ready solution on Java. Redisson is right for you. It allows to schedule and execute tasks (with cron-expression support) in distributed way on Redisson nodes using familiar ScheduledExecutorService api and based on Redis queue.
Here is an example. First define a task using java.lang.Runnable interface. Each task can access to Redis instance via injected RedissonClient object.
public class RunnableTask implements Runnable {
#RInject
private RedissonClient redissonClient;
#Override
public void run() throws Exception {
RMap<String, Integer> map = redissonClient.getMap("myMap");
Long result = 0;
for (Integer value : map.values()) {
result += value;
}
redissonClient.getTopic("myMapTopic").publish(result);
}
}
Now it's ready to sumbit it into ScheduledExecutorService:
RScheduledExecutorService executorService = redisson.getExecutorService("myExecutor");
ScheduledFuture<?> future = executorService.schedule(new CallableTask(), 10, 20, TimeUnit.MINUTES);
future.get();
// or cancel it
future.cancel(true);
Examples with cron expressions:
executorService.schedule(new RunnableTask(), CronSchedule.of("10 0/5 * * * ?"));
executorService.schedule(new RunnableTask(), CronSchedule.dailyAtHourAndMinute(10, 5));
executorService.schedule(new RunnableTask(), CronSchedule.weeklyOnDayAndHourAndMinute(12, 4, Calendar.MONDAY, Calendar.FRIDAY));
All tasks are executed on Redisson node.
A combined approach seems plausible:
No new task timestamp may be less than current time (clamp if less). Assuming reliable NTP synch.
All tasks go to bucket-lists at keys, suffixed with task timestamp.
Additionally, all task timestamps go to a dedicated zset (key and score — timestamp itself).
New tasks are accepted from clients via separate Redis list.
Loop: Fetch oldest N expired timestamps via zrangebyscore ... limit.
BLPOP with timeout on new tasks list and lists for fetched timestamps.
If got an old task, process it. If new — add to bucket and zset.
Check if processed buckets are empty. If so — delete list and entrt from zset. Probably do not check very recently expired buckets, to safeguard against time synchronization issues. End loop.
Critique? Comments? Alternatives?
Lua
I made something similar to what's been suggested here, but optimized the sleep duration to be more precise. This solution is good if you have few inserts into the delayed task queue. Here's how I did it with a Lua script:
local laterChannel = KEYS[1]
local nowChannel = KEYS[2]
local currentTime = tonumber(KEYS[3])
local first = redis.call("zrange", laterChannel, 0, 0, "WITHSCORES")
if (#first ~= 2)
then
return "2147483647"
end
local execTime = tonumber(first[2])
local event = first[1]
if (currentTime >= execTime)
then
redis.call("zrem", laterChannel, event)
redis.call("rpush", nowChannel, event)
return "0"
else
return tostring(execTime - currentTime)
end
It uses two "channels". laterChannel is a ZSET and nowChannel is a LIST. Whenever it's time to execute a task, the event is moved from the the ZSET to the LIST. The Lua script with respond with how many MS the dispatcher should sleep until the next poll. If the ZSET is empty, sleep forever. If it's time to execute something, do not sleep(i e poll again immediately). Otherwise, sleep until it's time to execute the next task.
So what if something is added while the dispatcher is sleeping?
This solution works in conjunction with key space events. You basically need to subscribe to the key of laterChannel and whenever there is an add event, you wake up all the dispatcher so they can poll again.
Then you have another dispatcher that uses the blocking left pop on nowChannel. This means:
You can have the dispatcher across multiple instances(i e it's scaling)
The polling is atomic so you won't have any race conditions or double events
The task is executed by any of the instances that are free
There are ways to optimize this even more. For example, instead of returning "0", you fetch the next item from the zset and return the correct amount of time to sleep directly.
Expiration
If you can not use Lua scripts, you can use key space events on expired documents.
Subscribe to the channel and receive the event when Redis evicts it. Then, grab a lock. The first instance to do so will move it to a list(the "execute now" channel). Then you don't have to worry about sleeps and polling. Redis will tell you when it's time to execute something.
execute_later(timestamp, eventId, event) {
SET eventId event EXP timestamp
SET "lock:" + eventId, ""
}
subscribeToEvictions(eventId) {
var deletedCount = DEL eventId
if (deletedCount == 1) {
// move to list
}
}
This however has it own downsides. For example, if you have many nodes, all of them will receive the event and try to get the lock. But I still think it's overall less requests any anything suggested here.