Ignite Data streamer optimization - ignite

I am using below settings:
allowOverwrite: false
nodeParallelOperations: 1
autoFlushFrequency: 10
perNodeBufferSize: 5000000
My records size is around 2000 bytes. And see the "grid-data-loader-flusher"
thread stats as below:
Thread Count Average Longest Duration
grid-data-loader-flusher-#100 38 4,737,793.579 30,427,862 180,036,156
What would be the best configurations for Data streamer?
Thanks

Its good to have parallel streaming mode for data streamer. You can achieve this by collecting you key-value records in java Map and call the streamer.addData() method in parallel mode over that map. Here is the snippet.
maptoStream.entrySet().parallelStream().forEach(streamer::addData);
Also, if you are setting allowOverWrite to false then you cant use your custom stream receiver to process your collection of records. In this case it will skip the record(s) if it is already there in cache.
Regarding buffersize, you need to wait till buffer gets full each time to get it flushed automatically to cache. flush frequency comes to your rescue in this case and it will do periodic flushing. so whatever condition first satisfies(either buffer gets full or flush frequency reach) it will do flush. I preferred calling manual flush after above method call.
I observed that streamer works well with much more big collection on which you will call streamer.addData() method in parallel.

Related

Unable to exit while loop in UVM monitor

This might be a silly mistake from my side that I have overlooked but I'm fairly new to UVM and I tried tinkering with my code for a while before this. I'm trying to send in a stream of 8 bit data within a packet using Data valid stall protocol from my UVM driver to the DUT. I'm facing an issue with my input monitor not being able to pick up these transactions that are driven.
I have a while loop with a condition that the valid bit must be high and the stall bit should be low. As long as this condition holds good, the monitor needs to pick up the data byte and push into the queue. I know for a fact that the data is being picked up and pushed to a queue as I used $display statements along the way. The problem is arising once all the data bytes are received and the valid bit goes low. Ideally, this should cause the exit from the while loop but isn't doing so. Any help here would be appreciated. I have attached a snippet of the code below. Thanks in advance.
virtual task main_phase (uvm_phase phase);
$display("Run phase of input monitor");
collect_transfer();
endtask: main_phase
virtual task collect_transfer();
fork
forever begin
wait_for_valid_transaction_cycle();
create_and_populate_pkt();
broadcast_pkt();
#(iP0_vif.cb_iP0_MON);
end
join_none
endtask: collect_transfer
virtual task wait_for_valid_transaction_cycle();
wait(iP0_vif.cb_iP0_MON.ip_valid && ~iP0_vif.cb_iP0_MON.ip_stall);
endtask: wait_for_valid_transaction_cycle
virtual task create_and_populate_pkt();
pkt = Router_seq_item :: type_id :: create("pkt");
pkt.valid = iP0_vif.cb_iP0_MON.ip_valid;
pkt.sop = iP0_vif.cb_iP0_MON.ip_sop;
$display("before data collection");
while(iP0_vif.cb_iP0_MON.ip_valid === `HIGH && iP0_vif.cb_iP0_MON.ip_stall === `LOW) begin
$display("After checking for stall");
pkt.data = iP0_vif.cb_iP0_MON.ip_data;
$display(pkt.data);
pkt.data_q.push_front(pkt.data);
pkt.eop = iP0_vif.cb_iP0_MON.ip_eop;
$display("print check in input monitor # time = %0t", $time);
#(iP0_vif.cb_iP0_MON);
end
$display("before printing input packet from monitor");
Check_for_port_route_and_populate_packet_field(pkt);
print_packet(pkt);
endtask: create_and_populate_pkt
The $display statement "before printing input packet from monitor" is not being displayed.
HIGH is defined as a binary 1 and LOW is defined as a binary 0.
The output of the code in terms of display statements is as below.
before data collection
before checking for stall
After checking for stall
2
print check in input monitor # time = 105
before checking for stall
After checking for stall
1
print check in input monitor # time = 115
before checking for stall
After checking for stall
3
print check in input monitor # time = 125
It's possible that the main phase objection is being dropped elsewhere in your environment. UVM will automatically kill any threads that were spawned during a phase when it ends.
To fix this, do not object to the main phase in your monitor. Objecting to that phase is the responsibility of the threads creating the stimulus. Instead, you should be launching this monitor during the run_phase, which will ensure that your loop is not killed until the end of simulation.
Also, during the shutdown phase, you will want your monitor to object whenever it is currently seeing a packet. This will ensure that simulation doesn't end as soon as stimulus has been sent in, giving your other monitors time to collect responses from the DUT.

ServerXmlHttpRequest hanging sometimes when doing a POST

I have a job that periodically does some work involving ServerXmlHttpRquest to perform an HTTP POST. The job runs every 60 seconds.
And normally it runs without issue. But there's about a 1 in 50,000 chance (every two or three months) that it will hang:
IXMLHttpRequest http = new ServerXmlHttpRequest();
http.open("POST", deleteUrl, false, "", "");
http.send(stuffToDelete); <---hang
When it hangs, not even the Task Scheduler (with the option enabled to kill the job if it takes longer than 3 minutes to run) can end the task. I have to connect to the remote customer's network, get on the server, and use Task Manager to kill the process.
And then its good for another month or three.
Eventually i started using Task Manager to create a process dump,
so i could analyze where the hang is. After five crash dumps (over the last 11 months or so) i get a consistent picture:
ntdll.dll!_NtWaitForMultipleObjects#20()
KERNELBASE.dll!_WaitForMultipleObjectsEx#20()
user32.dll!MsgWaitForMultipleObjectsEx()
user32.dll!_MsgWaitForMultipleObjects#20()
urlmon.dll!CTransaction::CompleteOperation(int fNested) Line 2496
urlmon.dll!CTransaction::StartEx(IUri * pIUri, IInternetProtocolSink * pOInetProtSink, IInternetBindInfo * pOInetBindInfo, unsigned long grfOptions, unsigned long dwReserved) Line 4453 C++
urlmon.dll!CTransaction::Start(const wchar_t * pwzURL, IInternetProtocolSink * pOInetProtSink, IInternetBindInfo * pOInetBindInfo, unsigned long grfOptions, unsigned long dwReserved) Line 4515 C++
msxml3.dll!URLMONRequest::send()
msxml3.dll!XMLHttp::send()
Contoso.exe!FrobImporter.TFrobImporter.DeleteFrobs Line 971
Contoso.exe!FrobImporter.TFrobImporter.ImportCore Line 1583
Contoso.exe!FrobImporter.TFrobImporter.RunImport Line 1070
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.HandleFrobImport Line 433
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.CoreExecute Line 71
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.Execute Line 84
Contoso.exe!Contoso.Contoso Line 167
kernel32.dll!#BaseThreadInitThunk#12()
ntdll.dll!__RtlUserThreadStart()
ntdll.dll!__RtlUserThreadStart#8()
So i do a ServerXmlHttpRequest.send, and it never returns. It will sit there for days (causing the system to miss financial transactions, until come Sunday night i get a call that it's broken).
It is of no help unless someone knows how to debug code, but the registers in the stalled thread at the time of the dump are:
EAX 00000030
EBX 00000000
ECX 00000000
EDX 00000000
ESI 002CAC08
EDI 00000001
EIP 732A08A7
ESP 0018F684
EBP 0018F6C8
EFL 00000000
Windows Server 2012 R2
Microsoft IIS/8.5
Default timeouts of ServerXmlHttpRequest
You can use serverXmlHttpRequest.setTimeouts(...) to configure the four classes of timeouts:
resolveTimeout: The value is applied to mapping host names (such as "www.microsoft.com") to IP addresses; the default value is infinite, meaning no timeout.
connectTimeout: A long integer. The value is applied to establishing a communication socket with the target server, with a default timeout value of 60 seconds.
sendTimeout: The value applies to sending an individual packet of request data (if any) on the communication socket to the target server. A large request sent to a server will normally be broken up into multiple packets; the send timeout applies to sending each packet individually. The default value is 30 seconds.
receiveTimeout: The value applies to receiving a packet of response data from the target server. Large responses will be broken up into multiple packets; the receive timeout applies to fetching each packet of data off the socket. The default value is 30 seconds.
The KB305053 (a server that decides to keep the connection open will cause serverXmlHttpRequest to wait for the connection to close) seems like it plausibly could be the issue. But the 30 second default timeout would have taken care of that.
Possible workaround - Add myself to a Job
The Windows Task Scheduler is unable to terminate the task; even though the option is enabled to do do.
I will look into using the Windows Job API to add my self process to a job, and use SetInformationJobObject to set a time limit on my process:
CreateJobObject
AssignProcessToJobObject
SetInformationJobObject
to limit my process to three minutes of execution time:
PerProcessUserTimeLimit
If LimitFlags specifies
JOB_OBJECT_LIMIT_PROCESS_TIME, this member is the per-process
user-mode execution time limit, in 100-nanosecond ticks. Otherwise,
this member is ignored.
The system periodically checks to determine
whether each process associated with the job has accumulated more
user-mode time than the set limit. If it has, the process is
terminated.
If the job is nested, the effective limit is the most
restrictive limit in the job chain.
Although since Task Scheduler uses Job objects to also limit a task's time, i'm not hopeful that the Job Object can limit a job either.
Edit: Job objects cannot limit a process by process time - only user time. And with a process idle waiting for an object, it will not accumulate any user time - certainly not three minutes worth.
Bonus Reading
How can a ServerXMLHTTP GET request hang? (GET, not POST)
KB305053: ServerXMLHTTP Stops Responding When You Send a POST Request (which says the timeout should expire; where mine does not)
MS Forums: oHttp.Send - Hangs (HEAD, not POST)
MS Forums: ASP to test SOAP WebService using MSXML2.ServerXMLHTTP Send hangs
CC to MS Support Forums
Consider switching to a newer, supported API.
msxml6.dll using MSXML2.ServerXMLHTTP.6.0
winhttpcom.dll using WinHttp.WinHttpRequest.5.1.
The msxml3.dll library is no longer supported and is only kept around for compatibility reasons. Plus, there were a number of security and stability improvements included with msxml4.dll (and newer) that you are missing out on.

Is it possible to set dynamic download delay in scrapy?

I know that a constant delay can be set in
settings.py
DOWNLOAD_DELAY = 2
however, if I set the delay to 2s it is not efficient enough. If I set the DOWNLOAD_DELAY = 0.
The crawler is able to crawl about 10 pages. after that, the target page will return something like " you are requesting too frequently ".
What I want to do is the keep the download_delay to 0. once the "requesting too frequently" msg is found in the html. it change the delay to 2s. After a while it switch back to zero.
is there any module can do this? or any other better idea to handle such case?
Update:
I found that is a extension call AutoThrottle
but is it able to customize some logic like this??
if (requesting too frequently) is found
increase the DOWNLOAD_DELAY
If right after you get anti-spider page, then in 2 seconds you can get data page, then what you are asking probably requires writing a downloader middleware
that checks for anti-spider page, reset all scheduled requests to a renew-queue, start a looping call when spider is idle to get request from the renew-queue, (the looping interval is your hack for a new download delay), and try to decide when the download delay is not necessary again (requires some tests), then stop the looping and reschedule all the requests in renew-queue to scrapy scheduler. You will need to use redis queue in case of distributed crawl.
With download delay set to 0, in my experience throughput can go easily above 1000 items/min. If anti-spider page pops up after 10 responses, then it is not worth the effort.
Instead maybe you can try to find out how fast does your target server allow, may be 1.5s, 1s, 0.7s, 0.5s etc. Then maybe redesign your product takes into consideration the throughput your crawler can achieve.
You can use Auto Throttle extension now. It is turned off by default. You can add these parameters in your project's settings.py file to enable it.
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 300
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = True
Yes, You can use the time module to set the dynamic delay.
import time
for i in range(10):
*** Operations 1****
time.sleep( i )
*** Operations 2****
Now you can see the delay between Operations 1 and Operations 2.
Note:
the variable 'i' is in the form of seconds.

JMeter HTTP Request Post Body from File

I am trying to send an HTTP request via JMeter. I have created a thread group with a loop count of 25. I have a ramp up period of 120 and number of threads set to 30. Within the thread group, I have 20 HTTP Requests. I am a little confused as to how JMeter runs these requests. Do each of the 20 requests within a thread group run in a single thread, and each loop over a thread group runs concurrently on a different thread? Or do each of the 20 requests run in different threads as and when they are available.
My other question is, Over each loop, I want to vary the body of the post data that is being sent via the HTTP request. Is it possible to pass the post data body via a file instead of inserting the data into the JMeter Body Data Tab as show below:
However, instead of doing that, I want to define some kind of variable that picks a file based on iteration of the threadgroup that is running, for example, if it is looping over the thread group the second time, i want to call test2.txt, if the third time test3.txt etc and these text files will contain different post data. Could anyone tell me if this is possible with JMeter please and if so, how would I go about doing this.
Point 1 - JMeter concurrency
JMeter starts with 1 thread and spawns more threads as per ramp-up set. In your case (30 threads and 120 seconds ramp-up) another thread is being added each 4 seconds. Each thread executes 20 requests and if there is another loop - starts over, if there is no loop - the threads shuts down. To control load and concurrency JMeter provides 2 options:
Synchronizing Timer - pause all threads till specified threshold is reached and then release all of them at the same time
Constant Throughput Timer - to specify the load in requests per minute.
Point 2 - Send file instead of text
You can replace your request body with __fileToString function. If you want to parametrize it you can use nested function to provide current iteration - see below.
Point 3 - adding iteration as a parameter
JMeter provides 2 options on how you can increment a counter each loop
Counter config element - starts from specified value and gets incremented by specified value each time it's called.
__counter function - start from 1 and gets incremented by 1 each time it's being called. Can be "per-user" or "global"
See How to Use JMeter Functions post series for comprehensive information on above and more JMeter functions.

Scalable delayed task execution with Redis

I need to design a Redis-driven scalable task scheduling system.
Requirements:
Multiple worker processes.
Many tasks, but long periods of idleness are possible.
Reasonable timing precision.
Minimal resource waste when idle.
Should use synchronous Redis API.
Should work for Redis 2.4 (i.e. no features from upcoming 2.6).
Should not use other means of RPC than Redis.
Pseudo-API: schedule_task(timestamp, task_data). Timestamp is in integer seconds.
Basic idea:
Listen for upcoming tasks on list.
Put tasks to buckets per timestamp.
Sleep until the closest timestamp.
If a new task appears with timestamp less than closest one, wake up.
Process all upcoming tasks with timestamp ≤ now, in batches (assuming
that task execution is fast).
Make sure that concurrent worker wouldn't process same tasks. At the same time, make sure that no tasks are lost if we crash while processing them.
So far I can't figure out how to fit this in Redis primitives...
Any clues?
Note that there is a similar old question: Delayed execution / scheduling with Redis? In this new question I introduce more details (most importantly, many workers). So far I was not able to figure out how to apply old answers here — thus, a new question.
Here's another solution that builds on a couple of others [1]. It uses the redis WATCH command to remove the race condition without using lua in redis 2.6.
The basic scheme is:
Use a redis zset for scheduled tasks and redis queues for ready to run tasks.
Have a dispatcher poll the zset and move tasks that are ready to run into the redis queues. You may want more than 1 dispatcher for redundancy but you probably don't need or want many.
Have as many workers as you want which do blocking pops on the redis queues.
I haven't tested it :-)
The foo job creator would do:
def schedule_task(queue, data, delay_secs):
# This calculation for run_at isn't great- it won't deal well with daylight
# savings changes, leap seconds, and other time anomalies. Improvements
# welcome :-)
run_at = time.time() + delay_secs
# If you're using redis-py's Redis class and not StrictRedis, swap run_at &
# the dict.
redis.zadd(SCHEDULED_ZSET_KEY, run_at, {'queue': queue, 'data': data})
schedule_task('foo_queue', foo_data, 60)
The dispatcher(s) would look like:
while working:
redis.watch(SCHEDULED_ZSET_KEY)
min_score = 0
max_score = time.time()
results = redis.zrangebyscore(
SCHEDULED_ZSET_KEY, min_score, max_score, start=0, num=1, withscores=False)
if results is None or len(results) == 0:
redis.unwatch()
sleep(1)
else: # len(results) == 1
redis.multi()
redis.rpush(results[0]['queue'], results[0]['data'])
redis.zrem(SCHEDULED_ZSET_KEY, results[0])
redis.exec()
The foo worker would look like:
while working:
task_data = redis.blpop('foo_queue', POP_TIMEOUT)
if task_data:
foo(task_data)
[1] This solution is based on not_a_golfer's, one at http://www.saltycrane.com/blog/2011/11/unique-python-redis-based-queue-delay/, and the redis docs for transactions.
You didn't specify the language you're using. You have at least 3 alternatives of doing this without writing a single line of code in Python at least.
Celery has an optional redis broker.
http://celeryproject.org/
resque is an extremely popular redis task queue using redis.
https://github.com/defunkt/resque
RQ is a simple and small redis based queue that aims to "take the good stuff from celery and resque" and be much simpler to work with.
http://python-rq.org/
You can at least look at their design if you can't use them.
But to answer your question - what you want can be done with redis. I've actually written more or less that in the past.
EDIT:
As for modeling what you want on redis, this is what I would do:
queuing a task with a timestamp will be done directly by the client - you put the task in a sorted set with the timestamp as the score and the task as the value (see ZADD).
A central dispatcher wakes every N seconds, checks out the first timestamps on this set, and if there are tasks ready for execution, it pushes the task to a "to be executed NOW" list. This can be done with ZREVRANGEBYSCORE on the "waiting" sorted set, getting all items with timestamp<=now, so you get all the ready items at once. pushing is done by RPUSH.
workers use BLPOP on the "to be executed NOW" list, wake when there is something to work on, and do their thing. This is safe since redis is single threaded, and no 2 workers will ever take the same task.
once finished, the workers put the result back in a response queue, which is checked by the dispatcher or another thread. You can add a "pending" bucket to avoid failures or something like that.
so the code will look something like this (this is just pseudo code):
client:
ZADD "new_tasks" <TIMESTAMP> <TASK_INFO>
dispatcher:
while working:
tasks = ZREVRANGEBYSCORE "new_tasks" <NOW> 0 #this will only take tasks with timestamp lower/equal than now
for task in tasks:
#do the delete and queue as a transaction
MULTI
RPUSH "to_be_executed" task
ZREM "new_tasks" task
EXEC
sleep(1)
I didn't add the response queue handling, but it's more or less like the worker:
worker:
while working:
task = BLPOP "to_be_executed" <TIMEOUT>
if task:
response = work_on_task(task)
RPUSH "results" response
EDit: stateless atomic dispatcher :
while working:
MULTI
ZREVRANGE "new_tasks" 0 1
ZREMRANGEBYRANK "new_tasks" 0 1
task = EXEC
#this is the only risky place - you can solve it by using Lua internall in 2.6
SADD "tmp" task
if task.timestamp <= now:
MULTI
RPUSH "to_be_executed" task
SREM "tmp" task
EXEC
else:
MULTI
ZADD "new_tasks" task.timestamp task
SREM "tmp" task
EXEC
sleep(RESOLUTION)
If you're looking for ready solution on Java. Redisson is right for you. It allows to schedule and execute tasks (with cron-expression support) in distributed way on Redisson nodes using familiar ScheduledExecutorService api and based on Redis queue.
Here is an example. First define a task using java.lang.Runnable interface. Each task can access to Redis instance via injected RedissonClient object.
public class RunnableTask implements Runnable {
#RInject
private RedissonClient redissonClient;
#Override
public void run() throws Exception {
RMap<String, Integer> map = redissonClient.getMap("myMap");
Long result = 0;
for (Integer value : map.values()) {
result += value;
}
redissonClient.getTopic("myMapTopic").publish(result);
}
}
Now it's ready to sumbit it into ScheduledExecutorService:
RScheduledExecutorService executorService = redisson.getExecutorService("myExecutor");
ScheduledFuture<?> future = executorService.schedule(new CallableTask(), 10, 20, TimeUnit.MINUTES);
future.get();
// or cancel it
future.cancel(true);
Examples with cron expressions:
executorService.schedule(new RunnableTask(), CronSchedule.of("10 0/5 * * * ?"));
executorService.schedule(new RunnableTask(), CronSchedule.dailyAtHourAndMinute(10, 5));
executorService.schedule(new RunnableTask(), CronSchedule.weeklyOnDayAndHourAndMinute(12, 4, Calendar.MONDAY, Calendar.FRIDAY));
All tasks are executed on Redisson node.
A combined approach seems plausible:
No new task timestamp may be less than current time (clamp if less). Assuming reliable NTP synch.
All tasks go to bucket-lists at keys, suffixed with task timestamp.
Additionally, all task timestamps go to a dedicated zset (key and score — timestamp itself).
New tasks are accepted from clients via separate Redis list.
Loop: Fetch oldest N expired timestamps via zrangebyscore ... limit.
BLPOP with timeout on new tasks list and lists for fetched timestamps.
If got an old task, process it. If new — add to bucket and zset.
Check if processed buckets are empty. If so — delete list and entrt from zset. Probably do not check very recently expired buckets, to safeguard against time synchronization issues. End loop.
Critique? Comments? Alternatives?
Lua
I made something similar to what's been suggested here, but optimized the sleep duration to be more precise. This solution is good if you have few inserts into the delayed task queue. Here's how I did it with a Lua script:
local laterChannel = KEYS[1]
local nowChannel = KEYS[2]
local currentTime = tonumber(KEYS[3])
local first = redis.call("zrange", laterChannel, 0, 0, "WITHSCORES")
if (#first ~= 2)
then
return "2147483647"
end
local execTime = tonumber(first[2])
local event = first[1]
if (currentTime >= execTime)
then
redis.call("zrem", laterChannel, event)
redis.call("rpush", nowChannel, event)
return "0"
else
return tostring(execTime - currentTime)
end
It uses two "channels". laterChannel is a ZSET and nowChannel is a LIST. Whenever it's time to execute a task, the event is moved from the the ZSET to the LIST. The Lua script with respond with how many MS the dispatcher should sleep until the next poll. If the ZSET is empty, sleep forever. If it's time to execute something, do not sleep(i e poll again immediately). Otherwise, sleep until it's time to execute the next task.
So what if something is added while the dispatcher is sleeping?
This solution works in conjunction with key space events. You basically need to subscribe to the key of laterChannel and whenever there is an add event, you wake up all the dispatcher so they can poll again.
Then you have another dispatcher that uses the blocking left pop on nowChannel. This means:
You can have the dispatcher across multiple instances(i e it's scaling)
The polling is atomic so you won't have any race conditions or double events
The task is executed by any of the instances that are free
There are ways to optimize this even more. For example, instead of returning "0", you fetch the next item from the zset and return the correct amount of time to sleep directly.
Expiration
If you can not use Lua scripts, you can use key space events on expired documents.
Subscribe to the channel and receive the event when Redis evicts it. Then, grab a lock. The first instance to do so will move it to a list(the "execute now" channel). Then you don't have to worry about sleeps and polling. Redis will tell you when it's time to execute something.
execute_later(timestamp, eventId, event) {
SET eventId event EXP timestamp
SET "lock:" + eventId, ""
}
subscribeToEvictions(eventId) {
var deletedCount = DEL eventId
if (deletedCount == 1) {
// move to list
}
}
This however has it own downsides. For example, if you have many nodes, all of them will receive the event and try to get the lock. But I still think it's overall less requests any anything suggested here.