I am using IBM LSF and trying to get usage statistics during a certain period. I found that bhist does the job, but the short form bhist output does not show all of the fields I need.
What I want to know is:
Is bhist's output field customizable? The fields I need are:
<jobid>
<user>
<queue>
<job_name>
<project_name>
<job_description>
<submission_time>
<pending_time>
<run_time>
If 1 is not possible, the long form (bhist -l) output shows everything I need, but the format is hard to manipulate. I've pasted an example of the format below.
For example, the number of line between records is not fixed, and the word wrap in each event may break the line in the middle of a word I'm trying to scan for. How do I parse this format with sed and awk?
JobId <1531>, User <user1>, Project <default>, Command< example200>
Fri Dec 27 13:04:14: Submitted from host <hostA> to Queue <priority>, CWD <$H
OME>, Specified Hosts <hostD>;
Fri Dec 27 13:04:19: Dispatched to <hostD>;
Fri Dec 27 13:04:19: Starting (Pid 8920);
Fri Dec 27 13:04:20: Running with execution home </home/user1>, Execution CWD
</home/user1>, Execution Pid <8920>;
Fri Dec 27 13:05:49: Suspended by the user or administrator;
Fri Dec 27 13:05:56: Suspended: Waiting for re-scheduling after being resumed
by user;
Fri Dec 27 13:05:57: Running;
Fri Dec 27 13:07:52: Done successfully. The CPU time used is 28.3 seconds.
Summary of time in seconds spent in various states by Sat Dec 27 13:07:52 1997
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
5 0 205 7 1 0 218
------------------------------------------------------------
.... repeat
I'm adding a second answer because it might help you with your problem without actually having to write your own solution (depending on the usage statistics you're after).
LSF already has a utility called bacct that computes and prints out various usage statistics about historical LSF jobs filtered by various criteria.
For example, to get summary usage statistics about jobs that were dispatched/completed/submitted between time0 and time1, you can use (respectively):
bacct -D time0,time1
bacct -C time0,time1
bacct -S time0,time1
Statistics about jobs submitted by a particular user:
bacct -u <username>
Statistics about jobs submitted to a particular queue:
bacct -q <queuename>
These options can be combined as well, so for example if you wanted statistics about jobs that were submitted and completed within a particular time window for a particular project, you can use:
bacct -S time0,time1 -C time0,time1 -P <projectname>
The output provides some summary information about all jobs that match the provided criteria like so:
$ bacct -u bobbafett -q normal
Accounting information about jobs that are:
- submitted by users bobbafett,
- accounted on all projects.
- completed normally or exited
- executed on all hosts.
- submitted to queues normal,
- accounted on all service classes.
------------------------------------------------------------------------------
SUMMARY: ( time unit: second )
Total number of done jobs: 0 Total number of exited jobs: 32
Total CPU time consumed: 46.8 Average CPU time consumed: 1.5
Maximum CPU time of a job: 9.0 Minimum CPU time of a job: 0.0
Total wait time in queues: 18680.0
Average wait time in queue: 583.8
Maximum wait time in queue: 5507.0 Minimum wait time in queue: 0.0
Average turnaround time: 11568 (seconds/job)
Maximum turnaround time: 43294 Minimum turnaround time: 40
Average hog factor of a job: 0.00 ( cpu time / turnaround time )
Maximum hog factor of a job: 0.02 Minimum hog factor of a job: 0.00
Total Run time consumed: 351504 Average Run time consumed: 10984
Maximum Run time of a job: 1844674 Minimum Run time of a job: 0
Total throughput: 0.24 (jobs/hour) during 160.32 hours
Beginning time: Nov 11 17:55 Ending time: Nov 18 10:14
This command also has a long form output that provides some bhist -l-like information about each job that might be a bit easier to parse (although still not all that easy):
$ bacct -l -u bobbafett -q normal
Accounting information about jobs that are:
- submitted by users bobbafett,
- accounted on all projects.
- completed normally or exited
- executed on all hosts.
- submitted to queues normal,
- accounted on all service classes.
------------------------------------------------------------------------------
Job <101>, User <bobbafett>, Project <default>, Status <EXIT>, Queue <normal>,
Command <sleep 100000000>
Wed Nov 11 17:37:45: Submitted from host <endor>, CWD <$HOME>;
Wed Nov 11 17:55:05: Completed <exit>; TERM_OWNER: job killed by owner.
Accounting information about this job:
CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP
0.00 1040 1040 exit 0.0000 0M 0M
------------------------------------------------------------------------------
...
Long form output is pretty hard to parse. I know bjobs has an option for unformatted output (-UF) in older LSF versions which makes it a bit easier, and the most recent version of LSF allows you to customize which columns get printed in short form output with -o.
Unfortunately, neither of these options are available with bhist. The only real possibilities for historical information are:
Figure out some way to parse bhist -l -- impractical and maybe not even possible due to inconsistent formatting as you've discovered.
Write a C program to do what you want using the LSF API, which exposes the functions that bhist itself uses to parse the lsb.events file. This is the file that stores all the historical information about the LSF cluster, and is what bhist reads to generate its ouptut.
If C is not an option for you, then you could try writing a script to parse the lsb.events file directly -- the format is documented in the configuration reference. This is hard, but not impossible. Here is the relevant document for LSF 9.1.3.
My personal recommendation would be #2 -- the function you're looking for is lsb_geteventrec(). You'd basically read each line in lsb.events one at a time and pull out the information you need.
Related
I'm struggling with a while() loop in a Conky script.
Here's what I want to do :
I'm tunneling a command output to awk, extracting and formatting data.
The problem is : the output could contain 1 to n sections, and I want to get values from each one of them.
Here's the output sent to awk :
1) -----------
name: wu_1664392603_228876_0
WU name: wu_1664392603_228876
project URL: https://boinc.loda-lang.org/loda/
received: Thu Oct 6 15:31:40 2022
report deadline: Thu Oct 13 15:31:40 2022
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 220917
resources: 1 CPU
estimated CPU time remaining: 1379.480287
elapsed task time: 5858.009798
slot: 1
PID: 2221366
CPU time at last checkpoint: 5690.500000
current CPU time: 5712.920000
fraction done: 0.809000
swap size: 1051 MB
working set size: 973 MB
2) -----------
name: wu_1664392603_228908_0
WU name: wu_1664392603_228908
project URL: https://boinc.loda-lang.org/loda/
received: Thu Oct 6 15:31:53 2022
report deadline: Thu Oct 13 15:31:53 2022
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 220917
resources: 1 CPU
estimated CPU time remaining: 1393.925106
elapsed task time: 5849.961764
slot: 7
PID: 2221367
CPU time at last checkpoint: 5654.640000
current CPU time: 5682.160000
fraction done: 0.807000
swap size: 802 MB
working set size: 728 MB
...
And here's the final output I want :
boinc.loda wu_1664392603_2288 80.9 07/10 01h37
boinc.loda wu_1664392603_2289 80.7 07/10 02h38
I managed to get the data I want ("WU name", "project URL", "estimated CPU time remaining" AND "fraction done") from one particuliar section using this code :
${execi 60 boinccmd --get_tasks | awk -F': |://|/' '\
/URL/ && ++i==1 {u=$3}\
/WU/ && ++j==1 {w=$2}\
/fraction/ && ++k==1 {p=$2}\
/estimated/ && ++l==1 {e=strftime("%d/%m %Hh%M",$2+systime())}\
END {printf "%.10s %.18s %3.1f %s", u, w, p*100, e}\
'}
This is quite inelegant, as I must repeat this code nth times, increasing i,j,k,l values to get the whole dataset (n is related to CPU threads, my PC has 8 threads, so I repeat the code 8 times).
I'd like the script to adapt to other CPUs, where n could be anything from 1 to ...
The obvious solution is to use a while() loop, parsing the whole dataset.
But nesting a conditional loop into an awk sequence calling an external command seems too tricky for me, and Conky scripts aren't really easy to debug, as Conky may hang without any error output or log if the script's syntax is bad.
Any help will be appreciated :)
Assumptions:
the sample input shows 2 values for estimated that are ~14.5 seconds apart (1379.480287 and 1393.925106) but the expected output is showing the estimated values as being ~61 mins apart (07/10 01h37 and 07/10 02h38); for now I'm going to assume this is due to OP's repeated runs of execi returning widely varying values for the estimated lines
each section of execi output always contains 4 matching lines (URL, WU, fraction, estimated) and these 4 strings only occur once within a section of execi output
I don't have execi installed on my system so to emulate OP's exceci I've cut-n-pasted OP's sample execi results into a local file named execi.dat.
Tweaking OP's current awk script that also allows us to eliminate the need for a bash loop (that repeatedly calls execi | awk):
cat execi.dat | awk -F': |://|/' '
FNR==NR { st=systime() }
/URL/ { found++; u=$3 }
/WU/ { found++; w=$2 }
/fraction/ { found++; p=$2 }
/estimated/ { found++; e=strftime("%d/%m %Hh%M",$2+st) }
found==4 { printf "%.10s %.18s %3.1f %s\n", u, w, p*100, e; found=0 }
'
This generates:
boinc.loda wu_1664392603_2288 80.9 06/10 17h47
boinc.loda wu_1664392603_2289 80.7 06/10 17h47
NOTE: the last value appears to be duplicated but that's due to the sample estimated values only differing by ~14.5 seconds
I need to create edges between a set of nodes but it is not guaranteed that the edge is not exists already, I need to know which edges has been created so I can increment the edges counter for the two connected nodes.
I want to know the edges count for every node without querying the graph each time.
Example:
MERGE (u:user {id:999049043279872})
MERGE (g1:group {id:346709075951616})
MERGE (g2:group {id:346709075951617})
MERGE (g1)-[m1:member]->(u)
MERGE (g2)-[m2:member]->(u)
Sometimes the user is already a member of the group so I don't want to increment the counter in this case.
I tried to use the result statistics but it returns the created relationships number only, I thought also about using a map and then fill the content using ON CREATE SET after MERGE:
WITH {g1:0, g2:0} as res
MERGE (u:user {id:999049043279872})
MERGE (g1:group {id:346709075951616})
MERGE (g2:group {id:346709075951617})
MERGE (g1)-[m1:member]->(u)
ON CREATE SET res.g1 = 1
MERGE (g2)-[m2:member]->(u)
ON CREATE SET res.g2 = 1
RETURN res
But it does not works; the server crashes immediately after executing the query.
Exception:
------ FAST MEMORY TEST ------
17235:M 28 Feb 2022 16:56:50.016 # main thread terminated
17235:M 28 Feb 2022 16:56:50.017 # Bio thread for job type #0 terminated
17235:M 28 Feb 2022 16:56:50.017 # Bio thread for job type #1 terminated
17235:M 28 Feb 2022 16:56:50.018 # Bio thread for job type #2 terminated
Fast memory test PASSED, however your memory can still be broken.
Please run a memory test for several hours if possible.
------ DUMPING CODE AROUND EIP ------
Symbol: (null) (base: (nil))
Module: /lib/x86_64-linux-gnu/libc.so.6 (base 0x7fbfe3dcc000)
$ xxd -r -p /tmp/dump.hex /tmp/dump.bin
$ objdump --adjust-vma=(nil) -D -b binary -m i386:x86-64 /tmp/dump.bin
=== REDIS BUG REPORT END. Make sure to include from START to END. ===
Please report the crash by opening an issue on github:
http://github.com/redis/redis/issues
Suspect RAM error? Use redis-server --test-memory to verify it.
Segmentation fault
Any ideas?
Thanks in advance
Neo4j stores already a counter inside each node to count the number of relationships and to provide a fast count access. When you want to get the number of members in a group, you can simply do:
MATCH (g:group)
return size((g)<-[:member]-())
I have integrated nutch 1.13 along with solr-6.6.0 on CentOS Linux release 7.3.1611 I had given about 10 urls in seedlist which is at /usr/local/apache-nutch-1.13/urls/seed.txt I followed the tutorial
The command I used is
/usr/local/apache-nutch-1.13/bin/crawl -i -D solr.server.url=httpxxx:8983/solr/nutch/ /usr/local/apache-nutch-1.13/urls/ crawl 100
It seems to run for one or two hour. and i get corresponding results in solr. but during crawling phase alot of urls seem to be fetched and parsed in the terminal screen. Why aren't they being added to seedlist.?
2.How to know whether my crawldb is growing ? It's been about a month and the only results i get on solr are from the seedlist and its links.
3.I have set above command in crontab -e and plesk scheduled tasks. Now I get same links many times in in return for search query. How to avoid duplicate results in solr?
I'm a total newbie and any additional info would be helpful.
1.It seems to run for one or two hour. and i get corresponding results in solr. but during crawling phase alot of urls seem to be fetched and parsed in the terminal screen. Why aren't they being added to seedlist.?
Seed file is never modified by nutch, it just serves as a read only purpose for the injection phase.
2.How to know whether my crawldb is growing ?
You should take a look at the readdb -stats option, where you should get something like this
crawl.CrawlDbReader - Statistics for CrawlDb: test/crawldb
crawl.CrawlDbReader - TOTAL urls: 5584
crawl.CrawlDbReader - shortest fetch interval: 30 days, 00:00:00
crawl.CrawlDbReader - avg fetch interval: 30 days, 01:14:16
crawl.CrawlDbReader - longest fetch interval: 42 days, 00:00:00
crawl.CrawlDbReader - earliest fetch time: Tue Nov 07 09:50:00 CET 2017
crawl.CrawlDbReader - avg of fetch times: Tue Nov 14 11:26:00 CET 2017
crawl.CrawlDbReader - latest fetch time: Tue Dec 19 09:45:00 CET 2017
crawl.CrawlDbReader - retry 0: 5584
crawl.CrawlDbReader - min score: 0.0
crawl.CrawlDbReader - avg score: 5.463825E-4
crawl.CrawlDbReader - max score: 1.013
crawl.CrawlDbReader - status 1 (db_unfetched): 4278
crawl.CrawlDbReader - status 2 (db_fetched): 1014
crawl.CrawlDbReader - status 4 (db_redir_temp): 116
crawl.CrawlDbReader - status 5 (db_redir_perm): 19
crawl.CrawlDbReader - status 6 (db_notmodified): 24
A good trick I always do is to put this command inside the crawl script provided by nutch (bin/crawl), inside the loop
for for ((a=1; ; a++))
do
...
> echo "stats"
> __bin_nutch readdb "$CRAWL_PATH"/crawldb -stats
done
It's been about a month and the only results i get on solr are from the seedlist and its links.
Causes are multiple, you should check the output of each phase and see how the funnel goes.
3.I have set above command in crontab -e and plesk scheduled tasks. Now I get same links many times in in return for search query. How to avoid duplicate results in solr?
Guess you've used nutch default solr schema, check for the url vs. id fields.
As far as I've workd with, id is the unique identifier of a url (which may content redirects)
One of my redis servers is repeatedly going down today without any overt, diagnosable cause. My users all end up getting Error 111 connecting to unix socket: /var/run/redis/redis2.sock. Connection refused errors.
Looking into the logs at /var/log/redis, the last few lines capture nothing more nefarious than a scheduled backup:
[8248] 09 Mar 07:48:17.090 * 10 changes in 21600 seconds. Saving...
[8248] 09 Mar 07:48:17.374 * Background saving started by pid 47613
[47613] 09 Mar 07:51:02.257 * DB saved on disk
[47613] 09 Mar 07:51:02.486 * RDB: 526 MB of memory used by copy-on-write
[8248] 09 Mar 07:51:02.920 * Background saving terminated with success
The pid file still exists too. Which implies the server wasn't formally shut down, and redis was still daemonized?
I logged into my system and did sudo service redis-server restart twice to get it up and running. Apart from these logs, how else can I diagnose what might have gone wrong?
Update: I noticed that at the time of the first crash, disk swapping started taking place. This hasn't happened before. Moreover, cat /proc/sys/vm/swappiness confirms swappiness is set to 2.
free -m shows (after normal operation):
total used free shared buffers cached
Mem: 28136 27015 1120 305 80 6586
-/+ buffers/cache: 20349 7787
Swap: 1023 991 32
free -m shows (after the redis server goes down):
total used free shared buffers cached
Mem: 28136 8770 19365 305 60 441
-/+ buffers/cache: 8268 19868
Swap: 1023 1022 1
This sounds like the work of the OS' OOM killer - you can verify/discredit the hypothesis by reviewing the /var/log/syslog.
In this case, the persistence job's overhead triggered the killer. You need to provision for that by setting maxmemory and allocating enough RAM to accommodate persistence's requirements, including COW.
Note that free isn't useful after the fact - you need to monitor your resources continuously.
As for swap, if you don't care about latency then you can certainly do that.
I have a redis install on Ubuntu 14.04, and I seem to have nearly weekly issues with RDB snapshots completing. Redis version is 3.0.4 64 bit.
3838:M 24 Feb 09:46:28.826 * Background saving terminated with success
3838:M 24 Feb 09:47:29.088 * 100000 changes in 60 seconds. Saving...
3838:M 24 Feb 09:47:29.230 * Background saving started by pid 17281 17281:signal-handler (1456338079) Received SIGTERM scheduling shutdown...
3838:M 24 Feb 13:24:19.358 # Background saving terminated by signal 9
3838:M 24 Feb 13:24:19.622 * 10 changes in 900 seconds. Saving...
3838:M 24 Feb 13:24:19.730 * Background saving started by pid 17477
What you see there is that at 9:47am the background save started, but when I found it at 1:24pm it appeared to be completely stalled. I found the forked process to have basically no activity - the amount of memory it was consuming wasn't increasing. I tried to "kill" the child process, but it never actually quit, so i had to kill it with extreme prejudice (-9).
When things are getting bad, I get the following errors in my app:
2016-02-24 13:11:12,046 [2344] ERROR kCollectors.Main - Error while adding to Redis: No connection is available to service this operation: SADD ALLCH
My redis config is to do rdb snapshots only (no AOF). The load is modification heavy, with thousands of writes per second.
Currently I'm at the point where no redis background save is succeeding, and the background process becomes so much larger than the regular process that my VM starts swapping. Here's my TOP. 3838 is my redis instance, and 17477 is the background save process (as noted above):
top - 14:06:42 up 118 days, 2:05, 1 user, load average: 1.07, 1.07, 1.13
Tasks: 81 total, 3 running, 78 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 1.5 sy, 0.0 ni, 45.8 id, 51.3 wa, 0.0 hi,
0.5 si, 0.0 st
KiB Mem: 8176996 total, 8036792 used, 140204 free, 120 buffers
KiB Swap: 6289404 total, 3968236 used, 2321168 free. 4044 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
36 root 20 0 0 0 0 S 2.3 0.0
288:05.05 kswapd0
3838 rrr 20 0 7791836 3.734g 612 S 2.0
47.9 330:08.65 redis-server
17477 rrr 20 0 7792228 6.606g 364 D 1.0 84.7 0:43.49 redis-server
This is very interesting since I don't remember to ever read of such issues, so to discover the root cause could be very useful.
So here you are reporting a child process that stays a long time active, and even continues to allocate memory. I've no explanation for this if not a data corruption in the process memory, causing the RDB process to find unexpected conditions and looping forever in some way.
A few questions:
Does this happen even if you restart the process? (However please DON'T DO IT if you can avoid restarting and you did not restated yet, otherwise we may no longer understand the root cause).
While the RDB saving is active, do you see the CPU usage to be high and the process running with ps/top?
Could you try to interrupt the process with gdb -p <pid> and obtain a stack trace of the process?
Could you provide Redis INFO output to check version and other configuration things and state?
Could you check free output while this happens?
TLDR: is it possible the system is out of memory and is swapping a lot? So the child process while saving the RDB file visited all the pages and forced everything to be in the Resident Set. The system can't cope with so much I/O so it takes ages to complete the RDB saving.
EDIT: I just noticed you reported memory info:
KiB Mem: 8176996 total, 8036792 used, 140204 free, 120 buffers
So the system is out of memory and is swapping like crazy, and this results in the above behavior. As RDB saving starts, COW will use a lot of additional memory pushing the server on the memory limits.
Thanks.