CasperJS and Parallel Crawling - phantomjs

I'm trying to crawl some website. However my crawling process is so long i need to use multiple instances to shorten it. I've searched for other ways and aborted all the unnecessary resources requested still it's way too slow for me(around 8-9 secs).
What is the easiest way to parallel casperjs instances or even run only two casperjs at the same time to crawl in parallel?
I have used parallel gnu from a blog post i've found however it seems like although the process' are alive they are not crawling in parallel because total execution time is still the same with one instance.
Should i use a nodejs server to create instances?
What is the easiest and most practical way?

Can you adapt this:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Breadth-first-parallel-web-crawler-mirrorer
#!/bin/bash
# E.g. http://gatt.org.yeslab.org/
URL=$1
# Stay inside the start dir
BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
URLLIST=$(mktemp urllist.XXXX)
URLLIST2=$(mktemp urllist.XXXX)
SEEN=$(mktemp seen.XXXX)
# Spider to get the URLs
echo $URL >$URLLIST
cp $URLLIST $SEEN
while [ -s $URLLIST ] ; do
cat $URLLIST |
parallel lynx -listonly -image_links -dump {} \; \
wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and do { $seen{$1}++ or print }' |
grep -F $BASEURL |
grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
mv $URLLIST2 $URLLIST
done
rm -f $URLLIST $URLLIST2 $SEEN

Related

Clone all projects from cgit

I have to download all projects that are hosted on some cgit instance. There are several hundreds of repositories, so it is tedious to do this manually.
How can it be done?
Seems that it is possible to do it with curl by parsing pages one by one. By is there more convenient interface?
There does not seem to be any official or convenient API for CGit to export/clone all its repositories.
You can try those alternatives:
curl -s http://git.suckless.org/ |
xml sel -N x="http://www.w3.org/1999/xhtml" -t -m "//x:a" -v '#title' -n |
grep . |
while read repo
do git clone git://git.suckless.org/$repo
done
Or:
curl -s http://git.suckless.org/ | xml pyx | awk '$1 == "Atitle" { print $2 }'
Or:
curl -s http://git.suckless.org/ | xml pyx | awk '$1 == "Atitle" { printf("git clone %s\n",$2) }' | s
I suspect this work for one page of Git repositories as listed by CGit: you might still have to repeat that for all subsequent Git repositories pages.

GNU Parallel -q option causing BCP "unknown option" errors (different string quotes on local vs remote hosts)

Seeing very strange behavior where when when using gnu parallel to distribute export jobs using bcp from mssql-tools. It appears that when using the -q option for parallel, strings are interpreted differently on local host than on remote hosts.
Running only as a loop through files on local host, the bcp processes throws no errors
However, distributing the file exports with parallel, the bcp processes executing on the local host throw
/opt/mssql-tools/bin/bcp: unknown option
errors, while those executing on remote hosts (via a --sshloginfile param) finish successfully. The basic code being run looks like...
# setting some vars to pass
TO_SERVER_ODBCDSN="-D -S MyMSSQLServer"
TO_SERVER_IP="-S 172.18.54.22"
DB="$dest_db" #TODO: enforce being more careful with this value
TABLE="$tablename" # MUST exist beforehand, case matters
USER=$(tail -n+1 $source_home/mssql-creds.txt | head -1)
PASSWORD=$(tail -n+2 $source_home/mssql-creds.txt | head -1)
DATAFILES="/some/path/to/files/"
TARGET_GLOB="*.tsv"
RECOMMEDED_IMPORT_MODE='-c' # makes a HUGE difference, see https://stackoverflow.com/a/16310219/8236733
DELIMITER="\\\t" # (currently not used) DO NOT use format like "'\t'", nested quotes seem to cause hard-to-catch error, want "\t" literal
....
bcpexport() {
filename=$1
TO_SERVER_ODBCDSN=$2
DB=$3
TABLE=$4 # MUST exist beforehand, case matters
USER=$5
PASSWORD=$6
RECOMMEDED_IMPORT_MODE=$7 # makes a HUGE difference, see https://stackoverflow.com/a/16310219/8236733
DELIMITER=$8 # not currently used
WORKDIR=$9
LOGDIR=${10}
....
/opt/mssql-tools/bin/bcp "$TABLE" in "$localfile" \
$TO_SERVER_ODBCDSN \
-U $USER -P $PASSWORD \
-d $DB \
$RECOMMEDED_IMPORT_MODE
-t "\t" \
-e ${localfile}.bcperror.log
}
export -f bcpexport
parallelization_pernode=5
parallel -q -j $parallelization_pernode \
--sshloginfile $source_home/parallel-nodes.txt \
--env bcpexport \
bcpexport {} "$TO_SERVER_ODBCDSN" $DB $TABLE $USER $PASSWORD $RECOMMEDED_IMPORT_MODE $DELIMITER $workingdir $logdir \
::: $DATAFILES/$TARGET_GLOB #from hdfs nfs gateway
Looking at the bash interpretation of the processes (by running ps -aux | grep bcp on the hosts that parallelis given in the --sshloginfile) for the remote hosts we see...
/bin/bash -c bcpexport() { ... /opt/mssql-tools/bin/bcp "$TABLE" in "$localfile" $TO_SERVER_ODBCDSN -U $USER -P $PASSWORD -d $DB $RECOMMEDED_IMPORT_MODE; -t "\t" -e ${localfile}.bcperror.log; ...
for the local host, the bash interpretation is...
/bin/bash -c bcpexport() { ... /opt/mssql-tools/bin/bcp "$TABLE" in "$localfile" $TO_SERVER_ODBCDSN -U $USER -P $PASSWORD -d $DB $RECOMMEDED_IMPORT_MODE; -t "\t" -e ${localfile}.bcperror.log; ...
that is, they look the same.
My current thought is that the "\t" in the bcp command is being interpreted in a problematic way. Debugging parallel without vs with the -q option we see...
$ parallel -j 5 --sshloginfile ./parallel-nodes.txt echo "Number {}: Running on \`hostname\`: \t" ::: 1 2 3 4 5
Number 4: Running on HW04.ucera.local: t
Number 1: Running on HW04.ucera.local: t
Number 2: Running on HW03.ucera.local: t
Number 5: Running on HW03.ucera.local: t
Number 3: Running on HW02.ucera.local: t
$ parallel -q -j 5 --sshloginfile ./parallel-nodes.txt echo "Number {}: Running on \`hostname\`: \t" ::: 1 2 3 4 5
Number 1: Running on `hostname`:
Number 4: Running on `hostname`:
Number 3: Running on `hostname`: \t
Number 2: Running on `hostname`: \t
Number 5: Running on `hostname`: \t
The bcp command needs the "\t" literal not the "t" literal (and I suspect several other similar string corruptions (also I do believe that \t is the default for bcp anyway, but this is just an example and want to keep \t for code clarity)), but not sure how to get this for both local and remote nodes or even why this behavior differs by remote vs local.
Basically, need the the strings to be exactly the same for both local and remote hosts even if strings have spaces or escape characters in them (note, I think this used to not be the case (have older script on other machines that don't have this problem))
Not sure if this is counts more as a parallel problem or a bcp problem (currently thinking something is going wrong with the -q option in parallel, but not sure). Anyone have any debugging suggestions or fixes? Ideas of what could be happening?
Firstly, the reason why hostname is not expanded is due to -q. It quotes the ` so that it does not expand.
Secondly, I think what you see is the different behaviours in built-in echo and /bin/echo. Built-in echo depends on the shell. Here I compare echo \\\\t in different shells:
$ parallel --onall --tag -S sh#lo,bash#lo,csh#lo,tcsh#lo,ksh#lo,zsh#lo echo \\\\t ::: a
bash#lo \t a
tcsh#lo a
sh#lo a
ksh#lo \t a
zsh#lo a
csh#lo \t a
That does not, however, get you closer to a solution. If I were you I would use env_parallel to copy the environment variables. And if the login shell on the remote systems are not the same as your shell, then set PARALLEL_SHELL to force using that shell.
So:
#!/bin/bash
env_parallel --session
# setting some vars to pass
TO_SERVER_ODBCDSN="-D -S MyMSSQLServer"
:
:
PARALLEL_SHELL=bash env_parallel -q -j $parallelization_pernode ...
(no need to use neither --env nor 'export -f' when using 'env_parallel --session')
# Cleanup (not needed if this is the last line in the script)
env_parallel --end-session

NcFTP -S with -bb

I am trying to upload all changed files to my FTP server. However, I cannot use -S .tmp and -v when I use the -bb flag - and I can't use those options with ncftpbatch at all. Here is my code:
#!/bin/bash -eo pipefail
IN=$(git diff-tree --no-commit-id --name-only -r HEAD)
OUT=$(echo $IN | tr ";" "\n")
for file in "${OUT[#]}"; do
ncftpput -bb -S .tmp -v -u "zeussite#kolechia.heliohost.org" -p "*****" ftp.kolechia.heliohost.org "/" $file
done
ncftpbatch
As you can see, I need the -S .tmp to avoid breaking the site during uploads. -v provides output to prevent my CI service from timing out.
How can I upload only the changed files - without temporarily breaking the site? I'm thinking of just logging in separately for each file, but that is bad practice.
Why not launch a function in background which just prints dummy values like uploading, please wait and then sleeps for few seconds and do it again. Outside the loop you can kill that background job
If you don't want any output
printf "\0"
or
printf "a\b"

background xargs/wget not adhering to -P and -n limits

I'm having a problem with xargs and Wget when run as shell scripts in an Applescript app. I want Wget to run 4 parallel processes in the background. The problem: basically, when I try to run the process in the background with
cat urls.txt | xargs -P 4 -n 1 /usr/local/bin/wget -q -E -b 1> NUL 2> NUL
a Wget process is apparently started for each URL passed in from the .txt file. This is too burdensome on the user's memory. When I run it in the foreground, however, with something like:
cat urls.txt | xargs -P 4 -n 1 /usr/local/bin/wget -q -E
I seem to get the four parallel Wget processes I need. Does anybody know how to get this script to run in the background with only 4 processes? I'm a bit of a novice, and I'm afraid I can't figure out why backgrounding the process causes this change.
You might run xargs on the background instead:
cat urls.txt | xargs -P4 -n1 wget -q &
Or if you want to return control to the AppleScript, disown the xargs process:
do shell script "cat urls.txt | xargs -P4 -n1 /usr/local/bin/wget -q & disown $!"
As far as I can tell, I have solved the problem with
cat urls.txt| (xargs -P4 -n1 wget -q -E >/dev/null 2>&1) &
There may well be a better solution, though...

Delete all the queues from RabbitMQ?

I installed rabbitmqadmin and was able to list all the exchanges and queues. How can I use rabbitmqadmin or rabbitmqctl to delete all the queues.
First, list your queues:
rabbitmqadmin list queues name
Then from the list, you'll need to manually delete them one by one:
rabbitmqadmin delete queue name='queuename'
Because of the output format, doesn't appear you can grep the response from list queues. Alternatively, if you're just looking for a way to clear everything (read: reset all settings, returning the installation to a default state), use:
rabbitmqctl stop_app
rabbitmqctl reset # Be sure you really want to do this!
rabbitmqctl start_app
Actually super easy with management plugin and policies:
Goto Management Console (localhost:15672)
Goto Admin tab
Goto Policies tab(on the right side)
Add Policy
Fill Fields
Virtual Host: Select
Name: Expire All Policies(Delete Later)
Pattern: .*
Apply to: Queues
Definition: expires with value 1 (change type from String to Number)
Save
Checkout Queues tab again
All Queues must be deleted
And don't forget to remove policy!!!!!!.
With rabbitmqadmin you can remove them with this one-liner:
rabbitmqadmin -f tsv -q list queues name | while read queue; do rabbitmqadmin -q delete queue name=${queue}; done
In Rabbit version 3.7.10 you can run below command with root permission:
rabbitmqctl list_queues | awk '{ print $1 }' | xargs -L1 rabbitmqctl delete_queue
Try this:
rabbitmqadmin list queues name | awk '{print $2}' | xargs -I qn rabbitmqadmin delete queue name=qn
If you don't have rabbitmqadmin installed, try to purge queues with rabbitmqctl:
rabbitmqctl list_queues | awk '{ print $1 }' | xargs -L1 rabbitmqctl purge_queue
If you're trying to delete queues because they're unused and you don't want to reset, one option is to set the queue TTL very low via a policy, wait for the queues to be auto-deleted once the TTL is passed and then remove the policy (https://www.rabbitmq.com/ttl.html).
rabbitmqctl.bat set_policy delq ".*" '{"expires": 1}' --apply-to queues
To remove the policy
rabbitmqctl clear_policy delq
Note that this only works for unused queues
Original info here: http://rabbitmq.1065348.n5.nabble.com/Deleting-all-queues-in-rabbitmq-td30933.html
I made a deleteRabbitMqQs.sh, which accepts arguments to search the list of queues for, selecting only ones matching the pattern you want. If you offer no arguments, it will delete them all! It shows you the list of queues its about to delete, letting you quit before doing anything destructive.
for word in "$#"
do
args=true
newQueues=$(rabbitmqctl list_queues name | grep "$word")
queues="$queues
$newQueues"
done
if [ $# -eq 0 ]; then
queues=$(rabbitmqctl list_queues name | grep -v "\.\.\.")
fi
queues=$(echo "$queues" | sed '/^[[:space:]]*$/d')
if [ "x$queues" == "x" ]; then
echo "No queues to delete, giving up."
exit 0
fi
read -p "Deleting the following queues:
${queues}
[CTRL+C quit | ENTER proceed]
"
while read -r line; do
rabbitmqadmin delete queue name="$line"
done <<< "$queues"
If you want different matching against the arguments you pass in, you can alter the grep in line four. When deleting all queues, it won't delete ones with three consecutive spaces in them, because I figured that eventuality would be rarer than people who have rabbitmqctl printing its output out in different languages.
Enjoy!
Here is a way to do it with PowerShell. the URL may need to be updated
$cred = Get-Credential
iwr -ContentType 'application/json' -Method Get -Credential $cred 'http://localhost:15672/api/queues' | % {
ConvertFrom-Json $_.Content } | % { $_ } | ? { $_.messages -gt 0} | % {
iwr -method DELETE -Credential $cred -uri $("http://localhost:15672/api/queues/{0}/{1}" -f [System.Web.HttpUtility]::UrlEncode($_.vhost), $_.name)
}
You can use rabbitmqctl eval as below:
rabbitmqctl eval 'IfUnused = false, IfEmpty = true, MatchRegex =
<<"^prefix-">>, [rabbit_amqqueue:delete(Q, IfUnused, IfEmpty) || Q <-
rabbit_amqqueue:list(), re:run(element(4, element(2, Q)), MatchRegex)
=/= nomatch ].'
The above will delete all empty queues in all vhosts that have a name
beginning with "prefix-".
You can edit the variables IfUnused, IfEmpty,
and MatchRegex as per your requirement.
Removing all queues using rabbitmqctl one liner
rabbitmqctl list_queues | awk '{ print $1 }' | sed 's/Listing//' | xargs -L1 rabbitmqctl purge_queue
You need not reset rabbitmq server to delete non-durable queues. Simply stop the server and start again and it will remove all the non-durable queues available.
In case you only want to purge the queues which are not empty (a lot faster):
rabbitmqctl list_queues | awk '$2!=0 { print $1 }' | sed 's/Listing//' | xargs -L1 rabbitmqctl purge_queue
For me, it takes 2-3 seconds to purge a queue (both empty and non-empty ones), so iterating through 50 queues is such a pain while I just need to purge 10 of them (40/50 are empty).
I tried rabbitmqctl and reset commands but they are very slow.
This is the fastest way I found (replace your username and password):
#!/bin/bash
# Stop on error
set -eo pipefail
USER='guest'
PASSWORD='guest'
curl -sSL -u $USER:$PASSWORD http://localhost:15672/api/queues/%2f/ | jq '.[].name' | sed 's/"//g' | xargs -L 1 -I# curl -XDELETE -sSL -u $USER:$PASSWORD http://localhost:15672/api/queues/%2f/#
# To also delete exchanges uncomment next line
# curl -sSL -u $USER:$PASSWORD http://localhost:15672/api/exchanges/%2f/ | jq '.[].name' | sed 's/"//g' | xargs -L 1 -I# curl -XDELETE -sSL -u $USER:$PASSWORD http://localhost:15672/api/exchanges/%2f/#
Note: This only works with the default vhost /
I tried the above pieces of code but I did not do any streaming.
sudo rabbitmqctl list_queues | awk '{print $1}' > queues.txt; for line in $(cat queues.txt); do sudo rabbitmqctl delete_queue "$line"; done.
I generate a file that contains all the queue names and loops through it line by line to the delete them. For the loops, while read ... did not do it for me. It was always stopping at the first queue name.
Here is a faster version (using parallel install sudo apt-get install parallel) expanding on the excellent answer by #admenva
parallel -j 50 rabbitmqadmin -H YOUR_HOST_OR_LOCALHOST -q delete queue name={} ::: $(rabbitmqadmin -H YOUR_HOST_OR_LOCALHOST -f tsv -q list queues name)
This commands deletes all your queues
python rabbitmqadmin.py \
-H YOURHOST -u guest -p guest -f bash list queues | \
xargs -n1 | \
xargs -I{} \
python rabbitmqadmin.py -H YOURHOST -u guest -p guest delete queue name={}
This script is super simple because it uses -f bash, which outputs the queues as a list.
Then we use xargs -n1 to split that up into multiple variables
Then we use xargs -I{} that will run the command following, and replace {} in the command.
To list queues,
./rabbitmqadmin -f tsv -q list queues
To delete a queue,
./rabbitmqadmin delete queue name=name_of_queue
For whose have a problem with installing rabbitmqadmin, You should firstly install python.
UNIX-like operating system users need to copy rabbitmqadmin to a directory in PATH, e.g. /usr/local/bin.
Windows users will need to ensure Python is on their PATH, and invoke rabbitmqadmin as python.exe rabbitmqadmin.
Then
Browse to http://{hostname}:15672/cli/rabbitmqadmin to download.
Go to the containing folder then run cmd with administrator privilege
To list Queues
python rabbitmqadmin list queues.
To delete Queue
python rabbitmqadmin delete queue name=Name_of_queue
To Delete all Queues
1- Declare Policy
python rabbitmqadmin declare policy name='expire_all_policies' pattern=.* definition={\"expires\":1} apply-to=queues
2- Remove the policy
python rabbitmqadmin delete policy name='expire_all_policies'
Following command worked for me:
sudo rabbitmqctl list_queues | awk '{print $1}' | xargs -I qn sudo rabbitmqctl delete_queue qn
There's a way to remove all queues and exchanges without scripts and full reset. You can just delete and re-create a virtual host from admin interface. This will work even for vhost /.
The only thing you'll need to restore is permissions for the newly created vhost.
Okay, important qualifier for this answer:
The question does ask to use either rabbitmqctl OR rabbitmqadmin to solve this, my answer needed to use both. Also, note that this was tested on MacOS 10.12.6 and the versions of the rabbitmqctl and rabbitmqadmin that are installed when installing rabbitmq with Homebrew and which is identified with brew list --versions as rabbitmq 3.7.0
rabbitmqctl list_queues -p <VIRTUAL_HOSTNAME> name | sed 1,2d | xargs -I qname rabbitmqadmin --vhost <VIRTUAL_HOSTNAME> delete queue name=qname
Another option is to delete the vhost associated with the queues. This will delete everything associated with the vhost, so be warned, but it is easy and fast.
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.
This is a method I use. It is easy, clear and effective.
This is the document:
Vhost=the_vhost_name
User=user_name
Password=the_passworld
for i in `rabbitmqctl list_queues -p $Vhost | awk '{ print $1 }'`
do
echo "queu_name: $i"
curl -u $User:$Passworld -H "content-type:application/json" -XDELETE http://localhost:15672/api/queues/$Vhost/$i
done
Try this:
rabbitmqctl list_queues -q name > q.txt
IFS=$'\n' read -d '' -r -a queues < q.txt
count=${#queues[#]}
i=1; while (($i < $count)); do echo ${queues[$i]};rabbitmqctl delete_queue ${queues[$i]};i=$((i+1)); done
As per https://stackoverflow.com/a/52002145/3278855
To automate that, it's possible to use this curl:
curl -X PUT --data '{"pattern":".*","apply-to":"all","definition":{"expires":1},"priority":0}' -u guest:guest 'http://localhost:15672/api/policies/%2f/clear' && \
curl -X DELETE -u guest:guest 'http://localhost:15672/api/policies/%2f/clear'
Please note that %2f is default vhost name (/) and guest:guest is login:password
rabbitmqadmin list queues|awk 'NR>3{print $4}'|head -n-1|xargs -I qname rabbitmqadmin delete queue name=qname