Clone all projects from cgit - automation

I have to download all projects that are hosted on some cgit instance. There are several hundreds of repositories, so it is tedious to do this manually.
How can it be done?
Seems that it is possible to do it with curl by parsing pages one by one. By is there more convenient interface?

There does not seem to be any official or convenient API for CGit to export/clone all its repositories.
You can try those alternatives:
curl -s http://git.suckless.org/ |
xml sel -N x="http://www.w3.org/1999/xhtml" -t -m "//x:a" -v '#title' -n |
grep . |
while read repo
do git clone git://git.suckless.org/$repo
done
Or:
curl -s http://git.suckless.org/ | xml pyx | awk '$1 == "Atitle" { print $2 }'
Or:
curl -s http://git.suckless.org/ | xml pyx | awk '$1 == "Atitle" { printf("git clone %s\n",$2) }' | s
I suspect this work for one page of Git repositories as listed by CGit: you might still have to repeat that for all subsequent Git repositories pages.

Related

How to filter output of a URL

I have a URL and when I send a request by curl, I get a big output.
curl https://www.aparat.com/video/video/embed/videohash/lXhkG/vt/frame -H "Accept: application/json" -s
I get: https://pastebin.mozilla.org/QM6FN8MZ#L
But I just want to get the URL of 720p, I mean just:
https:\/\/caspian1.cdn.asset.aparat.com\/aparat-video\/de54245e862b62249b6b7958c734276547445778-720p.apt?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjQ2NDJhYmQ4NGFiN2UzNDJkNGMxZWI3ZTNkMzlmZmQ5IiwiZXhwIjoxNjY5ODA5NzI1LCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.havkkhJyXjBt_jHPVv4poEVb65_7tRsLIxO5pCO7tGE
Any idea how to do it?
I'm trying to use grep but I don't know how to remove other things from else 720p URL.
curl https://www.aparat.com/video/video/embed/videohash/lXhkG/vt/frame -H "Accept: application/json" -s | grep -e "720p"
You could go the html-parsing/json-parsing route, e.g.:
curl -s https://www.aparat.com/video/video/embed/videohash/lXhkG/vt/frame |
# Normalize html
xmlstarlet fo -o -H -R 2> /dev/null |
# Extract relevant js bit
xmlstarlet sel -t -v '_:html/_:body/_:div/_:script' 2> /dev/null |
# Extract relevant json
sed -nE '/^ *var +options *= */ { s///; s/;$//p; }' |
# Extract desired url, i.e. the 720p in this case
jq -r '.multiSRC[][] | select( .label == "720p" ) | .src'
I would harness GNU AWK for this following way
wget --quiet -O - https://www.aparat.com/video/video/embed/videohash/lXhkG/vt/frame | awk 'match($0, /http[^"]*720[^"]*/){print substr($0,RSTART,RLENGTH)}'
gives output
https:\/\/caspian1.cdn.asset.aparat.com\/aparat-video\/de54245e862b62249b6b7958c734276547445778-720p.apt?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjY1OTcxYTRkNGZiMjkyYjk0NjM0Mjk2ODVkOTc3YjEwIiwiZXhwIjoxNjY5ODIxNDM2LCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.NI2_6nwOxLEOxhWghsR2bOqzrXINXqqscbduHpCWwok
Explanation: I use wget with information like progress bar &c turned-off (--quiet) and writing to standard output (-O -) which is piped into awk, which for each line is matching against following regular expression http[^"]*720[^"]* that is http followed by zero-or-more (*) not-quotes followed by 720 followed by zero-or-more non-quotes, if there is match I print substring of line containing that match. match string function sets RSTART and RLENGTH variables, which I use later in substr. Note: this might give false positivie if there are others URL containing 720.
(tested in GNU Wget 1.20.3 and GNU Awk 5.0.1)
Using any awk:
$ cat file | awk 'match($0,/"https?:\\\/\\\/[^"]*-720p\.apt\?[^"]*"/) { print substr($0,RSTART+1,RLENGTH-2) }'
https:\/\/caspian1.asset.aparat.com\/aparat-video\/de54245e862b62249b6b7958c734276547445778-720p.apt?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6ImViODhjZDNlYzZhYzk3OTBhZDc3MWJhMzIyNWQ3NmZlIiwiZXhwIjoxNjY5ODE4Mjc5LCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.e6do9Ha9EkDS46NZDoHT2dYHSOezu_TbdGAGblfi2tM
The contents of file are what you provided in pastebin, obviously just replace cat file with your curl command.

Get SVN URL of removed git-svn file

I would like to track a removed file as far back in history as possible, while using git-svn on a subdirectory of the SVN repository.
Using git log --full-history -- path/to/removed_file.py, I can get see the history starting with the time the file was moved into the subdirectory I checked out using git-svn.
I can see which SVN revision that was in the git-svn commit message postfix, so I would now like to use svn log <full_url>#revision to see the rest of the history.
I know that I could use git svn info --url path/to/existing_file.py to see the required full SVN url, but what is a quick (ideally scriptable) way of getting the SVN URL of a file that is no longer in the repository?
To git, it doesn't matter much that a file foo/bar.py is removed in HEAD — as long as you have it in history, you can view every past version of it.
For clarity of concreteness, I'll take this git-svn repo from the LLVM project as an example. There, the file docs/todo.rst has been deleted in svn revision 308987, git commit fb572868… and is absent in master.
Let's first init a local clone.
$ git clone https://github.com/llvm-mirror/lnt && cd lnt
Cloning into 'lnt'...
...
$ git svn init https://llvm.org/svn/llvm-project/lnt/trunk
$ git update-ref refs/remotes/git-svn refs/remotes/origin/master
$
$ #-- ask svn info of anything to check setup and/or force laziness
$ git svn info --url README.md
Rebuilding .git/svn/refs/remotes/git-svn/.rev_map.91177308-0d34-0410-b5e6-96231b3b80d8 ...
r154126 = 3c3062527ac17b5fac440c55a3e1510d0ab8c9d9
r154135 = 82a95d29ac7d25c355fbd0898a44dc3e71a75fd8
...
r374687 = 446f9a3b651086e87684d643705273ef78045279
r374824 = 8c57bba3687ada10de5653ae46c537e957525bdb
Done rebuilding .git/svn/refs/remotes/git-svn/.rev_map.91177308-0d34-0410-b5e6-96231b3b80d8
https://llvm.org/svn/llvm-project/lnt/trunk/README.md
So it gives back the README.md URL as expected. Now let's try the case of a deleted file:
$ git svn info --url docs/todo.rst
svn: 'docs/todo.rst' is not under version control
Fails, just like you say. man git-svn says that info Does not currently support a -r/--revision argument.
OK then, let's try emulating what it does, first by hand.
https://llvm.org/svn/llvm-project/lnt/trunk/README.md?r=374824 — this is the URL for given file at given revision.
Our vanished docs/todo.rst is available at https://llvm.org/svn/llvm-project/lnt/trunk/docs/todo.rst?p=308986 Notice the decrement: per git show fb572868 | grep git-svn-id, docs/todo.rst is already deleted in r308987 — so we request r308986.
On to scripting it... rather simple job.
git-svn-oldinfo () {
relfname="$1"
git log -n1 -- "$relfname" \
| awk '/git-svn-id:/ {sub(/#/, " ", $2); print $2}' \
| { read baseurl rev; echo "${baseurl}/${relfname}?p=$((rev-1))"; }
}
#-- test:
$ git-svn-oldinfo docs/todo.rst
https://llvm.org/svn/llvm-project/lnt/trunk/docs/todo.rst?p=308986
Quick-n-dirty but tested — you're welcome to adjust & extend as needed.
Edit
Despite git log being a "porcelain" command (i.e. not really designed for scripting), it's quite possible to parse out the filenames from it too, if you're to query by globs like **/removed_file.py:
git-svn-oldinfo-glob () {
fileglob="$1"
git log -n1 --stat --format=oneline -- "$fileglob" \
| { read commit msg; \
read fullname _remainder_dummy; \
git cat-file -p $commit \
| tail -n1 \
| awk '/git-svn-id:/ {sub(/#/, " ", $2); print $2}' \
| { read baseurl rev; echo "${baseurl}/${fullname}?p=$((rev-1))"; } \
}
}
#-- test:
$ git-svn-oldinfo-glob '**/todo.rst'
https://llvm.org/svn/llvm-project/lnt/trunk/docs/todo.rst?p=308986
Take it with a grain of salt: it'll probably break in hilarious ways or output garbage if the glob matches multiple files, non-removed files, files with whitespace in the name, etc.
As always, check out man git-log and customize as needed.

How to get all tags from github api

I usually get the releases/tags from github API with below command
$ repo="helm/helm"
$ curl -sL https://api.github.com/repos/${repo}/tags |jq -r ".[].name"
v3.2.0-rc.1
v3.2.0
v3.1.3
v3.1.2
v3.1.1
v3.1.0-rc.3
v3.1.0-rc.2
v3.1.0-rc.1
v3.1.0
v3.0.3
v3.0.2
v3.0.1
v3.0.0-rc.4
v3.0.0-rc.3
v3.0.0-rc.2
v3.0.0-rc.1
v3.0.0-beta.5
v3.0.0-beta.4
v3.0.0-beta.3
v3.0.0-beta.2
v3.0.0-beta.1
v3.0.0-alpha.2
v3.0.0-alpha.1
v3.0.0
v2.16.6
v2.16.5
v2.16.4
v2.16.3
v2.16.2
v2.16.1
But in fact, it doesn't list all releases, what should I do?
For example, I can't get release before v2.16.1 as below link
https://github.com/helm/helm/tags?after=v2.16.1
I try to reference the same way to add ?after=v2.16.1 in curl api
command, but no help
curl -sL https://api.github.com/repos/${repo}/tags?after=v2.16.1 |jq -r ".[].name"
I got same output.
Reference: https://developer.github.com/v3/git/tags/
This could be because of pagination
See this script as an example of detecting pages, and adding the required ?page=x to access to all the data from a GitHub API call.
Relevant extract:
# single page result-s (no pagination), have no Link: section, the grep result is empty
last_page=`curl -s -I "https://api.github.com${GITHUB_API_REST}" -H "${GITHUB_API_HEADER_ACCEPT}" -H "Authorization: token $GITHUB_TOKEN" | grep '^Link:' | sed -e 's/^Link:.*page=//g' -e 's/>.*$//g'`
# does this result use pagination?
if [ -z "$last_page" ]; then
# no - this result has only one page
rest_call "https://api.github.com${GITHUB_API_REST}"
else
# yes - this result is on multiple pages
for p in `seq 1 $last_page`; do
rest_call "https://api.github.com${GITHUB_API_REST}?page=$p"
done
fi
With help from #VonC, I got the result with extra query string ?page=2, if I'd like to query older releases and so on.
curl -sL https://api.github.com/repos/${repo}/tags?page=2 |jq -r ".[].name"
I can easily get the last page now.
$ GITHUB_API_REST="/repos/helm/helm/tags"
$ GITHUB_API_HEADER_ACCEPT="Accept: application/vnd.github.v3+json"
$ GITHUB_TOKEN=xxxxxxxx
$ last_page=`curl -s -I "https://api.github.com${GITHUB_API_REST}" -H "${GITHUB_API_HEADER_ACCEPT}" -H "Authorization: token $GITHUB_TOKEN" | grep '^Link:' | sed -e 's/^Link:.*page=//g' -e 's/>.*$//g'`
$ echo $last_page
4

extracting a number from a stream using sed/grep/awk

i am writting a script and i need to get a number out of the shell command output. The command & its return is
$ git branch -a -v --no-abbrev --contains $(git rev-parse HEAD)
* (HEAD detached at c5246b6) c5246b6907e46795741853852462914e7a5f60de Merge pull request 1166 from testPR into dev
remotes/origin-pull/1166/merge c5246b6907e46795741853852462914e7a5f60de Merge pull request 1166 from testPR into dev
i am trying to extract the 1166 out of the result by using sed over the piped result. Something like
$ git branch -a -v --no-abbrev --contains $(git rev-parse HEAD) | sed <pattern>
to get the 1166
My patterns so far doesn't seem to get the number i am expecting.
I seems that you're trying to extract the part of your remote branch name between last 2 slashes. And you may use grep with perl interpreted pattern to achieve that, here you are,
$ git branch ... | grep -oP '[^\/]+(?=\/[^\/]+$)'
1166
Brief explanation,
-o: Print only the matched (non-empty) parts
[^\/]+ : grep command would print this part, non-slash pattern
(?=\/[^\/]+$) : matches words ahead of the las slash of the line [^\/]+$
Not the answer to my exact question, but i am able to get what i want by modifying my bash command.
git branch -r -v --no-abbrev --contains $(git rev-parse HEAD) | awk '{print $1}'
This returns: origin-pull/1166/merge , which is what i want
notice the -r in the command, -a will return both local and remote git branch info. This way, i can cheat on the sed pattern again.

CasperJS and Parallel Crawling

I'm trying to crawl some website. However my crawling process is so long i need to use multiple instances to shorten it. I've searched for other ways and aborted all the unnecessary resources requested still it's way too slow for me(around 8-9 secs).
What is the easiest way to parallel casperjs instances or even run only two casperjs at the same time to crawl in parallel?
I have used parallel gnu from a blog post i've found however it seems like although the process' are alive they are not crawling in parallel because total execution time is still the same with one instance.
Should i use a nodejs server to create instances?
What is the easiest and most practical way?
Can you adapt this:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Breadth-first-parallel-web-crawler-mirrorer
#!/bin/bash
# E.g. http://gatt.org.yeslab.org/
URL=$1
# Stay inside the start dir
BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
URLLIST=$(mktemp urllist.XXXX)
URLLIST2=$(mktemp urllist.XXXX)
SEEN=$(mktemp seen.XXXX)
# Spider to get the URLs
echo $URL >$URLLIST
cp $URLLIST $SEEN
while [ -s $URLLIST ] ; do
cat $URLLIST |
parallel lynx -listonly -image_links -dump {} \; \
wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and do { $seen{$1}++ or print }' |
grep -F $BASEURL |
grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
mv $URLLIST2 $URLLIST
done
rm -f $URLLIST $URLLIST2 $SEEN