Popups block bulk download of pdfs from website with wget

Popups block bulk download of pdfs from website with wget - pdf

I would like to download some free-to-download pdfs (copies of old newspaper) from this website of the Austrian National Library with wget using the bash script below:
for year in {14..57}; do
for month in `seq -w 1 12`; do # -w for leading zero
for day in `seq -w 1 31`; do
wget -A pdf -nc -E -nd --no-check-certificate --content-disposition http://anno.onb.ac.at/pdfs/ONB_lzg_18$year$month$day.pdf
done
done
done
Aside of some newspaper issues not being available, I cannot download any issues even though they exist. I would get errors such as the one for the existing issue of June 30, 1814 for example:
http://anno.onb.ac.at/pdfs/ONB_lzg_18140630.pdf
Aufl"osen des Hostnamens anno.onb.ac.at (anno.onb.ac.at)... 193.170.112.230
Verbindungsaufbau zu anno.onb.ac.at (anno.onb.ac.at)|193.170.112.230|:80 ... verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet ... 404 Not Found
FEHLER 404: Not Found.
However, if you were to download the corresponding pdfs manually (here, see upper-right corner) you have to press "ok" in a pop-up acknowledgement. Once you did this, I can even download the issue via wget without a problem.
How can I tell wget to confirm via the command line the acknowledgements (the question you get once you want to download a pdf), see screenshot below? Is there a command in wget for that?

There are two issues in your code.
lgz newspaper is not available for all the dates
The PDF are not always generated and cached on the URL you used. You need to first run the other URL to make sure the PDF is generated
Below is the updated code that should work
#!/bin/bash
for year in {14..57}; do
DATES=$(curl -sS "http://anno.onb.ac.at/cgi-content/anno?aid=lzg&datum=18$year&zoom=33" | gawk 'match($0, /datum=([^&]+)/, ary) {print ary[1]}' | xargs echo)
for date in $DATES
do
echo "Downloading for $date"
curl "http://anno.onb.ac.at/cgi-content/anno_pdf.pl?aid=lzg&datum=$date" -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' -H 'DNT: 1' -H "Referer: http://anno.onb.ac.at/cgi-content/anno?aid=lzg&datum=$date" -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: en-US,en;q=0.9' --compressed
wget -A pdf -nc -E -nd --no-check-certificate --content-disposition http://anno.onb.ac.at/pdfs/ONB_lzg_$date.pdf
done
done

Related

curl command not working on gitlab-ci pipeline

I am trying to pull a project ID using gitlab REST API v4, but when I issue the curl command, I get this error:
"jobs:test:script config should be a string or an array of strings"
The command is this one:
curl -k -H "PRIVATE-TOKEN: PRIVATE-TOKEN" "https://gitlab.nbg992.poc.dcn.telekom.de/api/v4/projects?search=$CI_PROJECT_NAME"
I tried to single quote it:
'curl -k -H "PRIVATE-TOKEN: PRIVATE-TOKEN" "https://gitlab.nbg992.poc.dcn.telekom.de/api/v4/projects?search=$CI_PROJECT_NAME"'
But when I do it, it removes the failure, but the command is ignored.
So I tried to eval it like this:
eval - 'curl -k -H "PRIVATE-TOKEN: PRIVATE-TOKEN" "https://gitlab.nbg992.poc.dcn.telekom.de/api/v4/projects?search=$CI_PROJECT_NAME"'
When I do it, the failure its produced again:
"jobs:test:script config should be a string or an array of strings"
Any clue how should I issue the curl command? I think what is causing the failure is the colon within the "PRIVATE-TOKEN: PRIVATE-TOKEN"

This worked for me
Declare Job variables in variables sections eg:
variables:
PRIVATE-TOKEN: "TokenValue"
PRIVATE_HEADER: "PRIVATE-TOKEN: ${PRIVATE-TOKEN}"
Then under Script Section of the CI file used Curl command as follows
script:
curl -k -H ${PRIVATE_HEADER} "https://gitlab.nbg992.poc.dcn.telekom.de/api/v4/projects?search=${CI_PROJECT_NAME}
Using the {} braces around variable names made sure that ":" issue doesn't show up

How to get all tags from github api

I usually get the releases/tags from github API with below command
$ repo="helm/helm"
$ curl -sL https://api.github.com/repos/${repo}/tags |jq -r ".[].name"
v3.2.0-rc.1
v3.2.0
v3.1.3
v3.1.2
v3.1.1
v3.1.0-rc.3
v3.1.0-rc.2
v3.1.0-rc.1
v3.1.0
v3.0.3
v3.0.2
v3.0.1
v3.0.0-rc.4
v3.0.0-rc.3
v3.0.0-rc.2
v3.0.0-rc.1
v3.0.0-beta.5
v3.0.0-beta.4
v3.0.0-beta.3
v3.0.0-beta.2
v3.0.0-beta.1
v3.0.0-alpha.2
v3.0.0-alpha.1
v3.0.0
v2.16.6
v2.16.5
v2.16.4
v2.16.3
v2.16.2
v2.16.1
But in fact, it doesn't list all releases, what should I do?
For example, I can't get release before v2.16.1 as below link
https://github.com/helm/helm/tags?after=v2.16.1
I try to reference the same way to add ?after=v2.16.1 in curl api
command, but no help
curl -sL https://api.github.com/repos/${repo}/tags?after=v2.16.1 |jq -r ".[].name"
I got same output.
Reference: https://developer.github.com/v3/git/tags/

This could be because of pagination
See this script as an example of detecting pages, and adding the required ?page=x to access to all the data from a GitHub API call.
Relevant extract:
# single page result-s (no pagination), have no Link: section, the grep result is empty
last_page=`curl -s -I "https://api.github.com${GITHUB_API_REST}" -H "${GITHUB_API_HEADER_ACCEPT}" -H "Authorization: token $GITHUB_TOKEN" | grep '^Link:' | sed -e 's/^Link:.*page=//g' -e 's/>.*$//g'`
# does this result use pagination?
if [ -z "$last_page" ]; then
# no - this result has only one page
rest_call "https://api.github.com${GITHUB_API_REST}"
else
# yes - this result is on multiple pages
for p in `seq 1 $last_page`; do
rest_call "https://api.github.com${GITHUB_API_REST}?page=$p"
done
fi

With help from #VonC, I got the result with extra query string ?page=2, if I'd like to query older releases and so on.
curl -sL https://api.github.com/repos/${repo}/tags?page=2 |jq -r ".[].name"
I can easily get the last page now.
$ GITHUB_API_REST="/repos/helm/helm/tags"
$ GITHUB_API_HEADER_ACCEPT="Accept: application/vnd.github.v3+json"
$ GITHUB_TOKEN=xxxxxxxx
$ last_page=`curl -s -I "https://api.github.com${GITHUB_API_REST}" -H "${GITHUB_API_HEADER_ACCEPT}" -H "Authorization: token $GITHUB_TOKEN" | grep '^Link:' | sed -e 's/^Link:.*page=//g' -e 's/>.*$//g'`
$ echo $last_page
4

Using Apache Bench to load test with JSON on Windows

I'm issuing the following command:
ab -n 100 -c 20 -k -v 1 -H "Accept-Encoding: gzip, deflate" -T "application/json" -p ab-login.txt http://localhost:222/
With the contents of ab-login.txt being:
{username:'user',password:'pass!'}
And I get an error:
Could not stat POST data file (ab-login.txt): Partial results are valid but processing is incomplete
What am I doing wrong?

I added
-v 4 -T 'application/x-www-form-urlencoded'
Also, a line in my post file looked like:
name='myName'
There were no quotes or anything like that.

ssh wget download jdk

I've SSHed into one of my VPS and I'm trying to install java on it. Not really sure how to go about downloading but I am trying to use wget to download and install JDK7 from the Oracle website.
this file in particular: http://download.oracle.com/otn-pub/java/jdk/7u40-b43/jdk-7u40-linux-x64.rpm
To download the file, it requires authentication and I don't know how to do that through wget.
SOmeone please help

# rpm
wget --no-cookies \
--no-check-certificate \
--header "Cookie: oraclelicense=accept-securebackup-cookie" \
"http://download.oracle.com/otn-pub/java/jdk/7u55-b13/jdk-7u55-linux-x64.rpm" \
-O jdk-7-linux-x64.rpm
# ubuntu
wget --no-cookies \
--no-check-certificate \
--header "Cookie: oraclelicense=accept-securebackup-cookie" \
"http://download.oracle.com/otn-pub/java/jdk/7u55-b13/jdk-7u55-linux-x64.tar.gz" \
-O jdk-7-linux-x64.tar.gz
# then
tar -xzvf jdk-7-linux-x64.tar.gz
refer: https://gist.github.com/scottvrosenthal/11187116

To download the file, it requires authentication and I don't know how
to do that through wget.
To be more accurate it requires license agreement.
If you have SSH access to the server you can copy file using scp command
scp /<path to the file>/jdk-7u40-linux-x64.rpm user-name#server-name:/tmp/
Then you go to the server by ssh and find it in /tmp/ folder

Authenticating with the new Google Analytics Core Reporting API and OAuth 2.0

Google Analytics's new "Core Reporting API" (version 3.0) "recommend[s] using OAuth 2.0 to authorize requests" (citation). Its documentation, though, is very unclear about how to do that. (It says "When you create your application, you register it with Google" (citation), but does a shell script count as an "application"?? If so, I should register the bash script at the "APIs Console", which doesn't give any guidance on how to do so.) Using Analytics' version 2.3, I run a bash script:
#!/bin/bash
# generates an XML file
googleAuth="$(curl https://www.google.com/accounts/ClientLogin -s \
-d Email=foo \
-d Passwd=bar \
-d accountType=GOOGLE \
-d source=curl-dataFeed-v2 \
-d service=analytics \
| awk /Auth=.*/)"
# ...
feedUri="https://www.google.com/analytics/feeds/data\
?ids=$table\
&start-date=$SD\
&end-date=$ED\
&dimensions=baz\
&metrics=xyzzy\
&prettyprint=true"
# ...
curl $feedUri --silent \
--header "Authorization: GoogleLogin $googleAuth" \
--header "GData-Version: 2" \
| awk # ...
How would I do something like this — a script that grabs whatever login token I need and sends it back — for the new Analytics?
(Incidentally, yes, I realize the results will be JSON, not XML.)

Here is a sample script
googleAuth="$(curl https://www.google.com/accounts/ClientLogin -s \
-d Email=$USER_EMAIL \
-d Passwd=$USER_PASS \
-d accountType=GOOGLE \
-d source=curl-accountFeed-v1 \
-d service=analytics \
| grep "Auth=" | cut -d"=" -f2)"
feedUri="https://www.googleapis.com/analytics/v3/data/ga\
?start-date=$START_DATE\
&end-date=$END_DATE\
&ids=ga:$PROFILE_ID\
&dimensions=ga:userType\
&metrics=ga:users\
&max-results=50\
&prettyprint=false"
curl $feedUri -s --header "Authorization: GoogleLogin auth=$googleAuth"

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Popups block bulk download of pdfs from website with wget - pdf

Related

curl command not working on gitlab-ci pipeline

How to get all tags from github api

Using Apache Bench to load test with JSON on Windows

ssh wget download jdk

Authenticating with the new Google Analytics Core Reporting API and OAuth 2.0

Categories

Resources