How to get Xmllint to read from stdin? - xmllint

I don't want to create an XML file
I need to use the --shell with cat to create filters
I don't have xpath in my version of xmllint, using libxml2-2.7.6-14.el6.x86_64
xml|xmllint --shell - <<< $(echo 'cat /')
-:1: parser error : Start tag expected, '<' not found
EDIT: clustat -x outputs an XML file and I want to parse out the active node.
I don't think there is a way to do it without xpath so I created a temp xml file.
/usr/sbin/clustat -x > /tmp/clustat.xml
ACTIVENODE=$(xmllint --shell /tmp/clustat.xml <<< `echo 'cat //group/#owner'`|grep -v "^/ >"|cut -d= -f2|tr -d \")

I had a similar issue where I had to unzip an XML file then feed it to xmllint. The key is the "-" option which tells xmllint to read from stdin.
For example:
$ bzip2 -dc dierehabilitati00delagoog_meta.xml.bz2 | xmllint --format
would fail giving the "usage" for xmllint. Adding "-" worked:
$ bzip2 -dc dierehabilitati00delagoog_meta.xml.bz2 | xmllint --format -
<?xml version="1.0"?>
<metadata>
<title>Die Rehabilitation im Strafrecht</title>
<creator>Ernst Delaquis</creator>
<mediatype>texts</mediatype>
<collection>americana</collection>
</metadata>
Hope this helps.

Related

How to filter output of a URL

I have a URL and when I send a request by curl, I get a big output.
curl https://www.aparat.com/video/video/embed/videohash/lXhkG/vt/frame -H "Accept: application/json" -s
I get: https://pastebin.mozilla.org/QM6FN8MZ#L
But I just want to get the URL of 720p, I mean just:
https:\/\/caspian1.cdn.asset.aparat.com\/aparat-video\/de54245e862b62249b6b7958c734276547445778-720p.apt?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjQ2NDJhYmQ4NGFiN2UzNDJkNGMxZWI3ZTNkMzlmZmQ5IiwiZXhwIjoxNjY5ODA5NzI1LCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.havkkhJyXjBt_jHPVv4poEVb65_7tRsLIxO5pCO7tGE
Any idea how to do it?
I'm trying to use grep but I don't know how to remove other things from else 720p URL.
curl https://www.aparat.com/video/video/embed/videohash/lXhkG/vt/frame -H "Accept: application/json" -s | grep -e "720p"
You could go the html-parsing/json-parsing route, e.g.:
curl -s https://www.aparat.com/video/video/embed/videohash/lXhkG/vt/frame |
# Normalize html
xmlstarlet fo -o -H -R 2> /dev/null |
# Extract relevant js bit
xmlstarlet sel -t -v '_:html/_:body/_:div/_:script' 2> /dev/null |
# Extract relevant json
sed -nE '/^ *var +options *= */ { s///; s/;$//p; }' |
# Extract desired url, i.e. the 720p in this case
jq -r '.multiSRC[][] | select( .label == "720p" ) | .src'
I would harness GNU AWK for this following way
wget --quiet -O - https://www.aparat.com/video/video/embed/videohash/lXhkG/vt/frame | awk 'match($0, /http[^"]*720[^"]*/){print substr($0,RSTART,RLENGTH)}'
gives output
https:\/\/caspian1.cdn.asset.aparat.com\/aparat-video\/de54245e862b62249b6b7958c734276547445778-720p.apt?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjY1OTcxYTRkNGZiMjkyYjk0NjM0Mjk2ODVkOTc3YjEwIiwiZXhwIjoxNjY5ODIxNDM2LCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.NI2_6nwOxLEOxhWghsR2bOqzrXINXqqscbduHpCWwok
Explanation: I use wget with information like progress bar &c turned-off (--quiet) and writing to standard output (-O -) which is piped into awk, which for each line is matching against following regular expression http[^"]*720[^"]* that is http followed by zero-or-more (*) not-quotes followed by 720 followed by zero-or-more non-quotes, if there is match I print substring of line containing that match. match string function sets RSTART and RLENGTH variables, which I use later in substr. Note: this might give false positivie if there are others URL containing 720.
(tested in GNU Wget 1.20.3 and GNU Awk 5.0.1)
Using any awk:
$ cat file | awk 'match($0,/"https?:\\\/\\\/[^"]*-720p\.apt\?[^"]*"/) { print substr($0,RSTART+1,RLENGTH-2) }'
https:\/\/caspian1.asset.aparat.com\/aparat-video\/de54245e862b62249b6b7958c734276547445778-720p.apt?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6ImViODhjZDNlYzZhYzk3OTBhZDc3MWJhMzIyNWQ3NmZlIiwiZXhwIjoxNjY5ODE4Mjc5LCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.e6do9Ha9EkDS46NZDoHT2dYHSOezu_TbdGAGblfi2tM
The contents of file are what you provided in pastebin, obviously just replace cat file with your curl command.

String manipulation in .gitlab-ci variables

I'm trying to set up my ci file with some variables. I'm able to generate a variable like so;
...
variables:
TARGET_PROJECT_DIR: "${CI_PROJECT_NAME}.git"
However, I don't seem to be able to do this;
...
variables:
PROJECT_PROTOCOL_RELATIVE_URL: "${CI_PROJECT_URL//https:\/\/}.git"
If I run that in bash, I get the expected output which is gitlab.com/my/repo/url.git with the 'https://' removed and the '.git' appended.
My workaround has just been to export it in the 'script' section, but it feels a lot neater to add this to the variables section, since this is part of a template that is being inherited by the actual jobs. Is it possible?
There are several more useful variables defined in the GitLab CI environment.
CI_PROJECT_PATH gives you the <namespace>/<project name> (or just <project name> if you have no extra namespace) string and
CI_SERVER_HOST gives you the server name, so you could do
variables:
PROJECT_PROTOCOL_RELATIVE_URL: ${CI_SERVER_HOST}/${CI_PROJECT_PATH}.git
I have similar setups (also without quotes).
I'm not sure if that will work for you, since my runners and my server are under my control and I don't run pipelines with external projects.
But you can get all available variables displayed in the job log by running a job like this:
stages:
- env
show-env:
stage: env
script:
- env
Also always helpful is https://docs.gitlab.com/ee/ci/variables/predefined_variables.html
After looking around for similar challenges I found your not answered question. Here are my suggestions:
stages:
- todo
todo-job:
stage: todo
only:
- master
script:
#your question / example
- echo ${CI_PROJECT_URL}
- echo ${CI_PROJECT_URL:8:100}.git
#Because you have the word manipulation in the title, I have some more examples:
#Return substring between the two '_'
- INPUT="someletters_12345_moreleters.ext"
- SUBSTRING=`expr match "$INPUT" '.*_\([[:digit:]]*\)_.*' `
- echo $SUBSTRING
#Store a substring in a new variable and create an output
- b=${INPUT:12:5}
- echo $b
#Substring using grep with regex (more readable)
- your_number=$(echo "someletters_12345_moreleters.ext" | grep -E -o '[0-9]{5}')
- echo $your_number
#Substring using variable and 'grep' with regex (more readable)
- your_number=$(echo "$INPUT" | grep -E -o '[0-9]{5}')
- echo $your_number
#split a string and return a part using 'cut'
- your_id=$(echo "Release V14_TEST-42" | cut -d "_" -f2 )
- echo $your_id
#split the string of a variable and return a part using 'cut'
- VAR="Release V14_TEST-42"
- your_number=$(echo "$VAR" | cut -d "_" -f2 )
- echo $your_number
Gitlab output looks like:
$ echo ${CI_PROJECT_URL}
https://gitlab.com/XXXXXXXXXX/gitlab_related_projects/test
$ echo ${CI_PROJECT_URL:8:100}.git
gitlab.com/XXXXXXXXXX/gitlab_related_projects/test.git
$ INPUT="someletters_12345_moreleters.ext"
$ SUBSTRING=`expr match "$INPUT" '.*_\([[:digit:]]*\)_.*' `
$ echo $SUBSTRING
12345
$ b=${INPUT:12:5}
$ echo $b
12345
$ your_number=$(echo "someletters_12345_moreleters.ext" | grep -E -o '[0-9]{5}')
$ echo $your_number
12345
$ your_number=$(echo "$INPUT" | grep -E -o '[0-9]{5}')
$ echo $your_number
12345
$ your_number=$(echo "Release V14_TEST-42" | cut -d "_" -f2 )
$ echo $your_number
TEST-42
$ VAR="Release V14_TEST-42"
$ your_number=$(echo "$VAR" | cut -d "_" -f2 )
$ echo $your_number
TEST-42
Cleaning up project directory and file based variables
00:01
Job succeeded

How to get all tags from github api

I usually get the releases/tags from github API with below command
$ repo="helm/helm"
$ curl -sL https://api.github.com/repos/${repo}/tags |jq -r ".[].name"
v3.2.0-rc.1
v3.2.0
v3.1.3
v3.1.2
v3.1.1
v3.1.0-rc.3
v3.1.0-rc.2
v3.1.0-rc.1
v3.1.0
v3.0.3
v3.0.2
v3.0.1
v3.0.0-rc.4
v3.0.0-rc.3
v3.0.0-rc.2
v3.0.0-rc.1
v3.0.0-beta.5
v3.0.0-beta.4
v3.0.0-beta.3
v3.0.0-beta.2
v3.0.0-beta.1
v3.0.0-alpha.2
v3.0.0-alpha.1
v3.0.0
v2.16.6
v2.16.5
v2.16.4
v2.16.3
v2.16.2
v2.16.1
But in fact, it doesn't list all releases, what should I do?
For example, I can't get release before v2.16.1 as below link
https://github.com/helm/helm/tags?after=v2.16.1
I try to reference the same way to add ?after=v2.16.1 in curl api
command, but no help
curl -sL https://api.github.com/repos/${repo}/tags?after=v2.16.1 |jq -r ".[].name"
I got same output.
Reference: https://developer.github.com/v3/git/tags/
This could be because of pagination
See this script as an example of detecting pages, and adding the required ?page=x to access to all the data from a GitHub API call.
Relevant extract:
# single page result-s (no pagination), have no Link: section, the grep result is empty
last_page=`curl -s -I "https://api.github.com${GITHUB_API_REST}" -H "${GITHUB_API_HEADER_ACCEPT}" -H "Authorization: token $GITHUB_TOKEN" | grep '^Link:' | sed -e 's/^Link:.*page=//g' -e 's/>.*$//g'`
# does this result use pagination?
if [ -z "$last_page" ]; then
# no - this result has only one page
rest_call "https://api.github.com${GITHUB_API_REST}"
else
# yes - this result is on multiple pages
for p in `seq 1 $last_page`; do
rest_call "https://api.github.com${GITHUB_API_REST}?page=$p"
done
fi
With help from #VonC, I got the result with extra query string ?page=2, if I'd like to query older releases and so on.
curl -sL https://api.github.com/repos/${repo}/tags?page=2 |jq -r ".[].name"
I can easily get the last page now.
$ GITHUB_API_REST="/repos/helm/helm/tags"
$ GITHUB_API_HEADER_ACCEPT="Accept: application/vnd.github.v3+json"
$ GITHUB_TOKEN=xxxxxxxx
$ last_page=`curl -s -I "https://api.github.com${GITHUB_API_REST}" -H "${GITHUB_API_HEADER_ACCEPT}" -H "Authorization: token $GITHUB_TOKEN" | grep '^Link:' | sed -e 's/^Link:.*page=//g' -e 's/>.*$//g'`
$ echo $last_page
4

Issue with genstrings for Swift file

genstrings works well to extract localizable content from .m file as,
find . -name \*.m | xargs genstrings -o en.lproj
But, not working for .swift file as,
find . -name \*.swift | xargs genstrings -o en.lproj
The genstrings tool works fine with swift as far as I am concerned. Here is my test:
// MyClass.swift
let message = NSLocalizedString("This is the test message.", comment: "Test")
then, in the folder with the class
# generate strings for all swift files (even in nested directories)
$ find . -name \*.swift | xargs genstrings -o .
# See results
$ cat Localizable.strings
/* Test */
"This is the test message." = "This is the test message.";
$
I believe genstrings works as intended, however Apple's xargs approach to generate strings from all your project's files is flawed and does not properly parse paths containing spaces.
That might be the reason why it's not working for you.
Try using the following:
find . -name \*.swift | tr '\n' '\0' | xargs -0 genstrings -o .
We wrote a command line tool that works for Swift files and merges the result of apples genstrings tool.
It allows for key and value in NSLocalizedString
https://github.com/KeepSafe/genstrings_swift
There's an alternative tool called SwiftGenStrings
Hello.swift
NSLocalizedString("hello", value: "world", comment: "Hi!")
SwiftGenStrings:
$ SwiftGenStrings Hello.swift
/* Hi! */
"hello" = "world";
Apple genstrings:
$ genstrings Hello.swift
Bad entry in file Hello.swift (line = 1): Argument is not a literal string.
Disclaimer: I worked on SwiftGenStrings.
There is a similar question here:
How to use genstrings across multiple directories?
find ./ -name "*.m" -print0 | xargs -0 genstrings -o en.lproj
The issue I was having with find/genstrings was twofold:
When it reached folder names with spaces (generated by the output of find), it would exit with an error
When it reached the file where I had my custom routine defined, it was giving me an error when trying to parse my actual function definition
To fix both those problems I'm using the following:
find Some/Path/ \( -name "*.swift" ! -name "MyExcludedFile.swift" \) | sed "s/^/'/;s/$/'/" | xargs genstrings -o . -s MyCustomLocalizedStringRoutine
To summarize, we use the find command to both find and exclude your Swift files, then pipe the results into the sed command which will wrap each file path in quotes, then finally pipe that result into the genstrings command
Xcode now includes a powerful tool for extracting localizations.
Just select your project on the left then Editor menu >> Export localizations.
You'll get a folder with all the text in your files as well as the Localizable.strings and InfoPlist.strings
More details here:
https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/LocalizingYourApp/LocalizingYourApp.html

AWK to process compressed files and printing original (compressed) file names

I would like to process multiple .gz files with gawk.
I was thinking of decompressing and passing it to gawk on the fly
but I have an additional requirement to also store/print the original file name in the output.
The thing is there's 100s of .gz files with rather large size to process.
Looking for anomalies (~0.001% rows) and want to print out the list of found inconsistencies ALONG with the file name and row number that contained it.
If I could have all the files decompressed I would simply use FILENAME variable to get this.
Because of large quantity and size of those files I can't decompress them upfront.
Any ideas how to pass filename (in addition to the gzip stdout) to gawk to produce required output?
Assuming you are looping over all the files and piping their decompression directly into awk something like the following will work.
for file in *.gz; do
gunzip -c "$file" | awk -v origname="$file" '.... {print origname " whatever"}'
done
Edit: To use a list of filenames from some source other than a direct glob something like the following can be used.
$ ls *.awk
a.awk e.awk
$ while IFS= read -d '' filename; do
echo "$filename";
done < <(find . -name \*.awk -printf '%P\0')
e.awk
a.awk
To use xargs instead of the above loop will require the body of the command to be in a pre-written script file I believe which can be called with xargs and the filename.
this is using combination of xargs and sh (to be able to use pipe on two commands: gzip and awk):
find *.gz -print0 | xargs -0 -I fname sh -c 'gzip -dc fname | gawk -v origfile="fname" -f printbadrowsonly.awk >> baddata.txt'
I'm wondering if there's any bad practice with the above approach…