How to diff PDF files? - pdf

Sometimes, when I download a PDF file, such as one of my statements from my bank's website, and then, at a later time, download the same file again, both files differ.
How can I see how they differ?
I've tried:
diff file-1.pdf file-2.pdf
But that just prints:
Binary files file-1.pdf and file-2.pdf differ

Try:
diff -a file-1.pdf file-2.pdf | less
Or:
git diff -a file-1.pdf file-2.pdf
Example of diff's output:
1869,1870c1869,1870
< /CreationDate (D:20220504152530-00'00')
< /ModDate (D:20220504152530-00'00')
---
> /CreationDate (D:20220509154833-00'00')
> /ModDate (D:20220509154833-00'00')
Notes:
For either diff or git-diff, the -a, --text option "treat[s] all files as text". (See man diff or man git-diff.)
I use less in case diff -a outputs any binary data. (See this question and this comment.)
You must add the --no-index option after git diff -a when you run the command in a working tree controlled by Git and both files are inside that working tree. (See man git-diff.)
To view a PDF file's data as text, do less file.pdf.

Related

How to merge multiple markdown files with pandoc while retaining cross document links?

I am trying to merge multiple markdown documents in a single folder together into a PDF with pandoc.
The documents may contain links to each other which should be browseable in the markdown format, e.g. through IntelliJ or within GitLab.
Simple example documents:
0001-file-a.md
---
id: 0001
---
# File a
This is a simple file without an [external link](www.stackoverflow.com).
0002-file-b.md
---
id: 0002
---
# File b
This file links to [another file](0001-file-a.md).
By default pandoc does not handle this case out of the box, e.g. when running the following command:
pandoc -s -f markdown -t pdf *.md -V linkcolor=blue -o test.pdf
It merges the files, creates a PDF and highlights the links correctly, but when clicking the second link it wants to open the file instead of jumping to the right location in the document.
This problem has been experienced by many before me but none of the solutions I found so far have solved it. The closest I came was with the help of this answer: https://stackoverflow.com/a/61908457/6628753
It defines a filter that is first applied to each file and then the resulting JSON files are merged.
I modified this filter to fit my needs:
Add the number of the file to the label of the top-level header
Prepend the top-level header to all other header labels
Remove .md from internal links
Here is the filter:
#!/usr/bin/env python3
from pandocfilters import toJSONFilter, Header, Link
import re
import sys
"""
Pandoc filter to convert internal links for multifile documents
"""
headerL1 = []
def fix_links(key, value, format, meta):
global headerL1
# Store level 1 headers
if key == "Header":
[level, [label, t1, t2], header] = value
if level == 1:
id = meta.get("id")
newlabel = f"{id['c'][0]['c']}-{label}"
headerL1 = [newlabel]
sys.stderr.write(f"\nGlobal header: {headerL1}\n")
return Header(level, [newlabel, t1, t2], header)
# Prepend level 1 header label to all other header labels
if level > 1:
prefix = headerL1[0]
newlabel = prefix + "-" + label
sys.stderr.write(f"Header label: {label} -> {newlabel}\n")
return Header(level, [newlabel, t1, t2], header)
if key == "Link":
[t1, linktext, [linkref, t4]] = value
if ".md" in linkref:
newlinkref = re.sub(r'.md', r'', linkref)
sys.stderr.write(f'Link: {linkref} -> {newlinkref}\n')
return Link(t1, linktext, [newlinkref, t4])
else:
sys.stderr.write(f'External link: {linkref}\n')
if __name__ == "__main__":
toJSONFilter(fix_links)
And here is a script that executes the whole thing:
#!/bin/bash
MD_INPUT=$(find . -type f | grep md | sort)
# Pass the markdown through the gitlab filters into Pandoc JSON files
echo "Filtering Gitlab markdown"
for file in $MD_INPUT
do
echo "Filtering $file"
pandoc \
--filter fix-links.py \
"$file" \
-t json \
-o "${file%.md}.json"
done
JSON_INPUT=$(find . -type f | grep json | sort)
echo "Generating LaTeX"
pandoc -s -f json -t latex $JSON_INPUT -V linkcolor=blue -o test.tex
echo "Generating PDF"
pandoc -s -f json -t pdf $JSON_INPUT -V linkcolor=blue -o test.pdf
Applying this script generates a PDF where the second link does not work at all.
Looking at the LaTeX code the problem can be solved by replacing the generated \href directive with \hyperlink.
Once this is done the linking works as expected.
The problem now is that this isn't done automatically by pandoc, which almost seems like a bug.
Is there a way to tell pandoc a link is internal from within the filter?
After running the filter it is non-trivial to fix the issue since there is no good way to differentiate internal and external links.

How would you crack this (MD5 HashCat)?

I was given this file:
hashes.txt
experthead:e10adc3949ba59abbe56e057f20f883e
interestec:25f9e794323b453885f5181f1b624d0b
ortspoon:d8578edf8458ce06fbc5bb76a58c5ca4
reallychel:5f4dcc3b5aa765d61d8327deb882cf99
simmson56:96e79218965eb72c92a549dd5a330112
bookma:25d55ad283aa400af464c76d713c07ad
popularkiya7:e99a18c428cb38d5f260853678922e03
eatingcake1994:fcea920f7412b5da7be0cf42b8c93759
heroanhart:7c6a180b36896a0a8c02787eeafb0e4c
edi_tesla89:6c569aabbf7775ef8fc570e228c16b98
liveltekah:3f230640b78d7e71ac5514e57935eb69
blikimore:917eb5e9d6d6bca820922a0c6f7cc28b
johnwick007:f6a0cb102c62879d397b12b62c092c06
flamesbria2001:9b3b269ad0a208090309f091b3aba9db
oranolio:16ced47d3fc931483e24933665cded6d
spuffyffet:1f5c5683982d7c3814d4d9e6d749b21e
moodie:8d763385e0476ae208f21bc63956f748
nabox:defebde7b6ab6f24d5824682a16c3ae4
bandalls:bdda5f03128bcbdfa78d8934529048cf
I thought I had to separate them, for example I put the experthead, interestec, etc. in one file named wordtext.txt and e10adc3949ba59abbe56e057f20f883e, etc in another file called hash.txt.
I then ran this:
hashcat -m 0 -a 0 /Users/myname/Desktop/hash.txt /Users/myname/Desktop/wordtext.txt -O
but I couldn't get anything. And then I googled e10adc3949ba59abbe56e057f20f883e and the output was 123456 so now I don't know how to approach this problem.
Just leave the hashes (erase the plaintext) on the txt file, hashcat will sort them out by itself. What I do is: hashcat.exe -m 0 -a 0 hashFile.txt dict.txt --show
The file appears to be in username:hash format. By default, hashcat assumes that only hashes are in the target file.
You can change this behavior with hashcat's --username option.
You don't need to place the -O at the end. It should work perfectly without it, but you do need hashcat.exe in the beginning.

How do I verify a tape backup with tar?

I am doing some tape backups with:
mt eom
tar cv Projects*
Note: $TAPE variable is set to /dev/nsa0
I would like to verify that backup has been properly done, via md5sum. Therefore I could do:
tar cvf test.tar Projects*
And I would like to compare with the tar stored in the tape.
However, I do not know how could I extract the whole tar file in the tape into a backupInTape.tar file so I can do:
md5sum test.tar
md5sum backupInTape.tar
This is what I have tried to do something like that (at the right location in the tape, verified with tar tv):
tar xv > backupInTape.tar
But it still extracts everything.
I am using FreeBSD 12.1
You could get the data from tape with the dd command, EG:
dd if="$TAPE" of=data-from-tape.tar bs=126
Check your 'tar' command's default blocksize, and replace the 126 with it. I believe Gnu tar uses 126, but other tar's may still default to 20 or something. With some tape drives, the blocksize matters a lot.
Also, see https://serverfault.com/questions/293605/check-integrity-of-tar-gz-backup
Also, Gnu tar can --diff to confirm that a tar archive is correct, byte for byte.

How can i view all comments posted by users in bitbucket repository

In the repository home page , i can see comments posted in recent activity at the bottom, bit it only shows 10 commnets.
i want to all the comments posted since beginning.
Is there any way
Comments of pull requests, issues and commits can be retrieved using bitbucket’s REST API.
However it seems that there is no way to list all of them at one place, so the only way to get them would be to query the API for each PR, issue or commit of the repository.
Note that this takes a long time, since bitbucket has seemingly set a limit to the number of accesses via API to repository data: I got Rate limit for this resource has been exceeded errors after retrieving around a thousand results, then I could retrieve about only one entry per second elapsed from the time of the last rate limit error.
Finding the API URL to the repository
The first step is to find the URL to the repo. For private repositories, it is necessary to get authenticated by providing username and password (using curl’s -u switch). The URL is of the form:
https://api.bitbucket.org/2.0/repositories/{repoOwnerName}/{repoName}
Running git remote -v from the local git repository should provide the missing values. Check the forged URL (below referred to as $url) by verifying that repository information is correctly retrieved as JSON data from it: curl -u username $url.
Fetching comments of commits
Comments of a commit can be accessed at $url/commit/{commitHash}/comments.
The resulting JSON data can be processed by a script. Beware that the results are paginated.
Below I simply extract the number of comments per commit. It is indicated by the value of the member size of the retrieved JSON object; I also request a partial response by adding the GET parameter fields=size.
My script getNComments.sh:
#!/bin/sh
pw=$1
id=$2
json=$(curl -s -u username:"$pw" \
https://api.bitbucket.org/2.0/repositories/{repoOwnerName}/{repoName}/commit/$id/comments'?fields=size')
printf '%s' "$json" | grep -q '"type": "error"' \
&& printf "ERROR $id\n" && exit 0
nComments=$(printf '%s' "$json" | grep -o '"size": [0-9]*' | cut -d' ' -f2)
: ${nComments:=EMPTY}
checkNumeric=$(printf '%s' "$nComments" | tr -dc 0-9)
[ "$nComments" != "$checkNumeric" ] \
&& printf >&2 "!ERROR! $id:\n%s\n" "$json" && exit 1
printf "$nComments $id\n"
To use it, taking into account the possibility for the error mentioned above:
A) Prepare input data. From the local repository, generate the list of commits as wanted (run git fetch -a prior to update the local git repo if needed); check out git help rev-list for how it can be customised.
git rev-list --all | sort > sorted-all.id
cp sorted-all.id remaining.id
B) Run the script. Note that the password is passed here as a parameter – so first assign it to a variable safely using stty -echo; IFS= read -r passwd; stty echo, in one line; also see security considerations below. The processing is parallelised onto 15 processes here, using the option -P.
< remaining.id xargs -P 15 -L 1 ./getNComments.sh "$passwd" > commits.temp
C) When the rate limit is reached, that is when getNComments.sh prints !ERROR!, then kill the above command (Ctrl-C), and execute these below to update the input and output files. Wait a while for the request limit to increase, then re-execute the above one command and repeat until all the data is processed (that is when wc -l remaining.id returns 0).
cat commits.temp >> commits.result
cut -d' ' -f2 commits.result | sort | comm -13 - sorted-all.id > remaining.id
D) Finally, you can get the commits which received comments with:
grep '^[1-9]' commits.result
Fetching comments of pull requests and issues
The procedure is the same as for fetching commits’ comments, but for the following two adjustments:
Edit the script to replace in the URL commit by pullrequests or by issues, as appropriate;
Let $n be the number of issues/PRs to search. The git rev-list command above becomes: seq 1 $n > sorted-all.id
The total number of PRs in the repository can be obtained with:
curl -su username $url/pullrequests'?state=&fields=size'
and, if the issue tracker is set up, the number of issues with:
curl -su username $url/issues'?fields=size'
Hopefully, the repository has few enough PRs and issues so that all data can be fetched in one go.
Viewing comments
They can be viewed normally via the web interface on their commit/PR/issue page at:
https://bitbucket.org/{repoOwnerName}/{repoName}/commits/{commitHash}
https://bitbucket.org/{repoOwnerName}/{repoName}/pull-requests/{prId}
https://bitbucket.org/{repoOwnerName}/{repoName}/issues/{issueId}
For example, to open all PRs with comments in firefox:
awk '/^[1-9]/{print "https://bitbucket.org/{repoOwnerName}/{repoName}/pull-requests/"$2}' PRs.result | xargs firefox
Security considerations
Arguments passed on the command line are visible to all users of the system, via ps ax (or /proc/$PID/cmdline). Therefore the bitbucket password will be exposed, which could be a concern if the system is shared by multiple users.
There are three commands getting the password from the command line: xargs, the script, and curl.
It appears that curl tries to hide the password by overwriting its memory, but it is not guaranteed to work, and even if it does, it leaves it visible for a (very short) time after the process starts. On my system, the parameters to curl are not hidden.
A better option could be to pass the sensitive information through environment variables. They should be visible only to the current user and root via ps axe (or /proc/$PID/environ); although it seems that there are systems that let all users access this information (do a ls -l /proc/*/environ to check the environment files’ permissions).
In the script simply replace the lines pw=$1 id=$2 with id=$1, then pass pw="$passwd" before xargs in the command line invocation. It will make the environment variable pw visible to xargs and all of its descendent processes, that is the script and its children (curl, grep, cut, etc), which may or may not read the variable. curl does not read the password from the environment, but if its password hiding trick mentioned above works then it might be good enough.
There are ways to avoid passing the password to curl via the command line, notably via standard input using the option -K -. In the script, replace curl -s -u username:"$pw" with printf -- '-s\n-u "%s"\n' "$authinfo" | curl -K - and define the variable authinfo to contain the data in the format username:password. Note that this method needs printf to be a shell built-in to be safe (check with type printf), otherwise the password will show up in its process arguments. If it is not a built-in, try with print or echo instead.
A simple alternative to an environment variable that will not appear in ps output in any case is via a file. Create a file with read/write permissions restricted to the current user (chmod 600), and edit it so that it contains username:password as its first line. In the script, replace pw=$1 with IFS= read -r authinfo < "$1", and edit it to use curl’s -K option as in the paragraph above. In the command line invocation replace $passwd with the filename.
The file approach has the drawback that the password will be written to disk (note that files in /proc are not on the disk). If this too is undesirable, it is possible to pass a named pipe instead of a regular file:
mkfifo pipe
chmod 600 pipe
# make sure printf is a builtin, or use an equivalent instead
(while :; do printf -- '%s\n' "username:$passwd"; done) > pipe&
pid=$!
exec 3<pipe
Then invoke the script passing pipe instead of the file. Finally, to clean up do:
kill $pid
exec 3<&-
This will ensure the authentication info is passed directly from the shell to the script (through the kernel), is not written to disk and is not exposed to other users via ps.
You can go to Commits and see the top line for each commit, you will need to click on each one to see further information.
If I find a way to see all without drilling into each commit, I will update this answer.

AccuRev : How to get all files changed?

I am looking to get list of files changes between a timestamp.
For example 2013/11/11 11:10:00-now.
accurev hist command given the files changed on that particular stream but it does not include the changes came from parent stream.
Is there a way to get the list of changes flew from parent streams?
Change the basis time of your child stream to the date of 2013/11/11 11:10:00. Then perform a diff by files across the child and parent stream.
Accurev 6 has added some new arguments for the diff command so the following should do the trick:
accurev diff -a -i -v MyStream -V MyStream -t "2013/11/11 11:10:00-now"
Alternatively you could try the accurev.py script, from the ac2git repo, which will return to you all the transactions that could have affected your stream. Run it like this:
python accurev.py deep-hist -p MyDepot -s MyStream -t "2013/11/11 11:10:00-now"