How to merge multiple markdown files with pandoc while retaining cross document links? - pdf

I am trying to merge multiple markdown documents in a single folder together into a PDF with pandoc.
The documents may contain links to each other which should be browseable in the markdown format, e.g. through IntelliJ or within GitLab.
Simple example documents:
0001-file-a.md
---
id: 0001
---
# File a
This is a simple file without an [external link](www.stackoverflow.com).
0002-file-b.md
---
id: 0002
---
# File b
This file links to [another file](0001-file-a.md).
By default pandoc does not handle this case out of the box, e.g. when running the following command:
pandoc -s -f markdown -t pdf *.md -V linkcolor=blue -o test.pdf
It merges the files, creates a PDF and highlights the links correctly, but when clicking the second link it wants to open the file instead of jumping to the right location in the document.
This problem has been experienced by many before me but none of the solutions I found so far have solved it. The closest I came was with the help of this answer: https://stackoverflow.com/a/61908457/6628753
It defines a filter that is first applied to each file and then the resulting JSON files are merged.
I modified this filter to fit my needs:
Add the number of the file to the label of the top-level header
Prepend the top-level header to all other header labels
Remove .md from internal links
Here is the filter:
#!/usr/bin/env python3
from pandocfilters import toJSONFilter, Header, Link
import re
import sys
"""
Pandoc filter to convert internal links for multifile documents
"""
headerL1 = []
def fix_links(key, value, format, meta):
global headerL1
# Store level 1 headers
if key == "Header":
[level, [label, t1, t2], header] = value
if level == 1:
id = meta.get("id")
newlabel = f"{id['c'][0]['c']}-{label}"
headerL1 = [newlabel]
sys.stderr.write(f"\nGlobal header: {headerL1}\n")
return Header(level, [newlabel, t1, t2], header)
# Prepend level 1 header label to all other header labels
if level > 1:
prefix = headerL1[0]
newlabel = prefix + "-" + label
sys.stderr.write(f"Header label: {label} -> {newlabel}\n")
return Header(level, [newlabel, t1, t2], header)
if key == "Link":
[t1, linktext, [linkref, t4]] = value
if ".md" in linkref:
newlinkref = re.sub(r'.md', r'', linkref)
sys.stderr.write(f'Link: {linkref} -> {newlinkref}\n')
return Link(t1, linktext, [newlinkref, t4])
else:
sys.stderr.write(f'External link: {linkref}\n')
if __name__ == "__main__":
toJSONFilter(fix_links)
And here is a script that executes the whole thing:
#!/bin/bash
MD_INPUT=$(find . -type f | grep md | sort)
# Pass the markdown through the gitlab filters into Pandoc JSON files
echo "Filtering Gitlab markdown"
for file in $MD_INPUT
do
echo "Filtering $file"
pandoc \
--filter fix-links.py \
"$file" \
-t json \
-o "${file%.md}.json"
done
JSON_INPUT=$(find . -type f | grep json | sort)
echo "Generating LaTeX"
pandoc -s -f json -t latex $JSON_INPUT -V linkcolor=blue -o test.tex
echo "Generating PDF"
pandoc -s -f json -t pdf $JSON_INPUT -V linkcolor=blue -o test.pdf
Applying this script generates a PDF where the second link does not work at all.
Looking at the LaTeX code the problem can be solved by replacing the generated \href directive with \hyperlink.
Once this is done the linking works as expected.
The problem now is that this isn't done automatically by pandoc, which almost seems like a bug.
Is there a way to tell pandoc a link is internal from within the filter?
After running the filter it is non-trivial to fix the issue since there is no good way to differentiate internal and external links.

Related

How to diff PDF files?

Sometimes, when I download a PDF file, such as one of my statements from my bank's website, and then, at a later time, download the same file again, both files differ.
How can I see how they differ?
I've tried:
diff file-1.pdf file-2.pdf
But that just prints:
Binary files file-1.pdf and file-2.pdf differ
Try:
diff -a file-1.pdf file-2.pdf | less
Or:
git diff -a file-1.pdf file-2.pdf
Example of diff's output:
1869,1870c1869,1870
< /CreationDate (D:20220504152530-00'00')
< /ModDate (D:20220504152530-00'00')
---
> /CreationDate (D:20220509154833-00'00')
> /ModDate (D:20220509154833-00'00')
Notes:
For either diff or git-diff, the -a, --text option "treat[s] all files as text". (See man diff or man git-diff.)
I use less in case diff -a outputs any binary data. (See this question and this comment.)
You must add the --no-index option after git diff -a when you run the command in a working tree controlled by Git and both files are inside that working tree. (See man git-diff.)
To view a PDF file's data as text, do less file.pdf.

Nextflow: publishDir, output channels, and output subdirectories

I've been trying to learn how to use Nextflow and come across an issue with adding output to a channel as I need the processes to run in an order. I want to pass output files from one of the output subdirectories created by the tool (ONT-Guppy) into a channel, but can't seem to figure out how.
Here is the nextflow process in question:
process GupcallBases {
publishDir "$params.P1_outDir", mode: 'copy', pattern: "pass/*.bam"
executor = 'pbspro'
clusterOptions = "-lselect=1:ncpus=${params.P1_threads}:mem=${params.P1_memory}:ngpus=1:gpu_type=${params.P1_GPU} -lwalltime=${params.P1_walltime}:00:00"
output:
path "*.bam" into bams_ch
script:
"""
module load cuda/11.4.2
singularity exec --nv $params.Gup_container \
guppy_basecaller --config $params.P1_gupConf \
--device "cuda:0" \
--bam_out \
--recursive \
--compress \
--align_ref $params.refGen \
-i $params.P1_inDir \
-s $params.P1_outDir \
--gpu_runners_per_device $params.P1_GPU_runners \
--num_callers $params.P1_callers
"""
}
The output of the process is something like this:
$params.P1_outDir/pass/(lots of bams and fastqs)
$params.P1_outDir/fail/(lots of bams and fastqs)
$params.P1_outDir/(a few txt and log files)
I only want to keep the bam files in $params.P1_outDir/pass/, hence trying to use the pattern = "pass/*.bam, but I've tried a few other patterns to no avail.
The output syntax was chosen since once this process is done, using the following channel works:
// Channel
// .fromPath("${params.P1_outDir}/pass/*.bam")
// .ifEmpty { error "Cannot find any bam files in ${params.P1_outDir}" }
// .set { bams_ch }
But the problem is if I don't pass the files into the output channel of the first process, they run in parallel. I could simply be missing something in the extensive documentation in how to order processes, which would be an alternative solution.
Edit: I forgo to add the error message which is here: Missing output file(s) `*.bam` expected by process `GupcallBases` and the $params.P1_outDir/ contains the subdirectories and all the log files despite the pattern argument.
Thanks in advance.
Nextflow processes are designed to run isolated from each other, but this can be circumvented somewhat when the command-line input and/or outputs are specified using params. Using params like this can be problematic because if, for example, a params variable specifies an absolute path but your output declaration expects files in the Nextflow working directory (e.g. ./work/fc/0249e72585c03d08e31ce154b6d873), you will get the 'Missing output file(s) expected by process' error you're seeing.
The solution is to ensure your inputs are localized in the working directory using an input declaration block and that the outputs are also written to the work dir. Note that only files specified in the output declaration block can be published using the publishDir directive.
Also, best to avoid calling Singularity manually in your script block. Instead just add singularity.enabled = true to your nextflow.config. This should also work nicely with the beforeScript process directive to initialize your environment:
params.publishDir = './results'
input_dir = file( params.input_dir )
guppy_config = file( params.guppy_config )
ref_genome = file( params.ref_genome )
process GuppyBasecaller {
publishDir(
path: "${params.publishDir}/GuppyBasecaller",
mode: 'copy',
saveAs: { fn -> fn.substring(fn.lastIndexOf('/')+1) },
)
beforeScript 'module load cuda/11.4.2; export SINGULARITY_NV=1'
container '/path/to/guppy_basecaller.img'
input:
path input_dir
path guppy_config
path ref_genome
output:
path "outdir/pass/*.bam" into bams_ch
"""
mkdir outdir
guppy_basecaller \\
--config "${guppy_config}" \\
--device "cuda:0" \\
--bam_out \\
--recursive \\
--compress \\
--align_ref "${ref_genome}" \\
-i "${input_dir}" \\
-s outdir \\
--gpu_runners_per_device "${params.guppy_gpu_runners}" \\
--num_callers "${params.guppy_callers}"
"""
}

Problems getting two output files in Nextflow

Hello all!
I´m trying to write a small Nextflow pipeline that runs vcftools comands in 300 vcf´s. The pipe takes four inputs: vcf, pop1, pop2 and a .txt file, and would have to generate two outputs: a .log.weir.fst and a .log.log file. When i run the pipeline, it only gives the .log.weir.fst files but not the .log files.
Here´s my process definition:
process fst_calculation {
publishDir "${results_dir}/fst_results_pop1_pop2/", mode:"copy"
input:
file vcf
file pop_1
file pop_2
file mart
output:
path "*.log.*"
"""
while read linea
do
echo "[DEBUG] working in line: \$linea"
inicio=\$(echo "\$linea" | cut -f3)
final=\$(echo "\$linea" | cut -f4)
cromosoma=\$(echo "\$linea" | cut -f1)
segmento=\$(echo "\$linea" | cut -f5)
vcftools --vcf ${vcf} \
--weir-fst-pop ${pop_1} \
--weir-fst-pop ${pop_2} \
--out \$inicio.log --chr \$cromosoma \
--from-bp \$inicio --to-bp \$final
done < ${mart}
"""
}
And here´s the workflow of my process
/* Load files into channel*/
pop_1 = Channel.fromPath("${params.fst_path}/pop_1")
pop_2 = Channel.fromPath("${params.fst_path}/pop_2")
vcf = Channel.fromPath("${params.fst_path}/*.vcf")
mart = Channel.fromPath("${params.fst_path}/*.txt")
/* Import modules
*/
include {
fst_calculation } from './nf_modules/modules.nf'
/*
* main pipeline logic
*/
workflow {
p1 = fst_calculation(vcf, pop_1, pop_2, mart)
p1.view()
}
When i check the work directory of the pipeline, I can see that the pipe only generates the .log.weir.fst. To verify if my code was wrong, i ran "bash .command.sh" in the working directory and this actually generates the two output files. So, is there a reason for not getting the two output files when i run the pipe?
I appreciate any help.
Note that bash .command.sh and bash .command.run do different things. The latter is basically a wrapper around the former that sets up the environment and stages the declared input files, among other things. If running the latter produces the unusual behavior, you'll need to dig deeper.
It's not completely clear to me what the problem is here. My guess is that vcftools might behave differently when run non-interactively, such that it sends it's logging to STDERR. If that's the case, the logging will be captured in a file called .command.err. To instead send that to a file, you can just redirect STDERR in the usual way, untested:
while IFS=\$'\\t' read -r cromosoma null inicio final segmento ; do
>&2 echo "[DEBUG] Working with: \${cromosoma}, \${inicio}, \${final}, \${segmento}"
vcftools \\
--vcf "${vcf}" \\
--weir-fst-pop "${pop_1}" \\
--weir-fst-pop "${pop_2}" \\
--out "\${inicio}.log" \\
--chr "\${cromosoma}" \\
--from-bp "\${inicio}" \\
--to-bp "\${final}" \\
2> "\${cromosoma}.\${inicio}.\${final}.log.log"
done < "${mart}"

liquibase : generate changelogs from existing database

Is it possible with liquibase to generate changelogs from an existing database?
I would like to generate one xml changelog per table (not every create table statements in one single changelog).
If you look into documentation it looks like it generates only one changelog with many changesets (one for each table). So by default there is no option to generate changelogs per table.
While liquibase generate-changelog still doesn't support splitting up the generated changelog, you can split it yourself.
If you're using JSON changelogs, you can do this with jq.
I created a jq filter to group the related changesets, and combined it with a Bash script to split out the contents. See this blog post
jq filter, split_liquibase_changelog.jq:
# Define a function for mapping a changes onto its destination file name
# createTable and createIndex use the tableName field
# addForeignKeyConstraint uses baseTableName
# Default to using the name of the change, e.g. createSequence
def get_change_group: map(.tableName // .baseTableName)[0] // keys[0];
# Select the main changelog object
.databaseChangeLog
# Collect the changes from each changeSet into an array
| map(.changeSet.changes | .[])
# Group changes according to the grouping function
| group_by(get_change_group)
# Select the grouped objects from the array
| .[]
# Get the group name from each group
| (.[0] | get_change_group) as $group_name
# Select both the group name...
| $group_name,
# and the group, wrapped in a changeSet that uses the group name in the ID and
# the current user as the author
{ databaseChangelog: {
changeSet: {
id: ("table_" + $group_name),
author: env.USER,
changes: . } } }
Bash:
#!/usr/bin/env bash
# Example: ./split_liquibase_changelog.sh schema < changelog.json
set -e -o noclobber
OUTPUT_DIRECTORY="${1:-schema}"
OUTPUT_FILE="${2:-schema.json}"
# Create the output directory
mkdir --parents "$OUTPUT_DIRECTORY"
# --raw-output: don't quote the strings for the group names
# --compact-output: output one JSON object per line
jq \
--raw-output \
--compact-output \
--from-file split_liquibase_changelog.jq \
| while read -r group; do # Read the group name line
# Read the JSON object line
read -r json
# Process with jq again to pretty-print the object, then redirect it to the
# new file
(jq '.' <<< "$json") \
> "$OUTPUT_DIRECTORY"/"$group".json
done
# List all the files in the input directory
# Run jq with --raw-input, so input is parsed as strings
# Create a changelog that includes everything in the input path
# Save the output to the desired output file
(jq \
--raw-input \
'{ databaseChangeLog: [
{ includeAll:
{ path: . }
}
] }' \
<<< "$OUTPUT_DIRECTORY"/) \
> "$OUTPUT_FILE"
If you need to use XML changesets, you can try adapting this solution using an XML tool like XQuery instead.

PhantomJS: exported PDF to stdout

Is there a way to trigger the PDF export feature in PhantomJS without specifying an output file with the .pdf extension? We'd like to use stdout to output the PDF.
You can output directly to stdout without a need for a temporary file.
page.render('/dev/stdout', { format: 'pdf' });
See here for history on when this was added.
If you want to get HTML from stdin and output the PDF to stdout, see here
Sorry for the extremely long answer; I have a feeling that I'll need to refer to this method several dozen times in my life, so I'll write "one answer to rule them all". I'll first babble a little about files, file descriptors, (named) pipes, and output redirection, and then answer your question.
Consider this simple C99 program:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[])
{
if (argc < 2) {
printf("Usage: %s file_name\n", argv[0]);
return 1;
}
FILE* file = fopen(argv[1], "w");
if (!file) {
printf("No such file: %s\n", argv[1]);
return 2;
}
fprintf(file, "some text...");
fclose(file);
return 0;
}
Very straightforward. It takes an argument (a file name) and prints some text into it. Couldn't be any simpler.
Compile it with clang write_to_file.c -o write_to_file.o or gcc write_to_file.c -o write_to_file.o.
Now, run ./write_to_file.o some_file (which prints into some_file). Then run cat some_file. The result, as expected, is some text...
Now let's get more fancy. Type (./write_to_file.o /dev/stdout) > some_file in the terminal. We're asking the program to write to its standard output (instead of a regular file), and then we're redirecting that stdout to some_file (using > some_file). We could've used any of the following to achieve this:
(./write_to_file.o /dev/stdout) > some_file, which means "use stdout"
(./write_to_file.o /dev/stderr) 2> some_file, which means "use stderr, and redirect it using 2>"
(./write_to_file.o /dev/fd/2) 2> some_file, which is the same as above; stderr is the third file descriptor assigned to Unix processes by default (after stdin and stdout)
(./write_to_file.o /dev/fd/5) 5> some_file, which means "use your sixth file descriptor, and redirect it to some_file"
In case it's not clear, we're using a Unix pipe instead of an actual file (everything is a file in Unix after all). We can do all sort of fancy things with this pipe: write it to a file, or write it to a named pipe and share it between different processes.
Now, let's create a named pipe:
mkfifo my_pipe
If you type ls -l now, you'll see:
total 32
prw-r--r-- 1 pooriaazimi staff 0 Jul 15 09:12 my_pipe
-rw-r--r-- 1 pooriaazimi staff 336 Jul 15 08:29 write_to_file.c
-rwxr-xr-x 1 pooriaazimi staff 8832 Jul 15 08:34 write_to_file.o
Note the p at the beginning of second line. It means that my_pipe is a (named) pipe.
Now, let's specify what we want to do with our pipe:
gzip -c < my_pipe > out.gz &
It means: gzip what I put inside my_pipe and write the results in out.gz. The & at the end asks the shell to run this command in the background. You'll get something like [1] 10449 and the control gets back to the terminal.
Then, simply redirect the output of our C program to this pipe:
(./write_to_file.o /dev/fd/5) 5> my_pipe
Or
./write_to_file.o my_pipe
You'll get
[1]+ Done gzip -c < my_pipe > out.gz
which means the gzip command has finished.
Now, do another ls -l:
total 40
prw-r--r-- 1 pooriaazimi staff 0 Jul 15 09:14 my_pipe
-rw-r--r-- 1 pooriaazimi staff 32 Jul 15 09:14 out.gz
-rw-r--r-- 1 pooriaazimi staff 336 Jul 15 08:29 write_to_file.c
-rwxr-xr-x 1 pooriaazimi staff 8832 Jul 15 08:34 write_to_file.o
We've successfully gziped our text!
Execute gzip -d out.gz to decompress this gziped file. It will be deleted and a new file (out) will be created. cat out gets us:
some text...
which is what we expected.
Don't forget to remove the pipe with rm my_pipe!
Now back to PhantomJS.
This is a simple PhantomJS script (render.coffee, written in CoffeeScript) that takes two arguments: a URL and a file name. It loads the URL, renders it and writes it to the given file name:
system = require 'system'
renderUrlToFile = (url, file, callback) ->
page = require('webpage').create()
page.viewportSize = { width: 1024, height : 800 }
page.settings.userAgent = 'Phantom.js bot'
page.open url, (status) ->
if status isnt 'success'
console.log "Unable to render '#{url}'"
else
page.render file
delete page
callback url, file
url = system.args[1]
file_name = system.args[2]
console.log "Will render to #{file_name}"
renderUrlToFile "http://#{url}", file_name, (url, file) ->
console.log "Rendered '#{url}' to '#{file}'"
phantom.exit()
Now type phantomjs render.coffee news.ycombinator.com hn.png in the terminal to render Hacker News front page into file hn.png. It works as expected. So does phantomjs render.coffee news.ycombinator.com hn.pdf.
Let's repeat what we did earlier with our C program:
(phantomjs render.coffee news.ycombinator.com /dev/fd/5) 5> hn.pdf
It doesn't work... :( Why? Because, as stated on PhantomJS's manual:
render(fileName)
Renders the web page to an image buffer and save it
as the specified file.
Currently the output format is automatically set based on the file
extension. Supported formats are PNG, JPEG, and PDF.
It fails, simply because neither /dev/fd/2 nor /dev/stdout end in .PNG, etc.
But no fear, named pipes can help you!
Create another named pipe, but this time use the extension .pdf:
mkfifo my_pipe.pdf
Now, tell it to simply cat its inout to hn.pdf:
cat < my_pipe.pdf > hn.pdf &
Then run:
phantomjs render.coffee news.ycombinator.com my_pipe.pdf
And behold the beautiful hn.pdf!
Obviously you want to do something more sophisticated that just cating the output, but I'm sure it's clear now what you should do :)
TL;DR:
Create a named pipe, using ".pdf" file extension (so it fools PhantomJS to think it's a PDF file):
mkfifo my_pipe.pdf
Do whatever you want to do with the contents of the file, like:
cat < my_pipe.pdf > hn.pdf
which simply cats it to hn.pdf
In PhantomJS, render to this file/pipe.
Later on, you should remove the pipe:
rm my_pipe.pdf
As pointed out by Niko you can use renderBase64() to render the web page to an image buffer and return the result as a base64-encoded string.But for now this will only work for PNG, JPEG and GIF.
To write something from a phantomjs script to stdout just use the filesystem API.
I use something like this for images :
var base64image = page.renderBase64('PNG');
var fs = require("fs");
fs.write("/dev/stdout", base64image, "w");
I don't know if the PDF format for renderBase64() will be in a future version of phanthomjs but as a workaround something along these lines may work for you:
page.render(output);
var fs = require("fs");
var pdf = fs.read(output);
fs.write("/dev/stdout", pdf, "w");
fs.remove(output);
Where output is the path to the pdf file.
I don't know if it would address your problem, but you may also check the new renderBase64() method added to PhantomJS 1.6: https://github.com/ariya/phantomjs/blob/master/src/webpage.cpp#L623
Unfortunately, the feature is not documented on the wiki yet :/