List in yaml metadata block not aligned after reformating with pandoc - formatting

I make my notes in markdown and use yamle metadata blocks at the top of the file. I frequently use pandoc in order to format my notefiles. Unfortunately, it seems to me that in this process lists are not aligned correctly - at least with respect to my understanding of yaml. Example:
Before:
---
tags:
- capitalism
- democracy
- 'post-2008'
---
# Nach der 2008/2009 Wirtschaftskrise werden westliche Demokratien zusehends autoritär
Generell hegt die Linke die These das Kapitalismus und Demokratie nicht
zwingend zusammen gehören (siehe [Demokratie Ergebnis von
Arbeiterkämpfen](Demokratisierung_ist_Ergebnis_Proteste_mit_Arbeiterklasse.md)).
After:
---
tags:
- capitalism
- democracy
- 'post-2008'
---
# Nach der 2008/2009 Wirtschaftskrise werden westliche Demokratien zusehends autoritär
Generell hegt die Linke die These das Kapitalismus und Demokratie nicht
zwingend zusammen gehören (siehe [Demokratie Ergebnis von
Arbeiterkämpfen](Demokratisierung_ist_Ergebnis_Proteste_mit_Arbeiterklasse.md)).
The alignment of the list entries in the yaml metadata block completely disappear.
The pandoc command I use ist:
pandoc --standalone \
--atx-headers \
-f markdown-auto_identifiers+yaml_metadata_block \
-t markdown-simple_tables-multiline_tables-grid_tables-auto_identifiers-fenced_code_attributes+yaml_metadata_block

The output YAML is still valid since the YAML spec says that
The “-”, “?” and “:” characters used to denote block collection entries are perceived by people to be part of the indentation. This is handled on a case-by-case basis by the relevant productions.
Since - is parsed as part of the indentation, the items in your list are more indented than the parent key tags:, which makes this valid YAML and equivalent to your input.
Your YAML looks different because pandoc parses it and then emits it again. YAML is not round-tripping and therefore cannot perfectly preserve format. For details, see this question.

Related

How to merge multiple markdown files with pandoc while retaining cross document links?

I am trying to merge multiple markdown documents in a single folder together into a PDF with pandoc.
The documents may contain links to each other which should be browseable in the markdown format, e.g. through IntelliJ or within GitLab.
Simple example documents:
0001-file-a.md
---
id: 0001
---
# File a
This is a simple file without an [external link](www.stackoverflow.com).
0002-file-b.md
---
id: 0002
---
# File b
This file links to [another file](0001-file-a.md).
By default pandoc does not handle this case out of the box, e.g. when running the following command:
pandoc -s -f markdown -t pdf *.md -V linkcolor=blue -o test.pdf
It merges the files, creates a PDF and highlights the links correctly, but when clicking the second link it wants to open the file instead of jumping to the right location in the document.
This problem has been experienced by many before me but none of the solutions I found so far have solved it. The closest I came was with the help of this answer: https://stackoverflow.com/a/61908457/6628753
It defines a filter that is first applied to each file and then the resulting JSON files are merged.
I modified this filter to fit my needs:
Add the number of the file to the label of the top-level header
Prepend the top-level header to all other header labels
Remove .md from internal links
Here is the filter:
#!/usr/bin/env python3
from pandocfilters import toJSONFilter, Header, Link
import re
import sys
"""
Pandoc filter to convert internal links for multifile documents
"""
headerL1 = []
def fix_links(key, value, format, meta):
global headerL1
# Store level 1 headers
if key == "Header":
[level, [label, t1, t2], header] = value
if level == 1:
id = meta.get("id")
newlabel = f"{id['c'][0]['c']}-{label}"
headerL1 = [newlabel]
sys.stderr.write(f"\nGlobal header: {headerL1}\n")
return Header(level, [newlabel, t1, t2], header)
# Prepend level 1 header label to all other header labels
if level > 1:
prefix = headerL1[0]
newlabel = prefix + "-" + label
sys.stderr.write(f"Header label: {label} -> {newlabel}\n")
return Header(level, [newlabel, t1, t2], header)
if key == "Link":
[t1, linktext, [linkref, t4]] = value
if ".md" in linkref:
newlinkref = re.sub(r'.md', r'', linkref)
sys.stderr.write(f'Link: {linkref} -> {newlinkref}\n')
return Link(t1, linktext, [newlinkref, t4])
else:
sys.stderr.write(f'External link: {linkref}\n')
if __name__ == "__main__":
toJSONFilter(fix_links)
And here is a script that executes the whole thing:
#!/bin/bash
MD_INPUT=$(find . -type f | grep md | sort)
# Pass the markdown through the gitlab filters into Pandoc JSON files
echo "Filtering Gitlab markdown"
for file in $MD_INPUT
do
echo "Filtering $file"
pandoc \
--filter fix-links.py \
"$file" \
-t json \
-o "${file%.md}.json"
done
JSON_INPUT=$(find . -type f | grep json | sort)
echo "Generating LaTeX"
pandoc -s -f json -t latex $JSON_INPUT -V linkcolor=blue -o test.tex
echo "Generating PDF"
pandoc -s -f json -t pdf $JSON_INPUT -V linkcolor=blue -o test.pdf
Applying this script generates a PDF where the second link does not work at all.
Looking at the LaTeX code the problem can be solved by replacing the generated \href directive with \hyperlink.
Once this is done the linking works as expected.
The problem now is that this isn't done automatically by pandoc, which almost seems like a bug.
Is there a way to tell pandoc a link is internal from within the filter?
After running the filter it is non-trivial to fix the issue since there is no good way to differentiate internal and external links.

Package pdftex.def Error: File; What can I do so that LATEX can display the image?

Latex displays: Error: File `Schreibtisch/BLOCKPRAKTIKUM
MESSTECHNIK/EE/3c in line 242. What do I need to change so that LATEX
displays the picture? Unfortunately I am a total noob in LATEX. Any
help would be highly appreciated! I really don't know what to do I tried putting it into another folder but it doesn't display the image
\documentclass[a4paper,
pointlessnumbers,
%draft,
parskip=half,
automark
]{scrartcl}
\setlength{\parindent}{0pt}
\usepackage[a4paper, left=2.2cm, right=2.2cm, top=2.5cm, bottom=2.5cm,]{geometry}%müsste das Design sein
\usepackage{scrpage2}
\clearscrheadfoot
\pagestyle{scrheadings}
\usepackage[ngerman]{babel}
\usepackage[pdftex]{graphicx,color}
\usepackage[utf8]{inputenc}
\usepackage{amssymb,amsmath,amsthm, amsfonts}
\usepackage{latexsym}
\usepackage[decimalsymbol=comma]{siunitx}
\usepackage{booktabs}
\usepackage{tabulary}
\usepackage[dvipsnames]{xcolor}
\usepackage[centerlast,small,sc]{caption}
\usepackage{here}
\usepackage{siunitx}
\sisetup{per-mode = fraction, locale = DE}
\usepackage{titling}
\usepackage{subfigure}
\usepackage{float}
\usepackage{hyperref}
\usepackage{esvect}
%Mathe- Makros
\renewcommand{\i}{\mathrm{i}}
\newcommand{\e}{\mathrm{e}}
\newcommand{\diff}{\mathrm{d}}
\newcommand{\figref}[1]{Abb. \ref{#1}}
\newcommand{\ImNew}{\operatorname{Im}}
\newcommand{\ReNew}{\operatorname{Re}}
\newcommand{\xdot}{\! \, \cdot \! \,}
\newcommand{\funof}[1]{{\color{gray}(#1)}}
\section{Versuchsaufbau und Durchführung}
\subsection{Fadenstrahlrohr}
Eine gasgefüllte Glaskugel befindet sich in einem Helmholtz-Spulenpaar. Ein Elektronenstrahl wird durch die Lorentzkraft auf eine Kreisbahn gebracht, die durch Anpassung der Spannung an der Spule reguliert werden kann. Anschließend werden die jeweiligen Radien der Kreise gemessen.
\subsection{Milikan-Versuch}
Durch ein Mikroskop beobachtet man das Sinken oder das Steigen der Öltröpfchen in einem Plattenkondensator. Je nachdem wie der Kondensator gepolt ist, werden die Öltröpfchen entsprechend nach oben oder nach unten beschleunigt. Gemessen wird dann die Zeit, die die Öltröpfchen einmal zum Steigen und dann wieder zum Sinken benötigen.
\section{Auswertung}
\subsection{Fadenstrahlrohr}
Für 3 Kreisradien mit jeweils 5 Kombinationen aus Beschleunigungsspannnung und Spulenstrom kann man die spezifische Ladung des Elektrons bestimmen. Dazu wird die Spannung in Abhängigkeit des Stroms im Quadrat betrachtet.
\begin{figure}
\begin{align*}
\subfigure[\SI{3}{\centi\metre}]{\includegraphics[width=0.8\textwidth]{Desktop/BLOCKPRAKTIKUM MESSTECHNIK/EE/3cm.png}}
\end{align*}
\begin{align*}
\subfigure[\SI{4}{\centi\metre}]{\includegraphics[width=0.8\textwidth]{Desktop/BLOCKPRAKTIKUM MESSTECHNIK/EE/4cm.png}}
\end{align*}
\subfigure[\SI{5}{\centi\metre}]{\includegraphics[width=0.8\textwidth]{Desktop/BLOCKPRAKTIKUM MESSTECHNIK/EE/5cm.png}}
\caption{Diagramme für die Spannung in Abhängigkeit des Stroms im Quadrat für verschiedene Kreisradien}
\end{figure}
\end{document}
A couple of remarks about your code:
the documentclass option pointlessnumbers is outdated, use numbers=noenddot instead
the package scrpage2 is outdated and is even no longer included in current tex distributions. Use scrlayer-scrpage instead
\clearscrheadfoot is outdated, use \clearpairofpagestyles
the package option pdftex for graphicx is not necessary and causes a lot of problems, just remove it, graphicx is now grown up can detect itself which mode it needs
if your tex installation is not totally outdated, you don't need \usepackage[utf8]{inputenc}, that's now the default
the option decimalsymbol=comma for siunitx is obsolete, use output-decimal-marker={,} instead
you don't need the color package when you also load xcolor
don't load the same package multiple times
load hyperref last (with very few exceptions like cleveref)
instead of reinventing the wheel with \newcommand{\figref}[1]{Abb. \ref{#1}}, have a look at the cleveref package, much more flexible and powerful. For example it won't give an incorrectly large space like your macro
missing \begin{document}
unrelated to tex, but usually style guides recommend to use words for numbers up to ten, so rather use drei Kreisradien instead of 3 Kreisradien
your figure is missing floating specifier, e.g. \begin{figure}[htbp]
remove the align* environment inside the figure, that's really the most bizarre code I've seen in a very long time
just use the file name without extension \includegraphics[width=0.8\textwidth]{example-image-16x9}, latex will automatically use the best available one if there are multiple versions
avoid special characters like spaces in the path to the image

pandoc to produce pdf with latex, "paper" document class: missing affiliation and date

I am new to pandoc and latex. I cannot get to have the author affiliation to be part of the final pdf I produce using document class paper.
Let's assume I have the following source.md file:
---
title: my super document
subtitle: blablabla
author:
- me
- my friend
- my other friend
institute: alien space agency
date: <#today>
header-includes:
- \\usepackage{endnotes}
abstract: |
This document describes a super research project.
---
start of the writing blablabla...
<#today> is a gpp macro that translates to \today in the tex source.
I use pandoc like this:
cat source.md | gpp -H --include macros.gpp | pandoc -f markdown --variable documentclass=paper --standalone --smart --atx-headers --from=markdown+yaml_metadata_block -o document.pdf
In the produced document.pdf, there is no date and no affiliation for authors.
However, if I use document class article I have the correct date. But still
no authors institute.
How can I have both date and authors institute with paper document class ?
edit: more info...
If I produce a tex document here is what I get:
\title{my super document}
\providecommand{\subtitle}[1]{}
\subtitle{blablabla}
\author{me \and my friend \and my other friend}
\providecommand{\institute}[1]{}
\institute{alien space agency}
\date{\today}
Which looks ok to me. But there are only authors (without institute) and no date in the final document. I assume it is because of paper document class, changing to article shows the date but no institute...
My pandoc version is 1.19.2.1

Proxy file on snakemake code

I want to do alignment using star and I use proxy file for star the alignment.
Without a proxy file star-align run also without reference. So if I gave as input constrain of the alignment process the presence of database.done the alignment process can start.
How can manage this situation?
rule star_index:
input:
config['references']['transcriptome_fasta']
output:
genome=config['references']['starindex_dir'],
tp=touch("database.done")
shell:
'STAR --limitGenomeGenerateRAM 54760833024 --runMode genomeGenerate --genomeDir {output.genome} --genomeFastaFiles {input}'
rule star_map:
input:
dt="trim/{sample}/",
forward_paired="trim/{sample}/{sample}_forward_paired.fq.gz",
reverse_paired="trim/{sample}/{sample}_reverse_paired.fq.gz",
forward_unpaired="trim/{sample}/{sample}_forward_unpaired.fq.gz",
reverse_unpaired="trim/{sample}/{sample}_reverse_unpaired.fq.gz",
t1p="database.done",
output:
out1="ALIGN/{sample}/Aligned.sortedByCoord.out.bam",
out2="ALIGN/{sample}/",
# out2=touch("Star.align.done")
params:
genomedir = config['references']['basepath'],
sample="mitico",
platform_unit=config['platform'],
cente=config['center']
threads: 12
log: "ALIGN/log/{params.sample}_star.log"
shell:
'mkdir -p ALIGN/;STAR --runMode alignReads --genomeDir {params.genomedir} '
r' --outSAMattrRGline ID:{params.sample} SM:{params.sample} PL:{config[platform]} PU:{params.platform_unit} CN:{params.cente} '
'--readFilesIn {input.forward_paired} {input.reverse_paired} \
--readFilesCommand zcat
--outWigType wiggle \
--outWigStrand Stranded --runThreadN {threads} --outFileNamePrefix {output.out2} 2> {log} '
How can start a module only after all the previous function have finished.
I mean.Here i create the index then I trim ll my data and then I staart the alignment. I want after finishis all this sstep for all the sample start a new function like run fastqc. How can decode this in snakemake?
thanks so much for patience help
Without any mention of the genome as a required input for "star_map", I believe the rule is starting too early.
Try moving the genome reference from being a "Parameter" to being an "Input" requirement for star_map. Snakemake doesn't wait for parameters, only inputs. All reference genomes should be listed as inputs. In fact, all required files should be listed as input requirements. Param's are just for mostly convenience; ad-hoc strings and things on the fly.
I'm not entirely sure as to the connectivity across your files, some of these references are to a YAML file you have not provided, so I cannot guarantee the code will work.
rule star_map:
input:
dt="trim/{sample}/",
forward_paired="trim/{sample}/{sample}_forward_paired.fq.gz",
reverse_paired="trim/{sample}/{sample}_reverse_paired.fq.gz",
forward_unpaired="trim/{sample}/{sample}_forward_unpaired.fq.gz",
reverse_unpaired="trim/{sample}/{sample}_reverse_unpaired.fq.gz",
# Including the gnome as a required input, so Snakemake knows to wait for it too.
genomedir = config['references']['basepath'],
output:
out1="ALIGN/{sample}/Aligned.sortedByCoord.out.bam",
out2="ALIGN/{sample}/",
Snakemake doesn't check what files your shell commands are touching and modifying. Snakemake only knows to coordinate the files described in the "input" and "output" directives.

Flatten FDF / XFDF forms to PDF in PHP with utf-8 characters

My scenario:
A PDF template with formfields: template.pdf
An XFDF file that contains the data to be filled in: fieldData.xfdf
Now I need to have these to files combined & flattened.
pdftk does the job easily within php:
exec("pdftk template.pdf fill_form fieldData.xfdf output flatFile.pdf flatten");
Unfortunately this does not work with full utf-8 support.
For example: Cyrillic and greek letters get scrambled. I used Arial for this, with an unicode character set.
How can I accomplish to flatten my unicode files?
Is there any other pdf tool that offers unicode support?
Does pdftk have an unicode switch that I am missing?
EDIT 1: As this question has not been solved for more then 9 month, I decided to start a bounty for it. In case there are options to sponsor a feature or a bugfix in pdftk, I'd be glad to donate.
EDIT 2: I am not working on this project anymore, so I cannot verify new answers. If anyone has a similar problem, I am glad if they can respond in my favour.
I found by using Jon's template but using the DomDocument the numeric encoding was handled for me and worked well. My slight variation is below:
$xml = new DOMDocument( '1.0', 'UTF-8' );
$rootNode = $xml->createElement( 'xfdf' );
$rootNode->setAttribute( 'xmlns', 'http://ns.adobe.com/xfdf/' );
$rootNode->setAttribute( 'xml:space', 'preserve' );
$xml->appendChild( $rootNode );
$fieldsNode = $xml->createElement( 'fields' );
$rootNode->appendChild( $fieldsNode );
foreach ( $fields as $field => $value )
{
$fieldNode = $xml->createElement( 'field' );
$fieldNode->setAttribute( 'name', $field );
$fieldsNode->appendChild( $fieldNode );
$valueNode = $xml->createElement( 'value' );
$valueNode->appendChild( $xml->createTextNode( $value ) );
$fieldNode->appendChild( $valueNode );
}
$xml->save( $file );
You could try the trial version of http://www.adobe.com/products/livecycle/designer/ and see what PDF files it generates.
Another commercial software you could try is http://www.appligent.com/fdfmerge. See page 16 in http://146.145.110.1/docs/userguide/FDFMergeUserGuide.pdf for how it handles xFDF with UTF-8.
I also had a look at the FDF specification http://partners.adobe.com/public/developer/en/xml/xfdf_2.0.pdf
On page 12 it states:
Although XFDF is encoded in UTF-8, double byte characters are encoded as character references when
exported from Acrobat.
For example, the Japanese double byte characters , , and are exported to XFDF using
three character references. Here is an example of double byte characters in a form field:
...
<fields>
<field name="Text1">
<value>Here are 3 UTF-8 double byte
characters: あいう
</value>
</field>
</fields> ...
I looked through pdftk-1.44-dist/java/com/lowagie/text/pdf/XfdfReader.java. It doesn't seem to do anything special with the input.
Maybe pdftk will do what you want, when you encode the weird characters as character references in your xFDF input.
Using the pdftk 1.44 on a Win7 machine I encounter the same problems with xfdf-files whereas fdf works fine. I made a xfdf-file without any special characters (only ANSI) but pdftk crashed again. I mailed the developper. Unfortunately no answer until now.
Unfortunately, UTF-8 character encoding does not work neither with decimal nor hexadecimal references of non-ASCII characters in source .xfdf file. PDFTK v. 1.44.
I made some progress on this. Starting with code from http://koivi.com/fill-pdf-form-fields/, I modified the value encoding to output numeric codes for any characters outside the ascii range.
Now with pitulski's special strings:
Poznań Śródmieście Ćwiartka Ósma outputs Pozna ródmiecie wiartka Ósma with some box shapes superimposed
ęóąśłżźćńĘÓĄŚŁŻŹĆŃ outputs óÓ with more box shapes. I think it may be that the box shapes are characters my server doesn't recognize.
I tried it with some French characters: ùûüÿ€’“”«»àâæçéèêëïôœÙÛÜŸÀÂÆÇÉÈÊËÏÎÔ and they all came out OK, but some of them were overlapping.
--edit-- I just tried entering these manually into the form and got the same result minus the box shapes (using Evince). I then tried with a different form (created by someone else) - after entering ęóąśłżźćńĘÓĄŚŁŻŹĆŃ, ółÓŁ was displayed. It looks like it depends which characters are included in the document's embedded fonts.
/*
KOIVI HTML Form to FDF Parser for PHP (C) 2004 Justin Koivisto
Version 1.2.?
Last Modified: 2013/01/17 - Jon Hulka(jon dot hulka at gmail dot com)
- changed character encoding, all non-ascii characters get encoded as numeric character references
This library is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation; either version 2.1 of the License, or (at
your option) any later version.
This library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
License for more details.
You should have received a copy of the GNU Lesser General Public License
along with this library; if not, write to the Free Software Foundation,
Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Full license agreement notice can be found in the LICENSE file contained
within this distribution package.
Justin Koivisto
justin dot koivisto at gmail dot com
http://koivi.com
*/
/**
* createXFDF
*
* Tales values passed via associative array and generates XFDF file format
* with that data for the pdf address sullpiled.
*
* #param string $file The pdf file - url or file path accepted
* #param array $info data to use in key/value pairs no more than 2 dimensions
* #param string $enc default UTF-8, match server output: default_charset in php.ini
* #return string The XFDF data for acrobat reader to use in the pdf form file
*/
function createXFDF($file,$info,$enc='UTF-8'){
$data=
'<?xml version="1.0" encoding="'.$enc.'"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<fields>';
foreach($info as $field => $val){
$data.='
<field name="'.$field.'">';
if(is_array($val)){
foreach($val as $opt)
//2013.01.17 - Jon Hulka - all non-ascii characters get character references
$data.='
<value>'.mb_encode_numericentity(htmlspecialchars($opt),array(0x0080, 0xffff, 0, 0xffff), 'UTF-8').'</value>';
// $data.='<value>'.htmlentities($opt,ENT_COMPAT,$enc).'</value>'."\n";
}else{
$data.='
<value>'.mb_encode_numericentity(htmlspecialchars($val),array(0x0080, 0xffff, 0, 0xffff), 'UTF-8').'</value>';
// $data.='<value>'.htmlentities($val,ENT_COMPAT,$enc).'</value>'."\n";
}
$data.='
</field>';
}
$data.='
</fields>
<ids original="'.md5($file).'" modified="'.time().'" />
<f href="'.$file.'" />
</xfdf>';
return $data;
}
While pdftk doesn't appear to support UTF-8 in the FDF file, I found that with
iconv -f utf-8 -t ISO_8859-1
in the pipeline converting that FDF file to ISO-Latin-1, then at least those characters that are in the Latin-1 code page will still be represented properly.
What PDFTK's version?
I tried the same thing with Polish characters (utf-8).
Does not work for me.
pdftk.exe, libiconv2.dll from: http://www.pdflabs.com/docs/install-pdftk/
Windows 7, cmd, file.pdf + file.fdf -> new.pdf
pdftk file.pdf fill_form file.xfdf output new.pdf flatten
Unhandled Java Exception:
java.lang.NoClassDefFoundError: gnu.gcj.convert.Input_UTF8 not found in [file:.\, core:/]
at 0x005a3abe (Unknown Source)
at 0x005a3fb2 (Unknown Source)
at 0x006119f4 (Unknown Source)
at 0x00649ee4 (Unknown Source)
at 0x005b4c44 (Unknown Source)
at 0x005470a9 (Unknown Source)
at 0x00549c52 (Unknown Source)
at 0x0059d348 (Unknown Source)
at 0x007323c9 (Unknown Source)
at 0x0054715a (Unknown Source)
at 0x00562349 (Unknown Source)
But, with FDF file, with the same content, it worked properly.
But the characters in new.PDF are bad.
pdftk file.pdf fill_form file.fdf output new.pdf flatten
---FDF---
%FDF-1.2
%âãÏÓ
1 0 obj<</FDF<</F(file.pdf)
/Fields[
<</T(Miejsce)/V(666 Poznań Śródmieście Ćwiartka Ósma)>>
<</T(Nr)/V(ęóąśłżźćńĘÓĄŚŁŻŹĆŃ)>>
]>>>>
endobj
trailer
<</Root 1 0 R>>
%%EOF
---XFDF---
<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<f href="file.pdf"/>
<fields>
<field name="Miejsce">
<value>666 Poznań Śródmieście Ćwiartka Ósma</value>
</field>
<field name="Nr">
<value>ęóąśłżźćńĘÓĄŚŁŻŹĆŃ</value>
</field>
</fields>
</xfdf>
---PDF---
Miejsce: 666 PoznaÅ— ÅıródmieÅłcie ăwiartka Ãfisma
Nr: ÄŽÃ³Ä–ÅłÅ‡Å¼ÅºÄ⁄Å—ÄŸÃfiÄ—ÅıņŻŹăÅ
You can introduce utf-8 characters by giving their unicode code in octal with \ddd
To solve this, I wrote PdfFormFillerUTF-8: http://sourceforge.net/projects/pdfformfiller2/
There is a drop-in replacement for pdftk tool
Mcpdf: https://github.com/m-click/mcpdf
that solves unicode issues when filling forms. Works for me with CP1250 characters (Central Europe).
From project page:
the following command fills in form data from DATA.xfdf into FORM.pdf
and writes the result to RESULT.pdf. It also flattens the document to
prevent further editing:
java -jar mcpdf.jar FORM.pdf fill_form - output - flatten < DATA.xfdf > RESULT.pdf
This corresponds exactly to the usual PDFtk command:
pdftk FORM.pdf fill_form - output - flatten < DATA.xfdf > RESULT.pdf
Note that you need to have JRE installed.
I have managed to make it work with pdftk by creating a xfdf file with utf-8 encoding.
it took several tried but what make it work as exepcted was to add 'need_appearances'
here is an example:
pdftk source.pdf fill_form data.xfdf output output.pdf need_appearances
I have been solving this issue for a long time, and finally I have found the solution!
so, let's start.
download and install the latest version of pdftk
# PDFTK
RUN apk add openjdk8 \
&& cd /tmp \
&& wget https://gitlab.com/pdftk-java/pdftk/-/jobs/1507074845/artifacts/raw/build/libs/pdftk-all.jar \
&& mv pdftk-all.jar pdftk.jar \
&& echo '#!/usr/bin/env bash' > pdftk \
&& echo 'java -jar "$0.jar" "$#"' >> pdftk \
&& chmod 775 pdftk* \
&& mv pdftk* /usr/local/bin \
&& pdftk -version
Open your PDF Form in Adobe Acrobat Reader and look at field options, you need to detect the font, for example Helvetica, download this font.
Fill the form with flatten option
/usr/local/bin/pdftk A=form.pdf fill_form xfdf.xml output out.pdf drop_xfa need_appearances flatten replacement_font /path/to/font/HelveticaRegular.ttf
xfdf.xml example:
<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<fields>
<field name="Check Box 136">
<value>Your value | Значение (Cyrillic)</value>
</field>
</fields>
</xfdf>
Enjoy :)
pdftk supports encoding in UTF-16BE. It's not that difficult to convert from UTF-8 to UTF-16BE.
See: Weird characters when filling PDF with PDFTk