Flatten FDF / XFDF forms to PDF in PHP with utf-8 characters - pdf

My scenario:
A PDF template with formfields: template.pdf
An XFDF file that contains the data to be filled in: fieldData.xfdf
Now I need to have these to files combined & flattened.
pdftk does the job easily within php:
exec("pdftk template.pdf fill_form fieldData.xfdf output flatFile.pdf flatten");
Unfortunately this does not work with full utf-8 support.
For example: Cyrillic and greek letters get scrambled. I used Arial for this, with an unicode character set.
How can I accomplish to flatten my unicode files?
Is there any other pdf tool that offers unicode support?
Does pdftk have an unicode switch that I am missing?
EDIT 1: As this question has not been solved for more then 9 month, I decided to start a bounty for it. In case there are options to sponsor a feature or a bugfix in pdftk, I'd be glad to donate.
EDIT 2: I am not working on this project anymore, so I cannot verify new answers. If anyone has a similar problem, I am glad if they can respond in my favour.

I found by using Jon's template but using the DomDocument the numeric encoding was handled for me and worked well. My slight variation is below:
$xml = new DOMDocument( '1.0', 'UTF-8' );
$rootNode = $xml->createElement( 'xfdf' );
$rootNode->setAttribute( 'xmlns', 'http://ns.adobe.com/xfdf/' );
$rootNode->setAttribute( 'xml:space', 'preserve' );
$xml->appendChild( $rootNode );
$fieldsNode = $xml->createElement( 'fields' );
$rootNode->appendChild( $fieldsNode );
foreach ( $fields as $field => $value )
{
$fieldNode = $xml->createElement( 'field' );
$fieldNode->setAttribute( 'name', $field );
$fieldsNode->appendChild( $fieldNode );
$valueNode = $xml->createElement( 'value' );
$valueNode->appendChild( $xml->createTextNode( $value ) );
$fieldNode->appendChild( $valueNode );
}
$xml->save( $file );

You could try the trial version of http://www.adobe.com/products/livecycle/designer/ and see what PDF files it generates.
Another commercial software you could try is http://www.appligent.com/fdfmerge. See page 16 in http://146.145.110.1/docs/userguide/FDFMergeUserGuide.pdf for how it handles xFDF with UTF-8.
I also had a look at the FDF specification http://partners.adobe.com/public/developer/en/xml/xfdf_2.0.pdf
On page 12 it states:
Although XFDF is encoded in UTF-8, double byte characters are encoded as character references when
exported from Acrobat.
For example, the Japanese double byte characters , , and are exported to XFDF using
three character references. Here is an example of double byte characters in a form field:
...
<fields>
<field name="Text1">
<value>Here are 3 UTF-8 double byte
characters: あいう
</value>
</field>
</fields> ...
I looked through pdftk-1.44-dist/java/com/lowagie/text/pdf/XfdfReader.java. It doesn't seem to do anything special with the input.
Maybe pdftk will do what you want, when you encode the weird characters as character references in your xFDF input.

Using the pdftk 1.44 on a Win7 machine I encounter the same problems with xfdf-files whereas fdf works fine. I made a xfdf-file without any special characters (only ANSI) but pdftk crashed again. I mailed the developper. Unfortunately no answer until now.

Unfortunately, UTF-8 character encoding does not work neither with decimal nor hexadecimal references of non-ASCII characters in source .xfdf file. PDFTK v. 1.44.

I made some progress on this. Starting with code from http://koivi.com/fill-pdf-form-fields/, I modified the value encoding to output numeric codes for any characters outside the ascii range.
Now with pitulski's special strings:
Poznań Śródmieście Ćwiartka Ósma outputs Pozna ródmiecie wiartka Ósma with some box shapes superimposed
ęóąśłżźćńĘÓĄŚŁŻŹĆŃ outputs óÓ with more box shapes. I think it may be that the box shapes are characters my server doesn't recognize.
I tried it with some French characters: ùûüÿ€’“”«»àâæçéèêëïôœÙÛÜŸÀÂÆÇÉÈÊËÏÎÔ and they all came out OK, but some of them were overlapping.
--edit-- I just tried entering these manually into the form and got the same result minus the box shapes (using Evince). I then tried with a different form (created by someone else) - after entering ęóąśłżźćńĘÓĄŚŁŻŹĆŃ, ółÓŁ was displayed. It looks like it depends which characters are included in the document's embedded fonts.
/*
KOIVI HTML Form to FDF Parser for PHP (C) 2004 Justin Koivisto
Version 1.2.?
Last Modified: 2013/01/17 - Jon Hulka(jon dot hulka at gmail dot com)
- changed character encoding, all non-ascii characters get encoded as numeric character references
This library is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation; either version 2.1 of the License, or (at
your option) any later version.
This library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
License for more details.
You should have received a copy of the GNU Lesser General Public License
along with this library; if not, write to the Free Software Foundation,
Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Full license agreement notice can be found in the LICENSE file contained
within this distribution package.
Justin Koivisto
justin dot koivisto at gmail dot com
http://koivi.com
*/
/**
* createXFDF
*
* Tales values passed via associative array and generates XFDF file format
* with that data for the pdf address sullpiled.
*
* #param string $file The pdf file - url or file path accepted
* #param array $info data to use in key/value pairs no more than 2 dimensions
* #param string $enc default UTF-8, match server output: default_charset in php.ini
* #return string The XFDF data for acrobat reader to use in the pdf form file
*/
function createXFDF($file,$info,$enc='UTF-8'){
$data=
'<?xml version="1.0" encoding="'.$enc.'"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<fields>';
foreach($info as $field => $val){
$data.='
<field name="'.$field.'">';
if(is_array($val)){
foreach($val as $opt)
//2013.01.17 - Jon Hulka - all non-ascii characters get character references
$data.='
<value>'.mb_encode_numericentity(htmlspecialchars($opt),array(0x0080, 0xffff, 0, 0xffff), 'UTF-8').'</value>';
// $data.='<value>'.htmlentities($opt,ENT_COMPAT,$enc).'</value>'."\n";
}else{
$data.='
<value>'.mb_encode_numericentity(htmlspecialchars($val),array(0x0080, 0xffff, 0, 0xffff), 'UTF-8').'</value>';
// $data.='<value>'.htmlentities($val,ENT_COMPAT,$enc).'</value>'."\n";
}
$data.='
</field>';
}
$data.='
</fields>
<ids original="'.md5($file).'" modified="'.time().'" />
<f href="'.$file.'" />
</xfdf>';
return $data;
}

While pdftk doesn't appear to support UTF-8 in the FDF file, I found that with
iconv -f utf-8 -t ISO_8859-1
in the pipeline converting that FDF file to ISO-Latin-1, then at least those characters that are in the Latin-1 code page will still be represented properly.

What PDFTK's version?
I tried the same thing with Polish characters (utf-8).
Does not work for me.
pdftk.exe, libiconv2.dll from: http://www.pdflabs.com/docs/install-pdftk/
Windows 7, cmd, file.pdf + file.fdf -> new.pdf
pdftk file.pdf fill_form file.xfdf output new.pdf flatten
Unhandled Java Exception:
java.lang.NoClassDefFoundError: gnu.gcj.convert.Input_UTF8 not found in [file:.\, core:/]
at 0x005a3abe (Unknown Source)
at 0x005a3fb2 (Unknown Source)
at 0x006119f4 (Unknown Source)
at 0x00649ee4 (Unknown Source)
at 0x005b4c44 (Unknown Source)
at 0x005470a9 (Unknown Source)
at 0x00549c52 (Unknown Source)
at 0x0059d348 (Unknown Source)
at 0x007323c9 (Unknown Source)
at 0x0054715a (Unknown Source)
at 0x00562349 (Unknown Source)
But, with FDF file, with the same content, it worked properly.
But the characters in new.PDF are bad.
pdftk file.pdf fill_form file.fdf output new.pdf flatten
---FDF---
%FDF-1.2
%âãÏÓ
1 0 obj<</FDF<</F(file.pdf)
/Fields[
<</T(Miejsce)/V(666 Poznań Śródmieście Ćwiartka Ósma)>>
<</T(Nr)/V(ęóąśłżźćńĘÓĄŚŁŻŹĆŃ)>>
]>>>>
endobj
trailer
<</Root 1 0 R>>
%%EOF
---XFDF---
<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<f href="file.pdf"/>
<fields>
<field name="Miejsce">
<value>666 Poznań Śródmieście Ćwiartka Ósma</value>
</field>
<field name="Nr">
<value>ęóąśłżźćńĘÓĄŚŁŻŹĆŃ</value>
</field>
</fields>
</xfdf>
---PDF---
Miejsce: 666 PoznaÅ— ÅıródmieÅłcie ăwiartka Ãfisma
Nr: ÄŽÃ³Ä–ÅłÅ‡Å¼ÅºÄ⁄Å—ÄŸÃfiÄ—ÅıņŻŹăÅ

You can introduce utf-8 characters by giving their unicode code in octal with \ddd

To solve this, I wrote PdfFormFillerUTF-8: http://sourceforge.net/projects/pdfformfiller2/

There is a drop-in replacement for pdftk tool
Mcpdf: https://github.com/m-click/mcpdf
that solves unicode issues when filling forms. Works for me with CP1250 characters (Central Europe).
From project page:
the following command fills in form data from DATA.xfdf into FORM.pdf
and writes the result to RESULT.pdf. It also flattens the document to
prevent further editing:
java -jar mcpdf.jar FORM.pdf fill_form - output - flatten < DATA.xfdf > RESULT.pdf
This corresponds exactly to the usual PDFtk command:
pdftk FORM.pdf fill_form - output - flatten < DATA.xfdf > RESULT.pdf
Note that you need to have JRE installed.

I have managed to make it work with pdftk by creating a xfdf file with utf-8 encoding.
it took several tried but what make it work as exepcted was to add 'need_appearances'
here is an example:
pdftk source.pdf fill_form data.xfdf output output.pdf need_appearances

I have been solving this issue for a long time, and finally I have found the solution!
so, let's start.
download and install the latest version of pdftk
# PDFTK
RUN apk add openjdk8 \
&& cd /tmp \
&& wget https://gitlab.com/pdftk-java/pdftk/-/jobs/1507074845/artifacts/raw/build/libs/pdftk-all.jar \
&& mv pdftk-all.jar pdftk.jar \
&& echo '#!/usr/bin/env bash' > pdftk \
&& echo 'java -jar "$0.jar" "$#"' >> pdftk \
&& chmod 775 pdftk* \
&& mv pdftk* /usr/local/bin \
&& pdftk -version
Open your PDF Form in Adobe Acrobat Reader and look at field options, you need to detect the font, for example Helvetica, download this font.
Fill the form with flatten option
/usr/local/bin/pdftk A=form.pdf fill_form xfdf.xml output out.pdf drop_xfa need_appearances flatten replacement_font /path/to/font/HelveticaRegular.ttf
xfdf.xml example:
<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<fields>
<field name="Check Box 136">
<value>Your value | Значение (Cyrillic)</value>
</field>
</fields>
</xfdf>
Enjoy :)

pdftk supports encoding in UTF-16BE. It's not that difficult to convert from UTF-8 to UTF-16BE.
See: Weird characters when filling PDF with PDFTk

Related

unsupported format character error in youtube-dl

i've been trying to download a youtube playlist using youtube-dl, but i had a problem in the output template, i used the following command to get the playlist downloaded in an organised way :
youtube-dl -f mp4 -o "Desktop/mainFolder/courses/%(playlist_title)s-%(playlist_uploader)/%(title)s.%(ext)s" --embed-thumbnail --add-metadata --mark-watched https://www.youtube.com/playlist?list=PLzH6n4zXuckpfMu_4Ff8E7Z1behQks5ba
but i kept getting the following error :
ERROR: Error in output template: unsupported format character '/' (0x2f) at index 66 (encoding: 'UTF-8')
it states that i used an unsupported character which is '/', weirdly enough i used almost the same output template format in a previous download which is :
youtube-dl -f mp4 -o "Desktop/mainFolder/courses/%(playlist)s/%(playlist_index)s - %(title)s.%(ext)s" --add-metadata https://www.youtube.com/playlist?list=PL4C9296DF81B9EF13
and it worked just fine.
what did i did differently here so that the first command didn't work but the last one did ??
If you see this error it probably means one of the format expressions doesn't end with an s.
In this case, it looks like you're missing one after (playlist_uploader).
youtube-dl -f mp4 -o "Desktop/mainFolder/courses/%(playlist_title)s-%(playlist_uploader)s/%(title)s.%(ext)s" --embed-thumbnail --add-metadata --mark-watched https://www.youtube.com/playlist?list=PLzH6n4zXuckpfMu_4Ff8E7Z1behQks5ba

how to guess file encoding

I have a file (an author list from the Library of Congress) with lines like:
Arteaga, Ana Mar�ia
Corval�an-V�asquez, Oscar E.
(when printed to linux console)
I'd like to read those (either into a pandas dataframe or a set of lines)
df = pd.read_csv(fname, sep='\t', header='infer', lineterminator=None,encoding='latin1') #lineterminator \r\n hits error
or
with open(fname,'r',encoding='ISO-8859-1') as fp:
lines=fp.readlines()
but both are not quite right , giving me output like
Arteaga, Ana Marâia
(again when printed to console)
when I am pretty sure the actual name here should be María.
Does someone recognize this format?
Ok this seems to be the 'marc-8' format .
yaz-iconv -f marc8 -t utf8 infile.txt > outfile.txt
took care of the conversion to utf8 , with the sole hiccup being that yaz killed all the line terminators (both for \r\n and \n versions of the file).
Those can be returned with something along the lines of
sed 's/\[/\n\[/g' outfile.txt > outfile_utf.txt
(for example in my case where each line starts with a '[' character)

sqlQuery in R fails when called via source() [duplicate]

The following, when copied and pasted directly into R works fine:
> character_test <- function() print("R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示...")
> character_test()
[1] "R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示..."
However, if I make a file called character_test.R containing the EXACT SAME code, save it in UTF-8 encoding (so as to retain the special Chinese characters), then when I source() it in R, I get the following error:
> source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8")
Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") :
C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input
1: character.test <- function() print("R
2:
^
In addition: Warning message:
In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") :
invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'
Any help you can offer in solving and helping me to understand what is going on here would be much appreciated.
> sessionInfo() # Windows 7 Pro x64
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
loaded via a namespace (and not attached):
[1] tools_2.12.1
and
> l10n_info()
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
On R/Windows, source runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.
This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source function. You can get 90% of the way there by doing this instead:
eval(parse(filename, encoding="UTF-8"))
This'll work almost exactly like source() with default arguments, but won't let you do echo=T, eval.print=T, etc.
We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:
The file "myfile.r" contains:
russian <- function() print ("Американские с...");
The console contains:
source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."
Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).
I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following
danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")
is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.
Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?
For me (on windows) I do:
source.utf8 <- function(f) {
l <- readLines(f, encoding="UTF-8")
eval(parse(text=l),envir=.GlobalEnv)
}
It works fine.
Building on crow's answer, this solution makes RStudio's Source button work.
When hitting that Source button, RStudio executes source('myfile.r', encoding = 'UTF-8')), so overriding source makes the errors disappear and runs the code as expected:
source <- function(f, encoding = 'UTF-8') {
l <- readLines(f, encoding=encoding)
eval(parse(text=l),envir=.GlobalEnv)
}
You can then add that script to an .Rprofile file, so it will execute on startup.
I encounter this problem when a try to source a .R file containing some Chinese characters. In my case, I found that merely set "LC_CTYPE" to "chinese" is not enough. But setting "LC_ALL" to "chinese" works well.
Note that it's not enough to get encoding right when you read or write plain text file in Rstudio (or R?) with non-ASCII. The locale setting counts too.
PS. the command is Sys.setlocale(category = "LC_CTYPE",locale = "chinese"). Please replace locale value correspondingly.
On windows, when you copy-paste a unicode or utf-8 encoded string into a text-control that is set to single-byte-input (ascii... depending on locale), the unknown bytes will be replaced by questionmarks. If i take the first 4 characters of your string and copy-paste it into e.g. Notepad and then save it, the file becomes in hex:
52 3F 3F 3F 3F
what you have to do is find an editor which you can set to utf-8 before copy-pasting the text into it, then the saved file (of your first 4 characters) becomes:
52 E5 90 8C E6 97 B6 E4 B9 9F E8 A2 AB
This will then be recognized as valid utf-8 by [R].
I used "Notepad2" for trying this, but i am sure there are many more.

Setting the photometric interpretation tag for a multi-page tiff

While trying to convert a multipage document from a tiff to a pdf, I encountered the following problem:
↪ tiff2pdf 0271.f1.tiff -o 0271.f1.pdf
tiff2pdf: No support for 0271.f1.tiff with no photometric interpretation tag.
tiff2pdf: An error occurred creating output PDF file.
Does anybody know what causes this and how to fix it?
This is caused because one or more of the pages in the multi-page tiff does not have the photometric interpretation tag set. This is a required tag, so that means your tiffs are technically invalid (though I bet they work fine anyway).
To fix this, you must identify the page (or pages) that does not have the photometric interpretation set and fix it.
To identify the page, you can simply run something like:
↪ tiffinfo your-file.tiff
This will spit out the info for every page of your tiff. For each good page, you'll see something like:
TIFF Directory at offset 0x105c0 (67008)
Subfile Type: (0 = 0x0)
Image Width: 1760 Image Length: 2639
Resolution: 300, 300 pixels/inch
Bits/Sample: 1
Compression Scheme: CCITT Group 4
**Photometric Interpretation: min-is-white**
FillOrder: msb-to-lsb
Orientation: row 0 top, col 0 lhs
Samples/Pixel: 1
Rows/Strip: 2639
Planar Configuration: single image plane
Software: ScanFix(TM) Enhanced ImageGear Version: 11.00.024
DateTime: Mon Oct 31 15:11:07 2005
Artist: 1996-2001 AccuSoft Co., All rights reserved
If you have a bad page, it'll lack the photometric interpretation section, and you can fix it with:
↪ tiffset -d $page-number -s 262 0 your-file.tiff
Note that the value of zero is the default for the photometric interpretation key, which is 262. You can see the other values for this key at the link above.
If your tiff has a lot of pages (like mine does), you may not be able to easily identify the bad page by eye. In that case, you can take a brute force approach, setting the photometric interpretation for all pages to the default value.
# First, split the tiff into many one-page files
↪ tiffsplit your-file.tiff
# Then, set the photometric interpretation to the default for all pages
↪ find . -name '*.tiff' -exec tiffset -s 262 0 '{}' \;
# Then rejoin the pages
↪ tiffcp *.tiff -o out-file.tiff
Lot of dummy work, but gets the job done.

How to export printable text only(or any other packet property) in wireshark

Long story short - I'm capturing SQLs from vendor tool to Oracle database by using Wireshark. It already has decoder for TNS protocol (which is great) and I can access text of SQL by
Right Click->Copy->Bytes(Printable Text Only).
The problem is that there are tons of packets and doing right-click on each of them could take ages. I was wondering if there any way to export 'Printable Text Only' right from Wireshark. Ideally I want to have a text file with statements.
Any help will be highly appreciated.
Finally found away to do this. First, use tshark capturing tns packets:
tshark -R tcp.port==1521 -T fields -e data.data -d tcp.port==1521,tns > input.txt
Then you could use home brew Ruby script below to transform from bytes to text:
file = ARGV[0]
print_all = ARGV[1]
File.open(file, "r").each {|line|
line.gsub(",", ":").split(':').each {|byte|
chr = Integer('0x' + byte).chr
print chr if ((' '..'~').include?(chr) or chr == "\n") or (print_all.downcase == 'all' if print_all)
} if !line.chomp.empty?
}
Examples are:
encode.rb input.txt > output.txt
will export printable text only from input to output
encode.rb input.txt all > output.txt
will export all text from input to output
An easy way of looking at them all that has worked for me is just Right Click -> Follow TCP Stream.
A note: unprintable characters are displayed as .s. If there are a bunch of these interspersed between all the text you want to extract (as there was for me), switch it to ASCII, save it and open it in your favourite text editor (vim for me), then run a search and replace similar to /\.//g.
I don't know how to do it with TNS. but you can do something like this using tshark, for example to look at http requests.
tshark -T fields -e http.request.uri
So if you can look at the options in the TNS decoder, you should be able to grab that field and redirect the output to a file.