I am trying to extract rivers from OSM. I downloaded the waterway GPKG where I believe there are over 21 million entries (see link) with a file size of 19.9 GB.
I have tried using the split vector layer in QGIS, but it would crash.
Was thinking of using GDAL ogr2ogr, but having trouble generating the command line.
I first isolated the multiline string with the following command.
ogr2ogr -f gpkg water.gpkg waterway_EPSG4326.gpkg waterway_EPSG4326_line -nlt linestring
ogrinfo water.gpkg INFO: Open of water.gpkg' using driver GPKG' successful. 1: waterway_EPSG4326_line (Line String)
Tried the following command below, but it is not working.
ogr2ogr -f GPKG SELECT * FROM waterway_EPSG4326_line - where waterway="river" river.gpkg water.gpkg
Please let me know what is missing or if there is any easy way to perform the task. I tried opening the file in R sf package, but it would not load after a long time.
Thanks
I am having trouble getting multi-line latex equations to successfully compile when using knitr in R Markdown. When I output as an HTML file it works but when I attempt to output as a PDF I get errors and it will not compile.
The following sample code illustrates the problem. Just change output:html_document to output: pdf_document. You'll see that in HTML output both Test 1 and Test 2 work. But in PDF output only Test 1 successfully compiles.
---
title: "Test"
date: "1/16/2018"
output: html_document
---
## R Markdown
Testing to see if a multi-line latex equation works.
#### Test 1:
$$ f(x) = x^5 + x^3 + x $$
$$ g(x) = y^{x+1} $$
#### Test 2:
\begin{align}
f(x) &= x^5 + x^3 + x \\
g(x) &= y^{x+1}
\end{align}
I am running Mac OS X - High Sierra and R Studio - Version 1.1.383
The following is the error I got:
Rule 'pdflatex': File changes, etc:
Non-existent destination files:
'Untitled.pdf'
------------
Run number 1 of rule 'pdflatex'
------------
------------
Running 'pdflatex -halt-on-error -interaction=batchmode -recorder "Untitled.tex"'
------------
This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017) (preloaded format=pdflatex)
restricted \write18 enabled.
entering extended mode
=== TeX engine is 'pdfTeX'
Latexmk: Errors, so I did not complete making targets
Collected error summary (may duplicate other messages):
pdflatex: Command for 'pdflatex' gave return code 1
Refer to 'Untitled.log' for details
Latexmk: Use the -f option to force complete processing,
unless error was exceeding maximum runs of latex/pdflatex.
! Paragraph ended before \align was complete.
<to be read again>
\par
l.112
Error: Failed to compile Untitled.tex. See Untitled.log for more info.
Execution halted
I want to convert my pdf files to txt files and used pdfminer3k module & pdf2txt.py, however, I got an error.
pdf2txt.py -o file.txt -t tag file.pdf
This is my code at cmd screen.
Traceback (most recent call last):
File "C:\Python36\lib\site.py", line 67, in
import os
File "C:\Python36\lib\os.py", line 409
yield from walk(new_path, topdown, onerror, followlinks)
^
SyntaxError: invalid syntax
This is an error message that I got.
Could you help me to fix this problem??
Added for reference: Great resourse:
http://www.degeneratestate.org/posts/2016/Jun/15/extracting-tabular-data-from-pdfs/
The -t flag is the type of output. The options are text, tag, xml, and html.
Tag refers to generating a tag for xml. Replace tag with text in your command and try it.
The order of optional input also matters.
You also must invoke python, your command line does'nt know what import means, yet some of your environment seems to be setup. My example is for windows cmd from Anaconda3\Scripts directory. If your in juptyer notebook or a console, you should be able to run import pdf2txt with the .py
To setup your environment you need to append the os.path.append(yourpdfdirectory) otherwise file.pdf will not be found.
Try python pdf2txt.py -t text -o file.txt file.pdf
Or if you are brave...this is how to do programmatically. The trouble with xml is if you want to get the text, each character from xml tree is returned in an arbitrary order. You can get it to work but you need to build the string character by character which is not that hard, its just logically time consuming.
fp = open(filesin,'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager(caching=False)
laparams = LAParams(all_texts=True)
laparams.boxes_flow = -0.2
laparams.paragraph_indent = 0.2
laparams.detect_vertical = False
#laparams.heuristic_word_margin = 0.03
laparams.word_margin = 0.2
laparams.line_margin = 0.3
outfp = open(filesin+".out.tag" ,'wb')
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
#process_pdf(rsrcmgr, device, pdfparse, pagenos,caching=c, check_extractable=True)
for p,page in enumerate(doc.get_pages()):
if p == 0: #temporary for page 1
interpreter.process_page(page)
layout = device.get_result()
alltextinbox = ''
#This is a rich environment so categorization of this object hierarchy is needed
for c,lt_obj in enumerate(layout):
#print(type(lt_obj),"This is type ",c,"th object on the ",p,"th page")
if isinstance(lt_obj,LTTextBoxHorizontal) or isinstance(lt_obj,LTTextBox) or isinstance(lt_obj,LTTextLine):
print("Type ,",type(lt_obj)," and text ..",lt_obj.get_text())
obj_textbox_line.update({lt_obj:lt_obj.get_text()})
elif p != 0:
pass
fp.close()
#print(obj_textbox_line)
#call the column finder here
#check_matching("example", "example1")
#text_doc_df = pd.DataFrame(obj_textbox_line,columns=['text'])
#print (text_doc_df)
pass
I'm working on a generic row/column matcher. If you don't want to bother, you can buy this software already for like 150 bucks for a pro converter.
Setup : here is sessionInfo() :
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchDVI_1.9 knitr_1.5
loaded via a namespace (and not attached):
[1] compiler_3.0.2 evaluate_0.5.1 formatR_0.9 highr_0.2.1 stringr_0.6.2
[6] tcltk_3.0.2 tools_3.0.2
I am trying to get emacs and AucTeX to synchronize my .Rnw source file with evince to go to comiled text from source and back.
I have already checked that the synchronization works fine between a .tex source and a PDF.
My .Rnw file starts with :
\documentclass[a4paper,twoside,12pt]{article}
\synctex=1 %% Should force concordance generation
\pdfcompresslevel=0 %% Should force avoidance of PDF compression, which patchDVI does
\pdfobjcompresslevel=0 %% not handle
<<include=FALSE>>= %% Modificaton of what Sweave2kinitr does
## opts_chunk$set(concordance=TRUE, self.contained=TRUE) ## No possible effect
opts_knit$set(concordance=TRUE, self.contained=TRUE) ## Seems reasonable
#
%% \SweaveOpts{concordance=TRUE} %% That's where inspiration came from
Consider the following log (unrelevant parts edited) :
> options("knitr.concordance")
$knitr.concordance
[1] TRUE
> opts_knit$get("concordance")
[1] TRUE
> knit("IntroStat.Rnw")
processing file: IntroStat.Rnw
|...................... | 33%
ordinary text without R code
|........................................... | 67%
label: unnamed-chunk-1 (with options)
List of 1
$ include: logi FALSE
|.................................................................| 100%
ordinary text without R code
output file: IntroStat.tex
[1] "IntroStat.tex"
> system("pdflatex -synctex=1 IntroStat.tex")
[ Edited irrelevancies ]
SyncTeX written on IntroStat.synctex.gz.
Note : a concordance has *been* generated !!! **
Transcript written on IntroStat.log.
Let's do that again to fix references :
> system("pdflatex -synctex=1 IntroStat.tex")
[ Edited irrelevancies ]
Output written on IntroStat.pdf (1 page, 136907 bytes).
SyncTeX written on IntroStat.synctex.gz.
Note : a concordance has *been* generated *again* !!! **
Transcript written on IntroStat.log.
> patchDVI("IntroStat.pdf")
[1] "0 patches made. Did you set \\SweaveOpts{concordance=TRUE}?"
* This I do not understand *
> patchSynctex("IntroStat.synctex.gz")
[1] "0 patches made. Did you set \\SweaveOpts{concordance=TRUE}?"
* Ditto *
It appears that something in the set of tools does not work as advertized : either dviPatch does not recognize legal concordance \specials or pdflatex dfoes not generate them. It does generate something, however...
I checked that the resulting PDF enables evince to synchronize with the .tex file, but not in the .Rnw file. Furthermore, when the .Rnw file is open in emacs, starting the viewer with 'C-c C-v View" in AucTeX indeed starts the viewer (after requesting to open a server, which I authorize), but the viewers is empty, and i get this :
"TeX-evince-sync-view: Couldn't find the Evince instance for file:///home/charpent/Boulot/Cours/ODF/Chapitres/Ch3-StatMath/IntroStat.Rnw.pdf"
in the "Messages" buffer.
So we have a second problem here.
A third one would be to integrate all of this transparently in the AucTeX production chain, but this is another story...
I'd really like to keep emacs as my main tool for R/\LaTeX/Sage work, rather tha switch to RStudio, which probably won't like much SageTeX and othe various tools I need on a daily/weekly basis...
Any thoughts ?
Maybe this https://github.com/jan-glx/patchKnitrSynctex will help. I tried it on a simple file, and it does work.
As for the second and third problems, I have this script (note that I source the above code from jan-glx; modify path accordingly):
#!/bin/bash
FILE=$1
BASENAME=$(basename $FILE .Rnw)
Rscript -e 'library(knitr); opts_knit$set("concordance" = TRUE); knit("'$1'")'
pdflatex --synctex=1 --file-line-error --shell-escape "${1%.*}"
Rscript -e "source('~/Sources/patchKnitrSynctex.R'); patchKnitrSynctex('${1%.*}.tex')"
ln -s $BASENAME.synctex.gz $BASENAME.Rnw.synctex.gz
ln -s $BASENAME.pdf $BASENAME.Rnw.pdf
The links are my kludgy way of getting around the "Couldn't find the instance (...) ".
If you have your .Rnw in an Emacs buffer, go to a shell buffer, and call that script. When finished, C-c C-v from Emacs will open your configured PDF viewer (okular in my case). In the PDF, shift + left mouse click (okular at least) will bring you to the right place in the Emacs .Rnw buffer.
This is not ideal: if you jump to an error, it goest to the .tex, not the .Rnw. And I'd like to be able to invoke it via C-c C-c or similar (but I don't know how ---elisp ignorance).
My scenario:
A PDF template with formfields: template.pdf
An XFDF file that contains the data to be filled in: fieldData.xfdf
Now I need to have these to files combined & flattened.
pdftk does the job easily within php:
exec("pdftk template.pdf fill_form fieldData.xfdf output flatFile.pdf flatten");
Unfortunately this does not work with full utf-8 support.
For example: Cyrillic and greek letters get scrambled. I used Arial for this, with an unicode character set.
How can I accomplish to flatten my unicode files?
Is there any other pdf tool that offers unicode support?
Does pdftk have an unicode switch that I am missing?
EDIT 1: As this question has not been solved for more then 9 month, I decided to start a bounty for it. In case there are options to sponsor a feature or a bugfix in pdftk, I'd be glad to donate.
EDIT 2: I am not working on this project anymore, so I cannot verify new answers. If anyone has a similar problem, I am glad if they can respond in my favour.
I found by using Jon's template but using the DomDocument the numeric encoding was handled for me and worked well. My slight variation is below:
$xml = new DOMDocument( '1.0', 'UTF-8' );
$rootNode = $xml->createElement( 'xfdf' );
$rootNode->setAttribute( 'xmlns', 'http://ns.adobe.com/xfdf/' );
$rootNode->setAttribute( 'xml:space', 'preserve' );
$xml->appendChild( $rootNode );
$fieldsNode = $xml->createElement( 'fields' );
$rootNode->appendChild( $fieldsNode );
foreach ( $fields as $field => $value )
{
$fieldNode = $xml->createElement( 'field' );
$fieldNode->setAttribute( 'name', $field );
$fieldsNode->appendChild( $fieldNode );
$valueNode = $xml->createElement( 'value' );
$valueNode->appendChild( $xml->createTextNode( $value ) );
$fieldNode->appendChild( $valueNode );
}
$xml->save( $file );
You could try the trial version of http://www.adobe.com/products/livecycle/designer/ and see what PDF files it generates.
Another commercial software you could try is http://www.appligent.com/fdfmerge. See page 16 in http://146.145.110.1/docs/userguide/FDFMergeUserGuide.pdf for how it handles xFDF with UTF-8.
I also had a look at the FDF specification http://partners.adobe.com/public/developer/en/xml/xfdf_2.0.pdf
On page 12 it states:
Although XFDF is encoded in UTF-8, double byte characters are encoded as character references when
exported from Acrobat.
For example, the Japanese double byte characters , , and are exported to XFDF using
three character references. Here is an example of double byte characters in a form field:
...
<fields>
<field name="Text1">
<value>Here are 3 UTF-8 double byte
characters: あいう
</value>
</field>
</fields> ...
I looked through pdftk-1.44-dist/java/com/lowagie/text/pdf/XfdfReader.java. It doesn't seem to do anything special with the input.
Maybe pdftk will do what you want, when you encode the weird characters as character references in your xFDF input.
Using the pdftk 1.44 on a Win7 machine I encounter the same problems with xfdf-files whereas fdf works fine. I made a xfdf-file without any special characters (only ANSI) but pdftk crashed again. I mailed the developper. Unfortunately no answer until now.
Unfortunately, UTF-8 character encoding does not work neither with decimal nor hexadecimal references of non-ASCII characters in source .xfdf file. PDFTK v. 1.44.
I made some progress on this. Starting with code from http://koivi.com/fill-pdf-form-fields/, I modified the value encoding to output numeric codes for any characters outside the ascii range.
Now with pitulski's special strings:
Poznań Śródmieście Ćwiartka Ósma outputs Pozna ródmiecie wiartka Ósma with some box shapes superimposed
ęóąśłżźćńĘÓĄŚŁŻŹĆŃ outputs óÓ with more box shapes. I think it may be that the box shapes are characters my server doesn't recognize.
I tried it with some French characters: ùûüÿ€’“”«»àâæçéèêëïôœÙÛÜŸÀÂÆÇÉÈÊËÏÎÔ and they all came out OK, but some of them were overlapping.
--edit-- I just tried entering these manually into the form and got the same result minus the box shapes (using Evince). I then tried with a different form (created by someone else) - after entering ęóąśłżźćńĘÓĄŚŁŻŹĆŃ, ółÓŁ was displayed. It looks like it depends which characters are included in the document's embedded fonts.
/*
KOIVI HTML Form to FDF Parser for PHP (C) 2004 Justin Koivisto
Version 1.2.?
Last Modified: 2013/01/17 - Jon Hulka(jon dot hulka at gmail dot com)
- changed character encoding, all non-ascii characters get encoded as numeric character references
This library is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation; either version 2.1 of the License, or (at
your option) any later version.
This library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
License for more details.
You should have received a copy of the GNU Lesser General Public License
along with this library; if not, write to the Free Software Foundation,
Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Full license agreement notice can be found in the LICENSE file contained
within this distribution package.
Justin Koivisto
justin dot koivisto at gmail dot com
http://koivi.com
*/
/**
* createXFDF
*
* Tales values passed via associative array and generates XFDF file format
* with that data for the pdf address sullpiled.
*
* #param string $file The pdf file - url or file path accepted
* #param array $info data to use in key/value pairs no more than 2 dimensions
* #param string $enc default UTF-8, match server output: default_charset in php.ini
* #return string The XFDF data for acrobat reader to use in the pdf form file
*/
function createXFDF($file,$info,$enc='UTF-8'){
$data=
'<?xml version="1.0" encoding="'.$enc.'"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<fields>';
foreach($info as $field => $val){
$data.='
<field name="'.$field.'">';
if(is_array($val)){
foreach($val as $opt)
//2013.01.17 - Jon Hulka - all non-ascii characters get character references
$data.='
<value>'.mb_encode_numericentity(htmlspecialchars($opt),array(0x0080, 0xffff, 0, 0xffff), 'UTF-8').'</value>';
// $data.='<value>'.htmlentities($opt,ENT_COMPAT,$enc).'</value>'."\n";
}else{
$data.='
<value>'.mb_encode_numericentity(htmlspecialchars($val),array(0x0080, 0xffff, 0, 0xffff), 'UTF-8').'</value>';
// $data.='<value>'.htmlentities($val,ENT_COMPAT,$enc).'</value>'."\n";
}
$data.='
</field>';
}
$data.='
</fields>
<ids original="'.md5($file).'" modified="'.time().'" />
<f href="'.$file.'" />
</xfdf>';
return $data;
}
While pdftk doesn't appear to support UTF-8 in the FDF file, I found that with
iconv -f utf-8 -t ISO_8859-1
in the pipeline converting that FDF file to ISO-Latin-1, then at least those characters that are in the Latin-1 code page will still be represented properly.
What PDFTK's version?
I tried the same thing with Polish characters (utf-8).
Does not work for me.
pdftk.exe, libiconv2.dll from: http://www.pdflabs.com/docs/install-pdftk/
Windows 7, cmd, file.pdf + file.fdf -> new.pdf
pdftk file.pdf fill_form file.xfdf output new.pdf flatten
Unhandled Java Exception:
java.lang.NoClassDefFoundError: gnu.gcj.convert.Input_UTF8 not found in [file:.\, core:/]
at 0x005a3abe (Unknown Source)
at 0x005a3fb2 (Unknown Source)
at 0x006119f4 (Unknown Source)
at 0x00649ee4 (Unknown Source)
at 0x005b4c44 (Unknown Source)
at 0x005470a9 (Unknown Source)
at 0x00549c52 (Unknown Source)
at 0x0059d348 (Unknown Source)
at 0x007323c9 (Unknown Source)
at 0x0054715a (Unknown Source)
at 0x00562349 (Unknown Source)
But, with FDF file, with the same content, it worked properly.
But the characters in new.PDF are bad.
pdftk file.pdf fill_form file.fdf output new.pdf flatten
---FDF---
%FDF-1.2
%âãÏÓ
1 0 obj<</FDF<</F(file.pdf)
/Fields[
<</T(Miejsce)/V(666 Poznań Śródmieście Ćwiartka Ósma)>>
<</T(Nr)/V(ęóąśłżźćńĘÓĄŚŁŻŹĆŃ)>>
]>>>>
endobj
trailer
<</Root 1 0 R>>
%%EOF
---XFDF---
<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<f href="file.pdf"/>
<fields>
<field name="Miejsce">
<value>666 Poznań Śródmieście Ćwiartka Ósma</value>
</field>
<field name="Nr">
<value>ęóąśłżźćńĘÓĄŚŁŻŹĆŃ</value>
</field>
</fields>
</xfdf>
---PDF---
Miejsce: 666 PoznaÅ— ÅıródmieÅłcie ăwiartka Ãfisma
Nr: ÄŽÃ³Ä–ÅłÅ‡Å¼ÅºÄ⁄Å—ÄŸÃfiÄ—ÅıņŻŹăÅ
You can introduce utf-8 characters by giving their unicode code in octal with \ddd
To solve this, I wrote PdfFormFillerUTF-8: http://sourceforge.net/projects/pdfformfiller2/
There is a drop-in replacement for pdftk tool
Mcpdf: https://github.com/m-click/mcpdf
that solves unicode issues when filling forms. Works for me with CP1250 characters (Central Europe).
From project page:
the following command fills in form data from DATA.xfdf into FORM.pdf
and writes the result to RESULT.pdf. It also flattens the document to
prevent further editing:
java -jar mcpdf.jar FORM.pdf fill_form - output - flatten < DATA.xfdf > RESULT.pdf
This corresponds exactly to the usual PDFtk command:
pdftk FORM.pdf fill_form - output - flatten < DATA.xfdf > RESULT.pdf
Note that you need to have JRE installed.
I have managed to make it work with pdftk by creating a xfdf file with utf-8 encoding.
it took several tried but what make it work as exepcted was to add 'need_appearances'
here is an example:
pdftk source.pdf fill_form data.xfdf output output.pdf need_appearances
I have been solving this issue for a long time, and finally I have found the solution!
so, let's start.
download and install the latest version of pdftk
# PDFTK
RUN apk add openjdk8 \
&& cd /tmp \
&& wget https://gitlab.com/pdftk-java/pdftk/-/jobs/1507074845/artifacts/raw/build/libs/pdftk-all.jar \
&& mv pdftk-all.jar pdftk.jar \
&& echo '#!/usr/bin/env bash' > pdftk \
&& echo 'java -jar "$0.jar" "$#"' >> pdftk \
&& chmod 775 pdftk* \
&& mv pdftk* /usr/local/bin \
&& pdftk -version
Open your PDF Form in Adobe Acrobat Reader and look at field options, you need to detect the font, for example Helvetica, download this font.
Fill the form with flatten option
/usr/local/bin/pdftk A=form.pdf fill_form xfdf.xml output out.pdf drop_xfa need_appearances flatten replacement_font /path/to/font/HelveticaRegular.ttf
xfdf.xml example:
<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<fields>
<field name="Check Box 136">
<value>Your value | Значение (Cyrillic)</value>
</field>
</fields>
</xfdf>
Enjoy :)
pdftk supports encoding in UTF-16BE. It's not that difficult to convert from UTF-8 to UTF-16BE.
See: Weird characters when filling PDF with PDFTk