Split and merge pdf files using PDFBOX produces large file

Split and merge pdf files using PDFBOX produces large file - pdf

I have this large print file in pdf that's contains 5544 pages and is about 36mb in size. The file is created by MS Word 2010 and contains only text and a logo on each letter/document.
I split it into 5544 files and merge back into 2770 letters, based on keywords. Each letter is approx. 140-145kb.
When I merge all the letters into a new pdf print file, still containing 5544 pages, the size of the file is grown to 396mb.
All text extracting, splitting and merging is performed with calls to Apache PDFBox command-line tools from PHP, but result is the same when run from a console.
Any idea how to reduce the file size of the letters and the final print file?
It seems like PDFBox has just appended each letters in the final print file, instead creating a new pdf-document.
It's only in the testing phase that all the documents are merged into the final print file, some of the documents will be send by email.
I have also tried SAMBox (a fork of PDFBox) but with nearly the same result:
pdfinfo Original.pdf
Title: Printfile
Author: Claus Hjort Bube
Creator: Microsoft® Word 2010
Producer: Microsoft® Word 2010
CreationDate: Fri May 19 12:16:34 2017 CEST
ModDate: Fri May 19 12:16:34 2017 CEST
Tagged: yes
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 5544
Encrypted: no
Page size: 595.32 x 841.92 pts (A4)
Page rot: 0
File size: 36092281 bytes
Optimized: no
PDF version: 1.5
pdfinfo PDFBox.pdf
Title: Printfile
Author: Claus Hjort Bube
Creator: Microsoft® Word 2010
Producer: Microsoft® Word 2010
CreationDate: Fri May 19 12:16:34 2017 CEST
ModDate: Fri May 19 12:16:34 2017 CEST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 5544
Encrypted: no
Page size: 595.32 x 841.92 pts (A4)
Page rot: 0
File size: 396622354 bytes
Optimized: no
PDF version: 1.4
pdfinfo SAMBox.pdf
Creator: Sejda Console 3.2.17
Producer: SAMBox 1.1.8 (www.sejda.org)
ModDate: Tue Jul 11 23:34:33 2017 CEST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 5544
Encrypted: no
Page size: 595.32 x 841.92 pts (A4)
Page rot: 0
File size: 378779436 bytes
Optimized: no
PDF version: 1.7

That may sound sad but it is correct. When splitting, each file gets the resources (e.g. fonts and company logo graphic) it needs. When merged back, PDFBox does not know that these may be the same over the whole document, so these are now duplicated a lot.
The only solution I see for you would be to use the PDFBox java API to create the mailing files and the final print file in one step, i.e. without creating single files that are merged back.

Related

Downloading selected PDFs from CAG

I am trying to download some PDFs from CAG Website https://cag.gov.in/en/state-accounts-report?defuat_state_id=64. I need PDFs for only Monthly Key Indicators, so I am using the code as-
tabID="#tab-360"
for link in soup.select(f"{tabID} a[href$='.pdf']"):
filename=os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
This allows me to download the Monthly key indicators file but I need to download only Pdf files from March 2018 to March 2022. How to download March PDFs from 2018 to 2022.

The following code helped me in getting all march files
urllist=[]
url='https://cag.gov.in/en/state-accounts-report?defuat_state_id=79'
response=requests.get(url)
soup=BeautifulSoup(response.text,'html.parser')
for link in soup.select(f"{tabID} a[href$='.pdf']"):
urllist.append(link)
final_listMah=[]
list_year=['March, 2022','March(Pre), 2022','March(Pre), 2021','March, 2021','March(Pre), 2020','March(Pre), 2019','April, 2019']
for j in list_year:
for i in range(len(urllist)):
if (urllist[i].text==j):
print(urllist[i])
final_listMah.append(urllist[i])

Setting custom chapter numbering in Markdown

I'm creating an R Markdown document which outputs a PDF document. My YAML header is the following:
---
title: "Introduction"
author: "John Doe
date: "August 26, 2018"
mainfont: Pancetta Pro
documentclass: book
output:
pdf_document:
number_sections: true
df_print: kable
fig_caption: yes
fig_width: 6
highlight: tango
includes:
in_header: preamble.tex
latex_engine: xelatex
geometry: headheight=25pt, tmargin=25mm, bmargin=20mm, innermargin=20mm, outermargin=20mm
---
In the preamble.tex file I want to have the following LaTeX commands (which just modify the way the headers are displayed):
\renewcommand{\chaptermark}[1]{\markright{#1}}
\renewcommand{\sectionmark}[1]{}
\renewcommand{\subsectionmark}[1]{}
\usepackage{titlesec}
\titleformat{\chapter}[hang]{\Huge}{\bfseries\thechapter}{0.2pt}{\thicklines\thehook}[\vspace{0.5em}]
However, when these last lines are included in the preamble.tex I get an error when knitting the R Markdown file:
! Argument of \paragraph has an extra }.
<inserted text>
\par
l.1290 \ttl#extract\paragraph
Error: Failed to compile Template.tex
I can't figure out why it won't run. The contents of thepreamble.texfile are the following:
% !TeX program = lualatex
\usepackage{relsize} % To make math slightly larger.
\newcommand{\thehook}{%
\hspace{.5em}%
\setlength{\unitlength}{1em}%
\raisebox{-.5em}{\begin{picture}(.4,1.7)
\put(0,0){\line(1,0){.2}}
\put(.2,0){\line(0,1){1.7}}
\put(.2,1.7){\line(1,0){.2}}
\end{picture}}%
\hspace{0.5em}%
} %This creates the "hook" symbol at the beginning of each chapter.
\usepackage{anyfontsize}
\usepackage{fontspec}
\setmainfont{Pancetta Pro}
% We set the font for the chapters:
\newfontfamily\chapterfont{Pancetta Pro}
% And now for the sections:
\newfontfamily\sectionfont{Pancetta Pro}
\usepackage{fancyhdr}
\fancyhead{}
\fancyfoot{}
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrulewidth}{0pt}
\fancyhead[RO]{\large\sffamily\rightmark\thehook\textbf{\thepage}}
\fancyhead[LE]{\large\sffamily\textbf{\thepage}\thehook\rightmark}
\fancypagestyle{plain}{%
\fancyhf{}
}
\pagestyle{fancy}
\renewcommand{\chaptermark}[1]{\markright{#1}}
\renewcommand{\sectionmark}[1]{}
\renewcommand{\subsectionmark}[1]{}
\fontsize{12}{20}\selectfont
\usepackage{titlesec}
\titleformat{\chapter}[hang]{\Huge}{\bfseries\thechapter}{0.2pt}{\thicklines\thehook}[\vspace{0.5em}]
When excluding the last 6 lines in the previous code there's no error and the pdf is created.

If you want to use titlesec together with rmarkdown you have to add
subparagraph: yes
to your YAML headers, c.f. several other answers.
The default LaTeX class used by rmarkdown is article, which has no chapters. You should add
documentclass: report
or
documentclass: book
to your YAML header.

Why does ghostscript replace fontnames to "CairoFont"?

I use ghostscript to optimize pdf files (mostly with respect to size), for which it does a great job. The command that I use is:
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress \
-dCompatibilityLevel=1.4 -sOutputFile=out.pdf in.pdf
However, it seems that this replaces fonts (or subsets them) and does not preserve their names. It replaces it by CairoFont. How could I get ghostscript to preserve the fontnames?
Example:
A simple pdf file (created with Inkscape), with a single text element in it (Nimbus Roman) as an input (in.pdf):
for which pdffonts reports:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
PMLNBT+NimbusRomanNo9L Type 1 yes yes yes 5 0
However, after running ghostscript over the file pdffonts reports:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
OEPSCM+CairoFont-0-0 Type 1C yes yes no 8 0
So, is there a way to have ghostscript (or libcairo?) preserve the name of the font?
The input file is uploaded here.

Ghostscript doesn't change the font name, but there are, in fact, several different font 'names' in a PDF file.
In the case of your file the PDF FontDescriptor object has a name
<<
/Type /FontDescriptor
/FontName /PMLNBT+NimbusRomanNo9L
/Flags 4
/FontBBox [ -168 -281 1031 924 ]
/ItalicAngle 0
/Ascent 924
/Descent -281
/CapHeight 924
/StemV 80
/StemH 80
/FontFile 7 0 R
>>
which refers to a FontFile stream
/FontFile 7 0 R
That stream contains the following:
%!PS-AdobeFont-1.0: NimbusRomNo9L-Regu 1.06
%%Title: NimbusRomNo9L-Regu
%Version: 1.06
%%CreationDate: Thu Aug 2 13:14:49 2007
%%Creator: frob
%Copyright: Copyright (URW)++,Copyright 1999 by (URW)++ Design &
%Copyright: Development; Cyrillic glyphs added by Valek Filippov (C)
%Copyright: 2001-2005
% Generated by FontForge 20070723 (http://fontforge.sf.net/)
%%EndComments
FontDirectory/NimbusRomNo9L-Regu known{/NimbusRomNo9L-Regu findfont dup/UniqueID known pop false {dup
/UniqueID get 5020931 eq exch/FontType get 1 eq and}{pop false}ifelse
{save true}{false}ifelse}{false}ifelse
11 dict begin
/FontType 1 def
/FontMatrix [0.001 0 0 0.001 0 0 ]readonly def
/FontName /CairoFont-0-0 def
Do you see the FontName in the actual font ? Its called CairoFont-0-0
This brings me back to a point which I reiterate frequently here and elsewhere; when you process a PDF file with Ghostscript and emit a new PDF file using the pdfwrite device you are not 'optimising', 'converting', 'subsetting' or in a general sense manipulating the content of the original PDF file.
What Ghostscript does is interpret the PDF file, ths produces a set opf marking operations (such as 'stroke', 'fill', 'image' etc) which it sends to the selected Ghostscript device. Most Ghostscript devices will then use the graphics library to render the operations to a bitmap and when the page is complete will write the bitmap to a file. The 'high level' or 'vector' devices instead repackage the operations into another Page Description Language. In the case of pdfwrite, that's a PDF file.
What this means in practice is that the emitted PDF file has nothing (apart from appearance) in common with the original PDF file. In particular the description of the objects may be different.
So in your case, the pdfwrite device doesn't know what the font was called in the original PDF object. It does know that the font that was defined was called Cairo-0-0 so that's what it calls the font when it emits it.
Frankly this is another piss-poor example from Cairo, to go along with defining each page as containing transparency whether it does or not, the FontName in the Font object is supposed to be the same as the name in the Font stream.
Its pretty clear that the FontName has been altered, given the rest of the boilerplate there.

pdfbox extract color font on a pdf error

hi everybody sorry for my english level but i'm not english/american.
my question is the next: i try to use the example code that where posted in this site (How to get font color using pdfbox) in the example, the author says that the code was tried but when i tried it shows me this error:
jul 17, 2013 1:05:28 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
jul 17, 2013 1:05:29 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
DeviceGray
org.apache.pdfbox.pdmodel.graphics.color.PDColorState#481958
0.0
the pdf that i was extracting contents 3 letters (RGB) which is painted :
R: painted in red color
G: painted in green color
B: painted in black color
somebody can explain me because is this error o tell me how can i do to extract color text from a pdf?
thanks for all for the futures comments

Those log outputs are of level INFO only, not error:
jul 17, 2013 1:05:28 PM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: BDC
jul 17, 2013 1:05:29 PM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: EMC
All they say is that certain operators (BDC, EMC) were encountered in the page content for which no processor was registered. But as those operators are of interest for analyzing marked content only, those operators can be ignored for your task.
Thereafter you have output from the code you referred to:
DeviceGray
org.apache.pdfbox.pdmodel.graphics.color.PDColorState#481958
0.0
At least the first and the last line match that code: A color in DeviceGray gray with gray value 0 was encountered, most likely your black B. (Could it be you have added an additional output in-between, e.g. of graphicState.getStrokingColor()?)
Thus, no error, all working fine.

Setting the photometric interpretation tag for a multi-page tiff

While trying to convert a multipage document from a tiff to a pdf, I encountered the following problem:
↪ tiff2pdf 0271.f1.tiff -o 0271.f1.pdf
tiff2pdf: No support for 0271.f1.tiff with no photometric interpretation tag.
tiff2pdf: An error occurred creating output PDF file.
Does anybody know what causes this and how to fix it?

This is caused because one or more of the pages in the multi-page tiff does not have the photometric interpretation tag set. This is a required tag, so that means your tiffs are technically invalid (though I bet they work fine anyway).
To fix this, you must identify the page (or pages) that does not have the photometric interpretation set and fix it.
To identify the page, you can simply run something like:
↪ tiffinfo your-file.tiff
This will spit out the info for every page of your tiff. For each good page, you'll see something like:
TIFF Directory at offset 0x105c0 (67008)
Subfile Type: (0 = 0x0)
Image Width: 1760 Image Length: 2639
Resolution: 300, 300 pixels/inch
Bits/Sample: 1
Compression Scheme: CCITT Group 4
**Photometric Interpretation: min-is-white**
FillOrder: msb-to-lsb
Orientation: row 0 top, col 0 lhs
Samples/Pixel: 1
Rows/Strip: 2639
Planar Configuration: single image plane
Software: ScanFix(TM) Enhanced ImageGear Version: 11.00.024
DateTime: Mon Oct 31 15:11:07 2005
Artist: 1996-2001 AccuSoft Co., All rights reserved
If you have a bad page, it'll lack the photometric interpretation section, and you can fix it with:
↪ tiffset -d $page-number -s 262 0 your-file.tiff
Note that the value of zero is the default for the photometric interpretation key, which is 262. You can see the other values for this key at the link above.
If your tiff has a lot of pages (like mine does), you may not be able to easily identify the bad page by eye. In that case, you can take a brute force approach, setting the photometric interpretation for all pages to the default value.
# First, split the tiff into many one-page files
↪ tiffsplit your-file.tiff
# Then, set the photometric interpretation to the default for all pages
↪ find . -name '*.tiff' -exec tiffset -s 262 0 '{}' \;
# Then rejoin the pages
↪ tiffcp *.tiff -o out-file.tiff
Lot of dummy work, but gets the job done.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Split and merge pdf files using PDFBOX produces large file - pdf

Related

Downloading selected PDFs from CAG

Setting custom chapter numbering in Markdown

Why does ghostscript replace fontnames to "CairoFont"?

pdfbox extract color font on a pdf error

Setting the photometric interpretation tag for a multi-page tiff

Categories

Resources