PDFTK Output Same Size as Input Regardless of Cat'd Page Count - pdf

I'm running into a weird situation with a particular group of PDFs and not sure where to start. If I burst a 25M, 600 pg file, the output becomes 25M per bursted file. If I do pdftk input.pdf cat 1-100 output out.pdf the size is also 25M (25292kb vs 25524kb for original). Doing page range 1-5 results in a file size of 25040kb.
Is there a flag that I can add to pdftk to handle this situation? Ghostscript can take a page range from this pdf and make an appropriate size PDF but gs doesn't seem to handle burst as well as requires having every font installed.

You are probably making the following assumption about PDF: if you have a PDF with file size 3000 KB and 10 pages, then splitting this PDF will result in 30 files with file size 300 KB.
This assumption is wrong. Imagine a 3000 KB document with ten pages and the following objects:
four font subsets used on every page, each about 50 KB
ten images that figure on a single page, each about 200 KB (one image per page)
four images that figure on every page, each about 50 KB
ten pages with content streams of about 25 KB each
about 350 KB for objects such as the catalog, the info dictionary, the page tree, the cross-reference table, etc...
A single page will need at least:
- the four font subsets: 4 times 50 KB
- the single image: 1 time 200 KB
- the four images: 4 times 50 KB
- a single content stream: 1 time 50 KB
- a slightly reduced cross-reference table, a slightly reduced page tree, an almost identical catalog, an info dictionary of identical size,... 200 KB
Together that's 850 KB. This means that you end up with 8500 KB (10 times 850 KB) if you split up a 10-page 3000 KB PDF document into 10 separate pages.
My guess is that the shared resources (resources that are used in every page, e.g. fonts) are huge in your PDF. E.g. if someone used a high-resolution image as the background of each page that takes about 25M, then each of your 600 pages will need those 25M.
Note that PdfTk is nothing more than a wrapper around an obsolete version of iText. You may want to try a more recent version of iText to find out if the problem persists.


Multithreading pdf workload slows the program down

I tried building a program that scans PDF files to find certain elements and then outputs a new PDF file with all the pages that contain the elements. It was originally single-threaded and a bit slow to run. It took about 36 seconds on a 6-Core-5600X. So I tried multiprocessing it with concurrent.futures:
def process_pdf(filename):
# Open the PDF file
f = open(filename, "rb")
print("Searching: " + filename)
# Create a PDF object
pdf = PyPDF2.PdfReader(f)
# Extract the text from each page in the PDF
extracted_text = [page.extract_text() for page in pdf.pages]
# Initialize a list
matching_pages = []
# Iterate through the extracted text
for j, text in enumerate(extracted_text):
# Search for the symbol in the text
if symbol in text:
# If the symbol is found, get a new PageObject instance for the page
page = pdf.pages[j]
# Add the page to the list
return matching_pages
Multiprocessing Block:
with concurrent.futures.ThreadPoolExecutor() as executor:
# Get a list of the file paths for all PDF files in the directory
file_paths = [
os.path.join(directory, filename)
for filename in os.listdir(directory)
if filename.endswith(".pdf")
futures = [executor.submit(process_pdf, file_path) for file_path in file_paths]
# Initialize a list to store the results
results = []
# Iterate through the futures as they complete
for future in concurrent.futures.as_completed(futures):
# Get the result of the completed future
result = future.result()
# Add the result to the list
# Add each page to the new PDF file
for page in results:
The multiprocessing works, as evident from the printed text, but it somehow doesn't scale at all. 1 thread ~ 35 seconds, 12 Threads ~ 38 seconds.
Why? Where's my bottleneck?
Tried using other libraries, but most were broken or slower.
Tried using re instead of in to search for the symbol, no improvement.
In general PDF consumes much of a devices primary resources, Each device and system will be different so by way of illustration here is one simple PDF task.
Search for text by a common python library component poppler pdftotext, (others may be faster but the aim here is to show the normal attempts and issues)
As a ballpark yardstick for 1 minute I scanned roughly 2500 pages of one file for the word "AMENDMENTS" found 900 occurrences such as
page 82
page 83
page 84
page 85
to scan the whole file (13,234 pages) would be about 5 mins 20 seconds
and I know from testing 4 cpus could process 1/4 of that file (3,308 pages from 13,234) in under 1 minute (there is a gain for using smaller files).
So a 4 core device should do say 3000 pages per core in a short time, well lets see.
If I thread that as 3 x 1000 pages one finishes after about 50 seconds another about 60 seconds and a 3rd about 70 seconds
overall there is little or just a slight gain, so the cause is one application spending one third of its time in each thread.
Overall about 3000 pages per minute
Lets try clever lets use 3 applications on the one file. Surely they can each take less time, guess what
one finishes after about 50 seconds another about 60 seconds and a 3rd about 70 seconds no gain using 3 applications
Overall about 3000 pages per minute
Lets try more clever lets use 3 applications on 3 similar but different files. Surely they can each take less time, guess what
one finishes after about 50 seconds another about 60 seconds and a 3rd about 70 seconds no gain using 3 applications with 3 tasks
Overall about 3000 pages per minute
Whatever way I try the resources on this device are constrained to Overall about 3000 pages per minute.
I may just as well let 1 thread run un-fettered.
So the basic answer is use multiple devices, same as graphics farming, is done.

What does the log output from openstreetmap-tiles-update-expire/render_expired mean?

i am using the script from https://github.com/openstreetmap/mod_tile/blob/master/utils/openstreetmap-tiles-update-expire which is running inside the docker container of https://github.com/Overv/openstreetmap-tile-server.
I modified the script to always render expired tiles.
I changed the following line to:
render_expired --map=default --num-threads=8 --min-zoom=$EXPIRY_MINZOOM --max-zoom=$EXPIRY_MAXZOOM -s /run/renderd.sock <"$EXPIRY_FILE.$$" 2>&1 | tail -8 >>"$EXPIRYLOG";
Now i am confused of the long rendering times and the generated logs of this command.
The logs say : Wrote 2616400 entries to expired tiles list and it starts rendering with render_expired. After 11hours of rendering it says:
Total for all tiles rendered
Meta tiles rendered: Rendered 225518 tiles in 46397.42 seconds (4.86 tiles/s)
Total tiles rendered: Rendered 14433152 tiles in 46397.42 seconds (311.08 tiles/s)
Total tiles in input: 2616400
Total tiles expanded from input: 248012
Total meta tiles deleted: 0
Total meta tiles touched: 0
Total tiles ignored (not on disk): 22494
Can someone explain these logs to me and tell me where the amount of Total Tiles rendered: 14433152 come from? I guess this is the reason for the long rendering times.
My import was a merge of 4x .osm.pbf files and their .poly files (Germany, Switzerland, Austria, Liechtenstein).
I prerendered all tiles with render_list_geo.pl up to Z19 before starting the update script.
Thank you very much.

required size of a configuration file for a HX1K (in "SPI slave" mode)

I am reworking the programmer for the Olimex iCE40HX1K board (targetted towards a STM32F103 ma) where I also would like to implement the "SPI Slave" mode to configure an image directly into RAM without using the serial flash.
Looking at the Lattice "programming and Configuration guide" (page 11), it is noted in table 8 that a EPROM for a ICE40-LP/LX1K must be at least 34112 bytes. (which -I guess- means that the configuration-files can be up to that size).
However, all images I have (sofar) created with the icestorm tools are 32220 octets.
I am a bit puzzled here.
Can somebody explain the difference between these two figures?
Does the HX1K need a configuration-file of 32220 or 34112 bytes?
I don't know how Lattice arrived at this number. A complete HX1K bin file with BRAM initialization but without comment and without multiboot header is 32220 bytes in size. The (optional) multiboot header would add another 160 bytes (32220 + 160 = 32380). The lattice tools usually add about 80 bytes to the comment field (32220 + 80 = 32300). Whatever I do, all numbers I have are more than 1000 short of 34112.
I don't know if there is a maximum length for the comment. Maybe there is and 34112 is the size of a bit stream with a comment of maximum length?
34112 - 32220 = 1892. Maybe someone decided to add 8kB (8192 bytes) just in case, but that person accidentally swapped the first two digits? Idk..
If you don't care about comments or multiboot headers, then iCE40 1K bit-streams have a fixed size, and that size is 32220 bytes.

Setting the photometric interpretation tag for a multi-page tiff

While trying to convert a multipage document from a tiff to a pdf, I encountered the following problem:
↪ tiff2pdf 0271.f1.tiff -o 0271.f1.pdf
tiff2pdf: No support for 0271.f1.tiff with no photometric interpretation tag.
tiff2pdf: An error occurred creating output PDF file.
Does anybody know what causes this and how to fix it?
This is caused because one or more of the pages in the multi-page tiff does not have the photometric interpretation tag set. This is a required tag, so that means your tiffs are technically invalid (though I bet they work fine anyway).
To fix this, you must identify the page (or pages) that does not have the photometric interpretation set and fix it.
To identify the page, you can simply run something like:
↪ tiffinfo your-file.tiff
This will spit out the info for every page of your tiff. For each good page, you'll see something like:
TIFF Directory at offset 0x105c0 (67008)
Subfile Type: (0 = 0x0)
Image Width: 1760 Image Length: 2639
Resolution: 300, 300 pixels/inch
Bits/Sample: 1
Compression Scheme: CCITT Group 4
**Photometric Interpretation: min-is-white**
FillOrder: msb-to-lsb
Orientation: row 0 top, col 0 lhs
Samples/Pixel: 1
Rows/Strip: 2639
Planar Configuration: single image plane
Software: ScanFix(TM) Enhanced ImageGear Version: 11.00.024
DateTime: Mon Oct 31 15:11:07 2005
Artist: 1996-2001 AccuSoft Co., All rights reserved
If you have a bad page, it'll lack the photometric interpretation section, and you can fix it with:
↪ tiffset -d $page-number -s 262 0 your-file.tiff
Note that the value of zero is the default for the photometric interpretation key, which is 262. You can see the other values for this key at the link above.
If your tiff has a lot of pages (like mine does), you may not be able to easily identify the bad page by eye. In that case, you can take a brute force approach, setting the photometric interpretation for all pages to the default value.
# First, split the tiff into many one-page files
↪ tiffsplit your-file.tiff
# Then, set the photometric interpretation to the default for all pages
↪ find . -name '*.tiff' -exec tiffset -s 262 0 '{}' \;
# Then rejoin the pages
↪ tiffcp *.tiff -o out-file.tiff
Lot of dummy work, but gets the job done.

What is the maximum number of pages tha apache fop can generate?

Hi I was working with apache fop and when the number of pages exceeds about 130 pages ,it could not generate the pdf ....
Is there any limit to page number or the length of xml file...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap
at java.io.BufferedReader.(BufferedReader.java:80)
at java.io.BufferedReader.(BufferedReader.java:91)
at org.apache.xml.dtm.ObjectFactory.findJarServiceProviderName(ObjectFac
at org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName(ObjectFactory
at org.apache.xml.dtm.ObjectFactory.lookUpFactoryClass(ObjectFactory.jav
at org.apache.xml.dtm.ObjectFactory.createObject(ObjectFactory.java:131)
at org.apache.xml.dtm.ObjectFactory.createObject(ObjectFactory.java:101)
at org.apache.xml.dtm.DTMManager.newInstance(DTMManager.java:135)
at org.apache.xpath.XPathContext.reset(XPathContext.java:350)
at org.apache.xalan.transformer.TransformerImpl.reset(TransformerImpl.ja
at org.apache.xalan.transformer.TransformerImpl.transformNode(Transforme
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImp
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImp
at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImp
at org.apache.fop.cli.InputHandler.transformTo(InputHandler.java:214)
at org.apache.fop.cli.InputHandler.renderTo(InputHandler.java:125)
at org.apache.fop.cli.Main.startFOP(Main.java:166)
at org.apache.fop.cli.Main.main(Main.java:197)
I've created reports that are made from xmls that were several hundred thousand lines long. However I have had some issues creating smaller reports filled with svgs.
Your issue is probably that java by default only allocates 32 MB memory (if I recall correctly) so it's running out of memory.
In the fop.bat file (assumimg you're running on windows) add the following setting
rem Increase standard Java VM heap size, so that bigger reports get enough memory
set JAVAOPTS=-Xmx512M
and alter the execution line as follows
This should work with 0.95 at least