Pdf2htmlEx: The html contains images, how could i have instead graphics as output instead of images? - pdf2htmlex

I have tried every command found in the documentation, how could i get only the text part as output, and not at all the images?
https://github.com/coolwanglu/pdf2htmlEX/wiki/Command-Line-Options.

I'm not sure what you are trying to achieve as the question subject and details appears contradictory, but there are options to split out the graphics and text into separate files:
--embed <string>
--embed-css <0|1> (Default: 1)
--embed-font <0|1> (Default: 1)
--embed-image <0|1> (Default: 1)
--embed-javascript <0|1> (Default: 1)
--embed-outline <0|1> (Default: 1)
Specify which elements should be embedded into the output HTML
file.
If switched off, separated files will be generated along with
the HTML file for the corresponding elements.
--embed accepts a string as argument. Each letter of the string
must be one of `cCfFiIjJoO`, which corresponds to one of the
--embed-*** switches. Lower case letters for 0 and upper case
letters for 1. For example, `--embed cFIJo` means to embed
everything but CSS files and outlines.
--split-pages <0|1> (Default: 0)
If turned on, the content of each page is stored in a separated
file.
This switch is useful if you want pages to be loaded separately
& dynamically -- a supporting server might be necessary.
Also see --page-filename.
So if you use the --split-pages 1 and --embed-image 0 options, then you have one HTML page per PDF page, which does not include embedded images.
If this isn't what you want then please include additional information in your question.

Related

Lucene query result is not correct when running official demo

I tried Lucene official demo by running IndexFiles with arguments -index . -docs . , and console prints including pom.xml and *.java and *.class are added into index.
Then I tried SearchFiles with arguments -index . -query "lucene AND main", and console prints only IndexFiles.class and SearchFiles.class and IndexFiles.java, but not SearchFiles.java (which I think should be one of searched results).
Your search results are correct (for the .java files, at least).
The sample code uses the StandardAnalyzer which, in turn, uses the StandardTokenizer.
The StandardTokenizer splits input text into tokens using the rules described in this document. For example, from section 4 of that document:
When you have text such as the following, in the source files
org.apache.lucene.analysis.Analyzer
this is tokenized as a single token. There are no word boundaries.
Looking in the IndexFiles.java source file, there is the following text:
demonstrating simple Lucene indexing
This is tokenized into 4 separate tokens.
But in the SearchFiles.java source file, the text "lucene" only ever appears in text such as org.apache.lucene.analysis.Analyzer - and therefore the single token lucene is never created.
Your query therefore does not find any hits in the IndexFiles.java document because the query matches exact tokens. Both source files contain the word "main" but only one contains the word "lucene".
For the .class files, because these are compiled bytecode files, I would say they should not be indexed in the first place. Lucene works with text files, not binary files. Yes, the class files will contain fragments of text, but they will also typically contain unprintable control characters, which are not suitable to be indexed. I think indexing results could be unpredictable because of this.
You can explore the indexed data using Luke - which is bundled in the binary releases:

The separator between keywords in PDF meta data

I cannot find an "official" documentation on whether the keywords and keyword phrases in the meta data of a PDF file are to be separated by a comma or by a comma with space.
The following example demonstrates the difference:
keyword,keyword phrase,another keyword phrase
keyword, keyword phrase, another keyword phrase
Any high-quality references?
The online sources I found are of low quality.
E.g., an Adobe press web page says "keywords must be separated by commas or semicolons", but in the example we see a semicolon with a following space before the first keyword and a semicolon with a following space between each two neighbor keywords. We don't see keyword phrases in the example.
The keywords metadata field is a single text field - not a list. You can choose whatever is visually pleasing to you. The search engine which operates on the keyword data may have other preferences, but I would imagine that either comma or semicolon would work with most modern search engines.
Reference: PDF 32000-1:2008 on page 550 at 1. Adobe; 2. The Internet Archive
ExifTool, for example parses for comma separated values, but if it does not find a comma it will split on spaces:
# separate tokens in comma or whitespace delimited lists
my #values = ($val =~ /,/) ? split /,+\s*/, $val : split ' ', $val;
foreach $val (#values) {
$et->FoundTag($tagInfo, $val);
}
I dont have a "high-quality references" but, if i generated a pdf using latex i do it in the following way:
adding in my main.tex following line:
\usepackage[a-1b]{pdfx}
then i write a file main.xmpdata and add this lines:
\Title{My Title}
\Author{My Name}
\Copyright{Copyright \copyright\ 2018 "My Name"}
\Kewords{KeywordA\sep
KeywordB\sep
KeywordC}
\Subject{My Short Discription}
after generating the pdf with pdflatex i used a python script based on "pdfminer.six" to extract the metadata
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fp = open('main.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
if 'Metadata' in doc.catalog:
metadata = resolve1(doc.catalog['Metadata']).get_data()
print(metadata) # The raw XMP metadata
The part with the Keywords then look like this:
...<rdf:Bag><rdf:li>KeywordA</rdf:li>\n <rdf:li>KeywordB...
and looking with "Adobe Acrobat Reader DC" at the properties of "main.pdf" i find in the properties the following entry in the section keywords:
;KeywordA;KeywordB;KeywordC
CommonLook claim to be "a global leader in electronic document accessibility, providing software products and professional services enabling faster, more cost-efficient, and more reliable processes for achieving compliance with the leading PDF and document accessibility standards, including WCAG, PDF/UA and Section 508."
They provide the following advice on PDF metadata:
Pro Tip: When you’re entering Keywords into the metadata, separate
them with semicolons as opposed to commas.
although give no further reasoning as to why this is the preferred choice.

80-characters / right margin line in Sublime Text 3

You can have 80-characters / right margin line in Netbeans, Text Mate and probably many, many more other IDEs. Is it possible to have it in Sublime Text 3 as well? Any option, plugin etc.?
Yes, it is possible in Sublime Text 2, ST3, and ST4 (which you should really upgrade to if you haven't already). Select View → Ruler → 80 (there are several other options there as well). If you like to actually wrap your text at 80 columns, select View → Word Wrap Column → 80. Make sure that View → Word Wrap is selected.
To make your selections permanent (the default for all opened files or views), open Preferences → Settings and use any of the following rules in the right-side pane:
{
// set vertical rulers in specified columns.
// Use "rulers": [80] for just one ruler
// default value is []
"rulers": [80, 100, 120],
// turn on word wrap for source and text
// default value is "auto", which means off for source and on for text
"word_wrap": true,
// set word wrapping at this column
// default value is 0, meaning wrapping occurs at window width
"wrap_width": 80
}
These settings can also be used in a .sublime-project file to set defaults on a per-project basis, or in a syntax-specific .sublime-settings file if you only want them to apply to files written in a certain language (Python.sublime-settings vs. JavaScript.sublime-settings, for example). Access these settings files by opening a file with the desired syntax, then selecting Preferences → Settings—Syntax Specific.
As always, if you have multiple entries in your settings file, separate them with commas , except for after the last one. The entire content should be enclosed in curly braces { }. Basically, make sure it's valid JSON.
If you'd like a key combo to automatically set the ruler at 80 for a particular view/file, or you are interested in learning how to set the value without using the mouse, please see my answer here.
Finally, as mentioned in another answer, you really should be using a monospace font in order for your code to line up correctly. Other types of fonts have variable-width letters, which means one 80-character line may not appear to be the same length as another 80-character line with different content, and your indentations will look all messed up. Sublime has monospace fonts set by default, but you can of course choose any one you want. Personally, I really like Liberation Mono. It has glyphs to support many different languages and Unicode characters, looks good at a variety of different sizes, and (most importantly for a programming font) clearly differentiates between 0 and O (digit zero and capital letter oh) and 1 and l (digit one and lowercase letter ell), which not all monospace fonts do, unfortunately. Version 2.0 and later of the font are licensed under the open-source SIL Open Font License 1.1 (here is the FAQ).
For this to work, your font also needs to be set to monospace.
If you think about it, lines can't otherwise line up perfectly perfectly.
This answer is detailed at sublime text forum:
http://www.sublimetext.com/forum/viewtopic.php?f=3&p=42052
This answer has links for choosing an appropriate font for your OS,
and gives an answer to an edge case of fonts not lining up.
Another website that lists great monospaced free fonts for programmers.
http://hivelogic.com/articles/top-10-programming-fonts
On stackoverflow, see:
Michael Ruth's answer here:
How to make ruler always be shown in Sublime text 2?
MattDMo's answer here:
What is the default font of Sublime Text?
I have rulers set at the following:
30
50 (git commit message titles should be limited to 50 characters)
72 (git commit message details should be limited to 72 characters)
80 (Windows Command Console Window maxes out at 80 character width)
Other viewing environments that benefit from shorter lines:
github: there is no word wrap when viewing a file online
So, I try to keep .js .md and other files at 70-80 characters.
Windows Console: 80 characters.

Extract text from PDF in respect to formatting (font size, type etc)

Is possible to extract text from a PDF file concerning specific font/font size/font colour etc.? I prefer Perl, python or *nix command-line utilities. My goal is to extract all headlines from PDF file so I will have a nice index of articles contained in a single PDF.
Text and /font/font size/position (no color, as I checked) you can get from Ghostscript's txtwrite device (try -dTextFormat=0 | 1 options), as well as from mudraw's (MuPDF) with -tt option. Then parse XML-like output with e.g. Perl.
I have working code which extracts text from pdf with the size of the font.
with help of PDfminer, I have achieved this job. with many pdf's
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'path\whereyour pdffile'
os.chdir(path)
Extract_Data=[]
for PDF_file in os.listdir():
if PDF_file.endswith('.pdf'):
for page_layout in extract_pages(PDF_file):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
I have used fitz to accomplish the required task, as it is much faster compared to pdfminer. You can find my duplicate answer to a similar question here.
An example code snippet is shown below.
import fitz
def scrape(keyword, filePath):
results = [] # list of tuples that store the information as (text, font size, font name)
pdf = fitz.open(filePath) # filePath is a string that contains the path to the pdf
for page in pdf:
dict = page.get_text("dict")
blocks = dict["blocks"]
for block in blocks:
if "lines" in block.keys():
spans = block['lines']
for span in spans:
data = span['spans']
for lines in data:
if keyword in lines['text'].lower(): # only store font information of a specific keyword
results.append((lines['text'], lines['size'], lines['font']))
# lines['text'] -> string, lines['size'] -> font size, lines['font'] -> font name
pdf.close()
return results
If you wish to find the font information of every line, you may omit the if condition that checks for a specific keyword.
You can extract the text information in any desired format by understanding the structure of dictionary outputs that we obtain by using get_text("dict"), as mentioned in the documentation.

Output filenames when extracting a range of pages from pdf into jpeg using Imagemagick

I am trying to extract a range of pages from a multipage pdf file into individual jpegs using convert (Imagemagick). The extraction works fine. What I am stuck on is that if I want to extract page range 10-20, I still get out jpeg files with names page-0.jpeg to page-9.jpeg while I want them to be named page-10.jpeg to page-20.jpeg. Is there a way of specifying that on the command line?
I require this since I want to extract pages in chucks of 10 to avoid eating up too much memory for huge pdf files and don't want to keep renaming the files.
I remember having this working in an earlier project but can't figure out what I am missing now.
Finally managed to do this. Leaving a answer in case somebody else is looking for the same. The solution works with Imagemagick 6.5.1.
So we want to extract page numbered i to j from a.pdf into individual jpegs with files named from a-10.jpeg to a-20.jpeg.
convert a.pdf[i-j] -set filename:page "%[fx:t+i]" a-%[filename:page].jpeg
This uses fx operators. fx:t gives the screen number of current image in sequence and we can add our offset to it.
You can specify the first "page" number used by %d in the output filename by adding the -scene n parameter, e.g.:
convert a.pdf[0-9] -scene 10 a-%d.jpeg
will output a-10.jpeg, a-11.jpeg, etc.