Apache PDFBox Remove Spaces between characters - pdfbox

We are using PDFBox to extract text from PDF's.
Some PDF's text can't be extract correctly.
The following image shows a part from the PDF as image:
After text extraction we get the following text:
3, 8 5 EU R 1 Netto 38,50 EUR 4,00
(Spaces are added between ',' and '8')
Here is our code:
PDDocument pdf = PDDocument.load(reuseableInputStream);
PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.setSortByPosition(true);
String text = pdfStripper.getText(pdf);
We tried to play with the PDFTextStripper attributes 'AverageCharTolerance' and 'SpacingTolerance' with no positive effect.
The alternative libary 'iText' extract the text correctly without spaces between the characters. But we can't use it because of license problems.
Any ideas? Thank you.
EDIT: We are using version 1.8.9. We tried also the snapshot version 2.0.0 with no effect.

The cause
Inspecting the file provided by the OP it turns out that the issue is caused by extra spaces actually being there! There are multiple strings drawn from the same starting position; at every position at most one of those strings has a non-space character. Thus, the PDF viewer output looks good, but PDFBox as text extractor tries to make use of all characters found including those extra space characters.
The behavior can be reproduced using a PDF with this content stream with F0 being Courier:
BT
/F0 9 Tf
100 500 Td
( 2 Netto 5,00 EUR 3,00) Tj
0 0 Td
( 2882892 ENERGIZE LR6 Industrial 2,50 EUR 1) Tj
ET
In a PDF viewer this looks like this:
Copy & paste from Adobe Reader results in
2 2 8 8 2 8 9 2 E N E R G I Z E L R 6 I n d u s t r i a l 2 , 5 0 E U R 1 Netto 5,00 EUR 3,00
Regular extraction using PDFBox results in
2 2 8 8 2 89 2 E N E RG IZ E L R 6 I n du s t ri a l 2 ,5 0 EU R 1 Netto 5,00 EUR 3,00
Thus, not only PDFBox has problems here, these two outputs look different but the extra spaces are a problem either way.
I would propose telling the producer of those PDFs that they are difficult to post-process, even for widely-used software like Adobe Reader.
A work-around
To extract something sensible from this we have to somehow ignore the (actually existing!) extra spaces. As there is no way to ad hoc know which spaces can be used later on and which not, we simply remove all and hope PDFBox adds spaces where necessary:
String extractNoSpaces(PDDocument document) throws IOException
{
PDFTextStripper stripper = new PDFTextStripper()
{
#Override
protected void processTextPosition(TextPosition text)
{
String character = text.getCharacter();
if (character != null && character.trim().length() != 0)
super.processTextPosition(text);
}
};
stripper.setSortByPosition(true);
return stripper.getText(document);
}
(ExtractWithoutExtraSpaces.java)
Using this method with the test document we get:
2 2882892 ENERGIZE LR6 Industrial 2,50 EUR 1 Netto 5,00 EUR 3,00
Different text extractors
The alternative libary 'iText' extract the text correctly without spaces between the characters
This is due to iText extracting text string by string, not character by character. This procedure has its own perils but in this case results in something more usable out-of-the-box.

On newer versions of PDFBox the workaround doesn't work.
But you can fix the problem space and achieve the same result just setting your PDFTextStripper like that:
PDFTextStripper strippet = new PDFTextStripper();
stripper.setWordSeparator("");

Related

How to process mainframe numbers where "{" is the last character

I have a one mainframe file data like as below
000000720000{
I need to parse the data and load into a hive table like below
72000
the above field is income column and "{" sign which denotes +ve amount
datatype used while creating table income decimal(11,2)
in layout.cob copybook using INCOME PIC S9(11)V99
could someone help?
The number you want is 7200000 which would be 72000.00.
The conversion you are looking for is:
Positive numbers
{ = 0
A = 1
B = 2
C = 3
D = 4
E = 5
F = 6
G = 7
H = 8
I = 9
Negative numbers (this makes the whole value negative)
} = 0
J = 1
K = 2
L = 3
M = 4
N = 5
O = 6
P = 7
Q = 8
R = 9
Let's explain why.
Based on your question the issue you are having is when packed decimal data is unpacked UNPK into character data. Basically, the PIC S9(11)V2 actually takes up 7 bytes of storage and looks like the picture below.
You'll see three lines. The top is the character representation (missing in the first picture because the hex values do not map to displayable characters) and the lines below are the hexadecimal values. Most significant digit on top and least below.
Note that in the rightmost byte the sign is stored as C which is positive, to represent a negative value you would see a D.
When it is converted to character data it will look like this
Notice the C0 which is a consequence of the unpacking to preserve the sign. Be aware that this display is on z/OS which is EBCDIC. If the file has been transferred and converted to another code-page you will see the correct character but the hex values will be different.
Here are all the combinations you will likely see for positive numbers
and here for negative numbers
To make your life easy, if you see one of the first set of characters then you can replace it with the corresponding number. If you see something from the second set then it is a negative number.

generating palindromes with John the Ripper

How can I configure John the Ripper to generate only mangled (Jumbo) palindromes from a word-list to crack a password hash?
(I've googled it but only found "how to avoid palindromes")
in john/john.conf (for e.g. 9 and 10 letter palindromes) -append the following rules at the end:
# End of john.conf file.
# Keep this comment, and blank line above it, to make sure a john-local.conf
# that does not end with \n is properly loaded.
[List.Rules:palindromes]
f
f D5
then run john with your wordlist plus the newly created "palindromes" rules:
$ john --wordlist=wordlist.lst --rules:palindromes hashfile.hash
rule f simply appends a reflection of itself to the current word from the wordlist, e.g. P4ss! -> P4ss!!ss4P
rule f D5 not only reflects the word but then deletes the 5th character, e.g. P4ss! -> P4ss!ss4P
I haven't found a way to "delete the middle character" so as of now, the rule has to be adjusted to the required length of palindromes, e.g. f D4 for length of 7, f D6 for length of 11 etc.
Edit: Possible solution for variable length (not tested yet):
f
Mr[6
M = Memorize current word, r = Reverse the entire word , [ = Delete first character, 6 = Prepend the word saved to memory to current word
With this approach the palindromes could additionally be "turned inside out" (word from wordlist at the end of the resulting palindrome instead of at beginning)
f
Mr[6
Mr]4
M = Memorize current word, r = Reverse the entire word , ] = Delete last character, 4 = Append the word saved to memory to current word

Header and repeating time information removal from a GPS TEC rinex file

I have a rinex file and is shown here..an image showing the first part of rinex file
http://imageshack.us/photo/my-images/593/65961409.jpg
The data (AOPR Rinex file) is downloaded from the site after entering a year and a day.
http://www.naic.edu/aisr/GPSTEC/gpstec.html
I want to open this file as a matrix in matlab for further processing..After the end of header at the 42nd line the time information is on 43 rd line. Then data starts. But time information is coming again after some rows say 64 the line, which should be discarded. Header should also be discarded. Also the last column is coming below the first column as a second row which should be transferred to the last column. Totally there are 55700 rows. Kindly help me with this.
I suspect the last column appearing on the line below it is just an artifact of how large the window of your text reader is...
For the rest, I think a trial-and-error loop is in place here:
fid = fopen('test.txt','r');
C = {};
while ~feof(fid)
% read lines with dictated format.
D = textscan(fid, '%d %d %d %d');
% this will fail on headerlines, empty lines, etc.
if isempty(D{1})
% in those cases, advance the file pointer by one line
fgetl(fid);
else
% if that's not the case, save the lines thus read
C = [C;D]; %#ok
end
end
fclose(fid);
% Post-process: concatenate all sub-arrays into one
C = arrayfun(#(ii) cat(1, C{:,ii}), 1:size(C,2), 'UniformOutput', false);
This works, at least with my test.txt:
header
random
garbage
1 2 3 4
4 5 6 7
4 6 7 8
more random garbage
2 5 6 7
5 6 7 8
8 6 3 7
I suspect the last column appearing on the line below it is just an artifact of how large >the window of your text reader is...
For the rest, I think a trial-and-error loop is in place here
Dear Rody I don't have any matlab background and just a beginner. It is actually a Rinex file..with 2780 epochs and 6 observables with 30 satellite values..Decoding it in matlab is tough. That is the problem. You can read a sample code at
http://web.ics.purdue.edu/~tdauterm/EAS591/Lab7/read_rinexo.m
But the problem is that the observables are six and there only 5 in the m-file which also is not in the correct order. I need C1 P2 L1 L2 S1 S2...but the code at the link gives L1 L2 C1 P1 P2. :( Can you just correct that..Then it will be a great help..

Formatting source listings with listings & framed packages

I currently have a problem, that listings package cannot spread source files across multiple pages. In the doc is written, that the "framed" package should be used for various formatting option. Unfortunately I did not find any docs for the "framed" package. My current source formatting looks like this for C# sources:
Source Formatting http://www.free.image.hosting.net/uploads/88987a1ef4.png
Unfortunately the image service no longer exists and I can't find that image, since the post was posted more than 5 years ago. What I remember is that the formatted source code part, which should be visible on the next page, was just truncated and did not show up at all.
My formatting for "listings" package is:
\newcommand{\sourceFormatterCSharp}
{
\lstset
{ language=[Sharp]C
, captionpos=b
%, frame=lines
, morekeywords={var, get, set}
, basicstyle=\footnotesize\ttfamily
, keywordstyle=\color{blue}
, commentstyle=\color{darkgreen}
, stringstyle=\color{darkred}
, backgroundcolor=\color{lightgrey}
, numbers=left
, numberstyle=\scriptsize
, stepnumber=2
, numbersep=5pt
, breaklines=true
, tabsize=2
, showstringspaces=false
, emph={double, bool, int, unsigned, char, true, false, void, get, set}
, emphstyle=\color{blue}
, emph={Assert, Test}
, emphstyle=\color{red}
, emph={[2]\#using, \#define, \#ifdef, \#endif}
, emphstyle={[2]\color{blue}}
, frame=shadowbox
, rulesepcolor=\color{grey}
, lineskip={-1.5pt} % single line spacing
}
}
% first optional param is placement
% param1 file name without extension
% param2 chapter number, e.g. 1 or 2 ...
% param3 caption to use
\newcommand{\embedCSharp}[4][htbp]
{
\sourceFormatterCSharp
\includeListing{#1}{#4}{#3:#2}{#3/#2.cs}
}
Can anybody help me achieving similar looking results using "framed" package or any other for my source to look like this but be distributable across pages? An example how to embed a listing in the frame would not satisfy, since I was so far myself.
The listings package already supports splitting code across pages; see example below (sorry about the long listing). Note that you cannot have a float that breaks across pages, so you'll need to use the caption package (for example) to insert a caption at the beginning of the lstlisting environment.
\documentclass{article}
\usepackage[a5paper,landscape]{geometry}
\usepackage{xcolor,listings}
\begin{document}
\definecolor{lightgrey}{gray}{0.8}
\lstset
{
captionpos=b
, backgroundcolor=\color{lightgrey}
, numbers=left
, numberstyle=\scriptsize
, stepnumber=2
, numbersep=5pt
, frame=shadowbox
, rulesepcolor=\color{gray}
}
\begin{lstlisting}
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
\end{lstlisting}
\end{document}
The framed documentation is within the .sty file itself. Just use it like this:
\documentclass{article}
\usepackage{framed,lipsum}
\begin{document}
\begin{framed}
\lipsum[1-10]
\end{framed}
\end{document}
From the docs, you can also use:
framed -- ordinary frame box (\fbox) with edge at margin
shaded -- shaded background (\colorbox) bleeding into margin
snugshade -- similar
leftbar -- thick vertical line in left margin
Putting your listings instead of lipsum in the example above will allow multiple pages of code with a frame around it all; you won't be able to get identical output to listings, but should be able to tweak things to get things looking okay.

iTextSharp parse table

Using iTextSharp v5.5.13
I have a huge amount of PDF files I need to parse. About 5% of them have a table with data I also need.
The table looks like this:
Most of the time the line I need is parsed as
2 januari 15 januari € 49,49 € 21,57 € 15,09 € 34,39
I can work with that. I split by space and it works.
But sometimes the month name has an extra space: janu ari
I know I can override the strategies to get rid of these extra spaces. I'm already using it with the rest of the pdf (ITextExtractionStrategy), but for this table, I'm using a rectangle strategy:
var rect = new System.util.RectangleJ(70, 425, 460, 200);
RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy =
new FilteredTextRenderListener(new MyLocationTextExtractionStrategy(), filter);
var lines = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy).Split('\n');
My override looks like this:
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
protected override bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk)
{
var dist = chunk.DistanceFromEndOf(previousChunk);
return dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f;
}
}
I found this Googling. But it doesn't solve my problem.
In the case of janu ari dist is larger than -chunk.CharSpaceWidth and I'm not sure what to do next.
Please let me know when I should not use a rectangle strategy for this table but a different approach.
If your data in this type of table is always going to be in the same format, then you could take a different approach: just accept whatever data your MyLocationTextExtractionStrategy is throwing at you, and then massage that data into a format that you can use.
In this case, you data is always:
2 groups of:
1 or 2 digits (day of the month)
some characters (name of the month)
4 groups of:
Euro symbol
some digits (at least one)
comma
2 digits
In 2 januari 15 januari € 49,49 € 21,57 € 15,09 € 34,39 the spaces are separation characters, but with such well structured data, you don't even need spaces. So just drop them, and then your data becomes 2januari15januari€49,49€21,57€15,09€34,39.
Now you can use a regular expression with some capture groups, to massage your data into something palatable.
2 groups of:
[0-9]{1,2}
[a-z]*
4 groups of:
€
[0-9]{1,}
,
[0-9]{2}
As you wrote yourself in the comments, one possible resulting regular expression could be:
new Regex(#"([0-9]{1,2})([a-z]*)([0-9]{1,2})([a-z]*)(€[0-9]{1,},[0-9]{2})(€[0-9]{1,},[0-9]{2})(€[0-9]{1,},[0-9]{2})(€[0-9]{1,},[0-9]{2})")