replace some text in a existing PDF using PDFBOX - pdfbox

To achieve this my approach is:-
1) first I will get the coordinates (X,Y) of the word which I have to replace
2)And then re-write on that word in Append mode.
Now, the problem is when I manually putting the coordinates after getting the coordinates separately. I am getting the appended file however, the text is appended at some other location
Last coordinate of char is:- s [(X=119.27521,Y=82.579956) height=5.52 width=4.3166428]
and I am writing through this code:-
contentStream.newLineAtOffset((float) 119.27521, (float) 82.579956);

Related

Cannot move a text object (variable) outside a function

I am trying to first convert pdf credit card statements to text then use regex to extract dates, amounts, and vendor from the individual lines. I can extract all the lines of text as they appear on the statement but when I call the variable with the text file, it only returns the last line.
I set the directory and read-in the pdf credit card statement as "dfpdf"
I run this code ....
with plumb.open(dfpdf) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
global line
for line in text.split('\n'):
print(line)
this returns all the lines in the statement which is what I want. But if I later call or try to print "line" all I get is the last line of the statement. In addition to what is probably a really simple answer, I would also love a suggestion for a really good tutorial or class on using python to convert pdfs then using regex to create pd data frames. Thanks to all of you out there who know what you're doing and take the time to help amatuers like me. Mark

identify paragraphs of pdf fiiles using itextsharp

Because of some semantic analysis work, I need identify paragraphs from pdf files with iTextSharp. I know the coordinates of iTextSharp live in the left bottom corner of a page. I find three features to define the paragraph boundaries:
if the horizontal axis of the first word in one line is less than that of the general lines;
if the leading of two consecutive lines is larger than that of the general ones;
if one line ends with "." and the horizontal axis of the ending word is less than that of the other lines
However, I am stuck on the second one. How can I know the general leading between two lines in a paragraph? I mean there are different gaps between two consecutive lines, because some letters like 'f','g' need more space than the others like 'a','n' and so on.
Thanks for your help!
I'm assuming that you are parsing your PDF files using the parser functionality available in iTextSharp. See for instance Extract font height and rotation from PDF files with iText/iTextSharp to see how others have done this before you. A more elaborate article can be found here: Using Open Source PDF Technology to Solve the Unstructured Data Problem in Healthcare
Your question is: how can I calculate the leading? That is: how do I know the distance between the base lines of two consecutive lines?
When you parse a PDF using iTextSharp, you see each line as a series of TextRenderInfo object. These objects allow you to get the base line of the text:
LineSegment baseline = renderInfo.GetBaseline();
Vector startpoint = baseline.GetStartPoint();
This Vector consists of different elements: Getting Coordinates of string using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp
You need startpoint[Vector.I2]. See also: How to detect newline from PDF using iTextSharp
The difference between that value for two consecutive lines give you the value of the leading in its modern meaning. In the old times of printing, every character was a block of a fixed size. Printers (the people, not the machines) put a strip of lead between the rows of blocks to create some extra space between the lines. In modern computing, the word was preserved, but its meaning changed. There are no "blocks" anymore, but you could work with the font size. The font size is an average size of the glyphs in a font. Some glyphs will take more space in the height, some will take less, but taking both the leading (distance between baselines) and the font size (average height of each glyph) into account, you can get a fair idea of the "space between the lines".

Formatting a column to line up with data in a text file

I am trying to set-up a text file so that the data is directly in line with its given header. For instance the file contains 7 headers (t, x(t) ect...)
np.savetxt('vel.dat', Velocity_Col, fmt='%.5e', delimiter=(' '), header = (' t x(t) y(t) z(t) vx(t) vy(t) vz(t)'))
The data is under each header, however they begin to trail off.
https://imgflip.com/i/dq514
First time posting sorry if i am doing this wrong, also the picture upload is not good but you can see the offset of the data.
Cheers !
The 5 in '%.5e' sets the number of digits displayed after the decimal point. You also want to control the total width of each field. That is controlled with a number before the decimal point in the format specification. (The number sets the minimum field width. More characters will be used if needed.) For example, you could use fmt='%15.5e' to ensure that each field uses 15 characters. You wouldn't need that long delimiter; the default delimiter would be fine. Then adjust header to match.

What does an /ActualText of FEFF0009 mean in a PDF?

I've been looking into a PDF file to understand how it is built.
I noticed that InDesign has created PDFs with text as below (after decompression using pdftk).
0 Tc /Span<</ActualText<FEFF0009>>> BDC
4.018 -0.2 Td
( )Tj
I understand the role of ActualText (for copy/paste/searching) but I'm wondering exactly how I should be interpreting the FEFF0009. It looks like a UTF-16 string with BOM chars to represent a tab character. This seems incorrect as it's really a space. I'm wondering if there is a special meaning here?
.. This seems incorrect as it's really a space.
No, it's really a tab.
14.9.4 Replacement Text
NOTE 1: Just as alternate descriptions can be provided for images and other items that do not translate naturally into text (as described in the preceding sub-clause), replacement text can be specified for content that does translate into text but that is represented in a nonstandard way.
(PDF 32000-1:2008)
The PDF text engine does not support the concept of 'tabs'. In this case, InDesign mimicked the function of a tab character by inserting a space in the text stream, and it could set the space width to match the distance spanned by the original tab or use a large relative positioning for the rest of the text (which it did here: the horizontal displacement of 4.018 in your code snippet).
The general idea is that a space is rendered on the position of the tab, but when you copy this text and paste somewhere else you get a tab character. I suppose the 'space' is only inserted to have something to copy.

Preserve "long" spaces in PDFBox text extraction

I am using PDFBox to extract text from PDF.
The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other
This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns apart anymore (space within words in a column looks just like space between columns).
I appreciate that a general solution is very hard, but in this case the columns are really far apart so that having a simple differentiation between "long spaces" and "space between words" would be enough.
Is there a way to tell PDFBox to turn horizontal whitespace of more then x inches into something other than a single space? A proportional approach (x inch become y spaces) would also work.
The pdftotext C library/tool has a '-layout' switch that tries to preserve the layout. Basically, if I can emulate that with PDFBox, that would be perfect.
There does not seem to be a setting for this, but I was able to modify the source for the PDFTextStripper tool to output a column separator (|) when a "long" space was encountered. In the code where it was building the output line it is possible to look at the x positions of the current and previous letter, and if it is large enough, do something special. PDFTextStripper has lots of protected methods, but turned out to be not really all that extensible. I ended up having to copy the whole class to change a private method.
Looking at the code in there, I call myself lucky that with the particular PDF, this simple approach was successful. A more general solution seems very tricky.
PDF text extraction is difficult.
If the text was output as one big string separated by spaces such as :-
PDFTextOut(" Column 1 Column 2 Column 3");
and you are using a fixed width font such as Courier then you could theoretically calculate the number of spaces between items of text because each character is the same width. If the font is proportional such a Arial then the calculation is harder.
In reality most PDF's generated by individually placing each piece of text directly into its position. Therefore, there is technically no space character or any other characters between columns. The text is just placed into an absolute position on the page.
PDFMoveTo(100,100);
PDFTextOut("Column 1");
PDFMoveTo(250,100);
PDFTextOut("Column 2");
In order to perform data extraction on PDF documents you have to do a little bit more work to find and match column data by using pixel locations as you have mentioned and by making some assumptions and having a little bit of luck.