PDFBox extracting paragraphs

PDFBox extracting paragraphs - pdfbox

I am new to pdfbox and I want to extract a paragraph that matches some particular words and I am able to extract the whole pdf to text(notepad) but I have no idea of how to extract particular paragraph to my java program. Can anyone help me with this atleast some tutorials or examples.Thank you so much

Text in PDF documents is absolutely positioned. So instead of words, lines and paragraphs, one only has absolutely positioned characters.
Let's say you have a paragraph:
Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
Roughly speaking, in the PDF file it will be represented as characters N at some position, e a bit right to it, q, u, e more to the right, etc.
PDFBox tries to guess how the characters make words, lines and paragraphs. So it will look for a lot of characters at approximately same vertical position, for groups of characters that are near to each other and similar to try and find what you need. It does that by extracting the text from the entire page and then processing it character by character to create text (it can also try and extract text from just one rectangular area inside a page). See the appropriate class PDFTextStripper (or PDFTextStripperByArea). For usage, see ExtractText.java in PDFBox sources.
That means that you cannot extract paragraphs easily using PDFBox. It also means that PDFBox can and sometimes will miss when extracting text (there are a lot of very different PDF documents out there).
What you can do is extract text from the entire page and then try and find your paragraph searching through that text. Regular expressions are usually well suited for such tasks (available in Java either through Pattern and Matcher classes, or convenience methods on String class).

public static void main(String[] args) throws InvalidPasswordException, IOException {
File file = new File("File Path");
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.setParagraphStart("/t");
pdfStripper.setSortByPosition(true);
for (String line: pdfStripper.getText(document).split(pdfStripper.getParagraphStart()))
{
System.out.println(line);
System.out.println("********************************************************************");
}
}
Guys please try the above code. This works for sure with PDFBox-2.0.8 Jar

I had detected the start of paragraph using the using the following approach. Read the page line by line. For each line:-
Find the last index of '.' (period) in the line.
Compare this index with the length of the input line.
If the index is less then this implies that this is not the end of the previous paragraph.
If it is then it indicates that the previous paragraph has ended and the next line will be the beginning of the new paragraph.
Hope this helps.

After extracting text, paragraph can be constructed programmatically considering following points:
All lines starts with small letters should be joined with previous line. But a line starts with capital letter may also require to join with previous line. e.g: for quoted expression.
.,?,!," ending line with these characters may be the end of paragraph. Not always.
If programmatically a paragraph is determined, then test it for even number of quotes. This may be simple double quote or Unicode double opening and closing quote.

Try this:
private static String getParagraphs(String filePath, int linecount) throws IOException {
ParagraphDetector paragraphDetector = new ParagraphDetector();
StringBuilder extracted = new StringBuilder();
LineIterator it = IOUtils.lineIterator(new BufferedReader(new FileReader(filePath)));
int i = 0;
String line;
for (int lineNumber = 0; it.hasNext(); lineNumber++) {
line = (String) it.next();
if (lineNumber == linecount) {
for (int j = 0; it.hasNext(); j++) {
extracted.append((String) it.next());
}
}
}
return paragraphDetector.SentenceSplitter(extracted.toString());
}

You can first use pdfbox getText function to get the text. Every lines ends with '\n'; So you cannot segment paragraphs simpy with "\n". If a line satify the following condition:
line.length() > 2 && (int)line.charAt(line.length()-2) == 32
then this line is the last line of its paragraph. Here 32 is unicode value.

Related

PDFBox getText not returning all of the visible text

I am using PDFBox to extract text from my PDF document. It retrieves the text, but not all of it (specifically, seems like title/header and footer texts are missing). The parts that are missing are not images and are extracted when using text view in foxit reader.
I am using version 1.8.12 and made a test case with 2.0.2 just to see if it would return more of the content.
This is the code i used for 2.0.2:
public static void main(String[] args) {
File file = new File("D:\\\\file.pdf");
try {
PDDocument doc = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
//stripper.setSuppressDuplicateOverlappingText(false);
stripper.getText(doc);
} catch (Exception e) {
System.out.println("Exc errirs ");
}
}
Now I wonder are there any settings I missed? Is PDFBox failing because text is on top of some decorative elements (rectangle under text)?
Thanks
EDIT: link to file in question

As discussed in the comments, the text wasn't missing, but at the "wrong" position. By default, PDFBox text extraction extracts the characters as they come in the content stream, but they don't always come in a "natural" way. PDF files are created by software, not by humans.
An alternative is to use the sort option:
stripper.setSortByPosition(true)
However, as mkl pointed out, if the text is in two columns, you won't like the result either.

iTextSharp - Is it possible to set a different alignment in the same cell for text

In one cell and on the same line, I must add two text (name and date).
The first snippet of text must be on the left page side, the second one on the right, and everything must be in one line.
I've tried used Paragraphs, Chunks and Phrases but I don't know how to do it.

If you want to separate two pieces of text in the same Phrase or Paragraph, you have to create a Chunk I often refer to as glue:
Chunk glue = new Chunk(new VerticalPositionMark());
You can use this glue like this:
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(dest));
document.open();
Chunk glue = new Chunk(new VerticalPositionMark());
PdfPTable table = new PdfPTable(1);
Phrase p = new Phrase();
p.add("Left");
p.add(glue);
p.add("Right");
table.addCell(p);
document.add(table);
document.close();
}
The result looks like this:
As you can see, the special Chunk we've created separates the Strings "left" and "right".

Just use two paragraphs, chunks or phrases. If you are trying to do it with just one of the three, you are limited. Just define another text field to be added to the page. You can use any combination of three, and set the location on the page to reflect your requirements.

PDFBOX, Reading a pdf line by line and extracting text properties

I am using pdfbox to extract text from pdf files. I read the pdf document as follows
PDFParser parser = null;
String text = "";
PDFTextStripper stripper = null;
PDDocument pdoc = null;
COSDocument cdoc = null;
File file = new File("path");
try {
parser = new PDFParser(new FileInputStream(file));
} catch (IOException e) {
e.printStackTrace();
}
try {
parser.parse();
cdoc = parser.getDocument();
stripper = new PDFTextStripper();
pdoc = new PDDocument(cdoc);
stripper.setStartPage(1);
stripper.setEndPage(2);
text = stripper.getText(pdoc);
System.out.println(text);
} catch (IOException e) {
e.printStackTrace();
}
But what I want to do is read the document line by line and to extract the text properties such as bold,italic, from each line.
How can I achieve this with pdfbox library

extract the text properties such as bold,italic, from each line. How can I achieve this with pdfbox library
Properties such as bold and italic are not first-class properties in a PDF.
Bold or italic writing in PDFs is achieved either using
different fonts (which is the better way); in this case one can try to determine whether or not the fonts are bold or italic by
looking at the font name: it may contain a substring "bold", "italic", "oblique"...
looking at some optional properties of the font, e.g. font weight...
inspecting embedded font file.
Neither of these methods is fool-proof; or
using the same font as for non-bold, non-italic text but using special techniques to make them appear bold or italic (aka poor man's bold), e.g.
not only filling the glyph contours but also drawing a thicker line along it for a bold impression,
drawing the glyph twice, the second time slightly displaced, also for a bold impression,
using a text or transformation matrix to slant the letters for an italic impression.
By overriding the PDFTextStripper methods with such tests accordingly, you may achieve a fairly good guess rate for styles during PDF text extraction.

Create new paragraph with Docx4j

I'm having problem creating a Paragraph with docx4j. Well, actually not the paragraph itself, but it's contents. I'm putting together a new document from paragraphs (actually "blocks" made of paragraphs) and everything is working fine. I'm appending them to a list, and when all needed paragraphs are there, I assemble the document. Now, between these blocks, I need new paragraphs, with custom text added. I'm using this function to create the paragraph:
private P createParagraph(String content) {
P result = factory.createP();
R run = factory.createR();
Text text = factory.createText();
text.setValue(content);
run.getContent().add(text);
result.getContent().add(run);
System.out.println("HEADER : " + result.toString());
return result;
}
The print only prints "HEADER : ", the result.toString() is an empty string. Why is that?
BONUS question : I did not want to open a new thread for this. Is it possible, to add an id for a paragraph, which will appear in the generated html? (like p id="xyz" ...>
Thank you very much!

If you want to see the XML your P object will become, use:
System.out.println(
XmlUtils.marshaltoString(result, true, true) );
org.docx4j.wml.P is a class generated by JAXB's xjc.
There are a couple of plugins listed at https://java.net/projects/jaxb2-commons/pages/Home which we could have used to generate a toString method, but we didn't.
If you want the text content of the paragraph, you can use TextUtils

Justify text in SQL Reporting Services

Is there a way of fully-justifying text in SQL Reporting Services?
I've been searching around and it seems the feature is still not supported by Reporting Services, but are there any workarounds?
I know this question has been asked before, but maybe progress has been made in the mean time.

This is not possible, at least not in SSRS 2008 and below. The only options for aligning text are Left, Center and Right.
The only workaround I could think of was enabling HTML tags in a text box, but the styling for Justify alignment is just ignored. So there really aren't any suitable workarounds AFAIK, short of using picture with justified text (~shudder!~).
You should keep an eye on the corresponding MS feedback item and perhaps vote on it as well. It used to have 527 votes, but was reset to 0 during the move from MS Connect to this new feedback site. I found the bug report through this social.msdn thread, which has been going on for quite some time.

'picture with justified text in SSRS': you can create a AdvRichTextBox control (see code http://geekswithblogs.net/pvidler/archive/2003/10/14/182.aspx ) and use it in ssrs following these steps : http://binaryworld.net/Main/CodeDetail.aspx?CodeId=4049

Here's a possible workaround : Full Text Just
It makes use of RS utility and OLE Automation to do the job.

In Standard, SSRS does not Support justify. There are possibilities to work around:
Use a third party control doing this: (I was not able to get one to work.)
Call a component via COM like Word. (Is a security issue, but possible.)
Format the box in HTML and put small white spaces between the words. This can be done in a stored procedure.
The solution 3 is very long to describe in detail. This is the reason why I put my solution for free download on my web page.
The advantage of my solution is, that there is no installation necessary.
Here is the link to my solution: http://www.rupert-spaeth.de/justify/

If you use <p> try with:
$("[style*='padding-bottom:10pt']").css("text-align", "justify");

The following will work if you open the .rdl code file (which is xml).
You need a paragraph tag, if it doesn't already exist.
This formats a number to use commas (U.S. style) with two points after the decimal place.
It is then right-justified by the Right tag {I had been looking for a justify tag, but it is TextAlign}
<Paragraph>
<TextRuns>
<TextRun>
<Value>=Format( Sum(Fields!ourField.Value, "DataSet2") , "N2") </Value>
<Style>
<FontFamily />
<Color>White</Color>
</Style>
</TextRun>
</TextRuns>
<Style>
<TextAlign>Right</TextAlign>
</Style>
</Paragraph>

Actually its possible to Justify text in SSRS report if you pass the value as HTML and use something to format the text into justify'ed html text before, in my case im using .NET C# to format the passed string to justified html text.
But before that we need to to configure our SSRS report to accept HTML for this we need to add a text box and create a placeholder.
to add a place holder click on the textbox until it lets you write text to it then right click and choose "Create placeholder..."
After you created the place holder you will be prompted to enter the properties of the placeholder, all you need to specify is Value and Markup type
be sure to select the Markup type as HTML and for the value specify the variable that will have the justified html text in our case lets call it transformedHtml.
Now we need to create a function that trasforms our string to justified HTML text
/// <summary>
///
/// </summary>
/// <param name="text">The text that we want to justify</param>
/// <param name="width">Justified text width in pixels</param>
/// <param name="useHtmlTagsForNewLines">if true returns the output as justified html if false returns the ouput as justified string</param>
/// <returns>Justified string</returns>
public string GetText(string text, int width, bool useHtmlTagsForNewLines = false)
{
var palabras = text.Split(' ');
var sb1 = new StringBuilder();
var sb2 = new StringBuilder();
var length = palabras.Length;
var resultado = new List<string>();
var graphics = Graphics.FromImage(new Bitmap(1, 1));
var font = new Font("Times New Roman", 11);
for (var i = 0; i < length; i++)
{
sb1.AppendFormat("{0} ", palabras[i]);
if (graphics.MeasureString(sb1.ToString(), font).Width > width)
{
resultado.Add(sb2.ToString());
sb1 = new StringBuilder();
sb2 = new StringBuilder();
i--;
}
else
{
sb2.AppendFormat("{0} ", palabras[i]);
}
}
resultado.Add(sb2.ToString());
var resultado2 = new List<string>();
string temp;
int index1, index2, salto;
string target;
var limite = resultado.Count;
foreach (var item in resultado)
{
target = " ";
temp = item.Trim();
index1 = 0; index2 = 0; salto = 2;
if (limite <= 1)
{
resultado2.Add(temp);
break;
}
while (graphics.MeasureString(temp, font).Width <= width)
{
if (temp.IndexOf(target, index2) < 0)
{
index1 = 0; index2 = 0;
target = target + " ";
salto++;
}
index1 = temp.IndexOf(target, index2);
temp = temp.Insert(temp.IndexOf(target, index2), " ");
index2 = index1 + salto;
}
limite--;
resultado2.Add(temp);
}
var res = string.Join(useHtmlTagsForNewLines ? "<br> " + Environment.NewLine : "\n", resultado2);
if (useHtmlTagsForNewLines)
res = $"<div>{res.Replace(" ", " ").Replace("<br> ", "<br>")}</div>";
return res;
}
By using this function we can transform any string to justified text and we can select if we want the output to be HTMl or simple string
then we can just call this method like
string text = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.";
string transformedHtml = GetText(text, 350, true);
and we get the output as folows:
In C#
In SSRS
Now this example mainly shows how to get justified text if your passing the values from C# code to ssrs reports but you could acchieve this if you would make the same function in a stored procedure that formats any text the same way. Hope this helps someone.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PDFBox extracting paragraphs - pdfbox

Related

PDFBox getText not returning all of the visible text

iTextSharp - Is it possible to set a different alignment in the same cell for text

PDFBOX, Reading a pdf line by line and extracting text properties

Create new paragraph with Docx4j

Justify text in SQL Reporting Services

Categories

Resources