Tab separated data is confused to tables when parsing pdf to text

Tab separated data is confused to tables when parsing pdf to text - pdf

I am using pdfMiner to convert pdf to txt. When there are tabs, the data is read column wise instead of row wise. For example, the below snippet in a PDF:
titel1 : text1
title2: text2
title title3: text3
is converted to:
titel1 : text1
title2:
title title3:
text2
text3
How can I get it row by row like how they originally appeared in the original PDF?
P.S. I am using pdf2txt.py

I faced the same issu. Try it with pdfplumer (https://pypi.org/project/pdfplumber/) this is built up on pdfMiner.
This Code worked perfectly fine for me:
def pdf2txt(path):
with pdfplumber.open(path, laparams = {"detect_vertical": False}) as pdf:
text=""
for page in pdf.pages:
text= text +"\n"+ page.extract_text()
return text

Related

How to get vue to display line breaks

I'm trying to display my product description, but when I render it, the text goes next to each other instead of underneath.
So for example I'm getting
description 1 description 2
and what I'm trying to get is
description 1
description 2
When I save my description I save it like this
$description = "$description1. \r\n .$description2"
$product->description = $description;
$product->save();
and this is how I'm trying to render it in vue
<p v-html="product.description"></p>

have you tried using the "" tag similar to this:
$description = "$description1. <br> .$description2"
Haven't tested so syntax may be slightly different.

Is there any way to insert image logo/ Text in before saving to_html in pandas

I am saving pandas output as to_html()
Is there any way to integrate the logo/Text at the top of the html page before saving.

to_html returns a string with the html if the first parameter buf is None. You can than prepend your image or text html to this string and then write this result string to a file.
output = '<img src="logo.jpg" alt="logo"><br><b>some text</b><br>' + df.to_html()
with open('output.html', 'w') as f:
f.write(output)

ExpertPDF - How to know page number based on content in HTML

Suppose I have a HTML that have some heading & text like:
Heading 1
text......
Heading 2
text.....
Heading 3
text.....
Now I have to print this template in PDF, during print out, I have to add index page which actually refer page number with heading. Means print out should be like this.
Heading 1 ....... 1 [page number]
Heading 2 ....... 2
Heading 3 ....... 3
Heading 1
text......
Heading 2
text.....
Heading 3
text.....
So here I want to know, how to know page number based on text in HTML, like heading 1 belong to which page number & for others.
Any suggestion or idea really appreciated.

pdfConverter.PdfFooterOptions.PageNumberTextFontSize = 10;
pdfConverter.PdfFooterOptions.ShowPageNumber = true;
Its done inside the body of this method :-
private void AddFooter(PdfConverter pdfConverter)
{
string thisPageURL = HttpContext.Current.Request.Url.AbsoluteUri;
string headerAndFooterHtmlUrl = thisPageURL.Substring(0, thisPageURL.LastIndexOf('/')) + "/HeaderAndFooterHtml.htm";
//enable footer
pdfConverter.PdfDocumentOptions.ShowFooter = true;
// set the footer height in points
pdfConverter.PdfFooterOptions.FooterHeight = 60;
//write the page number
pdfConverter.PdfFooterOptions.TextArea = new TextArea(0, 30, "This is page &p; of &P; ",
new System.Drawing.Font(new System.Drawing.FontFamily("Times New Roman"), 10, System.Drawing.GraphicsUnit.Point));
pdfConverter.PdfFooterOptions.TextArea.EmbedTextFont = true;
pdfConverter.PdfFooterOptions.TextArea.TextAlign = HorizontalTextAlign.Right;
// set the footer HTML area
pdfConverter.PdfFooterOptions.HtmlToPdfArea = new HtmlToPdfArea(headerAndFooterHtmlUrl);
pdfConverter.PdfFooterOptions.HtmlToPdfArea.EmbedFonts = cbEmbedFonts.Checked;
}
See this page for more details
http://www.expertpdf.net/expertpdf-html-to-pdf-converter-headers-and-footers/

This is actually a pretty tricky problem which ExpertPDF would have to provide specific functionality to make possible.
My solution (not expertpdf) for this was to calculate the layout of the PDF first, get the text to be used in the index for each page and then calculate the layout of the index page/s. Then I'm able to number the pages (including the index pages) then update the page numbers in the index.. This is the only way to handle template pages which span multiple pages themselves, index text which wraps to take up more than a single line, and indexes which span multiple pages.

Create a TextElement
TextElement te = new TextElement(xPos, yPos, width, ""Page &p; of &P;"", footerFont);
footerTemplate.AddElement(te);
The library will automatically replace the &p; tokens.

Arabic Characters not connected while using pdfbox

I'm trying to insert Arabic text into a pdf using pdfbox
File myFile = new File("src/arabic/arial.ttf");
PDFont font = PDType0Font.load(doc, myFile);
PDPageContentStream contentStream = new PDPageContentStream(doc, page,true,true);
contentStream.beginText();
contentStream.setFont(font, 12);
contentStream.newLineAtOffset(30, 40);
String arabicText = "عطي يونيكود رقما فريدا لكل حرف" ;
// System.setProperty("ste.encoding", "UTF-8");
contentStream.showText(arabicText);
contentStream.endText();
contentStream.close();
The Arabic text appears as disconnected text in the resultant pdf.

(This applies for PDFBox 2.0, not for earlier versions)
You have to do this yourself. I can't explain it for Arabic, but for "western" glyphs:
stream.showText("film \uFB01lm");
Create a PDF with that one, then try to mark the "f" or the "l" in the second word - you can't, because it is one entity.
The first word has "f" and "i" as separate characters, the second one has the latin small ligature fi (U+FB01). So you'd have to do some preprocessing yourself to replace such combinations when your font supports them. Good luck!

Content templates rendering in TYPO3

I've got a strange problem connected with content rendering.
I use following code to grab the content:
lib.otherContent = CONTENT
lib.otherContent {
table = tt_content
select {
pidInList = this
orderBy = sorting
where = colPos=0
languageField = sys_language_uid
}
renderObj = COA
renderObj {
10 = TEXT
10.field = header
10.wrap = <h2>|</h2>
20 = TEXT
20.field = bodytext
20.wrap = <div class="article">|</div>
}
}
and everything works fine, except that I'd like to use also predefined column-content templates other than simple text (Text with image, Images only, Bullet list etc.).
The question is: with what I have to replace renderObj = COA and the rest between the brackets to let the TYPO3 display it properly?
Thanks,
I.

The available cObjects are more or less listed in TSRef, chapter 8.
TypoScript for rendering Text w/image can be found in typo3/sysext/css_styled_content/static/v4.3/setup.txt at line 724, and in the neighborhood you'll find e.g. bullets (below) and image (above), which is referenced in textpic line 731. Variants of this is what you'll write in your renderObj.
You will find more details in the file typo3/sysext/cms/tslib/class.tslib_content.php, where e.g. text w/image is found at or around line 897 and is called IMGTEXT (do a case-sensitive search). See also around line 403 in typo3/sysext/css_styled_content/pi1/class.cssstyledcontent_pi1.php, where the newer css-based rendering takes place.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Tab separated data is confused to tables when parsing pdf to text - pdf

Related

How to get vue to display line breaks

Is there any way to insert image logo/ Text in before saving to_html in pandas

ExpertPDF - How to know page number based on content in HTML

Arabic Characters not connected while using pdfbox

Content templates rendering in TYPO3

Categories

Resources