I have data in the form of pdfs and i want to convert it into text. I want to remove the images, header and footer, than the data will be left only in the form of multi-line tables, can you please suggest the best way to convert it? I tried tabula and apache tika but the results are not desired.
As you probably know, text in PDFs is arranged by x/y coordinates on the page. Headers are not stored/identified as such, like they are in MSWord, HTML etc.
Good to hear that you’ve tried tabula: https://github.com/tabulapdf/tabula . I’m sorry that it didn’t work for you.
If you’re working with journal articles, you might have luck with grobid https://wiki.apache.org/tika/GrobidJournalParser
To extract text by locations see: https://stackoverflow.com/a/35299074
http://pdftotext.com/
this works but probably not the way you where looking for
Related
My use case is pretty simple. I need to convert the PDFs to images.I tried using apache pdfbox and i am having some trouble in converting pdfs which contains scanned images. when i convert scanned image the image clarity is lost due to compression/scaling. So i was trying to extract the image data from the PDF and then store it. But the problem is i may get PDF files which will contain images and text in which case i would need to fallback to image conversion mode. The problem is how to differentiate between the pages/documents having only image and the ones with composite data. I was thinking i could use ProcSet defenition for this purpose but looks like it is marked as obsolete and non-reliable according to PDF specifications. Other possibility is to check all the objects linked to that page and see if it contains anything other than images. Please let me know if there is an easier way of doing this
Thanks
If your intention is convert pdf to image, It is better to use ImageMagick for that. If you use ImageMagick, there is a lot options to change the quality of the image. And converting pdf to image is pretty simple using ImageMagick.
I have a Flash file which displays a PDF file as a magazine, because the magazine is in Hebrew Google doesn't read it good.
Is there any why to display raw Text instead of the Flash for the search engine crawlers?
I heard you can do that using SWFObject, is that correct? and if so, how?
I need it to be SEO friendly...
Thanks in advance :)
SWFObject is still flash, so you don't really change the situation by using that.
recreate the pdf file in html; replace the .html file for your .pdf file. although there are extremely better ways to do this, if you're simply loading content via Ajax.
I'm trying to convert an entire presentation to HTML, extracting all the embedded content etc along the way. I've got text, audio, narrations etc all working fine but am having trouble finding out how to export video content.
Im looping through all slides in the presentation, then all shapes on the slide, looking for shapes of type msoMedia. If I find one, then I check it's MediaType. If it's ppMediaTypeMovie, then I can find the source file of an externally linked video file using Shape.LinkFormat.SourceFullName, but I can't for the life of me find out how to access EMBEDDED content.
If I find a shape with a MediaType of ppMediaTypeSound then I can use Shape.SoundFormat.Export to export the audio. Does anybody know of an equivalent for VIDEO shapes? (There's no Shape.VideoFormat) I've spent days looking through every possible data member I can but to no avail.
It appears Microsoft extract the contents of the media file to a temporary folder anyway, and embedded videos still provide a LinkFormat.SourceFullName to the extracted video:
?oshape.LinkFormat.SourceFullName
C:\Users\Alex\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.MSO\F26FF1D0.m4v
All that I need to do is fire this file through ffmpeg and I've got my video, in the format I want!
Thanks for your help :)
Note: You may find that the .Export method doesn't work for embedded sounds either in recent PPT versions.
Alex's suggestion is what I'd look into first; otherwise you can unzip the PPTX/PPSX/etc and find the videos in the media folder. Or you might try saving as an XML presentation; you might be able to parse the video out of that.
I've been looking for 2 days for a solution to that problem but didn't found anything yet. I need to create a PDF File with only one page which shows its content in a table (like a timetable). I'd like to fill the cells with string values, that I am getting from a XML file.
PDFKit wasn't usefull so far and Quartz 2D appears not to be a solution because it's quit complicated to draw a table which I can change easily. Are there any librarys or so to just create a one-paged-tableshowing PDF? Thanks
The easiest approach with the greatest flexibility would probably be to generate the table as HTML, use a WebView to display it and render it to PDF using NSView's dataWithPDFInsideRect: method.
I'm trying to understand how can I do to let my site be reachable from google image search spiders.
I like how last.fm solution, and I thought to use a technique like his staff do to let google find artists images on their pages.
When I'm looking for an artist and I search it on google image search, as often as not I find an image from last.fm artists page, I make an example:
If I search the band Pure Reason Revolution It brings me here, the artist's image page
http://www.last.fm/music/Pure+Reason+Revolution/+images/4284073
Now if I take a look to the image file, i can see it's named:
http://userserve-ak.last.fm/serve/500/4284073/Pure+Reason+Revolution+4.jpg
so if I try to understand how the service works I can try to say:
http://userserve-ak.last.fm/serve/ the server who serve the images
500/ the selected size for the image
4284073/ the image id for database
Pure+Reason+Revolution+4.jpg the image name
I thought it's difficult to think the real filename for the image is Pure+Reason+Revolution+4.jpg for image overwrite problems when an user upload it, in facts, if I digit:
http://userserve-ak.last.fm/serve/500/4284073.jpg
I probably find the real image location and filename
I see this can be done with mod_rewrite engine, but with this tecnique, will the image be highly reachable from search engines and easily archived?
My question is, does exist some guide or tutorial to approach on this kind of tecniques, or something similar?
In my opinion, the best resource for your question is Google itself.
One of the guides targets at google images search and provides some guidelines:
Don't embed text inside images
Tell us as much as you can about the image
Give your images detailed, informative filenames
Create great alt text
Anchor text
Provide good context for your image
Think about the best ways to protect your images
Create a great user experience
Source: Images - Webmaster Tools Help.
As for last.fm, one of the suggestions is:
Give your images detailed, informative
filenames
The filename can give Google clues
about the subject matter of the image.
Try to make your filename a good
description of the subject matter of
the image. For example,
my-new-black-kitten.jpg is a lot more
informative than IMG00023.JPG.
Descriptive filenames can also be
useful to users: If we're unable to
find suitable text in the page on
which we found the image, we'll use
the filename as the image's snippet in
our search results.
So yes, last.fm uses mod_rewrite to give informative filename, which google likes.
There are few more guides out there. None of them is formal, but they can help you anyway:
http://www.tareeinternet.com/forum/seo/236-optimizing-google-image-search.html
http://www.doshdosh.com/how-to-optimize-for-google-images-for-more-traffic/
http://creativebits.org/webdev/optimize_your_site_for_google_image_search
http://www.pearsonified.com/2007/01/get_53_percent_more_searches_with_one_tweak.php
The article pointed out by Tim covers most of it but I'd like to add that the title attribute on <img> tags is important too (but don't abuse it!).
To sum up:
Name your files well. apple.jpg is better SEO wise than PIC2346.jpg. For spaces in filenames use a dash (-) and not an underscore (_). See Dashes vs. underscores for more info.
Alyays fill up the alt attibute. Keep in mind that most screen readers for blind people will read this tag.
Fill the title attribute when usefull. Use a short statement describing the image. Not a whole paragraph!
The context of the image (what is the content around it) is very important too. If the image fits the surrounding contents it will give you more SEO "points".