Cross platform RTF control? - cross-platform

Does anyone know of an RTF control that can be used on Linux/Windows/Mac? It's unfortunate that I have to mention it, but it actually has to be able to save and open rtf files... unlike wxWidgets wxRichTextCtrl for instance.
Edit: Thanks to HappySmileMan for his reply. Better still if it's more of a standalone and not a part of a large library that it would depend on.
Edit: ... and it doesn't look like it can open rtf files... ugh.

RTF is simply not that common; it's a messy format controlled by Microsoft, basically a text dump of the .doc format. The only open source RTF implementations I know of are in Abiword, OpenOffice, and KWord. All are cross-platform, but none probably qualify as "controls" to your liking (though abiword has a bonobo interface, and KWord has a KPart, so they can be embedded, albeit in a heavyweight fashion).

Qt's control is HTML, not RTF (though foobar may just mean rich text, in which case it would be fine)

It seems that what I want (cross platform rtf control that reads and writes actual rtf files) doesn't exist, at least not for free and open source.
...I'd accept this answer but it doesn't seem possible.

If I understand the question correctly, the feature you are looking for is in the Qt toolkit.
Some info on this can be found at https://doc.qt.io/qt-5/richtext.html

Related

Get selected "PostScript" from PDF

I wasn't able to find anything on the internet and I get the feeling that what I want is not such a trivial thing. To make a long story short: I'd like to get my hands on the underlying code that describes the PDF document of a selected area from a .pdf file. I've been looking for libraries or open source readers but couldn't find anything useful yet.
Does there exist something that might be able to accomplish my needs here or anything that might be reused (like an open source reader) to get there a little faster and not having to write everything from scratch?
You can convert a whole PDF document to PostScript using pdftops, one of the utilities from the poppler PDF rendering library.
This utility enables you to convert individual pages, which is at least a start.
If you just want to extract bitmapped images, try pdfimages from the same package. This extraction can also be restricted to individual pages.
The poppler library was originally written for UNIX-like systems, but there are a couple of windows builds available.
The open source tool from iText called iText RUPS does what you want, showing you all the PDF commands for a particular PDF and allow you to visualize the structure and relationships.
http://sourceforge.net/projects/itextrups/

Editing `ods` file in C++ code

I need to edit LibreOffice Calc document programmatically in C++. I know that there is odfkit library, which uses webodf, but it looks like it doesn't support editing .ods files.
Is there any alternative that can deliver me this feature?
Libreoffice has API, called UNO, for controlling it from another process. So if you need something more complicated, that would be the simplest route.
If you just need some simple transformation, the other option is to unpack the file with plain old zip library (libzip, libarchive, ...) and modify the XML manually.
The opendocument site also mentions lpOD, but the web seems defunct and while search comes up with something that looks relevant, I am not sure whether there is anything usable.
see the SDK documentation, with many examples

How to open PDF and read it?

how can I open a PDF file and read some of it's contents with Python (this language is preferred, however Ruby, Perl or PHP are fine too) (in case it is recognized (not just an image)) or report that it's impossible without OCR? TIA
Update: thanks for the solutions, I'm sure some of them will suit me fine.
#RichH, I have a pdf file, and don't know whether it is image- or text-based. I'm looking for a tool to help me find that out and in case it's text-based extract some of it's contents.
For Perl, check out these modules:
PDF::API2
CAM::PDF
Parsing PDF and making something useful out of it is hard as the format is focused on keeping the layout so text can be stored in a way that each letter is positioned individually, depending on the font the text might also be stored as graphic.
libraries to read PDFs I know include the Zend Framework which has a PDF component which includes a PDF parser which can be used from PHP and gives more or less usaable results and the commercial PDFlib which offers quite usable results and offers binding to different languages.

Is there a reliable way to determine if a PDF was generated from a Powerpoint file?

Like the title says. Reason I ask is that we're converting PDFs to formatted ASCII text (using pdftotext) and only want to display the ones that look reasonably sane.
PPT files tend to have text over images, diagonal text and others things that don't translate to ASCII very well, so we'd like to filter them out if we can.
The creating application of a PDF is listed in its XMP metadata. You can see this quite easily in Acrobat 9 (and I believe earlier): go to File > Properties, click Additional Metadata..., then go to Advanced and it's listed under both XMP Core Properties and PDF Properties:
xmp:CreatorTool: Microsoft PowerPoint
pdf:Creator: Microsoft PowerPoint
I'm guessing you want to find this programatically, so you'll need to find a library to read this metadata that works with your language. Here is a list of some XMP tools.
Short answer:
No, I don't think so.
Long answer:
No, I don't think so, because there are may ways to convert a PowerPoint file to pdf, for example Adobe Acrobat and PDFCreator and many many others. It's up to the converters to embed specific information in the PDF file, even if you find a way to detect PowerPoint-source pdf from one convert, the same method may not work for another.
Even longer answer:
No, I don't think so, because of the reasons described in the "long answer". And I don't think detecting the source of the PDF is the best approach to the problem you are trying to solve. Not just PowerPoint produces overlapped text and images. I think it's much better to detect the actual layout of the PDF file. If there are overlay of image and text, then you do some filtering or pre-processing to cater for that.
Your reasoning is very arbitrary - there are surely plenty of PPT files without the features you describe, and plenty of PDF files with them, that were generated from another source.
In theory a better method would just be to detect when these "unwanted" situations occur. However, even though the PDF format is partly open (only for reading, apparently, so it's not truly an open format), extracting complex data like that would be incredibly difficult.
All PDFs can have this problem regardless of their source. Most desktop publishing suites are capable of outputting PDF and are often sold boasting their high quality and flashier PDF presentations ...
A "saner" method would be to use a PDF parser, ITextSharp, or pdfNet...etc, Using the library of your choice, find all image rectangles, and all text rectangles, SORT THE RECTANGLES, and then see if there is substantial overlap of text and image rects -- ignoring image to image overlaps. If so, reject the page and/or document.
That won't be perfect, but at least it's going to catch many PDFs that aren't sane, regardless of source. Other heuristics to add would include color analysis. (i.e. are the colors in the overlapping region sufficiently different to allow "sane" results?)
Best of luck to you
It might put its name in the creator or producer info, but I don't have a copy to check this theory with.
In general, it is not an easy task to programmatically determine (reliably) where a file came from or how it was generated based on its contents. After all, a file is just a collection of bits.
Unless you have a lot of resources to expend building the heuristics to determine whether a file looks "reasonably sane" according to your needs, I would consider this a task for human beings.
some converter from ppt to pdf preserve creator in comments at begin of pdf.
I think that PDF's generated from most applications seem to be the same. It may have some meta-data that you can read from the file...

Read existing PDF file with all format information

I want to read an existing PDF file, get not only the text, but also the format information like: Font (Bold, Italic...), and paragraphs... Is there an code library for doing this, is it open source or commercial?
I am on Windows and favor C# libraries, but C/C++ is also acceptable.
I can very much recommend
pdflib (http://www.pdflib.com/).
Its commercial, but it also has a lite version which you can use for free privately. It contains very muach functionality and is available for all plattforms.
I'd echo Mr. Meyers on this. There appear to be a number of them; search for "pdf parser library" (plus your language) in your favorite search engine.
A few top hits:
http://www.lowagie.com/iText/
http://metacpan.org/pod/PDF::Parse
http://podofo.sourceforge.net/
http://www.vicman.net/download/13733/ (several for .NET)
Note that if you're wanting to edit an existing PDF, you might want to read this:
http://1t3xt.info/tutorials/faq.php?branch=faq.pdf_in_general&node=replace_word
The Pdfium.Net SDK also can help you. Via this API you can get access to a collection of text, images and other objects and ther properties.
Please note I work at the company who develop this API.