How to remove all text color attributes from a QTextDocument? - qtextdocument

I've got a QTextDocument read from an HTML file; given a QString of HTML data named topicFileData, I do topicFileTextDocument.setHtml(topicFileData);. I then want to strip off all of the color information, making the whole document just use the default foreground and background brush. (I do not want to explicitly set the text to be black text on a white background; I want to remove the color information from the document.) (Background info: the reason I need to do this is that there are spans within the document that are erroneously set with a black foreground color, rather than just having no color information set, and that causes those spans to display as black-on-black when my app is running in "dark mode", when Qt changes the default text background brush to be black instead of white.)
Here's what I tried:
QTextCursor tc(&topicFileTextDocument);
tc.select(QTextCursor::Document);
QTextCharFormat noColorFormat;
noColorFormat.clearForeground();
noColorFormat.clearBackground();
tc.mergeCharFormat(noColorFormat);
This does not work, unfortunately; it looks like mergeCharFormat() does not understand that I want the clearForeground() and clearBackground() actions to be merged in to strip off those attributes.
I can do tc.setCharFormat(noColorFormat); instead, of course, and that does strip off the color attributes correctly; but it also obliterates all of the other character format info (font, etc.), which is not acceptable.
So, ideally I'd like to find an API that lets me explicitly remove a given text attribute from a QTextDocument. Alternatively, I guess I need to loop through all the spans of the QTextDocument one by one, get the char format of the current span, remove the color attributes from the format, and set the modified format back onto the span. That would be fine; but I have no idea how to loop over spans in that way. Thanks for any help.

Instead of creating a new instance of QTextCharFormat, update the current format and reapply it on the QTextEdit;
default = QTextCharFormat()
charFormat = self.textCursor().charFormat()
charFormat.setBackground(default.background())
charFormat.setForeground(default.foreground())
self.textCursor().mergeCharFormat(charFormat)

A sub-optimal solution that I have found as a workaround is to actually edit the HTML data string before I create the QTextDocument, using a regex:
topicFileData.replace(QRegularExpression("(;? ?color: ?#[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f])"), "");
This works for my situation, because all of the colors in my HTML file are set with color: #XXXXXX style attributes that can be stripped out of the HTML itself. This is fragile, however; colors specified in other ways would not be stripped, and if the body text of the HTML document happened to contain text that matched the regex, the regex would modify it and thus corrupt the content of the document. So I don't recommend this solution, and I won't be accepting it. If somebody can offer a better solution that would be preferable.

Related

PyQt5: Avoid reducing multiple spaces in QLabel / QPlainTextEdit

In PyQt5, both QLabel and QPlainTextEdit Widgets appear to shrink multiple whitespaces in to one. Say if I set the label text to: "abc abc abc", text shown is "abc abc abc".
Apparently, this is happening because of the default html formatting these widgets are using.
Is there a way to set these widgets to show the original text with multiple spaces keeping the original formatting ?
Edit:
QLabel- I was able to get this sorted by introducing <pre> tags. Can anyone help with the QPlainTextEdit ?
The QPlainTextEdit class does not reduce multiple whitespace, and it does not support html at all (as is suggested by the name). The QLabel class also does not reduce multiple whitespace unless you explicitly include html tags (and if you can't avoid that, you can simply use <pre> tags where necessary).
However, I suspect you may be using a variable-width font that has relatively narrow space characters, so the apparent whitespace reduction is actually illusory. If you use a fixed-width font instead, the illusion should disappear.

Tabulator - formatting print and PDF output

I am a relatively new user of Tabulator so please forgive me if I am asking anything that, perhaps, should be obvious.
I have a Tabulator report that I am able to print and create as a PDF, but the report's formatting (as shown on the screen) is not used in either output.
For printing I have used printAsHtml and printStyled=true, but this doesn't produce a printout that matches what is on the screen. I have formatted number fields (with comma separators) and these are showing correctly, but the number columns should be right-aligned but all of the columns appear as left-aligned.
I am also using Tree View where the tree rows are coloured differently to the main table, but when I print the report with a tree open it colours the whole table with the tree colours and not just the tree.
For the PDF none of the Tabulator formatting is being used. I've looked for anything similar to the printStyled option, but I can't see anything. I've also looked at the autoTable option, but I am struggling to find what to use.
I want to format the print and PDF outputs so that they look as close to the screen representation as possible.
Is there anywhere I could look that would provide examples of how to achieve the above? The Tabulator documentation is very good, but the provided examples don't appear to explain what I am trying to do.
Perhaps there are there CSS classes that I am missing or even mis-using? I have tried including .tabulator-print-table in my CSS, but I am probably not using it correctly. I also couldn't find anything equivalent for producing PDFs. Some examples would help immensely.
Thank you in advance for any advice or assistance.
Formatting is deliberately not included in these, below i will outline why:
Downloaders
Downloaded files do not contain formatted data, only the raw data, this is because a lot of the formatters create visual elements (progress bar, star formatter etc) that cannot be replicated sensibly in downloaded files.
If you want to change the format of data in the download you will need to use an accessor, the accessorDownload option is the one you want to use in this case. The accessors transform the data as it is leaving the table.
For instance we could create an accessor that prepended "Mr " to the front of every name in a column:
var mrAccessor= function(value, data, type, params, column, row){
return "Mr " + value;
}
Assign it to a columns definition:
{title:"Name", field:"name", accessorDownload:mrAccessor}
Printing
Printing also does not include the formatters, this is because when you print a Tabulator table, the whole table is actually rebuilt as a standard HTML table, which allows the printer to work out how to layout everything across multiple pages with column headers etc. The downside of this is that it is only loosely styled like a Tabulator and so formatted contents generated inside Tabulator cells will likely break when added to a normal td element.
For this reason there is also a accessorPrint option that works in the same way as the download accessor but for printing.
If you want to use the same accessor for both occasions, you can assign the function once to the accessor option and it will be applied in both instances.
Checkout the Accessor Documentation for full details.

Why is it so hard to convert PDF to plain text?

I needed to convert some PDF back to text. I tried many soft and online tools and result was always mediocre.
Why is it so difficult technically speaking ?
Let's not assume you are talking about PDFs which merely wrap some bitmap image because it should be clear that in that case you can only resort to OCR with all its restrictions.
Let's instead assume that text is drawn in the PDF at hand.
What is drawn on a PDF page is determined by a sequence of instructions in the content stream of that page. "Text is drawn" on a page means that among those instructions there are some setting the font to use by the instructions to come, some setting the text position and direction to use by the instructions to come, and some actually drawing text given by "string arguments".
Text extraction is the task of taking the sequence of instructions from a content stream and instead of drawing the text as indicated by the font and position setting instructions, to export it in a sensible order using a standard encoding, usually the encoding of the character type of the used programming language / platform.
The first problem is to understand the encoding of the string arguments of those text drawing instructions:
each font can have its own encoding; to extract the text one cannot simply ignore everything but the instructions drawing text and concatenate their string contents, you always have to take the current font into account (some extremely simple text extractors ignore this and, therefore, fail pretty often to return something sensible);
there are a large number of predefined encodings, some reminding of encodings you know, e.g. WinAnsiEncoding, many you likely don't know, e.g. Add-RKSJ-H; these encodings may use a constant number of bytes per glyph or they may be mixed-multibyte; so a text extractor must support very many encodings to start with;
encodings also may be completely ad-hoc and arbitrary; in particular in case of embedded subset fonts one often sees ad-hoc encodings generated by dealing out character codes from some starting value whenever one is needed; i.e. the first glyph in a given font used on a page is given the starting value as code, the next, different glyph is given the starting value plus one, the next, different one the starting value plus two, etc; "Hello World" and a starting value of 48 (ASCII value of '0') would result in "01223453627"; these fonts may contain a mapping to Unicode but they are not required to.
The next problem is to make sense out of the order of the strings:
the string drawing instructions may occur in an arbitrary order, e.g "Hello" might be drawn "lo" first, then after moving back "el", then after again moving back "H"; to extract the text one cannot ignore text positioning instructions and simply concatenate text strings, you always have to take the current position into account (some simple text extractors ignore this and, therefore, can fail to return something sensible);
multi-columnar text may present a difficulty, text may be drawn line by line, e.g. first the text of the top line of the first column, then the top line of the second column, then the second line of the first column, then the second line of the second column, etc.; there need not be any hints in the PDF that the text is multi-columnar.
Another problem is to recognize formatting or styling artifacts:
spaces between words need not be created by drawing a space glyph, it may also be done by text position changing instructions; text extractors not trying to recognize gaps created by text positioning instructions may return a result without spaces; on the other hand the same technique can be used to draw adjacent glyphs at an optimal distance, aka kerning; text extractors trying to recognize gaps created by text positioning instructions may falsely return spaces where there should be none;
sometimes selected words are printed s p a c e d o u t for extra emphasis; in the extracted text these gaps might be presented as space characters which automatic postprocessing of the text may see as word separators;
usually for bold text one uses a different, bold font program; if that is not at hand, people sometimes get creative and emulate bold by printing the same text twice with a minute offset; with a slightly larger offset (or a different transformation) and a different color a shadow effect can be emulated; if the text extractor does not try to recognize this, you end up having some duplicate characters in the output.
More problems arise due to incomplete or wrong extra information:
ToUnicode maps of fonts (optional maps from character code to Unicode) may be incomplete or contain errors; there e.g. are many questions here on stack overflow dealing with incorrect ToUnicode maps for Indian writings; the text extraction results reflect these errors;
there even are PDFs with contradictory information, e.g. with an error in the ToUnicode map but the correct information in an ActualText entry; this is used by some PDF creators to allow correct copy&paste from some programs (preferring an ActualText entry in such a situation) while injecting errors in the output of other programs (preferring ToUnicode information then).
Yet another problem arises if you expect the text extractor to extract only text eventually visible in the page:
text may be drawn outside the current clipping area or outside the visible page area; text extractors need to keep these in mind;
text may be drawn using the rendering mode "invisible"; text extractors have to keep an eye on the rendering mode;
text may be drawn using the same color as the background; to recognize this, a text extractor can not only look at the current instruction and a few graphics state details, it has to take into account anything drawn beforehand in the location of the text;
text may be drawn as a clip path; to recognize whether this text is visible in the end, a text extractor must keep track of what is drawn in the text area as long as the clip path is active;
text may be covered by something else later; a text extractor must drop recognized text in such a case; but depending on blend modes and transparency settings these coverings might or might not allow the text to shine through; thus, for a correct result the text extractor must for each glyph keep track of the color its drawn with, the color of the backdrop, and what all those spiffy effects do with those colors later on; and of course, both glyph color and backdrop color can be interesting, e.g. some shading colors; and the color spaces involved may differ, requiring one to convert back and forth between color spaces; and so on.
Furthermore, text may be drawn where text extractors usually don't look:
some tools hide text from text extraction by putting it into a pattern and filling the page area with that pattern;
similarly there are type 3 fonts; each character in a type 3 font is represented by its own content stream; thus, a tool can draw all text in the content stream of a single type 3 font glyph and then draw that glyph on the page.
...
You surely have meanwhile gotten an idea why text extraction results can be less than optimal. And be assured, the list above is not complete, there still are more complications for text extraction.

Add bold formatting to localized string that has placeholders

I have a localization string with a placeholder:
Verb {0}
I use this string in my view-model to return a string to my view that, in turn, is displayed in a TextBlock. Easy enough. But a new requirement has arisen saying that the "Verb" portion (everything other than the placeholder's inserted value) be displayed in bold.
Using a string with placeholders seems like the typical and easiest way to indicate word order. So the first question, then, is: where should I parse the localization string in order to add the bold formatting? The parse operation will need knowledge of the original placeholder's location. So far, the view-model has been responsible for utilizing the localization strings by using string.Format to insert values and return its result to the view. If I leave this responsibility in the view-model, as is probably necessary, then the view-model also needs to return rich text.
Is binding to rich text even supported by RichTextBlock? Even if it is supported, I've never before had a view-model return formatted text before. It initially feels sacrilegious to a follower of MVVM-ism, but perhaps upon further consideration I may find it acceptable.
What's the best way to add bold formatting to a localized string that has placeholders? Is returning rich text from the view-model the best way?

Having trouble interperating ansi color codes

I have spent the weekend working on a personal project and got stuck here. Basically, I need to turn
[0;37m[33m o0==============================~o[0]o~==============================0o
into
o0==============================~o[0]o~==============================0o (only this text would be yellow now)
Using cocoa's regex functionality, I was able to find and capture the "[0;", "37m" and "[33m" individually. the "0;" indicates the server's desire for any previous text styling to be removed and returned to the default which is black background and white text. The "37m" indicates that the server would like for text to be colored white (not sure why this is here, but this is what the server sends). The final "33m" indicates that the server wants the text to be colored yellow. My code correctly finds, strips out, and identifies the requested color changes in the string, but I am having trouble applying these colors to the NSAttributedString I create. The ranges supplied by the regex searches are no longer valid once I strip out the color sequences in the final string, what is an effective way to figure out where the color changes should be applied to the stripped string? In this example, all the color codes are supplied at the beginning, but in other cases, the color codes could be in the middle to cause the string to change color mid-line. NSAttributedString can handle this if I could figure out the proper ranges to assign the requested colors to.
Now that Lion is out I can post the answer. Basically you can use the fancy regex abilities in Lion to figure out what is up. The code to do this (which needs to be refactored, but at least it works) can be found here:
https://github.com/sgoodwin/Turbo-Mud/blob/experiment/Turbo%20Mud/Turbo_MudAppDelegate.m