Soft PDF documents - pdf

in fact, We have two types of PDF documents :
Soft documents (conversion from word to PDF or Latex to PDF).
Hard documents (conversion from scanned image to PDF).
By the way, I am only interested to soft documents.
In fact I am trying to conceal information (by using a specific steganography method...) in an existing PDF document, and I am interested to insert the embedded message by slightly modifying the position of the characters. So I know that in a line, all characters have the same y-axis but different x-axis. So I can insert some bits by modifying slightly the x-axis of each character, but if I insert bits by modifying the y-axis of characters that are places in the same line, so this will be easy detectable (because they have the same y-axis). That is why, I am interested to insert some bits by modifying the x-axis of characters which are belonging in the same line, and some bits by modifying the y-axis of characters which are belonging to different lines (each character in a specific line, but i didn't know if the gap between lines remains the same or not). And in this case, I think that my method will be more undetectable.
But before to achieve that, I am interested to get responses of the following questions:
1) If we have a PDF generated by a conversion from Microsoft word to PDF : does the gap between each line remains the same? and does the gap between paragraphs is constant (remains the same)?
2) Furthermore, If we have a PDF generated by a conversion from Latex to PDF: does the gap between each line remains the same? and does the gap between paragraphs is constant (remains the same)? Please I need your opinion and brief explanations about that.
3) When the text is justified, does the space between 2 pairs of letters remains the same ? In other word and for more precision, assume that we have a text into pdf, where the text is "happy new year and merry christmas, world is beautiful! ". The space between "ea" in year remains the same in "beautiful" ? So if we have multiple words containing "ea", does always the space between e and a is the same in all the ea of all the words? (assume that we do not change the police along all the text into the PDF).

You might have to explain more about what you want to do; that might make it easier to give good advice. Essentially it's important to understand the fundamental difference between applications such as Word (I'm hesitant to comment about Latex - I don't know enough about it) and PDF.
Word lives by words, sentences and paragraphs. Structured content is important and how that's layout on the page is - almost - an after thought. In fact, while recent versions of Word are much better at this, older versions of Word could produce completely different layout (including pagination) by simply selecting a different printer. Trust me, I got bitten by that very badly at one point (stupid me).
PDF lives by page representation and structure is - literally - an after thought. When a PDF file draws a paragraph, it draws individual characters or groups of characters. Sometimes in reading order, but possibly in a completely different order (depending on many factors). There is no concept of line height attributed to a character or paragraph style; the application generating the PDF simply moves the text pointer down a certain number of points and starts drawing the next characters.
So... to perhaps answer your question partially.
If you have Word documents generated by the same version of Word using the same operating system using the same font (not a font with the same name, the same font), you can generally assume that the basic text layout rules will be the same. So if you reproduce exactly the same text in both Word versions, you'll get exactly the same results.
However...
There are too many influencing parameters in Word to be absolutely certain. For example, line-height can be influenced by the actual words on a line. Having a bold word or a word in another font on a line (symbols can count!) can influence the amount of spacing between those particular lines. So while there might be overall the same distance between lines, individual lines may differ.
Also for example, word spacing is something that can quite easily be influenced with character styles and with text justification, as can inter-character spacing.
As for your question 3), apart from the fact that character spacing may change what you see, it's fair to assume that all things being equal the combination "ea" for example will always have the same distance. There are two types of fonts.
1) Those that define only character widths, which means that each combination of "ea" would logically always have the same width
2) Those that define character widths and specific kerning for specific character pairs. But because such kerning is for specific character pairs, the distance between "ea" would still always be the same.
I hope this makes sense, like I said, perhaps you need to share more about what you are trying to accomplish so that a better answer can be given...

#David's answer and #Jongware's comments to it already answered your explicit questions 1), 2), and 3). In essence, if you have an identical software setup (and at least in case of MS Word this may include system resources not normally considered), a source document (Word or LaTeX) is likely to produce the identical output concerning glyph positions. But small patches, maybe delivered as security updates from the manufacturer, may give rise to differences in this respect, most often minute but sometimes making lines or even pages break at different positions.
Thus, concerning your objective
to conceal information (by using a specific steganography method...) in an existing PDF document, [...] to insert the embedded message by slightly modifying the position of the characters.
Unless you want to have multiple identical software setups as part of your security concept, I would propose that you do not try to hide the information as difference between your manipulated PDF and the PDF without manipulations but instead in less significant digits (e.g. by hiding bits by making those digits odd or even, either before or after transformation with a given precision) in your manipulated documents making comparisons with "originals" unnecessary.
For more definite propositions, please provide more information, e.g.
whom shall the information be concealed from: how knowledgeable and resourceful are they?
how shall the information extraction be possible; by visual comparison? By some small program run-able on any computer? By a very well defined software setup?
what post-processing steps shall be possible without destroying the hidden information; shall e.g. signing by certain software packages be possible? Such post-processors sometimes introduce minor changes, for example by parsing numbers into float variables and writing them back from those floats later.

Related

Extracting text from a pdf...selectively

Interesting challenge...I'm looking for ways to extract data from a pdf, selectively. These are a collection of research abstracts which consistently have pieces of text I don't want (e.g. author's names and email address, weblinks, location cities etc).
The body of the text is what I want, and I'd looked at using stopwords as a way to solve the problem, but it quickly becomes counterproductive (many of the stopwords are actually necessary words within the text body I need).
So, is there a way to almost do an opposite approach to using stopwords, based on large areas of text you want, only? For example, where there is a title and block of text (e.g. object, methods, results) could these sections of text be selectively extracted?
To add a bit of a challenge further, there isn't much consistency to the documents (so they don't all have the same headings or length).
If anyone has any experience or tips and recommendations it would be really helpful, as the alternative of manual copy and paste just isn't sustainable.
Many thanks,
Graham.

How to use borb and a Translate API to translate a PDF while maintaining formatting?

I found borb - a cool Python package to analyze and create PDFs.
And there are several translation APIs available, e.g. Google Translate and DeepL.
(I realize the length of translated text is likely different than the original text, but to first order I'm willing to ignore this for now).
But I'm not clear from the borb documentation how to replace all texts with their translations, while maintaining all formatting.
Disclaimer: I am Joris Schellekens, the author of borb.
I don't think it will be easy to replace the text in the PDF. That's generally something that isn't really possible in PDF.
The problem you are facing is called "reflowing the content", the idea that you may cause a line of text to be longer/shorter. And then the whole paragraph changes. And perhaps the paragraph is part of a table, and the whole table needs to change, etc.
There are a couple of quick hacks.
You could write new content on top of the pdf, in a separate layer. The PDF spec calls this "optional content groups".
There is code in borb that does this already (the code related to OCR).
Unfortunately, there is no easy free or foolproof way to translate pdf documents and maintain document formatting.
DeepL's new Python Library allows for full document translation in this manner:
import deepl
auth_key = "YOUR_AUTH_KEY"
translator = deepl.Translator(auth_key)
translator.translate_document_from_filepath(
"path/to/original/file.pdf",
"path/to/write/translation/to.pdf",
target_lang="EN-US"
)
and the company now offers a free API with a character limit. If you have a few short pdfs you'd like to translate, this will probably be the way to go.
If you have many, longer pdfs and don't mind paying a base of $5.49/month + $25.00 per 1 million characters translated, the DeepL API is still probably the way to go.
EDIT: After attempting to use the DeepL full document translation feature with Mandarin text, this method is far from foolproof/accurate. At least with the Mandarin documents I examined, the formatting of each document varied significantly, and DeepL was unable to accurately translate full documents over a wide range of formatting. If you need only the rough translation of a document, I would still recommend using DeepL's doc translator. However, if you need a high degree of accuracy, there won't be an 'easy' way to do this (read the rest of the answer). Again, however, I have only tried this feature using mandarin pdf files.
However, if you'd like to focus on text extraction, translation, and formatting without using DeepL's full document translation feature, and are able to sink some real time into building a software that can do this, I would recommend using pdfplumber. While it has a steep learning curve, it is an incredibly powerful tool that provides data on each character in the pdf, image area information, offers visual debugging tools, and has table extraction tools. It is important to note that it only works machine generated pdfs, and has no OCR feature.
Many of the pdf's I deal with are in the Mandarin language and have characters that are listed out of order, but using the data that pdfplumber provides on each character, it is possible to determine their position on the page...for instance, if character n's Distance of left side of character from left side of page (char properties section of the docs) is less than the distance for character n+1, and each has the same Distance of top of character from bottom of page, then it can be reasonably assumed that they are on the same line.
Figuring out what looks the most typical for the body of pdfs that you typically work with is a long process, but performing the text extraction while maintaining line fidelity in this manner can be done with a high degree of accuracy. After extraction, passing the strings to DeepL and writing them in an outfile is an easy task.
If you can provide one of the pdfs you work with for testing that would be helpful!

How do I ensure my pdf-generating application supports as many language fonts as possible?

I am developing an application that generates a PDF based on user input. One of the user inputs is a foreign postal address in the native script of that country, which could possibly be anything. I know I can't support all possible glyphs, but I want to cover as much as reasonably possible. My plan right now is to:
Find a 'default' font that handles the easier languages (left-to-right langs with few glyphs like most latin alphabets, cyrillic, greek). I am thinking Ubuntu Font because it has a very liberal license
Find a fonts for common languages/language sets like CJK, arabic.
When I need to add text to a pdf, I try to find a font in my set that can handle all the codepoints in the string, starting with the default.
Does that sound like a reasonable thing to do, or is there an easier way? Is there a list of top N languages/writing systems I should be supporting?
I also wonder how web browsers do such a good job in displaying any language correctly (I haven't seen a 'tofu' character for unknown codepoint in a while.)
Depending on how big you application can be, you could take a look at Noto, which will "support all languages with a harmonious look and feel." But be warned that covering every writing system on the planet will require at least a gigabyte of fonts.
Browsers support a large range of writing systems ("languages") by relying on different fallback fonts supplied by the operating system. Only when they are exhausted you will see a tofu.
Does that sound like a reasonable thing to do, or is there an easier way?
Essentially items 1 and 2 together mean "collect enough fonts to cover a large enough portion of the Unicode code points". That obviously is necessary.
As #RoelN mentions in his answer, Noto might be a set of fonts to consider.
Item 3, though,
When I need to add text to a pdf, I try to find a font in my set that can handle all the codepoints in the string, starting with the default.
does not make sense. Of course, if there is such a single font, you could use it. But what if there is not?
So I would propose not to count on the existence of such a font but instead split your string into substrings that each consists of characters covered by a single font from your list, and draw your full string piece-wise, changing the font between pieces.
Most likely you will not only have to split your string by font but also by direction (RTL vs. LTR), at least in an intermediate step.
Is there a list of top N languages/writing systems I should be supporting?
Which language systems you should report, depends clearly on your use case. As you want to cover as much as reasonably possible, you probably should simply start with a font family like Noto and extend your list of fonts appropriately whenever your application logs a lookup failure for some character.

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably identify document structure. In particular I am concerned with the recognition of text columns. Even the very expensive PDFLib TET tool frequently jumbles the content of two adjacent columns of text.
It is frequently noted that the PDF format does not have any concept of columns, or even words. Several answers to similar questions on SO mention this. The problem is so great that it even warrants academic research. This journal article notes:
All data objects in a PDF file are represented in a
visually-oriented way, as a sequence of operators which...generally
do not convey information about higher level text units such as
tokens, lines, or columns—information about boundaries between such
units is only available implicitly through whitespace
Hence, all extraction tools I have tried (iTextSharp, PDFLib TET, and Python PDFMiner) have failed to recognize text column boundaries. Of these tools, PDFLib TET performs best.
However, SumatraPDF, the very lightweight and open source PDF Reader, and many others like it can identify columns and text areas perfectly. If I open a document in one of these applications, select all the text on a page (or even the entire document with CTRL+A) copy and paste it into a text file, the text is rendered in the correct order almost flawlessly. It occasionally mixes the footer and header text into one of the columns.
So my question is, how can these applications do what is seemingly so difficult (even for the expensive tools like PDFLib)?
EDIT 31 March 2014: For what it's worth I have found that PDFBox is much better at text extraction than iTextSharp (notwithstanding a bespoke Strategy implementation) and PDFLib TET is slightly better than PDFBox, but it's quite expensive. Python PDFMiner is hopeless. The best results I have seen come from Google. One can upload PDFs (2GB at a time) to Google Drive and then download them as text. This is what I am doing. I have written a small utility that splits my PDFs into 10 page files (Google will only convert the first 10 pages) and then stitches them back together once downloaded.
EDIT 7 April 2014. Cancel my last. The best extraction is achieved by MS Word. And this can be automated in Acrobat Pro (Tools > Action Wizard > Create New Action). Word to text can be automated using the .NET OpenXml library. Here is a class that will do the extraction (docx to txt) very neatly. My initial testing finds that the MS Word conversion is considerably more accurate with regard to document structure, but this is not so important once converted to plain text.
I once wrote an algorithm that did exactly what you mentioned for a PDF editor product that is still the number one PDF editor used today. There are a couple of reasons for what you mention (I think) but the important one is focus.
You are correct that PDF (usually) doesn't contain any structure information. PDF is interested in the visual representation of a page, not necessarily in what the page "means". This means in its purest form it doesn't need information about lines, paragraphs, columns or anything like that. Actually, it doesn't even need information about the text itself and there are plenty of PDF files where you can't even copy and paste the text without ending up with gibberish.
So if you want to be able to extract formatted text, you have to indeed look at all of the pieces of text on the page, perhaps taking some of the line-art information into account as well, and you have to piece them back together. Usually that happens by writing an engine that looks at white-space and then decides first what are lines, what are paragraphs and so on. Tables are notoriously difficult for example because they are so diverse.
Alternative strategies could be to:
Look at some of the structure information that is available in some PDF files. Some PDF/A files and all PDF/UA files (PDF for archival and PDF for Universal Accessibility) must have structure information that can very well be used to retrieve structure. Other PDF files may have that information as well.
Look at the creator of the PDF document and have specific algorithms to handle those PDFs well. If you know you're only interested in Word or if you know that 99% of the PDFs you will ever handle will come out of Word 2011, it might be worth using that knowledge.
So why are some products better at this than others? Focus I guess. The PDF specification is very broad, and some tools focus more on lower-level PDF tasks, some more on higher-level PDF tasks. Some are oriented towards "office" use - some towards "graphic arts" use. Depending on your focus you may decide that a certain feature is worth a lot of attention or not.
Additionally, and that may seem like a lousy answer, but I believe it's actually true, this is an algorithmically difficult problem and it takes only one genius developer to implement an algorithm that is much better than the average product on the market. It's one of those areas where - if you are clever and you have enough focus to put some of your attention on it, and especially if you have a good idea what the target market is you are writing this for - you'll get it right, while everybody else will get it mediocre.
(And no, I didn't get it right back then when I was writing that code - we never had enough focus to follow-through and make something that was really good)
To properly extract formatted text a library/utility should:
Retrieve correct information about properties of the fonts used in the PDF (glyph sizes, hinting information etc.)
Maintain graphics state (i.e. non-font parameters like text and page scaling etc.)
Implement some algorithm to decide which symbols on a page should be treated like words, lines or columns.
I am not really an expert in products you mentioned in your question, so the following conclusions should be taken with a grain of salt.
The tools that do not draw PDFs tend to have less expertise in the first two requirements. They have not have to deal with font details on a deeper level and they might not be that well tested in maintaining graphics state.
Any decent tool that translates PDFs to images will probably become aware of its shortcomings in text positioning sooner or later. And fixing those will help to excel in text extraction.

Displaying Korean Characters - iOS App

I am trying to display Korean text in my iPhone app. The app appends the Unicode of letters one by one to an NSMutableString and displays the string on the screen after each letter is appended.
I understand that there are some rules for conjoining letters (Jamo).
Is there a function for automatically applying all these rules to a string of letters or do I need to write code to make changes (e.g., changing a consonant to a tail consonant if there is a vowel before it)?
FCA. It's you who sent email to me, right? Because the more detailed question is here, I will try (my best) to answer here instead of replying to your email.
By reading the whole text you and people wrote here, I figured out that you are making a Korean handwriting recognition software. So, you would not enjoy the luxury of the Korean input method provided by Apple.
There are two things for me to say. Let's go one by one. (I believe you are already aware of one of the two things I'm going to explain.)
How to compose Hangul text.
So, by reading your inquiry, it should not be about Unicode composed/decomposed Korean string (or just a series of Ja (Consonants) and Mo (Vowels)). Your question looks to be about "how to determine if a consonant (your term is tail consonant, right?) a user writes is a last consonant or the begining consonant of next syllable.
Best thing is to learn Korean, but let me briefly explain it.
Let's say you write 소방차 (a Fire dept. car.)
You are to write : ㅅㅗㅂㅏㅇㅊㅏ
(Again I'm not talking about the decomposed form of Unicode. It's about how people write Korean text.)
When you type ㅗ (which is the 2nd char) temporarily a display system displays 소 by attaching the ㅗ to its preceding ㅅ. And it will look up Korean table. (Although how to assemble Hangul is JoHap style (조합형), which is called composite style, there are tables of allowed Korean text defined in any Korean standard called Wansung style (완성형). So, you are to test the "assembled" syllable to the table to see if there is such a syllable). Then you will find "소" in the table. So, you will display "소".
Now the next char, "ㅂ" is written. Then here it becomes a little complicated. Because there is a syllable "솝" in the table, first it will attach ㅂ to the preceding syllable. So, it will display "솝". However, things are not determined yet completely. A user writes the next char, "ㅏ". It's pretty sure that there is no syllable without first/beginning consonant (Ja). It will look up the table, but fail to find a syllable "ㅏ".
So, it will guess the ㅂ (edited from ㅅ. it was typo) attached to the previous syllable actually belongs to the 2nd syllable. And it should display "소바". Now, ㅇ is typed. Then it tries to attach the ㅇ to the second syllable. So it displays 소방. (At this moment it can also lookup 방 in the table. And it is found.)
Now, "ㅊ" is typed. Probably internally it can test 소방ㅊ where o and ㅊ exist under 바 (I can't write it, because there is no such syllable with o and ㅊ exist together under 바, like 밝.). However, there is no such syllable. So, it instantly determines that ㅊ belongs to the next syllable.
Then "ㅏ" is typed. It will assemble ㅊ and ㅏ to make 차. When you press the space key or return key or any other white space key, it will finish composing Hangul.
This is a simple case. In Korean, there are more complicated syllables like 빨, 꼭, 헗, etc. For the first consonants, 복자음 (BokJaUm, Double Consonants) like ㅃ, ㄲ in 빨 and 꼭, people type ㅂ and ㅅ by pressing the shift key. Then it will display ㅃ and ㄲ. So, picking up how may consonants and determine where (previous syllable or next syllable) it belongs to can be easy if a user type with keyboard. (However, there are some nice Korean input methods for Windows and Xterm, where it allows to type ㅂ twice to make ㅃ. It's kind of an intelligent feature. But testing text like 빱빠라빱, 흙을 can be complicated because you end up testing 3 or 4 consonants grouped like {1,3}, {2,2}, {3, 1}.
The bad news is... because you are writing handwriting recognition, you may need to handle such complicated case if you input recognized Hangul characters one by one into a Korean input method engine. However, if you write up your own input method in your app, you can maintain its own state machine, so it can be easier. But as you can see, it's a trade off. Depending on the existing input method engine and ingesting each char into it. (Hmmm... wait... Maybe the input method engine can handle those complicated cases too.)
FYI, I would like to introduce two open source projects. One is a Korean input method Finder module for Mac, and the other is an input method engine with which you can make a Korean input method. Also, there is a Korean input method for X-Windows hosted here. If you prefer Windows project to look up, here is one.
The latter two were hosted at KLDP.net, a Korean open source project hosting site, but they were moved to Google code. As far as I can remember, "SaeNaRu" and "Nabi" (butterfly) can support typing the same consonant twice to make a double consonant.
For more detailed information, you can look up the libhangul and nabi. (I remember that the input method part of code was almost the same between libhangul and nabi before. But at that time they were separated and expected to evolve independently. So, I guess that they are different.
OK. The first thing is done.
Now let's move on to the second issue. (This is the part I said you may know about already. But just to complete my explanation, let me explain this also.)
It's about what character to choose as an input to your probable Korean input method state machine or a engine like libhangul. There are basically two representation of composed (on display) Hangul characters : Composed and Decomposed. Composed one contains fully composed chars. For example, 사랑합니다, each syllable, 사, 랑, 합, 니, 다 is saved as such. They are not stored as ㅅ, ㅏ, ㄹ, ㅏ, ㅇ, ㅎ, ㅏ, ㅂ, ㄴ, ㅣ, ㄷ, ㅏ.
That is composed representation in Unicode. This representation is usually used by text editors, etc. The other representation is decomposed in Unicode. It's like ㅅ, ㅏ, ㄹ, ㅏ, ㅇ, ㅎ, ㅏ, ㅂ, ㄴ, ㅣ, ㄷ, ㅏ.
This representation is usually used by file systems. For example, if you put a file name in Hangul on Windows, and access the folder which contains it from Mac, it will be displayed like ㅅㅏㄹㅏㅇㅎㅏㅂㄴㅣㄷㅏ although it is displayed as 사랑합니다 on Windows.
However, there is another set of characters if memory serves, which is just a list of Hangul consonants and vowels. Although they may look same or similar to decomposed syllables, they are actually different in that the location where they are drawn is in middle a space where a character is drawn. Its purpose is to present Hangul characters in Korean alphabet tables or things like that for education purpose (or any other purpose.)
So, I'm not sure what characters (i.e. the decomposed or the characters for the list of Hangul consonants and vowels) to ingest to a input method state machine or input method engine you choose or implement. If you implement it, its your choice, but if you use some external libraries for the engine, you need to figure it out.
Also, as I mentioned in my blog post, there are two variants in each composed and decomposed representation, which are all defined in Unicode standard. So, well.. yeah.. I agree. It's quite a bit of work.
As for me, I tried to make an input method for Mac, (when Apple announced they would get rid of the Finder plugin architecture for security issue), but at that time libhangul (Yeah.. I tried to use it) was being changed a lot. So, until it stabilized, I decided to hold off. But because I became very busy at work and tired when I got home, so I didn't make progress on my own input method. So, I believe the state of the libhangul project is much better now than ever. So, it's good try at least to take a look at it.
Also, if you don't have Windows, it would be good to try hanterm or any xterm derivatives which supports Hangul input in itself. The source code will be available at their hosting web site.
Good luck with your project, and if there are more things to ask me, please do so.
Check out these system level text-input facility. I never used these, but looks promising.
http://developer.apple.com/library/ios/#documentation/StringsTextFonts/Conceptual/TextAndWebiPhoneOS/CustomTextProcessing/CustomTextProcessing.html#//apple_ref/doc/uid/TP40009542-CH4-SW8
http://developer.apple.com/library/ios/#documentation/UIKit/Reference/UITextInput_Protocol/Reference/Reference.html#//apple_ref/occ/intf/UITextInput
Because iOS doesn't support system-wide keyboard customization, everybody just use system-default input facility. And handling of Hangul composition is all different by every operating-systems or platforms. (MS/Apple/Samsung/LG or others) So the best way is using system-supplied facility such as UITextField for consistency for users. Or you should accurately simulate how your platform OS does it. Of course you can make it yourself, but users won't like it.
Though I'm not expert on this topic - Korean Hangul compositor -, but I don't think there's simple algorithm without table lookup. Anyway if you really want to implement it yourself, these are all the core problems you have to handle.
Compositing your visual symbols into consonants and vowels which defined in Unicode.
Determining initial-consonant / final-consonants by placement of vowels.
It wouldn't be so hard, but anyway ability to modify preceding character sequence is required. You cannot implement Korean input with only one-way stream unless you have separate key for initial/final consonants which are looks same.
Unicode defines all valid set of Jamo components. Usually those components are too many to be presented on a device. And also inefficient. Most Korean input system decomposes those Jamo again and composite them once before compositing final litter. You also can identify and decompose them visually just like Korean people do.
After you get initial/final-consonants and vowels which are defined in Unicode standard, Unicode Normalization feature (such as -[NSString precomposedStringWithCompatibilityMapping]) will do the rest of jobs.
libhangul (code.google.com/p/libhangul ) does the conversion! It has several functions to handle different types of keyboards (i.e., keyboards with different layouts) and converting the keys to the Unicodes of Hanguls.
It also has several functions which combine the Hanguls to make syllables (they basically implement table lookups that Eonil has mentioned in his response).
Libhangul stores the Hanguls in its buffer as it receives them (it does not output them). After receiving enough Hanguls and successfully converting them into a syllable, it outputs the syllable. Unfortunately, this is quite confusing for the user. The way around this is displaying the buffer content on the screen. After receiving a new Hangul, what has been displayed must be erased. If a syllable has been successfully formed, then the syllable is displayed. Otherwise, the buffer content is displayed again. Note that you can’t just display the new Hangul on the screen. You must erase what you have displayed before and read the previous Hanguls and the new one from the buffer and display them on the screen again.
The reason is that Libhangul may change the code for the previous Hanguls stored in the buffer to make it possible to combine them with the new Hangul. This way, you will get the updated Hanguls.
Also note if the user changes the location of the cursor, the buffer must be emptied.
Additionally, if the user presses backspace, then, the last Hangul displayed on the screen must be erased and must be removed from the buffer.
Libhangul has also some features for correcting typos. For example, if you typeᅡ and ᄉ, it converts them into사.
Thank you JongAm Park and Eonil for your help and thoughtful comments! Since my reputation is less than 15 at this point, I can’t upvote your answers, but I will do when I can.