How to convert non english characters into Unicode (UTF-8) - pdf

I am trying to copy articles from newspaper.
Below is the sample screenshot of the text from the pdf format
when I copy it from the pdf format of the newspaper, and paste it on either microsoft word or excel, it gives the following characters:
·Æ°·C≥
I believe the font is shree bangala font. (not 100% sure)
I have seen other font like Nirmala font which I didn't face issue while working in the utf-8 encoding.
It would be very helpful if someone could guide me how to convert the above text.

Related

Copy-Paste Paragraph from PDF to LaTeX with accents

I have a pdf file in Spanish with words like Hipergráficas or año. But when y tried to copy-paste (e.g. Word or Pages) those words looks like Hipergr'aficas and a~no.
I've tried to open with many text editor without any results.

Converting a bunch XLSX files in a folder into CSV

the catch is that the xlsx files have some Korean text which on converting to csv is changing to "??"
First convert the contents of xlsx file from Korean to English as shown in below link:
https://www.microsoft.com/en-us/translator/excel.aspx
Then proceed to convert xlsx to csv.
You might consider simply coding a loop saving as Unicode Text (*.txt) file format, and then changing the file extension to .csv
UTF-16 is useful if your Excel data contains any Asian characters e.g. Korean.
Note:
It is not fully compatible with ASCII files and requires some Unicode-aware programs to display this so be careful if exporting outside of Excel.
Information on options are discussed here
To continue quoting from there:
How to convert an Excel file to CSV UTF-16 Exporting an Excel file as
CSV UTF-16 is much quicker and easier than converting to UTF-8. This
is because Excel automatically employs the UTF-16 format when saving a
file as Unicode (.txt).
So, what you do is simply click File > Save As in Excel, select the
Unicode Text (*.txt) file format, and then change the file extension
to .csv in Windows Explorer. Done!
If you need a comma-separated or semicolon-separated CSV file, replace
all tabs with commas or semicolons, respectively, in a Notepad or any
other text editor of your choosing (see Step 6 above for full
details).
Every text file has a character encoding for a character set. You have to pick one.
If you pick one that doesn't support all the characters in the file, what would you like to happen? Replacing with ? is a commonly used option.
Picking UTF-8 for Unicode is a good choice for an Excel workbook (and almost all documents) because it uses the Unicode character set (as does VBA, BTW).
In any case, for a text file you have to communicate which encoding you use; And, for a CSV text file, whether there is a header row, what the field separator is, what the text qualifier is (quoting), text qualifier escape, line separator line characters, and column types are. (All of these are questions the Excel's text import wizard asks. Your users need the answers.)

Special formatting in SPSS (subscript, superscript)

Is it possible in SPSS to insert superscript or subscript characters in labels, specifically axis labels?
For example, V2/Hz. LaTeX commands don't work at all ($\mu V^2$) and there doesn't appear to be any appropriate fields in the property editor.
Does SPSS have the capability? I'm using Version 20 if that is relevant.
Variable and value labels are plain text. It is possible to use html or rtf text in places in the Viewer or via the TEXT extension command.
I think the only way to get superscript or subscript in SPSS is to use Unicode characters. You can find them on this Wikipedia page. All numbers are there, but some letters are missing.
I prefer using Unicode subscript/superscript in regular text. The characters will not change if you loose text formatting, though you have limited number of fonts that work.

TextFrame.Characters.Font.Name doesn't change the font of Chinese charatcers in excel shape

The title pretty much tells it all.
My code is like the below:
sh.TextFrame.Characters.Font.Name = "SimSun"
This code only changes the font for all English and single byte symbols.
All double byte symbols and the Chinese characters stay the default font.
I have tried TextFrame2 as well, same result.
I am on excel 2007.
Anyone who can help? Thank you.
I found a Microsoft help desk article in Japanese explaining the issue.
Excel divides the font name of 1 byte english characters and two byte chinese / japanese characters as different modules in vba.
Solution is as follows.
sh.TextFrame2.TextRange.Font.NameFarEast = "SimSun"
sh.TextFrame2.TextRange.Characters.Font.Name = "SimSun"
The first line changes the font of all 2 byte characters in the shape text box and the second line changes the font for all single byte letters.
Far East... wow

New words after pdf copy-paste

I have a pdf file. Then i select and copy "K([2.2.2]crypt)]5[Co2Sn17".
But in clipboard there is "KACHTUNGTRENUNG([2.2.2]crypt)]5ACHTUNGTRENUNG[Co2Sn17".
Any ideas what is "ACHTUNGTRENUNG"? Is it a kind of protection?
There likely are a few extra (invisible) characters in the file. When you copy the text, the application you use to copy translates the characters in the PDF file into something that can be stored on the clipboard. Most likely that happens by translating every character into the unicode string stored in the PDF file for that character in the used font.
For most normal characters the Unicode string should be the same as the character you visually see; here you probably have invisible spaces in the PDF file that are called "achtungtrenung" in the font.
If you have the PDF file available somewhere, I'll be happy to take a look and verify this is indeed what is happening.
It's extra characters between lines.
You can try the PDF Copy Paste software, and see if your desired portion can be converted to text of your preferences.