New words after pdf copy-paste - pdf

I have a pdf file. Then i select and copy "K([2.2.2]crypt)]5[Co2Sn17".
But in clipboard there is "KACHTUNGTRENUNG([2.2.2]crypt)]5ACHTUNGTRENUNG[Co2Sn17".
Any ideas what is "ACHTUNGTRENUNG"? Is it a kind of protection?

There likely are a few extra (invisible) characters in the file. When you copy the text, the application you use to copy translates the characters in the PDF file into something that can be stored on the clipboard. Most likely that happens by translating every character into the unicode string stored in the PDF file for that character in the used font.
For most normal characters the Unicode string should be the same as the character you visually see; here you probably have invisible spaces in the PDF file that are called "achtungtrenung" in the font.
If you have the PDF file available somewhere, I'll be happy to take a look and verify this is indeed what is happening.

It's extra characters between lines.
You can try the PDF Copy Paste software, and see if your desired portion can be converted to text of your preferences.

Related

Converting a bunch XLSX files in a folder into CSV

the catch is that the xlsx files have some Korean text which on converting to csv is changing to "??"
First convert the contents of xlsx file from Korean to English as shown in below link:
https://www.microsoft.com/en-us/translator/excel.aspx
Then proceed to convert xlsx to csv.
You might consider simply coding a loop saving as Unicode Text (*.txt) file format, and then changing the file extension to .csv
UTF-16 is useful if your Excel data contains any Asian characters e.g. Korean.
Note:
It is not fully compatible with ASCII files and requires some Unicode-aware programs to display this so be careful if exporting outside of Excel.
Information on options are discussed here
To continue quoting from there:
How to convert an Excel file to CSV UTF-16 Exporting an Excel file as
CSV UTF-16 is much quicker and easier than converting to UTF-8. This
is because Excel automatically employs the UTF-16 format when saving a
file as Unicode (.txt).
So, what you do is simply click File > Save As in Excel, select the
Unicode Text (*.txt) file format, and then change the file extension
to .csv in Windows Explorer. Done!
If you need a comma-separated or semicolon-separated CSV file, replace
all tabs with commas or semicolons, respectively, in a Notepad or any
other text editor of your choosing (see Step 6 above for full
details).
Every text file has a character encoding for a character set. You have to pick one.
If you pick one that doesn't support all the characters in the file, what would you like to happen? Replacing with ? is a commonly used option.
Picking UTF-8 for Unicode is a good choice for an Excel workbook (and almost all documents) because it uses the Unicode character set (as does VBA, BTW).
In any case, for a text file you have to communicate which encoding you use; And, for a CSV text file, whether there is a header row, what the field separator is, what the text qualifier is (quoting), text qualifier escape, line separator line characters, and column types are. (All of these are questions the Excel's text import wizard asks. Your users need the answers.)

How to convert non english characters into Unicode (UTF-8)

I am trying to copy articles from newspaper.
Below is the sample screenshot of the text from the pdf format
when I copy it from the pdf format of the newspaper, and paste it on either microsoft word or excel, it gives the following characters:
·Æ°·C≥
I believe the font is shree bangala font. (not 100% sure)
I have seen other font like Nirmala font which I didn't face issue while working in the utf-8 encoding.
It would be very helpful if someone could guide me how to convert the above text.

Testing for non-ascii characters copied from webpages

So, I'm finding lots of things about removing non-ascii characters, but not really adding them.
Basically, I have a text field that a user can type in, and then that string gets processed, stored, and presented under certain contexts. I expect the user to sometimes just copy and paste text from other webpages, and I want to make sure that nothing the user enters in that field will break anything. (I know this is a potential problem because a user coping and pasting a ' that was not actually an ascii ' already broke things once)
This is NOT about removing non-ascii characters! I want a good list/file of possible problem characters I can copy and paste to verify that they get processed correctly. Or at the very least, a good way to find these potential copy paste 'impostor' characters.
Thank you Tom Blodget. After shifting through and minimizing text, the following is a list of all UTF-8 characters that can be copied and pasted. (here is UTF-16 and UFT-32 lists. I don't have time to copy these lists to a text file. If those links are broken, use Google for UFT-16 table and Google for UTF-32 table)
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂ㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™

Special formatting in SPSS (subscript, superscript)

Is it possible in SPSS to insert superscript or subscript characters in labels, specifically axis labels?
For example, V2/Hz. LaTeX commands don't work at all ($\mu V^2$) and there doesn't appear to be any appropriate fields in the property editor.
Does SPSS have the capability? I'm using Version 20 if that is relevant.
Variable and value labels are plain text. It is possible to use html or rtf text in places in the Viewer or via the TEXT extension command.
I think the only way to get superscript or subscript in SPSS is to use Unicode characters. You can find them on this Wikipedia page. All numbers are there, but some letters are missing.
I prefer using Unicode subscript/superscript in regular text. The characters will not change if you loose text formatting, though you have limited number of fonts that work.

Convert text to image in Microsoft Word

I have a large book written in Microsoft Word and want to create a macro that will find all text using a predefined style and convert that text to an inline image. This text will be in Arabic and generally no longer than 4-5 lines. Is this possible?
UPDATE: Here's an example to show what I'm referring to:
I want to replace that entire line in Arabic with an image (as if I cropped this attached image to only include the Arabic and then replaced the line in Arabic with the image).
The reason I want a macro or script to do this is because there are hundreds of such lines and updating them one by one is cumbersome plus that will make modifications difficult later on.
UPDATE2: I found an interesting option here: http://windowssecrets.com/forums/showthread.php/31344-Convert-Text-to-an-Image-of-Text-in-VBA-(Office-2000-Sr1a)
It looks like you can cut a piece of text and then "Paste Special" as an image. So if there's a way to automate that that might work.
This is not an answer although I hope it will grow into a community answer. At the moment it is an exploration of what is required to solve the problem.
I know from the discussion when this question was posted on Super User that Abdullah wishes to publish his book on Kindle. So the question is really about how to get a document in English and Arabic ready for publication as an e-Book.
The Kindle does not support Arabic. The number of languages it does support is slowly increasing but there is no evidence I can find that Amazon has plans to add Arabic in the foreseeable future.
The format behind an Amazon e-Book is a cut down version of HTML. If a Word document containing Arabic letters is exported to HTML, the Arabic letters are included as character entities; for example: “ﭐ &#amp;64337; ﭒ ﭓ”. Importing the original Word or the HTML version to Kindle, results in the leading bits being discarded so these characters are displayed as P, Q, R and S instead of “ﭐ ﭑ ﭒ ﭓ (Alef Wasla isolated form, Alef Wasla final form, Beeh Wasla isolated form and Beeh Wasla final form).
I have tried Abdullah’s idea of saving some Arabic letters in a PNG file and creating an HTML file containing <p> … </p> <img src= “Arabic.png” > <p> … </p>. The appearance of this file on my Kindle 2 is perfectly acceptable so this has the potential to be a solution. The question is: how can the necessary conversions be performed?
We need to extract each Arabic string from either the Word document or its HTML equivalent and import it into a program that can convert them to PNG files.
The only way that I know of automating this would be to copy each string to a slide within PowerPoint. With PowerPoint’s SaveAs option it is possible to save each slide as a separate PNG file. The slides are named: SLIDE1.PNG, SLIDE2.PNG, SLIDE3.PNG and so on in sequence which would allow a macro to relate the results to the original strings. It would then be possible to replace the Arabic strings in the HTML file with the image elements. None of this would be too difficult to automate but there is a problem with the slides all being the size of the PowerPoint page. The page could be made smallish but what we need is for each slide to be cropped to just bigger than that slide’s text. I cannot think of any way of automating this cropping.
Does anyone have a better approach than converting each Arabic phrase to a PNG file?
I have been looking for PNG editors with some sort of command line interface but can find nothing that would be easier than using PowerPoint. Does anyone know of an alternative to PowerPoint?
Does anyone have any suggestions for automating the cropping of each image? When a string is placed in a PowerPoint slide it is possible to set its width to, say, 6.5cm (which looks good on my Kindle) and get the height determined by PowerPoint. This could be saved for later use if anyone knows how to use it.
Implementing solution
Pending any suggestions for improving the approach described above, the following outlines how I would implement it.
I would not attempt to process the Word document. I would save it as a Web Page, Filtered HTML file, which is a required step on the way to creating a Kindle eBook, and process that.
Within the HTML file created from my test document, the Arabic phrase comes out as:
<p class="MsoNormal"></p>
<p class="MsoNormal" align="center" style="text-align:center"><span dir="RTL"
style="font-size:24.0pt;font-family:Arial">
&#64336;&#64337;&#64338;&#64339;&#64340;&#64341;
&#64342;&#64343;&#65153;&#65154;&#65276;&#65275;
&#65274;&#65273;&#65246;&#65226;&#65227;&#65228;
</span><span style="font-size:24.0pt"></span></p>
<p class="MsoNormal"></p>
<p class="MsoNormal"></p>
I assume Abdullah's document will result in something similar. Note 1: the above is a random collection of Arabic letters. Note 2: they are held left-to-right in reading sequence even though, when displayed or printed, they are read right-to-left.
The whole of this block will have to be replaced with something like:
<br><imc src="xxxx.png"><br>
where the file xxxx.png holds an image of the Arabic text.
The file names, such as xxxx.png, could be systematic (A001.png, A002.png, ...) but I would have thought that transliterating the first ten or twenty characters of the phrase from the Arabic to English alphabets and using the result, with a numeric suffix, as the file name would be more convenient.
I would hold the records necessary to manage the process in an Excel worksheet. I would place the VBA code in the same workbook.
The steps in the conversion process that I envisage are:
VBA macro to extract Arabic strings from latest HTML file and add new strings to the Excel worksheet. (More about the Excel worksheet later.)
VBA macro to create PowerPoint file, with one slide per new string, and use SaveAs in PNG format to create one PNG file per slide before discarding the PowerPoint file.
Human to crop each PNG file. (There appears to be no way of automating the cropping so this task will be minimised by use of data in the Excel worksheet.)
VBA macro to rename each slide from SLIDEnnn.PNG to its permanent name and to record the permanent name in the Excel worksheet.
VBA macro to update the latest HTML file by replacing the block containing the Arabic phrase with the appropriate HTML IMG element.
The Excel worksheet needs two columns: Arabic phrase and PNG file name. If there is any risk of the worksheet being sorted between steps 2 and 4, we may need a sequence number as well.
Macro 1 will extract an Arabic phrase from the HTML file, look down the list in the worksheet for this phrase and add the phrase at the bottom if it is not already present.
Macro 2 will look for phrases in the worksheet that do not have a PNG file name. These new phrases are the ones to be written to the PowerPoint presentation. That is, a phrase only goes into this process once.
Task 3, cropping each PNG file, will be a pain. All I can say is that it will only be once per phrase.
Macro 4 will assume that the SLIDE001.PNG, SLIDE002.PNG, … are in the sequence of phrases without PNG files in the worksheet. If this might not be true (because the worksheet has been sorted) we will either need a sequence number or to retain the PowerPoint file. The macro will assign a unique name to each new phrase, record this name in the worksheet and rename the PNG file.
Macro 5 creates a new copy of the latest HTML file using the contents of the worksheet to determine which phrase to replace with which PNG file.
This process is not ideal but it will achieve the desired result and has no obvious complications. Any suggestions for improving it?
Before you begin these instructions, press record in the Microsoft Word macro editor, so you can see what the VBA code is.
I'm wondering if this will be easier if you convert the docx file to .rtf (rich text format) and replace that line with an image? Go to File > Save As.. > name it "old.rtf", then replace the line with an image and Save As.. again and name it "new.rtf" and then download Beyond Compare or your favorite diff program to see what happened. It should be easy to do this pro-grammatically if you choose to. I think working in text would be easier than Microsoft's binary format unless you can find a good library to modify their doc or docx formats.
Sub CopySelPasteAsPicture()
' Take a picture of a selection and paste it at the
' document end
With Selection
.CopyAsPicture
End With
ActiveDocument.Content.Select
With Selection
.Collapse Direction:=wdCollapseEnd
.TypeParagraph
.TypeParagraph
.PasteSpecial DataType:=wdPasteMetafilePicture
End With
End Sub