Testing for non-ascii characters copied from webpages

Testing for non-ascii characters copied from webpages - testing

So, I'm finding lots of things about removing non-ascii characters, but not really adding them.
Basically, I have a text field that a user can type in, and then that string gets processed, stored, and presented under certain contexts. I expect the user to sometimes just copy and paste text from other webpages, and I want to make sure that nothing the user enters in that field will break anything. (I know this is a potential problem because a user coping and pasting a ' that was not actually an ascii ' already broke things once)
This is NOT about removing non-ascii characters! I want a good list/file of possible problem characters I can copy and paste to verify that they get processed correctly. Or at the very least, a good way to find these potential copy paste 'impostor' characters.

Thank you Tom Blodget. After shifting through and minimizing text, the following is a list of all UTF-8 characters that can be copied and pasted. (here is UTF-16 and UFT-32 lists. I don't have time to copy these lists to a text file. If those links are broken, use Google for UFT-16 table and Google for UTF-32 table)
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂăĄąĆćČčĎďĐđĘęĚěĹĺĽľŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™

Related

Excel file contains invalid hidden characters that can't be removed

I have a peculiar problem with hidden characters in an Excel spreadsheet which uses VBA to create a text file. I've attached a link to a test version of the file, and I'll explain as best I can the issue.
The file creates a plain txt file that can be used to feed data into a System we use. It works well normally, however we've been supplied approximately 15,000 rows of data, and at random points throughout there are hidden characters.
In the test file, there's 1 row and it's cell B11 that has hidden characters at the beginning and end of the value. If you put your cursor at the end of it, and press the backspace key, it will look as if nothing has happened, but actually you've just deleted one of the characters.
As far as Excel is concerned, those hidden characters are question marks, but they're not, as text stream would parse those, but it doesn't, and instead throws up an invalid procedure call error.
I've tried using Excel's CLEAN formula, I've tried the VBA equivalent, tried using 'Replace', but nothing seems to recognise those characters. Excel is convinced they're just question marks, even an ASCII character call gives me the same answer (63), but replace doesn't replace them as question marks, it just omits them!
Any help on this, even if it's just a formula I could apply would be appreciated. In the interests of data protection the data in the file is fake by the way, it's nobody's real NI number.
The excel file with vba code is here

This VBA macro could be run on its own or in conjunction with the ClearFormatting macro. It did strip out the rogue unichars from the sample.
Sub strip_Rogue_Unichars()
Dim uc As Long
With Cells(11, 1).CurrentRegion
For uc = 8000 To 8390
.Replace what:=ChrW(uc), replacement:=vbNullString, lookat:=xlPart
DoEvents
Next uc
End With
End Sub
There's probably a better way to do this and being able to restrict the scope of the Unicode characters to search and replace would obviously speed things up. Turning off .EnableEvents, .ScreenUpdating, etc would likewise help. I believe the calculation was already at manual. I intentionally left a DoEvents in the loop as my first run was several thousand different unichars.

VBA: How to Reference Large Unicode Characters like Paperclip?

I know that similar question has been asked many times before but all I found was about characters up to 2-byte long. I need:
MyString = "📎"
The PAPERCLIP is (U+1F4CE) (http://www.fileformat.info/info/unicode/char/1f4ce/index.htm) and the
ChrW(128206) 'throws an error
HOW to reference the unicode chars longer than 2 bytes?

This is a job that your text editor ought to take care of. My memory of the VBA editor is hazy, I don't recollect any way to force the text encoding of the source code file and trying it quickly with the VBA editor in Excel 2013 looks very unpromising. It turns the utf-16 surrogate pair into two question marks.
Switching to another editor could work, Notepad works fine with the Encoding setting in the Save As dialog forced to "Unicode" for example. But that is hardship, with high odds that the string gets mangled again when you continue editing with the VBA editor. The workaround is to specify the surrogate pair explicitly. Try:
MyString = ChrW(&HD83D) & ChrW(&HDCCE)
Google "utf16 surrogate pair calculator" if you need to do this more than once.

When I paste a string into Xcode I always get an error and have to re-type it

This happens every time I paste a line of code containing a string into Xcode. for example when I pasted this into Xcode:
simonLabel.text = #"Good Job!";
I received an error saying that there was an "unexpeceted '#' in program"
If I delete everything and retype exactly the way I pasted it I do not get an error.

There are too many problems that can appear there:
" can be a different character that you expect
(space) can be a different character that you expect
invisible characters. Something you won't see in the editor at all but they got there with the copy-paste.
All these problems can happen because
You copy it from a website with different encoding (or from a really badly writter website)
You copy it from a smart editor (e.g. MS Word, Open Office) which replaces some of the characters to match locale (e.g. quotation marks) or replaces/add spaces based on grammar and word wrapping (e.g non-breaking space).

Macro for adding spaces between merged words in Microsoft Word

Today I had to edit a terrible text in Microsoft Word that have many paragraphs where all words are merged and text was impossible to read, nightmare :)
So, I thought, maybe there is a solution to this?
I have this idea: 1) I go through text (or paragraph) placing cursor
between words which should be separated (i.e. loremipsum should become lorem ipsum).
2) macro remember all occasions when I placed cursor
3) macro inserts necessary spaces between words
But inserting a space when I place cursor also can be good.
Any suggestions?
Thanks!

User939536, welcome to SO. Before you build a full-fledged Add-In to Word, a good chunk of your functionality is already there.
If you go to Word Options->Proofing->Autocorrect Options (your version may have a different method), you can see the list of words to automatically replace with other words. You can put your most commonly combined words here and let Word do the work for you.
The other route of course is to build an Add-In that will:
Have a database of commonly replaced words (probably a text file that's read on startup)
A search function to search the document
An interface to present the user with the choice to Replace or Skip.

Removing invisible question mark from text - #E using vba

I have to read the text from the cells of a column in excel and search for it in another sheet.
say for example, the text in sheet1 column A is "Evoked Potential Amplitude N2 - P2." This has to be searched in sheet2 column C. This fails because a question mark appears before the "E" which is not present in the value in the sheet2.
Both are representation of same character in different application. Maybe someone might recognize it.
In the excel sheet I don't see any junk characters, but while handling it in the vb code I see a question mark before the word - Evoke.
This data was extracted from a share point application and this character (?) is not visible to the plain eye. Search and replace functions are not working in this case.

Unicode 8203 is a zero-width space. I'm not sure where it's coming from. It is probably a flaw in the way the data is imported into Excel which you haven't noticed before, but it might be worth fixing.
In the meantime, you can simply use the Mid() function in Excel VBA to remove the unwanted character. For example instead of
x = cells(1,1).value
use
x = Mid(cells(1,1).value,2)
which deletes the first character.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Testing for non-ascii characters copied from webpages - testing

Related

Excel file contains invalid hidden characters that can't be removed

VBA: How to Reference Large Unicode Characters like Paperclip?

When I paste a string into Xcode I always get an error and have to re-type it

Macro for adding spaces between merged words in Microsoft Word

Removing invisible question mark from text - #E using vba

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Testing for non-ascii characters copied from webpages - testing

Related

Excel file contains invalid hidden characters that can't be removed

VBA: How to Reference Large Unicode Characters like Paperclip?

When I paste a string into Xcode I always get an error and have to re-type it

Macro for adding spaces between merged words in Microsoft Word

Removing invisible question mark from text - #​E using vba

Categories

Resources

Removing invisible question mark from text - #E using vba