simple input of diacritical marks, and superscripts - input

There are times when you need to input modified variables with diacritical marks, or superscripts.
Seems like declare_index_properties allows doing it at the stage of display print.
But it is neither simple, nor very useful in formulas.
is there a simple way of adding hats, umlauts, and ', "strokes on top of a symbol, making it distinguishable from the symbol without such mark both to interpreter and to human eye?

Maxima doesn't have a notion of declaring a symbol to have diacritical marks or other combining marks on it. However, Maxima allows Unicode characters in symbol names if the underlying Lisp implementation allows Unicode; almost all of them allow Unicode. GCL is the only Lisp implementation, so far as I know, which doesn't handle Unicode correctly.
WxMaxima appears to allow Unicode characters to be input. At least, it worked that way when I tried some examples. Command-line Maxima allows Unicode if the terminal it is running in allows Unicode.
I think any Unicode character should be OK in a string. For symbols, any character which passes ALPHA-CHAR-P (a build-in Lisp function) can be part of a symbol name. Also, any character which is declared to be alphabetic (via declare("x", alphabetic) where x is the character in question) can be part of a symbol name.
I think wxMaxima has some capability to allow the user to select characters with diacritical marks from a menu; I haven't tried it. When I want to use Unicode characters, I end up just pasting them from a web page or something. I have used https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html as a source of characters in the past.

Related

Where are named pdf characters defined like "f_f", "uni00D0" and "a204"?

I'm trying to read the official pdf specification "Document management — Portable document format — Part 1: PDF 1.7" (PDF32000_2008.pdf) as bytes and then interpret them according to that specification.
In Annex D, Character Sets and Encodings, there is a list of all named characters, like:
or
When I parse PDF32000_2008.pdf, there are also named characters like "f_f", "uni00D0" and "a204", which are missing in that specification.
My guess is that "f_f" is a symbol for two 'f' characters, which might get printed with a special glyph. There is a unicode "Latin Small Ligature Ff" for 'ff'.
For example, there is also "f_i" in that file, which I expect to mean 'fi', one glyph showing the 2 characters 'f' and 'i'. However, the pdf specification has 'fi' as named character "fi" and what is the point for having 2 named characters pointing to the same symbol ?
I can imagine that "uni00D0" means the unicode character 'Ð'. However, pdf defines it already as named character "Eth"
What could be "a204" ? Maybe Ansi 204 'Ì', which also has already a named character "Igrave" ?
Why do they use also "a62", which would be just a '<' ?
However, my main question is: Where can I find a specification for these additional named characters ?
Of course, Adobe Acrobat understands them, but also Gmail seems not to have a problem with them. So I guess, their meaning must be specified somewhere.

Parser not recognizing a dash

My program makes calculations on physics vectors and it allows copy/pasting from websites and then tries to parse them into the x, y, and z components automatically. I've come across one website (http://mathinsight.org/cross_product_examples) that has (3,−3,1). While that looks normal, that minus is actually not recognized by VB. Visually, it is longer than the normal minus (− and -), but return the same Unicode of 45. This picture shows the Unicode for every character (I added a minus in front of the first 3 for comparison) in the Textbox. Also, from this website, I had to use Ctrl+c because right clicking shows that this is not simple HTML.
One is valid (the first), but the second gives VB fits as shown below. Either it won't compile (shown by the blue line below) or a simple assignment (the second one) wrecks havok on my form.
I have tried using
vectorString.Replace("–", "-")
and pasting in the longer dash for the target string and a normal keystroke dash as the replacement, but nothing happens. I'm guessing that since they both have the same Unicode.
Is there some way to convert the longer, invalid dash into the one recognized by VB? I tried using dash symbol that Word likes to replace the minus sign with and it comes up as Unicode 150. So, apparently there are at least three different kinds of dashes. Any thoughts?
The character from Math Insight is U+2212, minus sign. The character you tried using in your Replace call is U+2013, en dash. That's why your replace didn't work.
Beyond the standard ASCII hyphen (-, U+0045), there are two common dashes: the en dash (–, U+2013) and the em dash (—, U+2014). There is also a figure dash (‒, U+2012), but it is not as common.

Approximate search with openldap

I am trying to write a search that queries our directory server running openldap.
The users are going to be searching using the first or last name of the person they're interested in.
I found a problem with accented characters (like áéíóú), because first and last names are written in Spanish, so while the proper way is Pérez it can be written for the sake of the search as Perez, without the accent.
If I use '(cn=*Perez*)' I get only the non-accented results.
If I use '(cn=*Pérez*)' I get only accented results.
If I use '(cn=~Perez)' I get weird results (or at least nothing I can use, because while the results contain both Perez and Pérez ocurrences, I also get some results that apparently have nothing to do with the query...
In Spanish this happens quite a lot... be it lazyness, be it whatever you want to call it, the fact is that for this kind of thing people tend NOT to write the accents because it's assumend all these searches work with both options (I guess since Google allowes it, everybody assumes it's supposed to work that way).
Other than updating the database and removing all accents and trimming them on the query... can you think of another solution?
You have your ~ and = swapped above. It should be (cn~=Perez). I still don't know how well that will work. Soundex has always been strange. Since many attributes are multi-valued including cn you could store a second value on the attribute that has the extended characters converted to their base versions. You would at least have the original value to still go off of when you needed it. You could also get real fancy and prefix the converted value with something and use the valuesReturnFilter to filter it out from your results.
#Sample object
dn:cn=Pérez,ou=x,dc=y
cn:Pérez
cn:{stripped}Perez
sn:Pérez
#etc.
Then modify your query to use an or expression.
(|(cn=Pérez)(cn={stripped}Perez))
And you would include a valuesReturnFilter that looked like
(!(cn={stripped}*))
See RFC3876 http://www.networksorcery.com/enp/rfc/rfc3876.txt for details. The method for adding a request control varies by what platform/library you are using to access the directory.
Search filters ("queries") are specified by RFC2254.
Encoding:
RFC2254
actually requires filters (indirectly defined) to be an
OCTET STRING, i.e. ASCII 8-byte String:
AttributeValue is OCTET STRING,
MatchingRuleId
and AttributeDescription
are LDAPString, LDAPString is an OCTET STRING.
The standard on escaping: Use "<ASCII HEX NUMBER>" to replace special characters
(https://www.rfc-editor.org/rfc/rfc4515#page-4, examples https://www.rfc-editor.org/rfc/rfc4515#page-5).
Quote:
The <valueencoding> rule ensures that the entire filter string is a
valid UTF-8 string and provides that the octets that represent the
ASCII characters "*" (ASCII 0x2a), "(" (ASCII 0x28), ")" (ASCII
0x29), "\" (ASCII 0x5c), and NUL (ASCII 0x00) are
represented as a backslash "\" (ASCII 0x5c) followed by the two hexadecimal digits
representing the value of the encoded octet.
Additionally, you should probably replace all characters that semantically modify the filter (RFC 4515's grammar gives a list), and do a Regex replace of non-ASCII characters with wildcards (*) to be sure. This will also help you with characters like "é".

RegEx to find % symbols in a string that don't form the start of a legal two-digit escape sequence?

I would like a regular expression to find the %s in the source string that don't form the start of a valid two-hex-digit escaped character (defined as a % followed by exactly two hexadecimal digits, upper or lower case) that can be used to replace only these % symbols with %25.
(The motivation is to make the best guess attempt to create legally escaped strings from strings of various origins that may be legally percent escaped and may not, and may even be a mixture of the two, without damaging the data intent if the original string was already correctly encoded, e.g. by blanket re-encoding).
Here's an example input string.
He%20has%20a%2050%%20chance%20of%20living%2C%20but%20there%27s%20only%20a%2025%%20chance%20of%20that.
This doesn't conform to any encoding standard because it is a mix of valid escaped characters eg. %20 and two loose percentage symbols. I'd like to convert those %s to %25s.
My progress so far is to identify a regex %[0-9a-z]{2} that finds the % symbols that are legal but I can't work out how to modify it to find the ones that aren't legal.
%(?![0-9a-fA-F]{2})
Should do the trick. Use a look-ahead to find a % NOT followed by a valid two-digit hexadecimal value then replace the found % symbol with your %25 replacement.
(Hopefully this works with (presumably) NSRegularExpression, or whatever you're using)
%(?![a-fA-F0-9]{2})
That's a percent followed by a negative lookahead for two hex digits.

Collect a word between two spaces in objective c

I'm trying to implement stuff similar to spell check, but I need to get the word that is limited by a space. EX: "HI HOW R U", I need to collect HI, HOW and so on as they type. i.e. After user hits HI and space I need to collect HI and do a spell check.
Check the documentation for NSString Here. You want the message componentsSepeparatedByString:.
I don't know objective-C, but I'm fairly sure it'll have a Regexp library - although it'd be straightforward to code it without one.
Regexp: \b([^\s])*\b
\b = word boundary (whitespace, comma, dot, exclamation-mark, etc.)
\s = whitespace character
[...] = character set
[^...] = negated character set (any character(s) EXCEPT ...)
() = grouping construct
* = zero or more times
So the suggested expression would start matching at any word boundary, then match every subsequent character that is not a whitespace character, then match a word boundary.
Your stated case is so simple you may just want to look for spaces (one char at a time) and get the substring, but RegExp is very widely used across a range of languages and platforms, and so it's fairly easy to find an expression when you need to - and one often does for common stuff like checking if zip codes, phone numbers, email addresses and so on are syntactically correct. So it's worth learning in any case. :)