Parser not recognizing a dash - vb.net

My program makes calculations on physics vectors and it allows copy/pasting from websites and then tries to parse them into the x, y, and z components automatically. I've come across one website (http://mathinsight.org/cross_product_examples) that has (3,−3,1). While that looks normal, that minus is actually not recognized by VB. Visually, it is longer than the normal minus (− and -), but return the same Unicode of 45. This picture shows the Unicode for every character (I added a minus in front of the first 3 for comparison) in the Textbox. Also, from this website, I had to use Ctrl+c because right clicking shows that this is not simple HTML.
One is valid (the first), but the second gives VB fits as shown below. Either it won't compile (shown by the blue line below) or a simple assignment (the second one) wrecks havok on my form.
I have tried using
vectorString.Replace("–", "-")
and pasting in the longer dash for the target string and a normal keystroke dash as the replacement, but nothing happens. I'm guessing that since they both have the same Unicode.
Is there some way to convert the longer, invalid dash into the one recognized by VB? I tried using dash symbol that Word likes to replace the minus sign with and it comes up as Unicode 150. So, apparently there are at least three different kinds of dashes. Any thoughts?

The character from Math Insight is U+2212, minus sign. The character you tried using in your Replace call is U+2013, en dash. That's why your replace didn't work.
Beyond the standard ASCII hyphen (-, U+0045), there are two common dashes: the en dash (–, U+2013) and the em dash (—, U+2014). There is also a figure dash (‒, U+2012), but it is not as common.

Related

simple input of diacritical marks, and superscripts

There are times when you need to input modified variables with diacritical marks, or superscripts.
Seems like declare_index_properties allows doing it at the stage of display print.
But it is neither simple, nor very useful in formulas.
is there a simple way of adding hats, umlauts, and ', "strokes on top of a symbol, making it distinguishable from the symbol without such mark both to interpreter and to human eye?
Maxima doesn't have a notion of declaring a symbol to have diacritical marks or other combining marks on it. However, Maxima allows Unicode characters in symbol names if the underlying Lisp implementation allows Unicode; almost all of them allow Unicode. GCL is the only Lisp implementation, so far as I know, which doesn't handle Unicode correctly.
WxMaxima appears to allow Unicode characters to be input. At least, it worked that way when I tried some examples. Command-line Maxima allows Unicode if the terminal it is running in allows Unicode.
I think any Unicode character should be OK in a string. For symbols, any character which passes ALPHA-CHAR-P (a build-in Lisp function) can be part of a symbol name. Also, any character which is declared to be alphabetic (via declare("x", alphabetic) where x is the character in question) can be part of a symbol name.
I think wxMaxima has some capability to allow the user to select characters with diacritical marks from a menu; I haven't tried it. When I want to use Unicode characters, I end up just pasting them from a web page or something. I have used https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html as a source of characters in the past.

VBA replace certain carriage

All.
I am used to programming VBA in Excel, but am new to the structures in Word.
I am working through a library of text files to update them. Many of them are either OCR documents, or were manually entered.
Each has a recurring pattern, the most common of which is unnecessary carriage returns.
For example, I am looking at several text files where there is a double return after each line. A search and replace of all double carriage returns removes all paragraph distinctions.
However, each line is approximately 30 characters long, and if I manually perform the following logic, it gives me a functional document.
If there is a double carriage return after 30+ characters, I replace them with a space.
If there were less than 30 characters prior to the double return, I replace them with a single return.
Can anyone help me with some rudimentary code that would help me get started on that? I could then modify it for each "pattern" of text documents I have.
e.g.
In this case, there are more than
thirty characters per line. And I
will keep going to illustrate this
example.
This would be a new paragraph, and
would be separated by another of
the single returns.
I want code that would return:
In this case, there are more than thirty character returns. And I will keep going to illustrate this example.
This would be a new paragraph, and would be separated by another of the single returns.
Let me know if anyone can throw something out that I can play with!
You can do this without code (which RegEx requires), simply using Word's own wildcard Find/Replace tools, where:
Find = ([!^13]{30,})[^13]{1,}
Replace = \1^32
and, to clean up the residual multi-paragraph breaks:
Find = [^13]{2,}
Replace = ^p
You could, of course, record the above as a macro...
Here is a RegEx that might work for you:
(\n\n)(?<!\.(\n\n))
The substitution is just a plain space, you can try it out (and modify / tweak it) here: https://regex101.com/r/zG9GPw/4
This 'pattern' tells the RegEx engine to look for the newline character \n which occurs x2 like this \n\n (worth noting this is from your question and might be different in your files, e.g. could be \r\n) and it assumes that a valid line break will be proceeded by a full stop: \..
In RegEx the full stop symbol is a single character wild card so it needs to be escaped with the '\' (n and r are normal characters, escaping them tells the RegEx engine they represent newline and return characters).
So... the expression is looking for a group of x2 newline characters but then uses a negative look-behind to exclude any matches where the previous character was a full stop.
Anyway, it's all explained on the site:
Here is how you could do a RegEx find and replace using NotePad++ (I'm not sure if it comes with RegEx or if a plugin is needed, either way it is easy). But you can set a location, filters (to target specific file types), and other options (such as search in sub-directories).
Other than that, as #MacroPod pointed out you could also do this with MS Word, document by document, not using any code :)

Finding the ascii value for shift-in shift-out in sql?

I've some records which are really messed up.
My team lead told me to find out the position of characters with ascii value 14 and 15.
I've a query
SELECT CHARINDEX(CHAR(14),X_CUSTOMER_COMMENTS)
FROM vp_service_requests;
SELECT CHARINDEX(CHAR(15),X_CUSTOMER_COMMENTS)
FROM vp_service_requests;
which returns 0 because i wasn't able to find char with 14 and 15 ascii value after google
search i found 14 and 15 ascii value are for shift in and shift out
how this represents on keyboard so i can try for it with CHAR(14) function.
As a holdover from the old DOS days, Windows still allows you to enter certain old ASCII keys from the keyboard by pressing and holding the ALT key, followed by the three-digit code you wish to enter (from the 10-key pad, not the numeric row atop the keyboard), eg for 14, type ALT-014.
However, some of the lower-level codes are inherited from old terminal functions, eg ASCII 7 is a bell, 8 is a backspace, eg, and rather that typing a character, they cause the cursor to behave a certain way or induce an application to respond in a defined manner. You can embed a CHAR(XX) value for testing simply by concatenating the value into a string and INSERTing it into your test table.
It should be Ctrl-N and Ctrl-O although I doubt this will help.
Try loading the records into a good editor and look at them in HEX. Weird characters should stick out like a sore thumb

Approximate search with openldap

I am trying to write a search that queries our directory server running openldap.
The users are going to be searching using the first or last name of the person they're interested in.
I found a problem with accented characters (like áéíóú), because first and last names are written in Spanish, so while the proper way is Pérez it can be written for the sake of the search as Perez, without the accent.
If I use '(cn=*Perez*)' I get only the non-accented results.
If I use '(cn=*Pérez*)' I get only accented results.
If I use '(cn=~Perez)' I get weird results (or at least nothing I can use, because while the results contain both Perez and Pérez ocurrences, I also get some results that apparently have nothing to do with the query...
In Spanish this happens quite a lot... be it lazyness, be it whatever you want to call it, the fact is that for this kind of thing people tend NOT to write the accents because it's assumend all these searches work with both options (I guess since Google allowes it, everybody assumes it's supposed to work that way).
Other than updating the database and removing all accents and trimming them on the query... can you think of another solution?
You have your ~ and = swapped above. It should be (cn~=Perez). I still don't know how well that will work. Soundex has always been strange. Since many attributes are multi-valued including cn you could store a second value on the attribute that has the extended characters converted to their base versions. You would at least have the original value to still go off of when you needed it. You could also get real fancy and prefix the converted value with something and use the valuesReturnFilter to filter it out from your results.
#Sample object
dn:cn=Pérez,ou=x,dc=y
cn:Pérez
cn:{stripped}Perez
sn:Pérez
#etc.
Then modify your query to use an or expression.
(|(cn=Pérez)(cn={stripped}Perez))
And you would include a valuesReturnFilter that looked like
(!(cn={stripped}*))
See RFC3876 http://www.networksorcery.com/enp/rfc/rfc3876.txt for details. The method for adding a request control varies by what platform/library you are using to access the directory.
Search filters ("queries") are specified by RFC2254.
Encoding:
RFC2254
actually requires filters (indirectly defined) to be an
OCTET STRING, i.e. ASCII 8-byte String:
AttributeValue is OCTET STRING,
MatchingRuleId
and AttributeDescription
are LDAPString, LDAPString is an OCTET STRING.
The standard on escaping: Use "<ASCII HEX NUMBER>" to replace special characters
(https://www.rfc-editor.org/rfc/rfc4515#page-4, examples https://www.rfc-editor.org/rfc/rfc4515#page-5).
Quote:
The <valueencoding> rule ensures that the entire filter string is a
valid UTF-8 string and provides that the octets that represent the
ASCII characters "*" (ASCII 0x2a), "(" (ASCII 0x28), ")" (ASCII
0x29), "\" (ASCII 0x5c), and NUL (ASCII 0x00) are
represented as a backslash "\" (ASCII 0x5c) followed by the two hexadecimal digits
representing the value of the encoded octet.
Additionally, you should probably replace all characters that semantically modify the filter (RFC 4515's grammar gives a list), and do a Regex replace of non-ASCII characters with wildcards (*) to be sure. This will also help you with characters like "é".

RegexKitLite: Match Expression --> Match anything except ] --> Match ]

I am essentially attempting to replace all of the footnotes in a large text. There are various reasons I am doing this in Objective-C, so please assume that constraint.
Every footnote beings with this: [Footnote
Every footnote ends with this: ]
There can be absolutely anything between those two markers, including line breaks. However, there will never be ] between them.
So, essentially I want to match [Footnote, then match anything except ], until ] is matched.
This is the closest I have been able to get to identifying all of the footnotes:
NSString *regexString = #"[\\[][F][o][o][t][n][o][t][e][^\\]\n]*[\\]]";
Using this regular expression manages to identify 780/889 footnotes. It also appears that none of those 780 are false alarms. The only ones it appears to miss are those footnotes that have line breaks in them.
I have spent a lengthly amount of time on www.regular-expressions.info, specifically on the page about dots (http://www.regular-expressions.info/dot.html). This has helped me to create the above regular expressions, but I have not truly figured out how to include any character or line break, except right bracket.
Using the following regular expression instead manages to capture all of the footnotes, but it captures way too much text, because * is greedy: (?s)[\\[][F][o][o][t][n][o][t][e].*[\\]]
Here is some sample text that the regular expression is run on:
<p id="id00082">[Footnote 1: In the history of Florence in the early part of the XVIth century <i>Piero di Braccio Martelli</i> is frequently mentioned as <i>Commissario della Signoria</i>. He was famous for his learning and at his death left four books on Mathematics ready for the press; comp. LITTA, <i>Famiglie celebri Italiane</i>, <i>Famiglia Martelli di Firenze</i>.—In the Official Catalogue of MSS. in the Brit. Mus., New Series Vol. I., where this passage is printed, <i>Barto</i> has been wrongly given for Braccio.</p>
<p id="id00083">2. <i>addi 22 di marzo 1508</i>. The Christian era was computed in Florence at that time from the Incarnation (Lady day, March 25th). Hence this should be 1509 by our reckoning.</p>
<p id="id00084">3. <i>racolto tratto di molte carte le quali io ho qui copiate</i>. We must suppose that Leonardo means that he has copied out his own MSS. and not those of others. The first thirteen leaves of the MS. in the Brit. Mus. are a fair copy of some notes on physics.]</p>
<p id="id00085">Suggestions for the arrangement of MSS treating of particular subjects.(5-8).</p>
When you put together the science of the motions of water, remember to include under each proposition its application and use, in order that this science may not be useless.--
[Footnote 2: A comparatively small portion of Leonardo's notes on water-power was published at Bologna in 1828, under the title: "_Del moto e misura dell'Acqua, di L. da Vinci_".]
In this example there are two footnotes and some non-footnote text. The first footnote, as you can see, contains two line breaks inside it. The second one contains no line breaks.
The first regular expression I mentioned above will manage to capture Footnote 2 in this example text, but it will not capture Footnote 1 because it contains line breaks.
Any improvements on my regular expression would be most appreciated.
Try
#"\\[Footnote[^\\]]*\\]";
This should match across newlines. No need to put a single character into a character class, either.
As a commented, multiline regex (without string escapes):
\[ # match a literal [
Footnote # match literal "Footnote"
[^\]]* # match zero or more characters except ]
\] # match ]
Inside a character class ([...]), the caret ^ takes on a different meaning; it negates the contents of the class. So [ab] matches a or b, whereas [^ab] matches any character except a or b.
Of course, if you have nested footnotes, this will malfunction. A text like [Footnote foo [footnote bar] foo] will match from the beginning until bar]. To avoid this, change the regex to
#"\\[Footnote[^\\]\\[]*\\]";
so neither opening nor closing brackets are allowed. Then of course, you only match the innermost Footnotes and will have to apply the same regex twice (or more, depending on the maximum level of nesting) to the entire text, "peeling back" layer by layer.