I am trying to understand the following piece of code in smalltalk
Character extend [
isGraph [
^ (Character space < self) & (self <= $~)
]
visible [
self isGraph ifTrue: [^ '$', self asString]
ifFalse: [^ self asInteger printStringRadix: 16]
]
]
So basically what that code is doing is to extend the functionality of character by adding two new functions to it. IsGraph returns a boolean value, but I don't understand its purpose. How do you interpret (Character space < self) & (self <= $~)?. Somehow the message space is sent to character and that returns something which is compared to self and then self is compared to $~. Can also someone explain the meaning of the symbol ',' in the iftrue block?
welcome to StackOverflow.
First of all the code is adding two new methods and not functions as this is object-oriented programming.
When you send the space message to the Character class it will return you an instance if that class which represents the space character. isGraph probably means "is graphical" because the characters that precede space in the ASCII table do not have a graphical representation (they are NULL, CR, ESC, etc.) as well as the DEL character that follows ~. Thus with isGraph, you check whether a character is between space and ~ on ASCII table.
visible returns a visible representation of a character and relies on isGraph to decide whether to return the actual character or its integer ASCII representation. The actual character is returned in the Smalltalk's character literal format e.g. $a is used for character a, $3 is used for character 3. Strings are concatenated with the , message.
Actually, one of the main points of Smalltalk is understandability. Thus you should be always able to debug a small piece of code or look at the implementations of a message (like , in your case). But I suspect that you use something like GNU Smalltalk that lacks many of these features.
Related
I'm trying to replace newline etc kind of values using regexp_replace. But when I open the result in query result window, I can still see the new lines in the text. Even when I copy the result, I can see new line characters. See output for example, I just copied from the result.
Below is my query
select regexp_replace('abc123
/n
CHAR(10)
头疼,'||CHR(10)||'allo','[^[:alpha:][:digit:][ \t]]','') from dual;
/ I just kept for testing characters.
Output:
abc123
/n
CHAR(10)
头疼,
allo
How to remove the new lines from the text?
Expected output:
abc123 /nCHAR(10)头疼,allo
There are two mistakes in your code. One of them causes the issue you noticed.
First, in a bracket expression, in Oracle regular expressions (which follow the POSIX standard), there are no escape sequences. You probably meant \t as escape sequence for tab - within the bracket expression. (Note also that in Oracle regular expressions, there are no escape sequences like \t and \n anyway. If you must preserve tabs, it can be done, but not like that.)
Second, regardless of this, you include two character classes, [:alpha:] and [:digit:], and also [ \t] in the (negated) bracket expression. The last one is not a character class, so the [ as well as the space, the backslash and the letter t are interpreted as literal characters - they stand in for themselves. The closing bracket, on the other hand, has special meaning. The first of your two closing brackets is interpreted as the end of the bracket expression; and the second closing bracket is interpreted as being an additional, literal character that must be matched! Since there is no such literal closing bracket anywhere in the string, nothing is replaced.
To fix both mistakes, replace [ \t] with the [:blank:] character class, which consists exactly of space and tab. (And, note that [:alpha:][:digit:] can be written more compactly as [:alnum:].)
I am writing a parser for Wolfram Language. The language has a concept of "named characters", which are specified by a name delimited by \[, and ]. For example: \[Pi].
Suppose I want to specify a regular expression for an identifier. Identifiers can include named characters. I see two ways to do it: one is to have a preprocessor that would convert all named characters to their unicode representation, and two is to enumerate all possible named characters in their source form as part of the regular expression.
The second approach does not seem feasible because there are a lot of named characters. I would prefer to have ranges of unicode characters in my regex.
So I want to preprocess my token stream. In other words, it seems to me that the lexer needs to check if the named characters syntax is correct and then look up the name and convert it to unicode.
But if the syntax is incorrect or the name does not exist I need to tell the user about it. How do I propagate this error to the user and yet let antlr4 recover from the error and resume? Maybe I can sort of "pipe" lexers/parsers? (I am new to antlr).
EDIT:
In Wolfram Language I can have this string as an identifier: \[Pi]Squared. The part between brackets is called "named character". There is a limited set of named characters, each of which corresponds to a unicode code point. I am trying to figure out how to tokenize identifiers like this.
I could have a rule for my token like this (simplified to just a combination of named characters and ASCII characters):
NAME : ('\\[' [a-z]+ ']'|[a-zA-Z])+ ;
but I would like to check if the named character actually exists (and other attributes such as if it is a letter, but the latter part is outside of the scope of the question), so this regex won't work.
I considered making a list of allowed named characters and just making a long regex that enumerates all of them, but this seems ugly.
What would be a good approach to this?
END OF EDIT
A common approach is to write the lexer/parser to allow syntactically correct input and defer semantic issues to the analysis of the generated parse tree. In this case, the lexer can naively accept named characters:
NChar : NCBeg .? RBrack ;
fragment NCBeg : '\\[' ;
fragment LBrack: '[' ;
fragment RBrack: ']' ;
Update
In the parser, allow the NChar's to exist in the parse-tree as discrete terminal nodes:
idents : ident+ ;
ident : NChar // named character string
| ID // simple character string?
| Literal // something quoted?
| ....
;
This makes analysis of the parse tree considerably easier: each ident context will contain only one non-null value for a discretely identifiable alt; and isolates analysis of all ordering issues to the idents context.
Update2
For an input \[Pi]Squared, the parse tree form that would be easiest to analyze would be an idents node with two well-ordered children, \[Pi] and Squared.
Best practice would not be to pack both children into the same token - would just have to later manually break the token text into the two parts to check if it is contains a valid named character and whether the particular sequence of parts is allowable.
No regex is going to allow conclusive verification of the named characters. That will require a list. Tightening the lexer definition of an NChar can, however, achieve a result equivalent to a regex:
NChar : NCBeg [A-Z][A-Za-z]+ RBrack ;
If the concern is that there might be a space after the named character, consider that this circumstance is likely better treated with a semantic warning as opposed to a syntactic error. Rather than skipping whitespace in the lexer, put the whitespace on the hidden channel. Then, in the verification analysis of each idents context, check the hidden channel for intervening whitespace and issue a warning as appropriate.
----
A parse-tree visitor can then examine, validate, and warn as appropriate regarding unknown or misspelled named characters.
To do the validation in the parser, if more desirable, use a predicated rule to distinguish known from unknown named characters:
#members {
ArrayList<String> keyList = .... // list of named chars
public boolean inList(String id) {
return keyList.contains(id) ;
}
}
nChar : known
| unknown
;
known : NChar { inList($NChar.getText()) }? ;
unknown : NChar { error("Unknown " + $NChar.getText()); } ;
The inList function could implement a distance metric to detect misspellings, but correcting the text directly in the parse-tree is a bit complex. Easier to do when implemented as a parse-tree decoration during a visitor operation.
Finally, a scrape and munge of the named characters into a usable map (both unicode and ascii) is likely worthwhile to handle both representations as well as conversions and misspelling.
I'm writing a regular expression in Objective-C.
The escape sequence \w is illegal and emits a warning, so the regular expression /\w/ must be written as #"\\w"; the escape sequence \? is valid, apparently, and doesn't emit a warning, so the regular expression /\?/ must be written as #"\\?" (i.e., the backslash must be escaped).
Question marks aren't invisible like \t or \n, so why is \? a valid escape sequence?
Edit: To clarify, I'm not asking about the quantifier, I'm asking about a string escape sequence. That is, this doesn't emit a warning:
NSString *valid = #"\?";
By contrast, this does emit a warning ("Unknown escape sequence '\w'"):
NSString *invalid = #"\w";
It specifies a literal question mark. It is needed because of a little-known feature called trigraphs, where you can write a three-character sequence starting with question marks to substitute another character. If you have trigraphs enabled, in order to write "??" in a string, you need to write it as "?\?" in order to prevent the preprocessor from trying to read it as the beginning of a trigraph.
(If you're wondering "Why would anybody introduce a feature like this?": Some keyboards or character sets didn't include commonly used symbols like {. so they introduced trigraphs so you could write ??< instead.)
? in regex is a quantifier, it means 0 or 1 occurences. When appended to the + or * quantifiers, it makes it "lazy".
For example, applying the regex o? to the string foo? would match o.
However, the regex o\? in foo? would match o?, because it is searching for a literal question mark in the string, instead of an arbitrary quantifier.
Applying the regex o*? to foo? would match oo.
More info on quantifiers here.
I would like to use the Xcode's find in project option to normalize the signatures of methods.
I wrote the find expression:
^\s*([+-])\s*\((\w+)\s*(\*?)\s*\)\s*(\w+)(\s*(:)\s*(\()\s*(\w+)\s*(\*?)\s*(\))\s*(\w+))?
and the replacement expression:
\1 \(\2\3\)\4\6\7\8\9\10\11
The test string is:
+(NSString *) testFunction : (NSInteger ) arg1
and the desired result:
+ (NSString*)testFunction:(NSInteger)arg1
Unfortunatelly Xcode isn't able to recognize te two digit capture group \10 and translates it to \1 and '0' character and so long. How to solve this problem or bug?
Thanks in advance,
Michał
I believe #trojanfoe is correct; regexes can only have nine capture groups. This is waaay more than you need for your particular example, though.
^\s*([+-])\s*\((\w+)\s*(\*?)\s*\)\s*(\w+)(\s*(:)\s*(\()\s*(\w+)\s*(\*?)\s*(\))\s*(\w+))?
\1 \(\2\3\)\4\6\7\8\9\10\11
The first thing I notice is that you're not using \5, so there's no reason to capture it at all. Next, I notice that \6 corresponds to the regex (:), so you can avoid capturing it and replace \6 with : in the output. \7 corresponds to (\(), so you can replace \7 with ( in the output. ...Iterating this approach yields a much simpler pair of regexes: one for zero-argument methods and one for one-argument methods.
^\s*([+-])\s*\((\w+)\s*(\*?)\s*\)\s*(\w+)
\1 \(\2\3\)\4
^([+-] \(\w+\*?\)\w+)\s*:\s*\(\s*(\w+)\s*(\*?)\s*\)\s*(\w+)
\1:\(\2\3\)\4
Notice that I can capture the whole regex [+-] \(\w+\*?\)\w+ without all those noisy \s*s, because it's been normalized already by the first regex's pass.
However, this whole idea is a huge mistake. Consider the following Objective-C method declarations:
-(const char *)toString;
-(id)initWithA: (A) a andB: (B) b andC: (C) c;
-(NSObject **)pointerptr;
-(void)performBlock: (void (^)(void)) block;
-(id)stringWithFormat: (const char *) fmt, ...;
None of these are going to be parsed correctly by your regex. The first one contains a two-word type const char instead of a single word; the second has more than one parameter; the third has a double pointer; the fourth has a very complicated type instead of a single word; and the fifth has not only const char but a variadic argument list. I could go on, through out parameters and arrays and __attribute__ syntax, but surely you're beginning to see why regexes are a bad match for this problem.
What you're really looking for is an indent program (named after GNU indent, which unfortunately doesn't do Objective-C). The best-known and best-supported Objective-C indent program is called uncrustify; get it here.
I want to verify that a given file in a path is of type text file, i.e. not binary, i.e. readable by a human. I guess reading first characters and check each character with :
isAlphaNumeric
isSpecial
isSeparator
isOctetCharacter ???
but joining all those testing methods with and: [ ... and: [ ... and: [ ] ] ] seems not to be very smalltalkish. Any suggestion for a more elegant way?
(There is a Python version here How to identify binary and text files using Python? which could be useful but syntax and implementation looks like C.)
only heuristics; you can never be really certain...
For ascii, the following may do:
|isPlausibleAscii numChecked|
isPlausibleAscii :=
[:char |
((char codePoint between:32 and:127)
or:[ char isSeparator ])
].
numChecked := text size min: 1024.
isPossiblyText := text from:1 to:numChecked conform: isPlausibleAscii.
For unicode (UTF8 ?) things become more difficult; you could then try to convert. If there is a conversion error, assume binary.
PS: if you don't have from:to:conform:, replace by (copyFrom:to:) conform:
PPS: if you don't have conform: , try allSatisfy:
All text contains more space than you'd expect to see in a binary file, and some encodings (UTF16/32) will contain lots of 0's for common languages.
A smalltalky solution would be to hide the gory details in method on Standard/MultiByte-FileStream, #isProbablyText would probably be a good choice.
It would essentially do the following:
- store current state if you intend to use it later, reset to start (Set Latin1 converter if you use a MultiByteStream)
Iterate over N next characters (where N is an appropriate number)
Encounter a non-printable ascii char? It's probably binary, so return false. (not a special selector, use a map, implement a new method on Character or something)
Increase 2 counters if appropriate, one for space characters, and another for zero characters.
If loop finishes, return whether either of the counters have been read a statistically significant amount
TLDR; Use a method to hide the gory details, otherwise it's pretty much the same.