skip to next non whitespace character in string array - objective-c

I have a problem...
in attempting to parse a file seperated by nothing more than whitespace i have an issue... I have decided the best way to do this is to tokenise the string i have, so far i have put all my lines into an array (by defining all new entrys in the array via the newline character) So my array may contain 5 entrys as such : (each entry in the array defines the lines in the file)
1)mary julia anne steve
2)alex james david katie
3)omegle yikes craxy horse
4)foo bar foobar matt maximus
5)capital or not smack
As you can see, each entry in the file may contain differing amounts of undefined whitespace... which can be one or more tab spaces, or many regular space characters.
I've considered looping through the string char by char until non whitespace is detected, but this seems ugly...
any help?
Thanks :)

sscanf make it all for you:
char *s="\nmary julia anne \t steve", o[100];
int n=0;
while( sscanf(s+=n,"%99s%n",o,&n)==1 )
puts(o);

str += strspn(str, " \t\r\n" );

use isspace()
From man isspace
isspace()
checks for white-space characters. In the "C" and "POSIX" locales, these are: space, form-feed ('\f'), newline ('\n'), carriage return ('\r'), horizontal tab ('\t'), and vertical tab ('\v').

Related

match multiline strings with regex

Is it possible to match multiline strings with match() function?
I tried to apply match(/(abc)\rdef/) to a cell containing 2 lines of text abc & def, but it does not work. Is there a way to get "abc" as result?
Simply use \n (newline) instead of \r (carriage return).
value.match(/(abc)\ndef/)
But you have to indicate where the newline is. match has no "multliline" parameter, so the dot (.) doesn't match line breaks.
Of course ! Thanx Ettore
And I found a way to do what I wanted with value.match(/(.*?\n)*(def)\n?(.*?\n?)*/)

How to separate words characters and non word characters?

Unicode have categories of characters. Some are alpha numeric. Some are punctuation.
What about if I want to know whether a word belongs to keyword or not
For example,
A,a,b,c, tend to belong to words. So is Ƈ,Ǝ,ǟ, so are all chinese characters.
Sentences like
Hello World, I "like" (to) eat ƇƎǟ and 款开源 ©
Have keywords:
Hello
World
I
like
to
eat
ƇƎǟ
款
开
源
Here, , (),© are not word characters and hence should just be ignored and use.
© doesn't count as punctuation either. '©'.IsPunctuation returns false in vb.net but I want to get rid of that too.
Now I want to make a program that can split sentences into keywords. For that I need to know which characters are word characters and which one is not.
Is there a vb.net function for that?
Do it the other way round: use IsLetter for your test. Or better yet, use regular expressions to split your string by words:
Dim str = "Hello World, I ""like"" (to) eat ƇƎǟ and 款开源 ©"
Dim wordPattern As New Regex("\p{L}+")
For Each match in wordPattern.Matches(str))
Console.WriteLine(match)
Next
Here, \p{L} matches any word character. However, the above matches “款开源” in a single rather than in separate matches since there is no separator between the characters.
u need to deal with "keycodes"
like if u only want letters [a-z]
then
for(c>='a' && c<='z'){
}
or
for(c>=97 && C<=122){
}

Regex for letters, digits, no spaces

I'm trying to create a Regex to check for 6-12 characters, one being a digit, the rest being any characters, no spaces. Can Regex do this? I'm trying to do this in objective-c and I'm not familiar with Regex at all. I've been reading a couple tutorials, but most are for matching simple cases of a number, or a set of numbers, but not exactly what i'm looking for. I can do it with methods, but I was wondering if it that would be too slow and I figured I could try learning something new.
asdfg1 == ok
asdfg 1 != ok
asdfgh != ok
123456 != ok
asdfasgdasgdasdfasdf != ok
use this regex ^(?=.*\d)(?=.*[a-zA-Z])[^ ]{6,12}$
It seems that you mean "letter" when you say "character", right? And (thanks to burning_LEGION for pointing that out) there may be only one digit?
In that case, use
^(?=\D*\d\D*$)[^\W_]{6,12}$
Explanation:
^ # Start of string
(?=\D*\d\D*$) # Assert that there is exactly one digit in the string
[^\W_] # Match a letter or digit (explanation below)
{6,12} # 6-12 times
$ # End of string
[^\W_] might look a little odd. How does it work? Well, \w matches any letter, digit or underscore. \W matches anything that \w doesn't match. So [^\W] (meaning "match any character that is not not alphanumeric/underscore") is essentially the same as \w, but by adding _ to this character class, we can remove the underscore from the list of allowed characters.
i didn't try though, but i think here is the answer
(^[^\d\x20]*\d[^\d\x20]*$){6,12}
This is for one digit: ^[^\d\x20]{0,11}\d{1}[^\d\x20]{0,11}$ but I can`t get limited to 6-12 length, you can use other function to check length first and if it from 6 to 12 check with this regex witch I wrote.

Which Unicode characters are "composing" characters (whose sole purpose is to add accent, tilda)?

This is related to
What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?
This is how I plan to do this:
Use http://msdn.microsoft.com/en-us/library/dd374126%28v=vs.85%29.aspx to turn the string into
KD form.
Basically it'll turn most variation such as superscript into the normal number. Also it decompose tilda and accent into 2 characters.
Next step would be to remove all characters whose sole purpose is tildaing or accenting character.
How do I know which characters are like that? Which characters are just "composing characters"
How do I find such characters? After I find those, how do I get rid of it? Should I scan character by character and remove all such "combining characters?"
For example:
Character from 300 to 362 can be gotten rid off.
Then what?
Combining characters are listed in UnicodeData.txt as having a nonzero Canonical_Combining_Class, and a General_Category of Mn (Mark, nonspacing).
For each character in the string, call GetUnicodeCategory and check the UnicodeCategory for NonSpacingMark, SpacingCombiningMark or EnclosingMark.
You may be able to do it more efficiently using regex, eg Regex.Replace(str, "\p{M}", "").

Smalltalk, newline character

Does anybody know what's the newline delimiter for a string in smalltalk?
I'm trying to split a string in separate lines, but I cannot figure out what's the newline character in smalltalk.
ie.
string := 'smalltalk is
a lot of fun.
ok, it's not.'
I need to split it in:
line1: smalltalk is
line2: a lot of fun.
line3: ok, it's not.
I can split a line based on any letter or symbol, but I can't figure out what the newline delimter is.
OK here is how I'm splitting the string based on commas, but I cannot do it based on a new line.
The newline delimiter is typically the carriage return, i.e., Character cr, or as others mentioned, in a string, String cr. If you wanted to support all standard newline formats, just include both standard delimiters, for example:
string := 'smalltalk is
a lot of fun.'.
string findTokens: String cr, String lf.
Since you now mention you're using VisualWorks, the above won't work unless you have the "squeak-accessing" category loaded (which you probably won't unless you're using Seaside). You could use a regular expression match instead:
'foo
bar' allRegexMatches: '[^', (String with: Character cr), ']+'
A quick solution (I don't know if it is the better) is:
|array |
array := mystring findTokens: String cr
Where String cr is the carriage return character
As noted in this question: Character cr.
You can send the String>>withCRs message then delimit the carriage returns with backslashes, thus--
string := 'smalltalk is\
a lot of fun.\
ok, it's not.' withCRs.
It is of course depending on the encoding. Could be cr, lf or crlf. For unicode there are a few extra possibilities. See: pharo linesDo: