Checking for different words in a value - openrefine

I need to check for different keywords in different cells of a column called item_description. If the cell contains that word, 1 is returned. Otherwise 0 is returned
if(or(cells.item_description.value.contains("new"), cells.item_description.value.contains("5"), cells.item_description.value.contains("some")), "1", "0")
I expected it to return 1 or 0, but I got the error message:
Parsing error at offset 95: Missing number, string, identifier, regex,
or parenthesized expression

The problem comes probably from your mix of different quotation marks (" IS NOT “, even if both seem very close at first glance).
This version should work:
if(or(cells.item_description.value.contains("new"), cells.item_description.value.contains("5"), cells.item_description.value.contains("some")), "1", "0")
By the way, in OpenRefine 3 (and 3.1), contains now accept a regex. So you can rewrite your ifelse like this:
if(cells.item_description.value.contains(/new|5|some/), "1", "0")

Related

Oracle REGEXP_LIKE doesn't work as expected

I was testing a regular expression in Oracle SQL and found something I could not understand:
-- NO MATCH
SELECT 1 FROM DUAL WHERE REGEXP_LIKE ('Professor Frank', '(^|\s)Prof[^\s]*(\s|$)');
Above doesn't match, while the following matches:
-- MATCH
SELECT 1 FROM DUAL WHERE REGEXP_LIKE ('Professor Frank', '(^|\s)Prof\S*(\s|$)');
In other regex flavors, It will be like \bProf[^\s]*\b versus \bProf\S*\b and have similar results. Note: Oracle SQL regex does not have \b or word boundary.
Question: Why don't [^\s]* and \S* work the same way in Oracle SQL?
I notice if I remove the (\s|$) at the end, the first regex will match.
In Oracle regular expressions, \s is indeed the escape sequence for a space, but NOT in a matching character set (that is, [.....], or [^....] for excluding one character). In a matching character set, only two characters have a special meaning, - for ranges and ] for closing the set enumeration. They can't be escaped; if needed in the matching set, ] must always be the first character right after the opening [ (it is the ONLY position in which a closing ] stands for itself as a character, and does not denote the end of the matching set), and - must be first or last (best to leave it always to the end of the matching set) - anywhere else it is seen as a range marker. To include (or exclude, if using the [^.....] syntax) a space, just type an actual physical space in the matching set.
Edit: What I said above is not entirely right. There is another special character in a matching set, namely ^. If it is used in the first position, it means "match any character OTHER THAN." In any other position it stands for itself. For example, '[^^]' will match any single character OTHER THAN ^ (the first ^ has special meaning, the second stands in for itself). And, a closing bracket ] stands for itself if it is the second character in brackets, if the first character is ^ (with its SPECIAL meaning). That is, to match any single character OTHER THAN ], we can use the matching pattern '[^]]'.

Using RegExp in sql to find rows that only contain 'x'

How do i use a regexp to only find rows where the first name only includes one type of character 'x' but it doesnt matter how many characters there are.
So far I came up with:
REGEXP_LIKE(LOWER(fst_name),'^x+$'))
possible rows I am looking for:
'x'
'xx'
'xxx'
'xxxxxxxxx'
So im interpreting this as meaning find the rows where x is at the beginning and the end of the field and there can be only x's inbetween. Am I interpreting this correctly?
or is it possible to have: 'xxxxxxaxxxxx'
Your regex is correct:
^x+$
^ is the "start" anchor
x is the character for which you are searching. I assume it isn't a regex metacharacter
+ is the "one or more" quantifier
$ is the "end" anchor
So I would interpret your regex to match all of the cases you supplied, and would not match something like 'xxxxaxxxx'. http://regex101.com/r/dE8vU6
It's been long enough since I used Oracle that I don't recall whether your REGEX_LIKE syntax is correct there, but it seems right to me.

Fortran read statement reading beyond an end of line

do you know if the following statement is guaranteed to be true by one of the fortran 90/95/2003 standards?
"Suppose a read statement for a character variable is given a blank line (i.e., containing only white spaces and new line characters). If the format specifier is an asterisk (*), it continues to read the subsequent lines until a non-blank line is found. If the format specifier is '(A)', a blank string is substituted to the character variable."
For example, please look at the following minimal program and input file.
program code:
PROGRAM chk_read
INTEGER, PARAMETER :: MAXLEN=30
CHARACTER(len=MAXLEN) :: str1, str2
str1='minomonta'
read(*,*) str1
write(*,'(3A)') 'str1_start|', str1, '|str1_end'
str2='minomonta'
read(*,'(A)') str2
write(*,'(3A)') 'str2_start|', str2, '|str2_end'
END PROGRAM chk_read
input file:
----'input.dat' content is below this line----
yamanakako
kawaguchiko
----'input.dat' content is above this line----
Please note that there are four lines in 'input.dat' and the first and third lines are blank (contain only white spaces and new line characters). If I run the program as
$ ../chk_read < input.dat > output.dat
I get the following output
----'output.dat' content is below this line----
str1_start|yamanakako |str1_end
str2_start| |str2_end
----'output.dat' content is above this line----
The first read statement for the variable 'str1' seems to look at the first line of 'input.dat', find a blank line, move on to the second line, find the character value 'yamanakako', and store it in 'str1'.
In contrast, the second read statement for the variable 'str2' seems to be given the third line, which is blank, and store the blank line in 'str2', without moving on to the fourth line.
I tried compiling the program by Intel Fortran (ifort 12.0.4) and GNU Fortran (gfortran 4.5.0) and got the same result.
A little bit about a background of asking this question: I am writing a subroutine to read a data file that uses a blank line as a separator of data blocks. I want to make sure that the blank line, and only the blank line, is thrown away while reading the data. I also need to make it standard conforming and portable.
Thanks for your help.
From Fortran 2008 standard draft:
List-directed input/output allows data editing according to the type
of the list item instead of by a format specification. It also allows
data to be free-field, that is, separated by commas (or semicolons) or
blanks.
Then:
The characters in one or more list-directed records constitute a
sequence of values and value separators. The end of a record has the
same effect as a blank character, unless it is within a character
constant. Any sequence of two or more consecutive blanks is treated as
a single blank, unless it is within a character constant.
This implicitly states that in list-directed input, blank lines are treated as blanks until the next non-blank value.
When using a fmt='(A)' format descriptor when reading, blank lines are read into str. On the other side, fmt=*, which implies list-directed I/O in free-form, skips blank lines until it finds a non-blank character string. To test this, do something like:
PROGRAM chk_read
INTEGER :: cnt
INTEGER, PARAMETER :: MAXLEN=30
CHARACTER(len=MAXLEN) :: str
cnt=1
do
read(*,fmt='(A)',end=100)str
write(*,'(I1,3A)')cnt,' str_start|', str, '|str_end'
cnt=cnt+1
enddo
100 continue
END PROGRAM chk_read
$ cat input.dat
yamanakako
kawaguchiko
EOF
Running the program gives this output:
$ a.out < input.dat
1 str_start| |str_end
2 str_start| |str_end
3 str_start| |str_end
4 str_start|yamanakako |str_end
5 str_start| |str_end
6 str_start|kawaguchiko |str_end
On the other hand, if you use default input:
read(*,fmt=*,end=100)str
You end up with this output:
$ a.out < input.dat
1 str1_start|yamanakako |str1_end
2 str2_start|kawaguchiko |str2_end
This Part of the F2008 standard draft probably treats your problem:
10.10.3 List-directed input
7 When the next effective item is of type character, the input form
consists of a possibly delimited sequence of zero or more
rep-char s whose kind type parameter is implied by the kind of the
effective item. Character sequences may be continued from the end of
one record to the beginning of the next record, but the end of record
shall not occur between a doubled apostrophe in an
apostrophe-delimited character sequence, nor between a doubled quote
in a quote-delimited character sequence. The end of the record does
not cause a blank or any other character to become part of the
character sequence. The character sequence may be continued on as many
records as needed. The characters blank, comma, semicolon, and slash
may appear in default, ASCII, or ISO 10646 character sequences.

REGEX for complete word matching

OK So i am confused (obviously)
I'm trying to return rows (from Oracle) where a text field contains a complete word, not just the substring.
a simple example is the word 'I'.
Show me all rows where the string contains the word 'I', but not simply where 'I' is a substring somewhere as in '%I%'
so I wrote what i thought would be a simple regex:
select REGEXP_INSTR(upper(description), '\bI\b') from mytab;
expecting that I should be detected with word boundaries. I get no results (or rather the result 0 for each row.
what i expect:
'I am the Administrator' -> 1
'I'm the administrator' -> 0
'Am I the administrator' -> 1
'It is the infamous administrator' -> 0
'The adminisrtrator, tis I' -> 1
isn't the /b supposed to find the contained string by word boundary?
tia
I believe that \b is not supported by your flavor of regex :
http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14251/adfns_regexp.htm#i1007670
Therefore you could do something like :
(^|\s)word(\s|$)
To at least ensure that your "word" is separated by some whitespace or it's the whole string.
Oracle doesn't support word boundary anchors, but even if it did, you wouldn't get the desired result: \b matches between an alphanumeric character and a non-alphanumeric character. The exact definition of what an alnum is differs between implementations, but in most flavors, it's [A-Za-z0-9_] (.NET also considers Unicode letters/digits).
So there are two boundaries around the I in %I%.
If you define your word boundary as "whitespace before/after the word", then you could use
(^|\s)I(\s|$)
which would also work at the start/end of the string.
Oracle native regex support is limited. \b or < cannot be used as word delimiters. You may want Oracle Text for word search.

Comma, ')',or valid expression continuation expected

I need my VB.net to write a file containing the following line
objWriter.WriteLine ("TEXTA " (FILEA) " TEXTB")
Unfortunatly the variable (FILEA) is causing problems i now get the error
Comma, ')', or valid expression continuation expected.
Could someone explain this please?
You're not concatenating (joining) the strings proerly...
objWriter.WriteLine ("TEXTA " & FILEA & " TEXTB")
A better style to get into the habit of using is:
objWriter.WriteLine (string.format("TEXTA {0} TEXTB", FILEA))
The FILEA variable replaces the {0} placeholder in the format string. Depending on what the writer you're using is, you may have a formatted overload so you could just do:
objWriter.WriteLine ("TEXTA {0} TEXTB", FILEA)
And since you asked for an explanation;
The compiler is asking you what exactly you want it to do - you've given it 3 variables (String, variable, String) and haven't told it that you want to join them together - It's saying that after the first string "TEXTA", there should either be the closing bracket (to end the method call), a comma (to pass another parameter to the method) OR a "valid continuation expression" - ie something that tells it what to do with the next bit. in this case, you want a continuation expression, specifically an ampersand to signify "concatenate with the next 'thing'".
Presumably you're looking for string concatenation? Try this:
objWriter.WriteLine("TEXTA" & FILEA & "TEXTB");
Note that FILEA isn't exactly a conventional variable name... which leads me to suspect there may be something else you're trying to achieve. Could you give more details?