Extract a substring using XPath where there might be no trailing delimiter in the field - vb.net

I'm trying to parse an XML file where the users (in their infinite wisdom) type a key value into a free-form field, <Description>. The values are normally typed in with returns (BR's?) between them. For instance:
<Description>
% Increase: 27%
Completion Date: 10-Aug-2015
</Description>
I need to look for an extract the date following the string "Completion Date:" Looking around here on SO I found something similar and adapted it to:
compdate = deal.SelectSingleNode("./Terms/Description[substring-before(substring-after(.,'Completion Date:'),'/')]")
The problem is that in the original question there was a trailing character that could be used to delimit the text, a /. In my case, there might be a BR of some sort, or it might be the last (as in this case) or only item on the line and thus there's no delimiter.
So... suggestions on how to extract the date? I can do it on the VB side, but I'd like to remain in the XPath world for code clarity - unless of course the resulting XPath is unreadable.

If XPath 2.0 solution is acceptable, try
./Terms/Description/tokenize(substring-after(.,'Completion Date: '), '\n')[1]
If not and date format is always DD-mon-YYYY (e.g. 01-Dec-2018), try
./Terms/Description/substring(substring-after(.,'Completion Date: '), 1, 11)

Related

Get everything after a string pattern and before a ' ' in Databricks SQL

I got the following entry in my database with column name - properties_desc:
#Thu Sep 03 02:18:11 UTC 2020 cardType=MasterCard cardDebit=true cardUniqueNumber=f0b03da93bc70fbc194a5a4ef5879685
I want to trim the entry so I get: MasterCard
So basically, I want everything after 'cardType=' and before ''.
I tried referring this Get everything after and before certain character in SQL Server
but this works for a special character and not a string.
My try:
SUBSTRING(properties_desc, length(SUBSTRING(properties_desc, 0, length(properties_desc) - CHARINDEX ('cardType=', properties_desc))) + 1,
length(properties_desc) - length(SUBSTRING(properties_desc, 0, length(properties_desc) - CHARINDEX ('cardType=', properties_desc))) - length(SUBSTRING(
properties_desc, CHARINDEX (' ', properties_desc), length(properties_desc))))
But the above query does not work. Any help is appreciated.
How can I solve it?
You have tagged this question as both sql-server and databricks. Based on your use of length() instead of len(), I assume that you are using databricks. In that case, you can make use of the regexp_extract() function
Try: "regexp_extract(properties_desc, '(?<=cardType=)[^ ]*')".
This is untested, as I am not a databricks programmer.
The "[^ ]*" in the above will match and extract a string of non-space characters after "cardType=". The "(?<=...)" is a "look-behind" construct that requires that the matched text be preceded by "cardType=", but does not include that text in the result. The end result is that the regex matches and extracts everything after "cardtype=" up to the next space (or the end of the string).
Regular expressions are a pretty powerful string matching tool. Well worth learning if you are not already familiar with them. (I wish SQL Server had them.)

How to include apostrophe in character set for REGEXP_SUBSTR()

The IBM i implementation of regex uses apostrophes (instead of e.g. slashes) to delimit a regex string, i.e.:
... where REGEXP_SUBSTR(MYFIELD,'myregex_expression')
If I try to use an apostrophe inside a [group] within the expression, it always errors - presumably thinking I am giving a closing quote. I have tried:
- escaping it: \'
- doubling it: '' (and tripling)
No joy. I cannot find anything relevant in the IBM SQL manual or by google search.
I really need this to, for instance, allow names like O'Leary.
Thanks to Wiktor Stribizew for the answer in his comment.
There are a couple of "gotchas" for anyone who might land on this question with the same problem. The first is that you have to give the (presumably Unicode) hex value rather than the EBCDIC value that you would use, e.g. in ordinary interactive SQL on the IBM i. So in this case it really is \x27 and not \x7D for an apostrophe. Presumably this is because the REGEXP_ ... functions are working through Unicode even for EBCDIC data.
The second thing is that it would seem that the hex value cannot be the last one in the set. So this works:
^[A-Z0-9_\+\x27-]+ ... etc.
But this doesn't
^[A-Z0-9_\+-\x27]+ ... etc.
I don't know how to highlight text within a code sample, so I draw your attention to the fact that the hyphen is last in the first sample and second-to-last in the second sample.
If anyone knows why it has to not be last, I'd be interested to know. [edit: see Wiktor's answer for the reason]
btw, using double quotes as the string delimiter with an apostrophe in the set didn't work in this context.
A single quote can be defined with the \x27 notation:
^[A-Z0-9_+\x27-]+
^^^^
Note that when you use a hyphen in the character class/bracket expression, when used in between some chars it forms a range between those symbols. When you used ^[A-Z0-9_\+-\x27]+ you defined a range between + and ', which is an invalid range as the + comes after ' in the Unicode table.

How can I add a string character based on a position in OpenRefine?

I have a column in Openrefine, which I would like to add a character string in each of its rows, based on the position in the string.
For example:
I have an 8th character number string: 85285296 and would like to add "-" at the fourth place: "8528-5296".
Anyone can help me find the specific function in OpenRefine?
Thanks
Tzipy
The simplest approach is to just use the expression language's built-in string indexing and concatenation:
value[0,4]+'-'+value[4,8]
or more generally, if you don't know that your value is exactly 8 characters long:
value[0,4]+'-'+value[4,999]
Possible solution (not sure if it's the most straightforward):
value.replace(/(\d{4})(.+)/, "$1-$2")
This means : if $1 represents the content of the first parenthesis/group in the regular expression before and $2 the content of the second one, replaces each value in the column with $1-$2.
Some other options:
value.splitByLengths(4,4).join("-")
value.match(/(\d{4})(\d{4})/).join("-")
value.substring(0,4)+"-"+value.substring(4,8)
I think 'splitByLengths' is the neatest, but I might use 'match' instead because it fails with an error if your starting string isn't 8 digits - which means you don't accidentally process data that doesn't conform to your assumption of what data is in the column - but you could use a facet/filter to check this with any of the others

Convert text with HTML character encoding to database characterset

Our application receives data from various sources. Some of these contain HTML character makeup instead of regular characters. So instead of string "â" we receive string "â".
How can we convert "â" to a character in the database character set using SQL/PLSQL?
Our database is 10GR2.
Unescape_reference and excape_reference I believe is what you're looking for
UTL_I18N.UNESCAPE_REFERENCE('hello < å')
This returns 'hello <'||chr(229).
http://docs.oracle.com/cd/B28359_01/appdev.111/b28419/u_i18n.htm#i998992
You can use the CHR() function to convert an ascii character number to a character representation.
SELECT chr(226)
FROM dual;
CHR(226)
--------
â
For more information see: http://www.techonthenet.com/oracle/functions/chr.php
Hope it helps...
one solution
replace(your_test, 'â', chr(226))
but you'd have to nest many replace functions, one for each entity you need to replace. This might be very slow if you have to replace many.
You can wrote your own function, seqrching for the ampersand and replacing when found.
Have you searched the Oracle Supplied Packages manual? I know they have a function that does the opposite for a few entities.
to convert a column in oracle which contains HTML items to plain text, you could use:
trim(regexp_replace(UTL_I18N.unescape_reference(column_name), '<[^>]+>'))
It will replace HTML character as above stated but will also remove HTML tags en remove leading and trailing spaces.
I hope it will help someone.

unwanted leading blank space on oracle number format

I need to pad numbers with leading zeros (total 8 digits) for display. I'm using oracle.
select to_char(1011,'00000000') OPE_NO from dual;
select length(to_char(1011,'00000000')) OPE_NO from dual;
Instead of '00001011' I get ' 00001011'.
Why do I get an extra leading blank space? What is the correct number formatting string to accomplish this?
P.S. I realise I can just use trim(), but I want to understand number formatting better.
#Eddie: I already read the documentation. And yet I still don't understand how to get rid of the leading whitespace.
#David: So does that mean there's no way but to use trim()?
Use FM (Fill Mode), e.g.
select to_char(1011,'FM00000000') OPE_NO from dual;
From that same documentation mentioned by EddieAwad:
Negative return values automatically
contain a leading negative sign and
positive values automatically contain
a leading space unless the format
model contains the MI, S, or PR format
element.
EDIT: The right way is to use the FM modifier, as answered by Steve Bosman. Read the section about Format Model Modifiers for more info.