Matching a string which includes -,.$\/ with a regex - sql

I am trying to match a string which includes -,.$/ ( and might include other special characters which I don't know yet( with a regex . I have to match first 28 characters in the string
The String is -->
Received - Data Migration 1. Units, of UNITED STATES $ CXXX CORPORATION COMMON SHARE STOCK CERTIFICATE NO. 323248 987,837 SHARES PAR VAL $1.00 NOT ADMINISTERED XX XX, XXXSFHIGSKF/XXXX PURPOSES ONLY
The regex I am using is ((([\w-,.$\/]+)\s){28}).*
Is there a better way to match special characters ?
Also I get an error if the string length is less than 28. What can I do to include the range so that the regex works even if the string is less than 28 characters
the code looks something like this
Select regexp_extract(Txn_Desc,'((([\w-,.$;!#\/%)^#<>&*(]+)\s){1,28}).*',1) as Transaction_Short_Desc,Txn_Desc
from Table x

It seems you are looking for 28 tokens.
Try
(\S+\s+){0,28}
or
([^ ]+ +){0,28}
This is the result for 8 tokens:
Received - Data Migration 1. Units, of UNITED
| | | | | | | |
1 2 3 4 5 6 7 8

Related

extracting year from string using regexp_extract pyspark

This is the portion of my result :
Grumpier Old Men (1995)
Death Note: Desu nôto (2006–2007)
Irwin & Fran 2013
9500 Liberty (2009)
Captive Women (1000 Years from Now) (3000 A.D.) (1952)
The Garden of Afflictions 2017
The Naked Truth (1957) (Your Past Is Showing)
Conquest 1453 (Fetih 1453) (2012)
Commune, La (Paris, 1871) (2000)
1013 Briar Lane
return:
1995
2006
2013
2009
1952
2017
1957
1453<--
1871<--
<--this part for last title is empty and supposed to be empty too
As you can see from the above,last 2 title is given wrong result.
This is my code:
import pyspark.sql.functions as F
from pyspark.sql.functions import regexp_extract,col
bracket_regexp = "((?<=\()\d{4}(?=[^\(]*$))"
movies_DF=movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
I am trying to get the year portion of the title string.
You can try using the following regex: r'(?<=\()(\d+)(?=\))', which is inspired by this excellent answer.
For example:
movies_DF = movies_DF.withColumn('uu', regexp_extract(col("title"), r'(?<=\()(\d+)(?=\))',1))
+------------------------------------------------------------+----+
|title |uu |
+------------------------------------------------------------+----+
|Grumpier Old Men (1995) |1995|
|Happy Anniversary (1959) |1959|
|Paths (2017) |2017|
|The Three Amigos - Outrageous! (2003) |2003|
|L'obsession de l'or (1906) |1906|
|Babe Ruth Story, The (1948) |1948|
|11'0901 - September 11 (2002) |2002|
|Blood Trails (2006) |2006|
|Return to the 36th Chamber (Shao Lin da peng da shi) (1980) |1980|
|Off and Running (2009) |2009|
+------------------------------------------------------------+----+
Empirically, the following regex pattern seems to be working:
(?<=[( ])\d{4}(?=\S*\)|$)
Here is a working regex demo.
Updated PySpark code:
bracket_regexp = "((?<=[( ])\d{4}(?=\S*\)|$))"
movies_DF = movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
The regex pattern works by matching:
(?<=[( ]) assert that what precedes is ( or a space
\d{4} match a 4 digit year
(?=\S*\)|$) assert that ), possibly prefaced by non whitespace, follows
OR the end of the string follows
Your regex can only work for the first line. \(\d{4}\) tries to match a (, 4 digits and a ). For the first line you have (1995) which is alright. The other lines do not contain that pattern.
In your situation, we can use lookbehind and lookahead patterns to detect dates within brackets. (?<=\() means an open bracket before. (?=–|(–)|\)) means a closing bracket after, or – or – which is the original character that was misencoded. Once you have covered the date in between brackets, you can cover dates that are at the end of the string without brackets: \d{4}$.
import pyspark.sql.functions as F
bracket_regexp = "((?<=\()\d{4}(?=–|(–)|\)))"
movies_DF\
.withColumn('uu', regexp_extract("title", bracket_regex + "|(\d{4}$)", 0))\
.show(truncate=False)
+------------------------------------------------------+-------------+
|title |yearOfRelease|
+------------------------------------------------------+-------------+
|Grumpier Old Men (1995) |1995 |
|Death Note: Desu nôto (2006–2007) |2006 |
|Irwin & Fran 2013 |2013 |
|9500 Liberty (2009) |2009 |
|test 1234 test 4567 |4567 |
|Captive Women (1000 Years from Now) (3000 A.D.) (1952)|1952 |
|The Garden of Afflictions 2017 |2017 |
|The Naked Truth (1957) (Your Past Is Showing) |1957 |
|Conquest 1453 (Fetih 1453) (2012) |2012 |
|Commune, La (Paris, 1871) (2000) |2000 |
|1013 Briar Lane | |
+------------------------------------------------------+-------------+
Also you do not need to prefix the string with r when you pass a regex to a spark function.
Here is a regexp that would work:
df = df.withColumn("year", F.regexp_extract("title", "(?:[\s\(])(\d{4})(?:[–\)])?", 1))
Definitely overkill for the examples you provide, but I want to avoid capturing e.g. other numbers in the titles. Also, your regexp does not work because not all years are surrounding by brackets in your examples and sometimes you have non-numeric characters inside the brackets,.

DFA that contains 1011 as a substring

I have to draw a DFA that accepts set of all strings containing 1011 as a substring in it. I tried but could not come up with one. Can anyone help me please?
Thanks
The idea for a DFA that does this is simple: keep track of how much of that substring we have seen on the end of the input we've seen so far. If you eventually get to a point where the input you've seen so far ends with that substring, then you accept the whole input. If you get to the end of input before ever seeing a prefix that ends with your substring, you don't accept.
We can create the DFA by adding states as necessary to represent differing levels of match against the target substring. All DFAs need at least one state: let's call it q0.
---->q0
The implied alphabet of your language is {0, 1}, so we need transitions for both of these symbols on the state q0. Let's think about how much of the substring we will have seen in state q0. We can get to q0 with the empty string; that is, before consuming any input at all. After seeing the empty string we have seen zero of the four symbols that make up our substring. So, q0 should correspond to the case "the input I've seen up until now ends with a string that matches 0/4 of the target substring".
Given this, what transitions should we add for 0 and 1? If we see a 0 in state q0, that doesn't help at all, since the substring we're looking for begins with 1; so, seeing a 0 in q0 doesn't change the fact that the input we've seen so far matches 0/4 symbols. This means we can have the transition from q0 on 0 return to q0.
/-\
0 | |
V /
---->q0
What about if we see a 1 in q0? Well, if we see a 1, then the input we've seen so far ends with a string that matches 1011 in exactly 1/4 places (the first 1); so, we need another state to represent the fact we're a little closer to the goal. Let's call this state q1.
/-\
0 | |
V /
---->q0---->q1
1
We repeat the process now for state q1. If we see a 0 in q1, we get a little closer to our target of 1011, so we can go to a new state, q2. If we see a 1 in q1, we don't get any closer to our goal, but we also don't fall back.
0 1
/-\ /-\
| | | |
V / V /
---->q0---->q1---->q2
1 0
If we see a 0 in q2, that means we've seen the substring 00; that doesn't appear in 1011 at all, which means we are totally back to square one and must return to q0. If we see a 1 though, we get a little closer to our goal and must move to a new state; let's call this q3:
0 1
/-\ /-\
| | | |
V / V /
---->q0---->q1---->q2---->q3
^ 1 0 | 1
| |
\-------------/
0
If we see a 0 in q3 then our input has ended with the substring 10, which puts us back at q2; if we see a 1, then we have seen the whole target 1011 and need to go to a new state to remember this fact.
0 1 0
/-\ /-\ /------\
| | | | | |
V / V / V |
---->q0---->q1---->q2---->q3---->q4
^ 1 0 | 1 1
| |
\-------------/
0
Finally, in state q4, no matter what we see, we know we must accept the input since we've already seen the substring 1011 somewhere in the input. This means we should make q4 accepting and have both transitions go back to q4:
0 1 0
/-\ /-\ /------\ /---\
| | | | | | | |
V / V / V | V | 0,1
---->q0---->q1---->q2---->q3---->[q4]--/
^ 1 0 | 1 1
| |
\-------------/
0
You can check some samples to convince yourself that this DFA accepts the language you want. We built it one state at a time by asking ourselves where the transitions had to go. We stopped adding new states when new transitions didn't demand them anymore.
We want to construct a DFA for a string which contains 1011 as a substring which means it language contain
L={0,1}
which means the strings may be
{0111011,001011,11001011,........}
A string must contains 1011 has a substring.
As we observed in the transition diagram at initial state if q0 accepts 1 then move to next state otherwise remains in the same state.
If q1 accepts 0 then move to next state q2 otherwise remains in the same state.
If q2 accepts 1 then move to q3 else move to q0 because we want to substring which starts with 1 not with 0.
If q3 accepts 1 then move to q4 else which is a final state if system reaches to a final state it means a string is accepted because it contains a 1011 as a substring , if q3 accepts then back to q2.
After reaching the final state a string may not end with 1011 but it have some more words or string to be taken like in 001011110 110 is left which have to accept that's why at q4 if it accepts 0 or 1 it remains in the same state.
DFA for accepting strings with a substring 1011.
They are four transitions A,B,C,D in every construction of DFA we have to check each transaction must have both transactions otherwise it is not a DFA so that it is given to construct a DFA that accept string of odd 0's and 1's that was as shown below
A is the initial state on transition of 0 it will goes to C and
On transition 1 it will give to B
B is another state gives transition of D on 0 and A on 1 and C is a state that will give transition of D and A on transition of 1 and 0
D is final state will give transition of B and C on 0 and 1
This is the process is been done on the below figure let us check the DFA with example 1011 it has odd no of 1'sand odd no of 0 so A on 1 it will give B and B on 0 it will give D and D on 1 it will give C and C on 1 it will give D hence it is the required DFA.

Cypher: How to create a recursive cost query with alternates?

I have the following structure:
(:pattern)-[:contains]->(:pattern)
...basically a hierarchy of patterns that use other patterns as content. These constitute trees.
Certain patterns are generated by certain generators:
(:generator)-[:canProduce]->(:pattern)
The canProduce relationship has a cost value associated with it as a property. Multiple generators can create the same pattern.
I would like to figure out, with a query, what patterns I need to generate to produce a particular output - and which generators to choose to have the lowest cost. I started like this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern) RETURN ps
so far so good. The results don't contain the starting pattern, so I made this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as patterns
RETURN patterns
That does not feel elegant, but it also does not provide the hierarchy
I can of course do a path query (MATCH path = MATCH...) but the results don't seem very useful.
Also, now I need to connect the cost from the generator relationship.
I tried this:
MATCH (p:pattern {name: 'awesome'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as rec
CALL {
WITH rec
MATCH (rec)-[r:canGenerate]-(g:generator)
return r.GenCost as GenCost, g.name AS GenName
}
return rec.name, GenCost , GenName
The problem I have now is that if any of the patterns that are part of another pattern can be generated by multiple generators, I just get double entries in the list, but what I want is separate lists for each alternative possibility, so that I can generate the cost.
This is my pattern tree:
Awesome
input1
input2
input 3
Input 3 can be generated by 2 different generators. I now get:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
input4 | 1.4 | TestGen4
What I want is this: Two lists (or n, in the general case, where I might have n possible paths), one
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
and one:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input4 | 1.4 | TestGen4
each set representing one alternative set, so that I can calculate the costs and compare.
I have no idea how to do something like that. Any suggestions?

Regex: extracting a house number from an address

I have following patterns:
13 R 2
48 B / 5
42 B
42B
303 Box 15
303 Bte 15
303 B Bt 15
and only want to have the following results (because Box 15, Bte 15 are the box numbers, and I only want the house nbr + potentially the letter attached to the house number):
13 R 2
48 B / 5
42 B
42B
303
303
303 B
Is this possible using a regular expression? I tried the following: REGEXP_SUBSTR(my_string_variable, '^\d+(\s*\w$)?'). This however only works for the patterns 3-5, and not for the first 2 and last patterns. Dropping the $ from the regex would incorrectly 'strip' the first letter for patterns 5 and 6.
I am basically assuming that if the letter behind the numeric is more than 1 character, that it belongs to the box number. For example, BTE is the French abbreviation for Boite which means Box. I realise this might be invalid if a house number has 2 letters (e.g.: 11 AA), but I would not know a solution for this and I don't think it occurs much.
This will remove: a space followed by an uppercase letter followed by at least one lowercase letter followed by an optional space followed by any number of digits:
RegExp_Replace(house_number, '\s[A-Z][a-z]+\s+\d+$')
See regex101.com

Splitting a composite variable into two variables

I have a string variable called country with a value which can be for example Afghanistan2008, but it can also be Brasil2012. I would like to create two new variables, one being the country part and one the year part .
Because there are always numbers at the end of the string, I do know the position the string should be split at from the right side but not from the left side.
Could I use something like:
gen(substr("country",-4,.))
If not, could anyone tell me how to split an entire column of such variables into a country and a year variable? I would also like to keep the original variable.
You can use a regular expression:
clear
set obs 2
generate string = ""
replace string = "Afghanistan2008" in 1
replace string = "Brasil2012" in 2
generate country = regexs(0) if regex(string, "[a-zA-Z]+")
generate year = regexs(1) + regexs(2) if regex(string, "(19|20)([0-9][0-9])")
list
+--------------------------------------+
| string country year |
|--------------------------------------|
1. | Afghanistan2008 Afghanistan 2008 |
2. | Brasil2012 Brasil 2012 |
+--------------------------------------+
Type help regex in Stata's command prompt for more information.
Alternatively you could do the following:
generate len = length(string) - 3
generate country2 = substr(string, 1, len - 1)
generate year2 = substr(string, len, .)
list country2 year2
+---------------------+
| country2 year2 |
|---------------------|
1. | Afghanistan 2008 |
2. | Brasil 2012 |
+---------------------+
For my specific situation the following makes a new year variable:
gen spyear = real(substr(country,-4,.))
I took the other part from #PearlySpencer:
generate len = length(country) - 3
generate spcountry = substr(country, 1, len - 1)
which creates an excess column to be removed.
EDIT (Nick Cox) This can be simplified to
gen spyear = real(substr(country, -4, 4))
gen spcountry = substr(country, 1, length(country) - 4)
showing that
There is no need to create a variable containing the string length.
The puzzling split 4 = 3 + 1 is not needed either.