How can I find why re finditer is missing one item in the search - python-re

I have been trying to use re finditer to return four items, however it only return three. How should I modify the search in my re?
1). string:
'''contet_o= "ALIBRA Updated Wednesday (04 May 2022\nA S *Contactus
forrates/chartson scrubber & eco tonnage.\n\nPERIOD\n\nDRY TIME CHARTER
ESTIMATES ($/pdpr)\n\n4/6 MOS\n\nSIZE\nHANDY
(sexaw1)\nSMAX/ULTRA.\nPANA/KMAX\n\nCAPESIZE\n\nATL PAC ATL PAC ATL
PAC\n\n28,500 | 32,000 # 27,500/ 28,500) 23,000 24,000\n28,500 |32,000 |=
30,250 |= 22,000) 24,500) 19,250\n29,000 | 29,000 26,000/'¥ 26,000) 24,000|—=
24,000\n\n29,500 |31,000 | 28,000 | 28,000/= 24,000/= 24,000\n\n'''
2). search by re:
''' saved_list_1 = re.finditer(r"\n((.*)(\d+\,\d+)){6}\n",contet_o)
list_ddddd = []
for item in saved_list_1:
list_ddddd.append(item.group())
list_ddddd'''
3). It return: (missing the second one)
['\n28,500 | 32,000 # 27,500/ 28,500) 23,000 24,000\n',
"\n29,000 | 29,000 26,000/'¥ 26,000) 24,000|—= 24,000\n",
'\n29,500 |31,000 | 28,000 | 28,000/= 24,000/= 24,000\n']

Related

extracting year from string using regexp_extract pyspark

This is the portion of my result :
Grumpier Old Men (1995)
Death Note: Desu nôto (2006–2007)
Irwin & Fran 2013
9500 Liberty (2009)
Captive Women (1000 Years from Now) (3000 A.D.) (1952)
The Garden of Afflictions 2017
The Naked Truth (1957) (Your Past Is Showing)
Conquest 1453 (Fetih 1453) (2012)
Commune, La (Paris, 1871) (2000)
1013 Briar Lane
return:
1995
2006
2013
2009
1952
2017
1957
1453<--
1871<--
<--this part for last title is empty and supposed to be empty too
As you can see from the above,last 2 title is given wrong result.
This is my code:
import pyspark.sql.functions as F
from pyspark.sql.functions import regexp_extract,col
bracket_regexp = "((?<=\()\d{4}(?=[^\(]*$))"
movies_DF=movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
I am trying to get the year portion of the title string.
You can try using the following regex: r'(?<=\()(\d+)(?=\))', which is inspired by this excellent answer.
For example:
movies_DF = movies_DF.withColumn('uu', regexp_extract(col("title"), r'(?<=\()(\d+)(?=\))',1))
+------------------------------------------------------------+----+
|title |uu |
+------------------------------------------------------------+----+
|Grumpier Old Men (1995) |1995|
|Happy Anniversary (1959) |1959|
|Paths (2017) |2017|
|The Three Amigos - Outrageous! (2003) |2003|
|L'obsession de l'or (1906) |1906|
|Babe Ruth Story, The (1948) |1948|
|11'0901 - September 11 (2002) |2002|
|Blood Trails (2006) |2006|
|Return to the 36th Chamber (Shao Lin da peng da shi) (1980) |1980|
|Off and Running (2009) |2009|
+------------------------------------------------------------+----+
Empirically, the following regex pattern seems to be working:
(?<=[( ])\d{4}(?=\S*\)|$)
Here is a working regex demo.
Updated PySpark code:
bracket_regexp = "((?<=[( ])\d{4}(?=\S*\)|$))"
movies_DF = movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
The regex pattern works by matching:
(?<=[( ]) assert that what precedes is ( or a space
\d{4} match a 4 digit year
(?=\S*\)|$) assert that ), possibly prefaced by non whitespace, follows
OR the end of the string follows
Your regex can only work for the first line. \(\d{4}\) tries to match a (, 4 digits and a ). For the first line you have (1995) which is alright. The other lines do not contain that pattern.
In your situation, we can use lookbehind and lookahead patterns to detect dates within brackets. (?<=\() means an open bracket before. (?=–|(–)|\)) means a closing bracket after, or – or – which is the original character that was misencoded. Once you have covered the date in between brackets, you can cover dates that are at the end of the string without brackets: \d{4}$.
import pyspark.sql.functions as F
bracket_regexp = "((?<=\()\d{4}(?=–|(–)|\)))"
movies_DF\
.withColumn('uu', regexp_extract("title", bracket_regex + "|(\d{4}$)", 0))\
.show(truncate=False)
+------------------------------------------------------+-------------+
|title |yearOfRelease|
+------------------------------------------------------+-------------+
|Grumpier Old Men (1995) |1995 |
|Death Note: Desu nôto (2006–2007) |2006 |
|Irwin & Fran 2013 |2013 |
|9500 Liberty (2009) |2009 |
|test 1234 test 4567 |4567 |
|Captive Women (1000 Years from Now) (3000 A.D.) (1952)|1952 |
|The Garden of Afflictions 2017 |2017 |
|The Naked Truth (1957) (Your Past Is Showing) |1957 |
|Conquest 1453 (Fetih 1453) (2012) |2012 |
|Commune, La (Paris, 1871) (2000) |2000 |
|1013 Briar Lane | |
+------------------------------------------------------+-------------+
Also you do not need to prefix the string with r when you pass a regex to a spark function.
Here is a regexp that would work:
df = df.withColumn("year", F.regexp_extract("title", "(?:[\s\(])(\d{4})(?:[–\)])?", 1))
Definitely overkill for the examples you provide, but I want to avoid capturing e.g. other numbers in the titles. Also, your regexp does not work because not all years are surrounding by brackets in your examples and sometimes you have non-numeric characters inside the brackets,.

DFA that contains 1011 as a substring

I have to draw a DFA that accepts set of all strings containing 1011 as a substring in it. I tried but could not come up with one. Can anyone help me please?
Thanks
The idea for a DFA that does this is simple: keep track of how much of that substring we have seen on the end of the input we've seen so far. If you eventually get to a point where the input you've seen so far ends with that substring, then you accept the whole input. If you get to the end of input before ever seeing a prefix that ends with your substring, you don't accept.
We can create the DFA by adding states as necessary to represent differing levels of match against the target substring. All DFAs need at least one state: let's call it q0.
---->q0
The implied alphabet of your language is {0, 1}, so we need transitions for both of these symbols on the state q0. Let's think about how much of the substring we will have seen in state q0. We can get to q0 with the empty string; that is, before consuming any input at all. After seeing the empty string we have seen zero of the four symbols that make up our substring. So, q0 should correspond to the case "the input I've seen up until now ends with a string that matches 0/4 of the target substring".
Given this, what transitions should we add for 0 and 1? If we see a 0 in state q0, that doesn't help at all, since the substring we're looking for begins with 1; so, seeing a 0 in q0 doesn't change the fact that the input we've seen so far matches 0/4 symbols. This means we can have the transition from q0 on 0 return to q0.
/-\
0 | |
V /
---->q0
What about if we see a 1 in q0? Well, if we see a 1, then the input we've seen so far ends with a string that matches 1011 in exactly 1/4 places (the first 1); so, we need another state to represent the fact we're a little closer to the goal. Let's call this state q1.
/-\
0 | |
V /
---->q0---->q1
1
We repeat the process now for state q1. If we see a 0 in q1, we get a little closer to our target of 1011, so we can go to a new state, q2. If we see a 1 in q1, we don't get any closer to our goal, but we also don't fall back.
0 1
/-\ /-\
| | | |
V / V /
---->q0---->q1---->q2
1 0
If we see a 0 in q2, that means we've seen the substring 00; that doesn't appear in 1011 at all, which means we are totally back to square one and must return to q0. If we see a 1 though, we get a little closer to our goal and must move to a new state; let's call this q3:
0 1
/-\ /-\
| | | |
V / V /
---->q0---->q1---->q2---->q3
^ 1 0 | 1
| |
\-------------/
0
If we see a 0 in q3 then our input has ended with the substring 10, which puts us back at q2; if we see a 1, then we have seen the whole target 1011 and need to go to a new state to remember this fact.
0 1 0
/-\ /-\ /------\
| | | | | |
V / V / V |
---->q0---->q1---->q2---->q3---->q4
^ 1 0 | 1 1
| |
\-------------/
0
Finally, in state q4, no matter what we see, we know we must accept the input since we've already seen the substring 1011 somewhere in the input. This means we should make q4 accepting and have both transitions go back to q4:
0 1 0
/-\ /-\ /------\ /---\
| | | | | | | |
V / V / V | V | 0,1
---->q0---->q1---->q2---->q3---->[q4]--/
^ 1 0 | 1 1
| |
\-------------/
0
You can check some samples to convince yourself that this DFA accepts the language you want. We built it one state at a time by asking ourselves where the transitions had to go. We stopped adding new states when new transitions didn't demand them anymore.
We want to construct a DFA for a string which contains 1011 as a substring which means it language contain
L={0,1}
which means the strings may be
{0111011,001011,11001011,........}
A string must contains 1011 has a substring.
As we observed in the transition diagram at initial state if q0 accepts 1 then move to next state otherwise remains in the same state.
If q1 accepts 0 then move to next state q2 otherwise remains in the same state.
If q2 accepts 1 then move to q3 else move to q0 because we want to substring which starts with 1 not with 0.
If q3 accepts 1 then move to q4 else which is a final state if system reaches to a final state it means a string is accepted because it contains a 1011 as a substring , if q3 accepts then back to q2.
After reaching the final state a string may not end with 1011 but it have some more words or string to be taken like in 001011110 110 is left which have to accept that's why at q4 if it accepts 0 or 1 it remains in the same state.
DFA for accepting strings with a substring 1011.
They are four transitions A,B,C,D in every construction of DFA we have to check each transaction must have both transactions otherwise it is not a DFA so that it is given to construct a DFA that accept string of odd 0's and 1's that was as shown below
A is the initial state on transition of 0 it will goes to C and
On transition 1 it will give to B
B is another state gives transition of D on 0 and A on 1 and C is a state that will give transition of D and A on transition of 1 and 0
D is final state will give transition of B and C on 0 and 1
This is the process is been done on the below figure let us check the DFA with example 1011 it has odd no of 1'sand odd no of 0 so A on 1 it will give B and B on 0 it will give D and D on 1 it will give C and C on 1 it will give D hence it is the required DFA.

Cypher: How to create a recursive cost query with alternates?

I have the following structure:
(:pattern)-[:contains]->(:pattern)
...basically a hierarchy of patterns that use other patterns as content. These constitute trees.
Certain patterns are generated by certain generators:
(:generator)-[:canProduce]->(:pattern)
The canProduce relationship has a cost value associated with it as a property. Multiple generators can create the same pattern.
I would like to figure out, with a query, what patterns I need to generate to produce a particular output - and which generators to choose to have the lowest cost. I started like this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern) RETURN ps
so far so good. The results don't contain the starting pattern, so I made this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as patterns
RETURN patterns
That does not feel elegant, but it also does not provide the hierarchy
I can of course do a path query (MATCH path = MATCH...) but the results don't seem very useful.
Also, now I need to connect the cost from the generator relationship.
I tried this:
MATCH (p:pattern {name: 'awesome'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as rec
CALL {
WITH rec
MATCH (rec)-[r:canGenerate]-(g:generator)
return r.GenCost as GenCost, g.name AS GenName
}
return rec.name, GenCost , GenName
The problem I have now is that if any of the patterns that are part of another pattern can be generated by multiple generators, I just get double entries in the list, but what I want is separate lists for each alternative possibility, so that I can generate the cost.
This is my pattern tree:
Awesome
input1
input2
input 3
Input 3 can be generated by 2 different generators. I now get:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
input4 | 1.4 | TestGen4
What I want is this: Two lists (or n, in the general case, where I might have n possible paths), one
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
and one:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input4 | 1.4 | TestGen4
each set representing one alternative set, so that I can calculate the costs and compare.
I have no idea how to do something like that. Any suggestions?

Matching a string which includes -,.$\/ with a regex

I am trying to match a string which includes -,.$/ ( and might include other special characters which I don't know yet( with a regex . I have to match first 28 characters in the string
The String is -->
Received - Data Migration 1. Units, of UNITED STATES $ CXXX CORPORATION COMMON SHARE STOCK CERTIFICATE NO. 323248 987,837 SHARES PAR VAL $1.00 NOT ADMINISTERED XX XX, XXXSFHIGSKF/XXXX PURPOSES ONLY
The regex I am using is ((([\w-,.$\/]+)\s){28}).*
Is there a better way to match special characters ?
Also I get an error if the string length is less than 28. What can I do to include the range so that the regex works even if the string is less than 28 characters
the code looks something like this
Select regexp_extract(Txn_Desc,'((([\w-,.$;!#\/%)^#<>&*(]+)\s){1,28}).*',1) as Transaction_Short_Desc,Txn_Desc
from Table x
It seems you are looking for 28 tokens.
Try
(\S+\s+){0,28}
or
([^ ]+ +){0,28}
This is the result for 8 tokens:
Received - Data Migration 1. Units, of UNITED
| | | | | | | |
1 2 3 4 5 6 7 8

Explain the Differential Evolution method

Can someone please explain the Differential Evolution method? The Wikipedia definition is extremely technical.
A dumbed-down explanation followed by a simple example would be appreciated :)
Here's a simplified description. DE is an optimisation technique which iteratively modifies a population of candidate solutions to make it converge to an optimum of your function.
You first initialise your candidate solutions randomly. Then at each iteration and for each candidate solution x you do the following:
you produce a trial vector: v = a + ( b - c ) / 2, where a, b, c are three distinct candidate solutions picked randomly among your population.
you randomly swap vector components between x and v to produce v'. At least one component from v must be swapped.
you replace x in your population with v' only if it is a better candidate (i.e. it better optimise your function).
(Note that the above algorithm is very simplified; don't code from it, find proper spec. elsewhere instead)
Unfortunately the Wikipedia article lacks illustrations. It is easier to understand with a graphical representation, you'll find some in these slides: http://www-personal.une.edu.au/~jvanderw/DE_1.pdf .
It is similar to genetic algorithm (GA) except that the candidate solutions are not considered as binary strings (chromosome) but (usually) as real vectors. One key aspect of DE is that the mutation step size (see step 1 for the mutation) is dynamic, that is, it adapts to the configuration of your population and will tend to zero when it converges. This makes DE less vulnerable to genetic drift than GA.
Answering my own question...
Overview
The principal difference between Genetic Algorithms and Differential Evolution (DE) is that Genetic Algorithms rely on crossover while evolutionary strategies use mutation as the primary search mechanism.
DE generates new candidates by adding a weighted difference between two population members to a third member (more on this below).
If the resulting candidate is superior to the candidate with which it was compared, it replaces it; otherwise, the original candidate remains unchanged.
Definitions
The population is made up of NP candidates.
Xi = A parent candidate at index i (indexes range from 0 to NP-1) from the current generation. Also known as the target vector.
Each candidate contains D parameters.
Xi(j) = The jth parameter in candidate Xi.
Xa, Xb, Xc = three random parent candidates.
Difference vector = (Xb - Xa)
F = A weight that determines the rate of the population's evolution.
Ideal values: [0.5, 1.0]
CR = The probability of crossover taking place.
Range: [0, 1]
Xc` = A mutant vector obtained through the differential mutation operation. Also known as the donor vector.
Xt = The child of Xi and Xc`. Also known as the trial vector.
Algorithm
For each candidate in the population
for (int i = 0; i<NP; ++i)
Choose three distinct parents at random (they must differ from each other and i)
do
{
a = random.nextInt(NP);
} while (a == i)
do
{
b = random.nextInt(NP);
} while (b == i || b == a);
do
{
c = random.nextInt(NP);
} while (c == i || c == b || c == a);
(Mutation step) Add a weighted difference vector between two population members to a third member
Xc` = Xc + F * (Xb - Xa)
(Crossover step) For every variable in Xi, apply uniform crossover with probability CR to inherit from Xc`; otherwise, inherit from Xi. At least one variable must be inherited from Xc`
int R = random.nextInt(D);
for (int j=0; j < D; ++j)
{
double probability = random.nextDouble();
if (probability < CR || j == R)
Xt[j] = Xc`[j]
else
Xt[j] = Xi[j]
}
(Selection step) If Xt is superior to Xi then Xt replaces Xi in the next generation. Otherwise, Xi is kept unmodified.
Resources
See this for an overview of the terminology
See Optimization Using Differential Evolution by Vasan Arunachalam for an explanation of the Differential Evolution algorithm
See Evolution: A Survey of the State-of-the-Art by Swagatam Das and Ponnuthurai Nagaratnam Suganthan for different variants of the Differential Evolution algorithm
See Differential Evolution Optimization from Scratch with Python for a detailed description of an implementation of a DE algorithm in python.
The working of DE algorithm is very simple.
Consider you need to optimize(minimize,for eg) ∑Xi^2 (sphere model) within a given range, say [-100,100]. We know that the minimum value is 0. Let's see how DE works.
DE is a population-based algorithm. And for each individual in the population, a fixed number of chromosomes will be there (imagine it as a set of human beings and chromosomes or genes in each of them).
Let me explain DE w.r.t above function
We need to fix the population size and the number of chromosomes or genes(named as parameters). For instance, let's consider a population of size 4 and each of the individual has 3 chromosomes(or genes or parameters). Let's call the individuals R1,R2,R3,R4.
Step 1 : Initialize the population
We need to randomly initialise the population within the range [-100,100]
G1 G2 G3 objective fn value
R1 -> |-90 | 2 | 1 | =>8105
R2 -> | 7 | 9 | -50 | =>2630
R3 -> | 4 | 2 | -9.2| =>104.64
R4 -> | 8.5 | 7 | 9 | =>202.25
objective function value is calculated using the given objective function.In this case, it's ∑Xi^2. So for R1, obj fn value will be -90^2+2^2+2^2 = 8105. Similarly it is found for all.
Step 2 : Mutation
Fix a target vector,say for eg R1 and then randomly select three other vectors(individuals)say for eg.R2,R3,R4 and performs mutation. Mutation is done as follows,
MutantVector = R2 + F(R3-R4)
(vectors can be chosen randomly, need not be in any order).F (scaling factor/mutation constant) within range [0,1] is one among the few control parameters DE is having.In simple words , it describes how different the mutated vector becomes. Let's keep F =0.5.
| 7 | 9 | -50 |
+
0.5 *
| 4 | 2 | -9.2|
+
| 8.5 | 7 | 9 |
Now performing Mutation will give the following Mutant Vector
MV = | 13.25 | 13.5 | -50.1 | =>2867.82
Step 3 : Crossover
Now that we have a target vector(R1) and a mutant vector MV formed from R2,R3 & R4 ,we need to do a crossover. Consider R1 and MV as two parents and we need a child from these two parents. Crossover is done to determine how much information is to be taken from both the parents. It is controlled by Crossover rate(CR). Every gene/chromosome of the child is determined as follows,
a random number between 0 & 1 is generated, if it is greater than CR , then inherit a gene from target(R1) else from mutant(MV).
Let's set CR = 0.9. Since we have 3 chromosomes for individuals, we need to generate 3 random numbers between 0 and 1. Say for eg, those numbers are 0.21,0.97,0.8 respectively. First and last are lesser than CR value, so those positions in the child's vector will be filled by values from MV and second position will be filled by gene taken from target(R1).
Target-> |-90 | 2 | 1 | Mutant-> | 13.25 | 13.5 | -50.1 |
random num - 0.21, => `Child -> |13.25| -- | -- |`
random num - 0.97, => `Child -> |13.25| 2 | -- |`
random num - 0.80, => `Child -> |13.25| 2 | -50.1 |`
Trial vector/child vector -> | 13.25 | 2 | -50.1 | =>2689.57
Step 4 : Selection
Now we have child and target. Compare the obj fn of both, see which is smaller(minimization problem). Select that individual out of the two for next generation
R1 -> |-90 | 2 | 1 | =>8105
Trial vector/child vector -> | 13.25 | 2 | -50.1 | =>2689.57
Clearly, the child is better so replace target(R1) with the child. So the new population will become
G1 G2 G3 objective fn value
R1 -> | 13.25 | 2 | -50.1 | =>2689.57
R2 -> | 7 | 9 | -50 | =>2500
R3 -> | 4 | 2 | -9.2 | =>104.64
R4 -> | -8.5 | 7 | 9 | =>202.25
This procedure will be continued either till the number of generations desired has reached or till we get our desired value. Hope this will give you some help.