extracting year from string using regexp_extract pyspark - sql

This is the portion of my result :
Grumpier Old Men (1995)
Death Note: Desu nôto (2006–2007)
Irwin & Fran 2013
9500 Liberty (2009)
Captive Women (1000 Years from Now) (3000 A.D.) (1952)
The Garden of Afflictions 2017
The Naked Truth (1957) (Your Past Is Showing)
Conquest 1453 (Fetih 1453) (2012)
Commune, La (Paris, 1871) (2000)
1013 Briar Lane
return:
1995
2006
2013
2009
1952
2017
1957
1453<--
1871<--
<--this part for last title is empty and supposed to be empty too
As you can see from the above,last 2 title is given wrong result.
This is my code:
import pyspark.sql.functions as F
from pyspark.sql.functions import regexp_extract,col
bracket_regexp = "((?<=\()\d{4}(?=[^\(]*$))"
movies_DF=movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
I am trying to get the year portion of the title string.

You can try using the following regex: r'(?<=\()(\d+)(?=\))', which is inspired by this excellent answer.
For example:
movies_DF = movies_DF.withColumn('uu', regexp_extract(col("title"), r'(?<=\()(\d+)(?=\))',1))
+------------------------------------------------------------+----+
|title |uu |
+------------------------------------------------------------+----+
|Grumpier Old Men (1995) |1995|
|Happy Anniversary (1959) |1959|
|Paths (2017) |2017|
|The Three Amigos - Outrageous! (2003) |2003|
|L'obsession de l'or (1906) |1906|
|Babe Ruth Story, The (1948) |1948|
|11'0901 - September 11 (2002) |2002|
|Blood Trails (2006) |2006|
|Return to the 36th Chamber (Shao Lin da peng da shi) (1980) |1980|
|Off and Running (2009) |2009|
+------------------------------------------------------------+----+

Empirically, the following regex pattern seems to be working:
(?<=[( ])\d{4}(?=\S*\)|$)
Here is a working regex demo.
Updated PySpark code:
bracket_regexp = "((?<=[( ])\d{4}(?=\S*\)|$))"
movies_DF = movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
The regex pattern works by matching:
(?<=[( ]) assert that what precedes is ( or a space
\d{4} match a 4 digit year
(?=\S*\)|$) assert that ), possibly prefaced by non whitespace, follows
OR the end of the string follows

Your regex can only work for the first line. \(\d{4}\) tries to match a (, 4 digits and a ). For the first line you have (1995) which is alright. The other lines do not contain that pattern.
In your situation, we can use lookbehind and lookahead patterns to detect dates within brackets. (?<=\() means an open bracket before. (?=–|(–)|\)) means a closing bracket after, or – or – which is the original character that was misencoded. Once you have covered the date in between brackets, you can cover dates that are at the end of the string without brackets: \d{4}$.
import pyspark.sql.functions as F
bracket_regexp = "((?<=\()\d{4}(?=–|(–)|\)))"
movies_DF\
.withColumn('uu', regexp_extract("title", bracket_regex + "|(\d{4}$)", 0))\
.show(truncate=False)
+------------------------------------------------------+-------------+
|title |yearOfRelease|
+------------------------------------------------------+-------------+
|Grumpier Old Men (1995) |1995 |
|Death Note: Desu nôto (2006–2007) |2006 |
|Irwin & Fran 2013 |2013 |
|9500 Liberty (2009) |2009 |
|test 1234 test 4567 |4567 |
|Captive Women (1000 Years from Now) (3000 A.D.) (1952)|1952 |
|The Garden of Afflictions 2017 |2017 |
|The Naked Truth (1957) (Your Past Is Showing) |1957 |
|Conquest 1453 (Fetih 1453) (2012) |2012 |
|Commune, La (Paris, 1871) (2000) |2000 |
|1013 Briar Lane | |
+------------------------------------------------------+-------------+
Also you do not need to prefix the string with r when you pass a regex to a spark function.

Here is a regexp that would work:
df = df.withColumn("year", F.regexp_extract("title", "(?:[\s\(])(\d{4})(?:[–\)])?", 1))
Definitely overkill for the examples you provide, but I want to avoid capturing e.g. other numbers in the titles. Also, your regexp does not work because not all years are surrounding by brackets in your examples and sometimes you have non-numeric characters inside the brackets,.

Related

Use dictionary as part of replace_regexp in Pyspark

I am trying to use a dictionary like this:
mydictionary = {'AL':'Alabama', '(AL)': 'Alabama', 'WI':'Wisconsin','GA': 'Georgia','(GA)': 'Georgia'}
To go through a spark dataframe:
data = [{"ID": 1, "TheString": "On WI ! On WI !"},
{"ID": 2, "TheString": "The state of AL is next to GA"},
{"ID": 3, "TheString": "The state of (AL) is also next to (GA)"},
{"ID": 4, "TheString": "Alabama is in the South"},
{"ID": 5, "TheString": 'Wisconsin is up north way'}
]
sdf = spark.createDataFrame(data)
display(sdf)
And replace the substring found in the value part of the dictionary with matching substrings to the key.
So, something like this:
for k, v in mydictionary.items():
replacement_expr = regexp_replace(col("TheString"), '(\s+)'+k, v)
print(replacement_expr)
sdf.withColumn("TheString_New", replacement_expr).show(truncate=False)
(this of course does not work; the regular expression being compiled is wrong)
A few things to note:
The abbreviation has either a space before and after, or left and right parentheses.
I think the big problem here is that I can't get the re to "compile" correctly across the dictionary elements. (And then also throw in the "space or parentheses" restriction noted.)
I realize I could get rid of the (GA) with parentheses keys (and just use GA with spaces or parentheses as boundaries), but it seemed simpler to have those cases in the dictionary.
Expected result:
On Wisconsin ! On Wisconsin !
The state of Alabama is next to Georgia
The state of (Alabama) is next to (Georgia)
Alabama is in the South
Wisconsin is way up north
Your help is much appreciated.
Some close solutions I've looked at:
Replace string based on dictionary pyspark
Use \b in regex to specify word boundary. Also, you can use functools.reduce to generate the replace expression from the dict itemps like this:
from functools import reduce
from pyspark.sql import functions as F,
replace_expr = reduce(
lambda a, b: F.regexp_replace(a, rf"\b{b[0]}\b", b[1]),
mydictionary.items(),
F.col("TheString")
)
sdf.withColumn("TheString", replace_expr).show(truncate=False)
# +---+------------------------------------------------+
# |ID |TheString |
# +---+------------------------------------------------+
# |1 |On Wisconsin ! On Wisconsin ! |
# |2 |The state of Alabama is next to Georgia |
# |3 |The state of (Alabama) is also next to (Georgia)|
# |4 |Alabama is in the South |
# |5 |Wisconsin is up north way |
# +---+------------------------------------------------+

How can I find why re finditer is missing one item in the search

I have been trying to use re finditer to return four items, however it only return three. How should I modify the search in my re?
1). string:
'''contet_o= "ALIBRA Updated Wednesday (04 May 2022\nA S *Contactus
forrates/chartson scrubber & eco tonnage.\n\nPERIOD\n\nDRY TIME CHARTER
ESTIMATES ($/pdpr)\n\n4/6 MOS\n\nSIZE\nHANDY
(sexaw1)\nSMAX/ULTRA.\nPANA/KMAX\n\nCAPESIZE\n\nATL PAC ATL PAC ATL
PAC\n\n28,500 | 32,000 # 27,500/ 28,500) 23,000 24,000\n28,500 |32,000 |=
30,250 |= 22,000) 24,500) 19,250\n29,000 | 29,000 26,000/'¥ 26,000) 24,000|—=
24,000\n\n29,500 |31,000 | 28,000 | 28,000/= 24,000/= 24,000\n\n'''
2). search by re:
''' saved_list_1 = re.finditer(r"\n((.*)(\d+\,\d+)){6}\n",contet_o)
list_ddddd = []
for item in saved_list_1:
list_ddddd.append(item.group())
list_ddddd'''
3). It return: (missing the second one)
['\n28,500 | 32,000 # 27,500/ 28,500) 23,000 24,000\n',
"\n29,000 | 29,000 26,000/'¥ 26,000) 24,000|—= 24,000\n",
'\n29,500 |31,000 | 28,000 | 28,000/= 24,000/= 24,000\n']

How to use pandas read_csv to read csv file having backward slash and double quotation

I have a CSV file like this (comma separated)
ID, Name,Context, Location
123,"John","{\"Organization\":{\"Id\":12345,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}","Road 1"
234,"Mike","{\"Organization\":{\"Id\":23456,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}","Road 2"
I want to create DataFrame like this:
ID | Name |Context |Location
123| John |{\"Organization\":{\"Id\":12345,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}|Road 1
234| Mike |{\"Organization\":{\"Id\":23456,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}|Road 2
Could you help and show me how to use pandas read_csv doing it?
An answer - if you are willing to accept that the \ char gets stripped:
pd.read_csv(your_filepath, escapechar='\\')
ID Name Context Location
0 123 John {"Organization":{"Id":12345,"IsDefault":false}... Road 1
1 234 Mike {"Organization":{"Id":23456,"IsDefault":false}... Road 2
An answer if you actually want the backslashes in - using a custom converter:
def backslash_it(x):
return x.replace('"','\\"')
pd.read_csv(your_filepath, escapechar='\\', converters={'Context': backslash_it})
ID Name Context Location
0 123 John {\"Organization\":{\"Id\":12345,\"IsDefault\":... Road 1
1 234 Mike {\"Organization\":{\"Id\":23456,\"IsDefault\":... Road 2
escapechar on read_csv is used to actually read the csv then the custom converter puts the backslashes back in.
Note that I tweaked the header row to make the column name match easier:
ID,Name,Context,Location

Matching a string which includes -,.$\/ with a regex

I am trying to match a string which includes -,.$/ ( and might include other special characters which I don't know yet( with a regex . I have to match first 28 characters in the string
The String is -->
Received - Data Migration 1. Units, of UNITED STATES $ CXXX CORPORATION COMMON SHARE STOCK CERTIFICATE NO. 323248 987,837 SHARES PAR VAL $1.00 NOT ADMINISTERED XX XX, XXXSFHIGSKF/XXXX PURPOSES ONLY
The regex I am using is ((([\w-,.$\/]+)\s){28}).*
Is there a better way to match special characters ?
Also I get an error if the string length is less than 28. What can I do to include the range so that the regex works even if the string is less than 28 characters
the code looks something like this
Select regexp_extract(Txn_Desc,'((([\w-,.$;!#\/%)^#<>&*(]+)\s){1,28}).*',1) as Transaction_Short_Desc,Txn_Desc
from Table x
It seems you are looking for 28 tokens.
Try
(\S+\s+){0,28}
or
([^ ]+ +){0,28}
This is the result for 8 tokens:
Received - Data Migration 1. Units, of UNITED
| | | | | | | |
1 2 3 4 5 6 7 8

How does the Soundex function work in SQL Server?

Here's an example of Soundex code in SQL:
SELECT SOUNDEX('Smith'), SOUNDEX('Smythe');
----- -----
S530 S530
How does 'Smith' become S530?
In this example, the first digit is S because that's the first character in the input expression, but how are the remaining three digits are calculated?
Take a look a this article
The first letter of the code corresponds to the first letter of the
name. The remainder of the code consists of three digits derived from
the syllables of the word according to the following code:
1 = B, F, P, V
2 = C, G, J, K, Q, S, X, Z
3 = D, T
4 = L
5 = M,N
6 = R
The double letters with the same Soundex code, A, E, I, O, U, H, W, Y,
and some prefixes are being disregarded...
So for Smith and Smythe the code is created like this:
S S -> S
m m -> 5
i y -> 0
t t -> 3
h h -> 0
e -> -
What is Soundex?
Soundex is:
a phonetic algorithm for indexing names by sound, as pronounced in English; first developed by Robert C. Russell and Margaret King Odell in 1918
How does it Work?
There are several implementations of Soundex, but most implement the following steps:
Retain the first letter of the name and drop all other occurrences of vowels and h,w:
|a, e, i, o, u, y, h, w | → "" |
Replace consonants with numbers as follows (after the first letter):
| b, f, p, v | → 1 |
| c, g, j, k, q, s, x, z | → 2 |
| d, t | → 3 |
| l | → 4 |
| m, n | → 5 |
| r | → 6 |
Replace identical adjacent numbers with a single value (if they were next to each other prior to step 1):
| M33 | → M3 |
Cut or Pad with zeros or cut to produce a 4 digit result:
| M3 | → M300 |
| M34123 | → M341 |
Here's an interactive demo in jsFiddle:
And here's a demo in SQL using SQL Fiddle
In SQL Server, SOUNDEX is often used in conjunction with DIFFERENCE, which is used to score how many of the resulting digits are identical (just like the game mastermind†), with higher numbers matching most closely.
What are the Alternatives?
It's important to understand the limitations and criticisms of soundex and where people have tried to improve it, notably only being rooted in English pronunciation and also discards a lot of data, resulting in more false positives.
Both Metaphone & Double Metaphone still focus on English pronunciations, but add much more granularity to the nuances of speech in Enlgish (ie. PH → F)
Phil Factor wrote a Metaphone Function in SQL with the source on github
Soundex is most commonly used on identifying similar names, and it'll have a really hard time finding any similar nicknames (i.e. Robert → Rob or Bob). Per this question on a Database of common name aliases / nicknames of people, you could incorporate a lookup against similar nicknames as well in your matching process.
Here are a couple free lists of common nicknames:
SOEMPI - name_to_nick.csv | Github
carltonnorthern - names.csv | Github
Further Reading:
Fuzzy matching using T-SQL
SQL Server – Do You Know Soundex Functions?