I am trying to use a dictionary like this:
mydictionary = {'AL':'Alabama', '(AL)': 'Alabama', 'WI':'Wisconsin','GA': 'Georgia','(GA)': 'Georgia'}
To go through a spark dataframe:
data = [{"ID": 1, "TheString": "On WI ! On WI !"},
{"ID": 2, "TheString": "The state of AL is next to GA"},
{"ID": 3, "TheString": "The state of (AL) is also next to (GA)"},
{"ID": 4, "TheString": "Alabama is in the South"},
{"ID": 5, "TheString": 'Wisconsin is up north way'}
]
sdf = spark.createDataFrame(data)
display(sdf)
And replace the substring found in the value part of the dictionary with matching substrings to the key.
So, something like this:
for k, v in mydictionary.items():
replacement_expr = regexp_replace(col("TheString"), '(\s+)'+k, v)
print(replacement_expr)
sdf.withColumn("TheString_New", replacement_expr).show(truncate=False)
(this of course does not work; the regular expression being compiled is wrong)
A few things to note:
The abbreviation has either a space before and after, or left and right parentheses.
I think the big problem here is that I can't get the re to "compile" correctly across the dictionary elements. (And then also throw in the "space or parentheses" restriction noted.)
I realize I could get rid of the (GA) with parentheses keys (and just use GA with spaces or parentheses as boundaries), but it seemed simpler to have those cases in the dictionary.
Expected result:
On Wisconsin ! On Wisconsin !
The state of Alabama is next to Georgia
The state of (Alabama) is next to (Georgia)
Alabama is in the South
Wisconsin is way up north
Your help is much appreciated.
Some close solutions I've looked at:
Replace string based on dictionary pyspark
Use \b in regex to specify word boundary. Also, you can use functools.reduce to generate the replace expression from the dict itemps like this:
from functools import reduce
from pyspark.sql import functions as F,
replace_expr = reduce(
lambda a, b: F.regexp_replace(a, rf"\b{b[0]}\b", b[1]),
mydictionary.items(),
F.col("TheString")
)
sdf.withColumn("TheString", replace_expr).show(truncate=False)
# +---+------------------------------------------------+
# |ID |TheString |
# +---+------------------------------------------------+
# |1 |On Wisconsin ! On Wisconsin ! |
# |2 |The state of Alabama is next to Georgia |
# |3 |The state of (Alabama) is also next to (Georgia)|
# |4 |Alabama is in the South |
# |5 |Wisconsin is up north way |
# +---+------------------------------------------------+
Related
This is the portion of my result :
Grumpier Old Men (1995)
Death Note: Desu nôto (2006–2007)
Irwin & Fran 2013
9500 Liberty (2009)
Captive Women (1000 Years from Now) (3000 A.D.) (1952)
The Garden of Afflictions 2017
The Naked Truth (1957) (Your Past Is Showing)
Conquest 1453 (Fetih 1453) (2012)
Commune, La (Paris, 1871) (2000)
1013 Briar Lane
return:
1995
2006
2013
2009
1952
2017
1957
1453<--
1871<--
<--this part for last title is empty and supposed to be empty too
As you can see from the above,last 2 title is given wrong result.
This is my code:
import pyspark.sql.functions as F
from pyspark.sql.functions import regexp_extract,col
bracket_regexp = "((?<=\()\d{4}(?=[^\(]*$))"
movies_DF=movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
I am trying to get the year portion of the title string.
You can try using the following regex: r'(?<=\()(\d+)(?=\))', which is inspired by this excellent answer.
For example:
movies_DF = movies_DF.withColumn('uu', regexp_extract(col("title"), r'(?<=\()(\d+)(?=\))',1))
+------------------------------------------------------------+----+
|title |uu |
+------------------------------------------------------------+----+
|Grumpier Old Men (1995) |1995|
|Happy Anniversary (1959) |1959|
|Paths (2017) |2017|
|The Three Amigos - Outrageous! (2003) |2003|
|L'obsession de l'or (1906) |1906|
|Babe Ruth Story, The (1948) |1948|
|11'0901 - September 11 (2002) |2002|
|Blood Trails (2006) |2006|
|Return to the 36th Chamber (Shao Lin da peng da shi) (1980) |1980|
|Off and Running (2009) |2009|
+------------------------------------------------------------+----+
Empirically, the following regex pattern seems to be working:
(?<=[( ])\d{4}(?=\S*\)|$)
Here is a working regex demo.
Updated PySpark code:
bracket_regexp = "((?<=[( ])\d{4}(?=\S*\)|$))"
movies_DF = movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
The regex pattern works by matching:
(?<=[( ]) assert that what precedes is ( or a space
\d{4} match a 4 digit year
(?=\S*\)|$) assert that ), possibly prefaced by non whitespace, follows
OR the end of the string follows
Your regex can only work for the first line. \(\d{4}\) tries to match a (, 4 digits and a ). For the first line you have (1995) which is alright. The other lines do not contain that pattern.
In your situation, we can use lookbehind and lookahead patterns to detect dates within brackets. (?<=\() means an open bracket before. (?=–|(–)|\)) means a closing bracket after, or – or – which is the original character that was misencoded. Once you have covered the date in between brackets, you can cover dates that are at the end of the string without brackets: \d{4}$.
import pyspark.sql.functions as F
bracket_regexp = "((?<=\()\d{4}(?=–|(–)|\)))"
movies_DF\
.withColumn('uu', regexp_extract("title", bracket_regex + "|(\d{4}$)", 0))\
.show(truncate=False)
+------------------------------------------------------+-------------+
|title |yearOfRelease|
+------------------------------------------------------+-------------+
|Grumpier Old Men (1995) |1995 |
|Death Note: Desu nôto (2006–2007) |2006 |
|Irwin & Fran 2013 |2013 |
|9500 Liberty (2009) |2009 |
|test 1234 test 4567 |4567 |
|Captive Women (1000 Years from Now) (3000 A.D.) (1952)|1952 |
|The Garden of Afflictions 2017 |2017 |
|The Naked Truth (1957) (Your Past Is Showing) |1957 |
|Conquest 1453 (Fetih 1453) (2012) |2012 |
|Commune, La (Paris, 1871) (2000) |2000 |
|1013 Briar Lane | |
+------------------------------------------------------+-------------+
Also you do not need to prefix the string with r when you pass a regex to a spark function.
Here is a regexp that would work:
df = df.withColumn("year", F.regexp_extract("title", "(?:[\s\(])(\d{4})(?:[–\)])?", 1))
Definitely overkill for the examples you provide, but I want to avoid capturing e.g. other numbers in the titles. Also, your regexp does not work because not all years are surrounding by brackets in your examples and sometimes you have non-numeric characters inside the brackets,.
I have dataframe:
d = [{'text': 'They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.', 'begin_end': [128, 139]},
{'text': 'Mom called dad, and when he came home, he took moms car and drove to the store', 'begin_end': [20,31]}]
s = spark.createDataFrame(d)
----------+----------------------------------------------------------------------------------------------------------------------------+
|begin_end |text |
+----------+----------------------------------------------------------------------------------------------------------------------------+
|[111, 120]|They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.|
|[20, 31] |Mom called dad, and when he came home, he took moms car and drove to the store |
+----------+----------------------------------------------------------------------------------------------------------------------------+
I needed to extract the words from the text column using the begin_end column array, like text[111:120+1]. In pandas, this could be done via zip:
df['new_col'] = [s[a:b+1] for s, (a,b) in zip(df['text'], df['begin_end'])]
result:
begin_end new_col
0 [111, 120] jumps bad
1 [20, 31] when he came
How can I rewrite zip function to pyspark and get new_col? Do I need to write a udf function for this?
You can do so by using substring in an expression. It expects the string you want to substring, a starting position and the length of the substring. An expression is needed as the substring function from pyspark.sql.functions doesn't take a column as starting position or length.
s.withColumn('new_col', F.expr("substr(text, begin_end[0] + 1, begin_end[1] - begin_end[0] + 1)")).show()
+----------+--------------------+------------+
| begin_end| text| new_col|
+----------+--------------------+------------+
|[111, 120]|They say that all...| jumps bad|
| [20, 31]|Mom called dad, a...|when he came|
+----------+--------------------+------------+
Here is the scenario. Assuming I have the following table:
identifier
line
51169081604
2
00034886044
22
51168939455
52
The challenge is to, for every single column line, select the next biggest column line, which I have accomplished by the following SQL:
SELECT i1.line,i1.identifier,
MAX(i1.line) OVER (
ORDER BY i1.line ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
)AS parent
FROM global_temp.documentIdentifiers i1
The challenge is partially solved alright, the problem is, when I execute this code on Spark, the performance is terrible. The warning message is very clear about it:
No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Partitioning by any of the two fields does not work, it breaks the result, of course, as every created partition is not aware of the other lines.
Does anyone have any clue on how can I " select the next biggest column line" without performance issues?
Thanks
Using your "next" approach AND assuming the data is generated in ascending line order, the following does work in parallel, but if actually faster you can tell me; I do not know your volume of data. In any event you cannot solve just with SQL (%sql).
Here goes:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
case class X(identifier: Long, line: Long) // Too hard to explain, just gets around issues with df --> rdd --> df.
// Gen some more data.
val df = Seq(
(1000000, 23), (1200, 56), (1201, 58), (1202, 60),
(8200, 63), (890000, 67), (990000, 99), (33000, 123),
(33001, 124), (33002, 126), (33009, 132), (33019, 133),
(33029, 134), (33039, 135), (800, 201), (1800, 999),
(1801, 1999), (1802, 2999), (1800444, 9999)
).toDF("identifier", "line")
// Add partition so as to be able to apply parallelism - except for upper boundary record.
val df2 = df.as[X]
.rdd
.mapPartitionsWithIndex((index, iter) => {
iter.map(x => (index, x ))
}).mapValues(v => (v.identifier, v.line)).map(x => (x._1, x._2._1, x._2._2))
.toDF("part", "identifier", "line")
// Process per partition.
#transient val w = org.apache.spark.sql.expressions.Window.partitionBy("part").orderBy("line")
val df3 = df2.withColumn("next", lead("line", 1, null).over(w))
// Process upper boundary.
val df4 = df3.filter(df3("part") =!= 0).groupBy("part").agg(min("line").as("nxt")).toDF("pt", "nxt")
val df5 = df3.join(df4, (df3("part") === df4("pt") - 1), "outer" )
val df6 = df5.withColumn("next", when(col("next").isNull, col("nxt")).otherwise(col("next"))).select("identifier", "line", "next")
// Display. Sort accordingly.
df6.show(false)
returns:
+----------+----+----+
|identifier|line|next|
+----------+----+----+
|1000000 |23 |56 |
|1200 |56 |58 |
|1201 |58 |60 |
|1202 |60 |63 |
|8200 |63 |67 |
|890000 |67 |99 |
|990000 |99 |123 |
|33000 |123 |124 |
|33001 |124 |126 |
|33002 |126 |132 |
|33009 |132 |133 |
|33019 |133 |134 |
|33029 |134 |135 |
|33039 |135 |201 |
|800 |201 |999 |
|1800 |999 |1999|
|1801 |1999|2999|
|1802 |2999|9999|
|1800444 |9999|null|
+----------+----+----+
You can add additional sorting etc. Relies on narrow transformation when adding partition index. How you load may be an issue. Caching not considered.
If data is not ordered as stated above, a range partitioning needs to occur first.
I am working on a spark cluster and I have two dataframes. One contains text. The other is a look-up table. Both tables are huge (M and N both could easily exceed 100,000 entries). What is the best way to match them?
Doing a cross-join then filtering results based on matches seems like a crazy idea since I would most certainly run out of memory.
My dataframes look something like this:
df1:
text
0 i like apples
1 oranges are good
2 eating bananas is healthy
. ...
. ...
M tomatoes are red, bananas are yellow
df2:
fruit_lookup
0 apples
1 oranges
2 bananas
. ...
. ...
N tomatoes
I am expecting an output dataframe to look something like this:
output_df:
text extracted_fruits
0 i like apples ['apples']
1 oranges are good ['oranges']
2 eating bananas is healthy ['bananas']
. ...
. ...
M tomatoes are red, bananas are yellow . ['tomatoes','bananas']
One way is to use CountVectorizerModel as 100K lookup words should be manageable for this model (default vocabSize=262144):
The basic idea is to create the CountVectorizerModel based on a customized list from df2 (lookup table). split df1.text into an array column and then transform this column into a SparseVector which can then be mapped into words:
Edit: in split function , adjusted the regex from \s+ to [\s\p{Punct}]+ so that all punctuation marks are removed. change 'text' to lower(col('text')) if the lookup is case insensitive.
from pyspark.ml.feature import CountVectorizerModel
from pyspark.sql.functions import split, udf, regexp_replace, lower
df2.show()
+---+------------+
| id|fruit_lookup|
+---+------------+
| 0| apples|
| 1| oranges|
| 2| bananas|
| 3| tomatoes|
| 4|dragon fruit|
+---+------------+
Edit-2: Added the following df1 pre-process step and create an array column including all N-gram combinations. For each string with L words, N=2 will add (L-1) more items in array, if N=3, (L-1)+(L-2) more items.
# max number of words in a single entry of the lookup table df2
N = 2
# Pre-process the `text` field up to N-grams,
# example: ngram_str('oranges are good', 3)
# --> ['oranges', 'are', 'good', 'oranges are', 'are good', 'oranges are good']
def ngram_str(s_t_r, N):
arr = s_t_r.split()
L = len(arr)
for i in range(2,N+1):
if L - i < 0: break
arr += [ ' '.join(arr[j:j+i]) for j in range(L-i+1) ]
return arr
udf_ngram_str = udf(lambda x: ngram_str(x, N), 'array<string>')
df1_processed = df1.withColumn('words_arr', udf_ngram_str(lower(regexp_replace('text', r'[\s\p{Punct}]+', ' '))))
Implement the model on the processed df1:
lst = [ r.fruit_lookup for r in df2.collect() ]
model = CountVectorizerModel.from_vocabulary(lst, inputCol='words_arr', outputCol='fruits_vec')
df3 = model.transform(df1_processed)
df3.show(20,40)
#+----------------------------------------+----------------------------------------+-------------------+
#| text| words_arr| fruits_vec|
#+----------------------------------------+----------------------------------------+-------------------+
#| I like apples| [i, like, apples, i like, like apples]| (5,[0],[1.0])|
#| oranges are good|[oranges, are, good, oranges are, are...| (5,[1],[1.0])|
#| eating bananas is healthy|[eating, bananas, is, healthy, eating...| (5,[2],[1.0])|
#| tomatoes are red, bananas are yellow|[tomatoes, are, red, bananas, are, ye...|(5,[2,3],[1.0,1.0])|
#| test| [test]| (5,[],[])|
#|I have dragon fruit and apples in my bag|[i, have, dragon, fruit, and, apples,...|(5,[0,4],[1.0,1.0])|
#+----------------------------------------+----------------------------------------+-------------------+
Then you can map the fruits_vec back to the fruits using model.vocabulary
vocabulary = model.vocabulary
#['apples', 'oranges', 'bananas', 'tomatoes', 'dragon fruit']
to_match = udf(lambda v: [ vocabulary[i] for i in v.indices ], 'array<string>')
df_new = df3.withColumn('extracted_fruits', to_match('fruits_vec')).drop('words_arr', 'fruits_vec')
df_new.show(truncate=False)
#+----------------------------------------+----------------------+
#|text |extracted_fruits |
#+----------------------------------------+----------------------+
#|I like apples |[apples] |
#|oranges are good |[oranges] |
#|eating bananas is healthy |[bananas] |
#|tomatoes are red, bananas are yellow |[bananas, tomatoes] |
#|test |[] |
#|I have dragon fruit and apples in my bag|[apples, dragon fruit]|
#+----------------------------------------+----------------------+
Method-2: As your dataset are not huge in terms of Spark context, the following might work, this will work with lookup value having multiple words as per your comment:
from pyspark.sql.functions import expr, collect_set
df1.alias('d1').join(
df2.alias('d2')
, expr('d1.text rlike concat("\\\\b", d2.fruit_lookup, "\\\\b")')
, 'left'
).groupby('text') \
.agg(collect_set('fruit_lookup').alias('extracted_fruits')) \
.show()
+--------------------+-------------------+
| text| extracted_fruits|
+--------------------+-------------------+
| oranges are good| [oranges]|
| I like apples| [apples]|
|tomatoes are red,...|[tomatoes, bananas]|
|eating bananas is...| [bananas]|
| test| []|
+--------------------+-------------------+
Where: "\\\\b is word boundary so that the lookup values do not mess up with their contexts.
Note: you may need to clean up all punctuation marks and redundant spaces on both columns before the dataframe join.
I have a simple question.
“How to read Files with Fixed record length?” i have 2 fields in the record- name & state.
File Data-
John OHIO
VictorNEWYORK
Ron CALIFORNIA
File Layout-
Name String(6);
State String(10);
I just want to read it and create a DataFrame from this file. Just to elaborate more for example on “fixed record length”- if you see since “OHIO” is 4 characters length, in file it is populated with 6 trailing spaces “OHIO “
The record length here is 16.
Thanks,
Sid
Read your input file:
val rdd = sc.textFile('your_file_path')
Then use substring to split the fields and then convert RDD to Dataframe using toDF().
val df = rdd.map(l => (l.substring(0, 6).trim(), l.substring(6, 16).trim()))
.toDF("Name","State")
df.show(false)
Result:
+------+----------+
|Name |State |
+------+----------+
|John |OHIO |
|Victor|NEWYORK |
|Ron |CALIFORNIA|
+------+----------+