Regex extraction in SQL - sql

I have the following data format from which I'm trying to extract the id part,
{"memberurn"=urn:li:member:10000012}
This is my code,
CAST(regexp_extract(key.memberurn, 'urn:li:member:(\\d+)', 1) AS BIGINT) AS member_id
In the output member_id is NULL
What am I doing wrong here?

Try this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import LongType
spark = SparkSession.builder \
.appName('practice')\
.getOrCreate()
sc= spark.sparkContext
df= sc.parallelize([
[""" {"memberurn"=urn:li:member:10000012}"""]]).toDF(["a"])
df.show(truncate=False)
+-------------------------------------+
|a |
+-------------------------------------+
| {"memberurn"=urn:li:member:10000012}|
+-------------------------------------+
df1= df.withColumn("id", F.regexp_extract(F.col('a'),
'(urn:li:member:)(\d+)', 2))
df2= df1.withColumn("id",df1["id"].cast(LongType()))
df2.show()
+-------------------------------------+--------+
|a |id |
+-------------------------------------+--------+
| {"memberurn"=urn:li:member:10000012}|10000012|
+-------------------------------------+--------+
print(df2.printSchema())
root
|-- a: string (nullable = true)
|-- id: long (nullable = true)

In scala-
Using regex_extract
val df = spark.range(1).withColumn("memberurn", lit("urn:li:member:10000012"))
df.withColumn("member_id",
expr("""CAST(regexp_extract(memberurn, 'urn:li:member:(\\d+)', 1) AS BIGINT)"""))
.show(false)
/**
* +---+----------------------+---------+
* |id |memberurn |member_id|
* +---+----------------------+---------+
* |0 |urn:li:member:10000012|10000012 |
* +---+----------------------+---------+
*/
Using substring_index
df.withColumn("member_id",
substring_index($"memberurn", ":", -1).cast("bigint"))
.show(false)
/**
* +---+----------------------+---------+
* |id |memberurn |member_id|
* +---+----------------------+---------+
* |0 |urn:li:member:10000012|10000012 |
* +---+----------------------+---------+
*/

Related

Spark- check intersect of two string columns

I have a dataframe below where colA and colB contain strings. I'm trying to check if colB contains any substring of values in colA. The vaules can contain , or space, but as long as any part of colB's string has overlap with colA's, it is a match. For example row 1 below has an overlap ("bc"), and row 2 does not.
I was thinking of splitting the values into arrays but the delimiters are not constant. Could someone please help to shed some light on how to do this? Many thanks for your help.
+---+-------+-----------+
| id|colA | colB |
+---+-------+-----------+
| 1|abc d | bc, z |
| 2|abcde | hj f |
+---+-------+-----------+
You could split by using regex and then create a UDF function to check substrings.
Example:
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z, d"},
{"id": 2, "A": "abc-d", "B": "acb, abc"},
{"id": 3, "A": "abcde", "B": "hj f ab"},
]
df = spark.createDataFrame(data)
split_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.split(F.col("A"), split_regex))
df = df.withColumn("B", F.split(F.col("B"), split_regex))
def mapper(a, b):
result = []
for ele_b in b:
for ele_a in a:
if ele_b in ele_a:
result.append(ele_b)
return result
df = df.withColumn(
"result", F.udf(mapper, ArrayType(StringType()))(F.col("A"), F.col("B"))
)
Result:
root
|-- A: array (nullable = true)
| |-- element: string (containsNull = true)
|-- B: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: long (nullable = true)
|-- result: array (nullable = true)
| |-- element: string (containsNull = true)
+--------+-----------+---+-------+
|A |B |id |result |
+--------+-----------+---+-------+
|[abc, d]|[bc, z, d] |1 |[bc, d]|
|[abc, d]|[acb, abc] |2 |[abc] |
|[abcde] |[hj, f, ab]|3 |[ab] |
+--------+-----------+---+-------+
You can use a custom UDF to implement the intersect logic as below -
Data Preparation
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import pandas as pd
data = {"id" :[1,2],
"colA" : ["abc d","abcde"],
"colB" : ["bc, z","hj f"]}
mypd = pd.DataFrame(data)
sparkDF = sql.createDataFrame(mypd)
sparkDF.show()
+---+-----+-----+
| id| colA| colB|
+---+-----+-----+
| 1|abc d|bc, z|
| 2|abcde| hj f|
+---+-----+-----+
UDF
def str_intersect(x,y):
res = set(x) & set(y)
if res:
return ''.join(res)
else:
return None
str_intersect_udf = F.udf(lambda x,y:str_intersect(x,y),StringType())
sparkDF.withColumn('intersect',str_intersect_udf(F.col('colA'),F.col('colB'))).show()
+---+-----+-----+---------+
| id| colA| colB|intersect|
+---+-----+-----+---------+
| 1|abc d|bc, z| bc |
| 2|abcde| hj f| null|
+---+-----+-----+---------+

Fill in row missing values with previous and next non missing values

I know you can forward/backward fill in missing values with next non-missing values with last function combined with a window function.
But I have a data looks like:
Area,Date,Population
A, 1/1/2000, 10000
A, 2/1/2000,
A, 3/1/2000,
A, 4/1/2000, 10030
A, 5/1/2000,
In this example, for May population, I like to fill in 10030 which is easy. But for Feb and Mar, I would like to fill in value is mean of 10000 and 10030, not 10000 or 10030.
Do you know how to implement this?
Thanks,
Get the next and previous value and compute the mean as below-
df2.show(false)
df2.printSchema()
/**
* +----+--------+----------+
* |Area|Date |Population|
* +----+--------+----------+
* |A |1/1/2000|10000 |
* |A |2/1/2000|null |
* |A |3/1/2000|null |
* |A |4/1/2000|10030 |
* |A |5/1/2000|null |
* +----+--------+----------+
*
* root
* |-- Area: string (nullable = true)
* |-- Date: string (nullable = true)
* |-- Population: integer (nullable = true)
*/
val w1 = Window.partitionBy("Area").orderBy("Date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val w2 = Window.partitionBy("Area").orderBy("Date").rowsBetween(Window.currentRow, Window.unboundedFollowing)
df2.withColumn("previous", last("Population", ignoreNulls = true).over(w1))
.withColumn("next", first("Population", ignoreNulls = true).over(w2))
.withColumn("new_Population", (coalesce($"previous", $"next") + coalesce($"next", $"previous")) / 2)
.drop("next", "previous")
.show(false)
/**
* +----+--------+----------+--------------+
* |Area|Date |Population|new_Population|
* +----+--------+----------+--------------+
* |A |1/1/2000|10000 |10000.0 |
* |A |2/1/2000|null |10015.0 |
* |A |3/1/2000|null |10015.0 |
* |A |4/1/2000|10030 |10030.0 |
* |A |5/1/2000|null |10030.0 |
* +----+--------+----------+--------------+
*/
Here is my try.
w1 and w2 are used to partition the window and w3 and w4 are used to fill the preceding and following values. After that, you can give the condition to calculate how fill the Population.
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.partitionBy('Area').orderBy('Date').rowsBetween(Window.unboundedPreceding, Window.currentRow)
w2 = Window.partitionBy('Area').orderBy('Date').rowsBetween(Window.currentRow, Window.unboundedFollowing)
w3 = Window.partitionBy('Area', 'partition1').orderBy('Date')
w4 = Window.partitionBy('Area', 'partition2').orderBy(f.desc('Date'))
df.withColumn('check', f.col('Population').isNotNull().cast('int')) \
.withColumn('partition1', f.sum('check').over(w1)) \
.withColumn('partition2', f.sum('check').over(w2)) \
.withColumn('first', f.first('Population').over(w3)) \
.withColumn('last', f.first('Population').over(w4)) \
.withColumn('fill', f.when(f.col('first').isNotNull() & f.col('last').isNotNull(), (f.col('first') + f.col('last')) / 2).otherwise(f.coalesce('first', 'last'))) \
.withColumn('Population', f.coalesce('Population', 'fill')) \
.orderBy('Date') \
.select(*df.columns).show(10, False)
+----+--------+----------+
|Area|Date |Population|
+----+--------+----------+
|A |1/1/2000|10000.0 |
|A |2/1/2000|10015.0 |
|A |3/1/2000|10015.0 |
|A |4/1/2000|10030.0 |
|A |5/1/2000|10030.0 |
+----+--------+----------+

Split a dataframe string column by two different delimiters

The following is my dataset:
Itemcode
DB9450//DB9450/AD9066
DA0002/DE2396//DF2345
HWC72
GG7183/EB6693
TA444/B9X8X4:7-2-
The following is the code I have been trying to use
df.withColumn("item1", split(col("Itemcode"), "/").getItem(0)).withColumn("item2", split(col("Itemcode"), "/").getItem(1)).withColumn("item3", split(col("Itemcode"), "//").getItem(0))
But it fails when there is a double slash in between first and second item and also fails when there is a double slash between 2nd and 3rd item
Desired output is:
item1 item2 item3
DB9450 DB9450 AD9066
DA0002 DE2396 DF2345
HWC72
GG7183 EB6693
TA444 B9X8X4
You can first replace the // with / then you can split.. Please try the below and let us know if worked
Input
df_b = spark.createDataFrame([('DB9450//DB9450/AD9066',"a"),('DA0002/DE2396//DF2345',"a"),('HWC72',"a"),('GG7183/EB6693',"a"),('TA444/B9X8X4:7-2-',"a")],[ "reg","postime"])
+--------------------+-------+
| reg|postime|
+--------------------+-------+
|DB9450//DB9450/AD...| a|
|DA0002/DE2396//DF...| a|
| HWC72| a|
| GG7183/EB6693| a|
| TA444/B9X8X4:7-2-| a|
+--------------------+-------+
Logic
df_b = df_b.withColumn('split_col', F.regexp_replace(F.col('reg'), "//", "/"))
df_b = df_b.withColumn('split_col', F.split(df_b['split_col'], '/'))
df_b = df_b.withColumn('col1' , F.col('split_col').getItem(0))
df_b = df_b.withColumn('col2' , F.col('split_col').getItem(1))
df_b = df_b.withColumn('col2', F.regexp_replace(F.col('col2'), ":7-2-", ""))
df_b = df_b.withColumn('col3' , F.col('split_col').getItem(2))
Output
+--------------------+-------+--------------------+------+------+------+
| reg|postime| split_col| col1| col2| col3|
+--------------------+-------+--------------------+------+------+------+
|DB9450//DB9450/AD...| a|[DB9450, DB9450, ...|DB9450|DB9450|AD9066|
|DA0002/DE2396//DF...| a|[DA0002, DE2396, ...|DA0002|DE2396|DF2345|
| HWC72| a| [HWC72]| HWC72| null| null|
| GG7183/EB6693| a| [GG7183, EB6693]|GG7183|EB6693| null|
| TA444/B9X8X4:7-2-| a|[TA444, B9X8X4:7-2-]| TA444|B9X8X4| null|
+--------------------+-------+--------------------+------+------+------+
Processing the text as csv works well for this.
First, let's read in the text, replacing double backslashes along the way
Edit: Also removing everything after a colon
val items = """
Itemcode
DB9450//DB9450/AD9066
DA0002/DE2396//DF2345
HWC72
GG7183/EB6693
TA444/B9X8X4:7-2-
""".replaceAll("//", "/").split(":")(0)
Get the max number of items in a row
to create an appropriate header
val numItems = items.split("\n").map(_.split("/").size).reduce(_ max _)
val header = (1 to numItems).map("Itemcode" + _).mkString("/")
Then we're ready to create a Data Frame
val df = spark.read
.option("ignoreTrailingWhiteSpace", "true")
.option("delimiter", "/")
.option("header", "true")
.csv(spark.sparkContext.parallelize((header + items).split("\n")).toDS)
.filter("Itemcode1 <> 'Itemcode'")
df.show(false)
+---------+-----------+---------+
|Itemcode1|Itemcode2 |Itemcode3|
+---------+-----------+---------+
|DB9450 |DB9450 |AD9066 |
|DA0002 |DE2396 |DF2345 |
|HWC72 |null |null |
|GG7183 |EB6693 |null |
|TA444 |B9X8X4 |null |
+---------+-----------+---------+
Perhaps this is useful (spark>=2.4)-
split and TRANSFORM spark sql function will do the magic as below-
Load the provided test data
val data =
"""
|Itemcode
|
|DB9450//DB9450/AD9066
|
|DA0002/DE2396//DF2345
|
|HWC72
|
|GG7183/EB6693
|
|TA444/B9X8X4:7-2-
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +---------------------+
* |Itemcode |
* +---------------------+
* |DB9450//DB9450/AD9066|
* |DA0002/DE2396//DF2345|
* |HWC72 |
* |GG7183/EB6693 |
* |TA444/B9X8X4:7-2- |
* +---------------------+
*
* root
* |-- Itemcode: string (nullable = true)
*/
Use split and TRANSFORM (you can run this query directly in pyspark)
df.withColumn("item_code", expr("TRANSFORM(split(Itemcode, '/+'), x -> split(x, ':')[0])"))
.selectExpr("item_code[0] item1", "item_code[1] item2", "item_code[2] item3")
.show(false)
/**
* +------+------+------+
* |item1 |item2 |item3 |
* +------+------+------+
* |DB9450|DB9450|AD9066|
* |DA0002|DE2396|DF2345|
* |HWC72 |null |null |
* |GG7183|EB6693|null |
* |TA444 |B9X8X4|null |
* +------+------+------+
*/

How to split a column by using length split and MaxSplit in Pyspark dataframe?

For Example
If I have a Column as given below by calling and showing the CSV in Pyspark
+--------+
| Names|
+--------+
|Rahul |
|Ravi |
|Raghu |
|Romeo |
+--------+
if I specify in my functions as Such
Length = 2
Maxsplit = 3
Then I have to get the results as
+----------+-----------+----------+
|Col_1 |Col_2 |Col_3 |
+----------+-----------+----------+
| Ra | hu | l |
| Ra | vi | Null |
| Ra | gh | u |
| Ro | me | o |
+----------+-----------+----------+
Simirarly in Pyspark
Length = 3
Max split = 2 it should provide me the output such as
+----------+-----------+
|Col_1 |Col_2 |
+----------+-----------+
| Rah | ul |
| Rav | i |
| Rag | hu |
| Rom | eo |
+----------+-----------+
This is how it should look like, Thank you
Another way to go about this. Should be faster than any looping or udf solution.
from pyspark.sql import functions as F
def split(df,length,maxsplit):
return df.withColumn('Names',F.split("Names","(?<=\\G{})".format('.'*length)))\
.select(*((F.col("Names")[x]).alias("Col_"+str(x+1)) for x in range(0,maxsplit)))
split(df,3,2).show()
#+-----+-----+
#|Col_1|Col_2|
#+-----+-----+
#| Rah| ul|
#| Rav| i|
#| Rag| hu|
#| Rom| eo|
#+-----+-----+
split(df,2,3).show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| Ra| hu| l|
#| Ra| vi| |
#| Ra| gh| u|
#| Ro| me| o|
#+-----+-----+-----+
Try this,
import pyspark.sql.functions as F
tst = sqlContext.createDataFrame([("Raghu",1),("Ravi",2),("Rahul",3)],schema=["Name","val"])
def fn (split,max_n,tst):
for i in range(max_n):
tst_loop=tst.withColumn("coln"+str(i),F.substring(F.col("Name"),(i*split)+1,split))
tst=tst_loop
return(tst)
tst_res = fn(3,2,tst)
The for loop can also replaced by a list comprehension or reduce, but i felt in you case, a for loop looked neater. they have the same physical plan anyway.
The results
+-----+---+-----+-----+
| Name|val|coln0|coln1|
+-----+---+-----+-----+
|Raghu| 1| Rag| hu|
| Ravi| 2| Rav| i|
|Rahul| 3| Rah| ul|
+-----+---+-----+-----+
Try this
def split(data,length,maxSplit):
start=1
for i in range(0,maxSplit):
data = data.withColumn(f'col_{start}-{start+length-1}',f.substring('channel',start,length))
start=length+1
return data
df = split(data,3,2)
df.show()
+--------+----+-------+-------+
| channel|type|col_1-3|col_4-6|
+--------+----+-------+-------+
| web| 0| web| |
| web| 1| web| |
| web| 2| web| |
| twitter| 0| twi| tte|
| twitter| 1| twi| tte|
|facebook| 0| fac| ebo|
|facebook| 1| fac| ebo|
|facebook| 2| fac| ebo|
+--------+----+-------+-------+
Perhaps this is useful-
Load the test data
Note: written in scala
val Length = 2
val Maxsplit = 3
val df = Seq("Rahul", "Ravi", "Raghu", "Romeo").toDF("Names")
df.show(false)
/**
* +-----+
* |Names|
* +-----+
* |Rahul|
* |Ravi |
* |Raghu|
* |Romeo|
* +-----+
*/
split the string col as per the length and offset
val schema = StructType(Range(1, Maxsplit + 1).map(f => StructField(s"Col_$f", StringType)))
val split = udf((str:String, length: Int, maxSplit: Int) =>{
val splits = str.toCharArray.grouped(length).map(_.mkString).toArray
RowFactory.create(splits ++ Array.fill(maxSplit-splits.length)(null): _*)
}, schema)
val p = df
.withColumn("x", split($"Names", lit(Length), lit(Maxsplit)))
.selectExpr("x.*")
p.show(false)
p.printSchema()
/**
* +-----+-----+-----+
* |Col_1|Col_2|Col_3|
* +-----+-----+-----+
* |Ra |hu |l |
* |Ra |vi |null |
* |Ra |gh |u |
* |Ro |me |o |
* +-----+-----+-----+
*
* root
* |-- Col_1: string (nullable = true)
* |-- Col_2: string (nullable = true)
* |-- Col_3: string (nullable = true)
*/
Dataset[Row] -> Dataset[Array[String]]
val x = df.map(r => {
val splits = r.getString(0).toCharArray.grouped(Length).map(_.mkString).toArray
splits ++ Array.fill(Maxsplit-splits.length)(null)
})
x.show(false)
x.printSchema()
/**
* +-----------+
* |value |
* +-----------+
* |[Ra, hu, l]|
* |[Ra, vi,] |
* |[Ra, gh, u]|
* |[Ro, me, o]|
* +-----------+
*
* root
* |-- value: array (nullable = true)
* | |-- element: string (containsNull = true)
*/

Spark dataframe transverse of columns

I have ticket transaction systems. Sample dataframe looks like below. Every day will have 2 records with how many ticket & value of tickets were booked through channel(only 2 channel is possible. Passenger,Agent)
date,channel,ticket_qty,ticket_amount
20011231,passenger,500,2500
20011231,agent,100,1100
20020101,passenger,450,2000
20020101,agent,120,1500
I want to make it to single record per date& removing channel. Like below
date,passenger_ticket_qty,passenger_ticket_amount,agent_ticket_qty,agent_ticket_amount
20011231,500,2500,100,1100
20020101,450,2000,120,1500
I have acheived it in below way.
val pas_df= spark.read.csv(filepath).option("header","true")
.filter($"channel" === "passenger")
val agent_df= spark.read.csv(filepath).option("header","true")
.filter($"channel" === "agent")
val df = pas_df.as("pdf").join(agent_df.as("adf"), $"adf.date" === $"pdf.date")
.select($"pdf.date" as date,
$"pdf.ticket_qty" as passenger_ticket_qty,
$"pdf.ticket_amount" as passenger_ticket_amount,
$"adf.ticket_qty" agent_ticket_qty,
$"adf.ticket_amount" as agent_ticket_amount)
This is working perfect way.But it takes around 3 hrs since the file 40yrs of records.
Is there a better way to get this done without join?
Thanks in Advance.
Perhaps this is useful-
Load the data provided
val data =
"""
|date,channel,ticket_qty,ticket_amount
|20011231,passenger,500,2500
|20011231,agent,100,1100
|20020101,passenger,450,2000
|20020101,agent,120,1500
""".stripMargin
val stringDS = data.split(System.lineSeparator())
// .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +--------+---------+----------+-------------+
* |date |channel |ticket_qty|ticket_amount|
* +--------+---------+----------+-------------+
* |20011231|passenger|500 |2500 |
* |20011231|agent |100 |1100 |
* |20020101|passenger|450 |2000 |
* |20020101|agent |120 |1500 |
* +--------+---------+----------+-------------+
*
* root
* |-- date: integer (nullable = true)
* |-- channel: string (nullable = true)
* |-- ticket_qty: integer (nullable = true)
* |-- ticket_amount: integer (nullable = true)
*/
Use pivot and first
df.groupBy("date")
.pivot("channel")
.agg(
first("ticket_qty").as("ticket_qty"),
first("ticket_amount").as("ticket_amount")
).show(false)
/**
* +--------+----------------+-------------------+--------------------+-----------------------+
* |date |agent_ticket_qty|agent_ticket_amount|passenger_ticket_qty|passenger_ticket_amount|
* +--------+----------------+-------------------+--------------------+-----------------------+
* |20011231|100 |1100 |500 |2500 |
* |20020101|120 |1500 |450 |2000 |
* +--------+----------------+-------------------+--------------------+-----------------------+
*/