pyspark dataframe replace null in one column with another column by converting it from string to array

pyspark dataframe replace null in one column with another column by converting it from string to array - dataframe

I would like to replace a null value of a pyspark dataframe column with another string column converted to array.
import pyspark.sql.functions as F
import pyspark.sql.types as T
new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
new_customers.printSchema()
new_customers.show(5, False)
new_customers = new_customers.withColumn("new_val", F.coalesce(F.col("val"), F.array(F.col("name"))))
new_customers.show(10, truncate=False)
But, it is
root
|-- name: string (nullable = true)
|-- val: array (nullable = true)
| |-- element: string (containsNull = true)
+------+---+
|name |val|
+------+---+
|Karen |[a]|
|Penny |[b]|
|John |[] |
|Cosimo|[d]|
+------+---+
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[] |
|Cosimo|[d]|[d] |
+------+---+-------+
what I expect:
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[John] |
|Cosimo|[d]|[d] |
+------+---+-------+
Did I miss something ? thanks

Problem is that you've an array with null element in it. It will not test positive for isNull check.
First clean up single-null-element arrays:
import pyspark.sql.functions as F
import pyspark.sql.types as T
new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
+------+------+
|name |val |
+------+------+
|Karen |[a] |
|Penny |[b] |
|John |[null]|
|Cosimo|[d] |
+------+------+
new_customers = new_customers.withColumn("val", F.filter(F.col("val"), lambda x: x.isNotNull()))
+------+---+
|name |val|
+------+---+
|Karen |[a]|
|Penny |[b]|
|John |[] |
|Cosimo|[d]|
+------+---+
Then, change your expression for array empty check instead of null check:
new_customers = new_customers.withColumn("new_val", F.when(F.size("val")>0, F.col("val")).otherwise(F.array(F.col("name"))))
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[John] |
|Cosimo|[d]|[d] |
+------+---+-------+

Related

How to split a column by using length split and MaxSplit in Pyspark dataframe?

For Example
If I have a Column as given below by calling and showing the CSV in Pyspark
+--------+
| Names|
+--------+
|Rahul |
|Ravi |
|Raghu |
|Romeo |
+--------+
if I specify in my functions as Such
Length = 2
Maxsplit = 3
Then I have to get the results as
+----------+-----------+----------+
|Col_1 |Col_2 |Col_3 |
+----------+-----------+----------+
| Ra | hu | l |
| Ra | vi | Null |
| Ra | gh | u |
| Ro | me | o |
+----------+-----------+----------+
Simirarly in Pyspark
Length = 3
Max split = 2 it should provide me the output such as
+----------+-----------+
|Col_1 |Col_2 |
+----------+-----------+
| Rah | ul |
| Rav | i |
| Rag | hu |
| Rom | eo |
+----------+-----------+
This is how it should look like, Thank you

Another way to go about this. Should be faster than any looping or udf solution.
from pyspark.sql import functions as F
def split(df,length,maxsplit):
return df.withColumn('Names',F.split("Names","(?<=\\G{})".format('.'*length)))\
.select(*((F.col("Names")[x]).alias("Col_"+str(x+1)) for x in range(0,maxsplit)))
split(df,3,2).show()
#+-----+-----+
#|Col_1|Col_2|
#+-----+-----+
#| Rah| ul|
#| Rav| i|
#| Rag| hu|
#| Rom| eo|
#+-----+-----+
split(df,2,3).show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| Ra| hu| l|
#| Ra| vi| |
#| Ra| gh| u|
#| Ro| me| o|
#+-----+-----+-----+

Try this,
import pyspark.sql.functions as F
tst = sqlContext.createDataFrame([("Raghu",1),("Ravi",2),("Rahul",3)],schema=["Name","val"])
def fn (split,max_n,tst):
for i in range(max_n):
tst_loop=tst.withColumn("coln"+str(i),F.substring(F.col("Name"),(i*split)+1,split))
tst=tst_loop
return(tst)
tst_res = fn(3,2,tst)
The for loop can also replaced by a list comprehension or reduce, but i felt in you case, a for loop looked neater. they have the same physical plan anyway.
The results
+-----+---+-----+-----+
| Name|val|coln0|coln1|
+-----+---+-----+-----+
|Raghu| 1| Rag| hu|
| Ravi| 2| Rav| i|
|Rahul| 3| Rah| ul|
+-----+---+-----+-----+

Try this
def split(data,length,maxSplit):
start=1
for i in range(0,maxSplit):
data = data.withColumn(f'col_{start}-{start+length-1}',f.substring('channel',start,length))
start=length+1
return data
df = split(data,3,2)
df.show()
+--------+----+-------+-------+
| channel|type|col_1-3|col_4-6|
+--------+----+-------+-------+
| web| 0| web| |
| web| 1| web| |
| web| 2| web| |
| twitter| 0| twi| tte|
| twitter| 1| twi| tte|
|facebook| 0| fac| ebo|
|facebook| 1| fac| ebo|
|facebook| 2| fac| ebo|
+--------+----+-------+-------+

Perhaps this is useful-
Load the test data
Note: written in scala
val Length = 2
val Maxsplit = 3
val df = Seq("Rahul", "Ravi", "Raghu", "Romeo").toDF("Names")
df.show(false)
/**
* +-----+
* |Names|
* +-----+
* |Rahul|
* |Ravi |
* |Raghu|
* |Romeo|
* +-----+
*/
split the string col as per the length and offset
val schema = StructType(Range(1, Maxsplit + 1).map(f => StructField(s"Col_$f", StringType)))
val split = udf((str:String, length: Int, maxSplit: Int) =>{
val splits = str.toCharArray.grouped(length).map(_.mkString).toArray
RowFactory.create(splits ++ Array.fill(maxSplit-splits.length)(null): _*)
}, schema)
val p = df
.withColumn("x", split($"Names", lit(Length), lit(Maxsplit)))
.selectExpr("x.*")
p.show(false)
p.printSchema()
/**
* +-----+-----+-----+
* |Col_1|Col_2|Col_3|
* +-----+-----+-----+
* |Ra |hu |l |
* |Ra |vi |null |
* |Ra |gh |u |
* |Ro |me |o |
* +-----+-----+-----+
*
* root
* |-- Col_1: string (nullable = true)
* |-- Col_2: string (nullable = true)
* |-- Col_3: string (nullable = true)
*/
Dataset[Row] -> Dataset[Array[String]]
val x = df.map(r => {
val splits = r.getString(0).toCharArray.grouped(Length).map(_.mkString).toArray
splits ++ Array.fill(Maxsplit-splits.length)(null)
})
x.show(false)
x.printSchema()
/**
* +-----------+
* |value |
* +-----------+
* |[Ra, hu, l]|
* |[Ra, vi,] |
* |[Ra, gh, u]|
* |[Ro, me, o]|
* +-----------+
*
* root
* |-- value: array (nullable = true)
* | |-- element: string (containsNull = true)
*/

Create empty sparse vectors in PySpark

I have a dataframe DF1 that looks like this:
+-------+------+
|user_id|meta |
+-------+------+
| 1| null|
| 11| null|
| 15| null|
+-------+------+
Schema:
root
|-- user_id: string (nullable = true)
|-- meta: string (nullable = true)
and I have another dataframe DF2 that looks like this
+-------+------------------------------------+
|user_id| Vectorz |
+-------+------------------------------------+
| 10| (2,[1],[1.0])|
| 12| (2,[1],[1.0])|
| 13| (2,[0],[1.0])|
| 14| (2,[1],[1.0])|
---------------------------------------------
Schema is:
[user_id: string, Vectorz: vector]
I want to inject all the user_ids from DF1 into DF2, but create empty sparse vectors for them since their "meta" column is all NULLs.
So, I want DF2 to finally be:
+-------+------------------------------------+
|user_id| Vectorz |
+-------+------------------------------------+
| 1| (,[],[])|
| 10| (2,[1],[1.0])|
| 11| (,[],[])|
| 12| (2,[1],[1.0])|
| 13| (2,[0],[1.0])|
| 14| (2,[1],[1.0])|
| 15| (,[],[])|
---------------------------------------------
Can somebody please help?
I am new to PySpark. So, sorry if I don't sound informed enough.

You can go ahead and create empty vectors for all the user_ids when meta is null.
Anyways you need to decide when the meta column is not null.
Sample COde
DF1
val spark = sqlContext.sparkSession
val implicits = sqlContext.sparkSession.implicits
import implicits._
val df1 = sqlContext.range(1,4)
.withColumnRenamed("id", "user_id")
.withColumn("meta", lit(null).cast(DataTypes.StringType))
df1.show(false)
df1.printSchema()
+-------+----+
|user_id|meta|
+-------+----+
|1 |null|
|2 |null|
|3 |null|
+-------+----+
root
|-- user_id: long (nullable = false)
|-- meta: string (nullable = true)
DF2
import org.apache.spark.ml.linalg.Vectors
val staticVector = udf(() => Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), SQLDataTypes.VectorType)
val df2 = sqlContext.range(5,8)
.withColumnRenamed("id", "user_id")
.withColumn("Vectorz", staticVector())
df2.show(false)
df2.printSchema()
+-------+-------------------+
|user_id|Vectorz |
+-------+-------------------+
|5 |(5,[1,3],[1.0,7.0])|
|6 |(5,[1,3],[1.0,7.0])|
|7 |(5,[1,3],[1.0,7.0])|
+-------+-------------------+
root
|-- user_id: long (nullable = false)
|-- Vectorz: vector (nullable = true)
Processed DF
val emptyVector = udf(() => Vectors.sparse(0, Array.empty[Int], Array.empty[Double]), SQLDataTypes.VectorType)
val processedDF =
// meta column shouldn't have any value
// for the safer side adding filter as meta is null
// need to decide what if meta is not null
// I'm assigning empty vector to that also
df1.where(col("meta").isNull)
.withColumn("Vectorz", when(col("meta").isNull, emptyVector()).otherwise(emptyVector()))
.drop("meta")
.unionByName(df2)
processedDF.show(false)
processedDF.printSchema()
+-------+-------------------+
|user_id|Vectorz |
+-------+-------------------+
|1 |(0,[],[]) |
|2 |(0,[],[]) |
|3 |(0,[],[]) |
|5 |(5,[1,3],[1.0,7.0])|
|6 |(5,[1,3],[1.0,7.0])|
|7 |(5,[1,3],[1.0,7.0])|
+-------+-------------------+
root
|-- user_id: long (nullable = false)
|-- Vectorz: vector (nullable = true)

How do I group records that are within a specific time interval using Spark Scala or sql?

I would like to group records in scala only if they have the same ID and their time is within 1 min of each other.
I am thinking conceptually something like this? But I am not really sure
HAVING a.ID = b.ID AND a.time + 30 sec > b.time AND a.time - 30 sec < b.time
| ID | volume | Time |
|:-----------|------------:|:--------------------------:|
| 1 | 10 | 2019-02-17T12:00:34Z |
| 2 | 20 | 2019-02-17T11:10:46Z |
| 3 | 30 | 2019-02-17T13:23:34Z |
| 1 | 40 | 2019-02-17T12:01:02Z |
| 2 | 50 | 2019-02-17T11:10:30Z |
| 1 | 60 | 2019-02-17T12:01:57Z |
to this:
| ID | volume |
|:-----------|------------:|
| 1 | 50 | // (10+40)
| 2 | 70 | // (20+50)
| 3 | 30 |
df.groupBy($"ID", window($"Time", "1 minutes")).sum("volume")
the code above is 1 solution but it always rounds.
For example 2019-02-17T12:00:45Z will have a range of
2019-02-17T12:00:00Z TO 2019-02-17T12:01:00Z.
I am looking for this instead:
2019-02-17T11:45:00Z TO 2019-02-17T12:01:45Z.
Is there a way?

org.apache.spark.sql.functions provides overloaded window functions as below.
1. window(timeColumn: Column, windowDuration: String) : Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
The windows will look like:
{{{
09:00:00-09:01:00
09:01:00-09:02:00
09:02:00-09:03:00 ...
}}}
2. window((timeColumn: Column, windowDuration: String, slideDuration: String):
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
slideDuration Parameter specifying the sliding interval of the window, e.g. 1 minute.A new window will be generated every slideDuration. Must be less than or equal to the windowDuration.
The windows will look like:
{{{
09:00:00-09:01:00
09:00:10-09:01:10
09:00:20-09:01:20 ...
}}}
3. window((timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String): Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
The windows will look like:
{{{
09:00:05-09:01:05
09:00:15-09:01:15
09:00:25-09:01:25 ...
}}}
For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes. This is the perfect overloaded window function which suites your requirement.
Please find working code as below.
import org.apache.spark.sql.SparkSession
object SparkWindowTest extends App {
val spark = SparkSession
.builder()
.master("local")
.appName("File_Streaming")
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
//Prepare Test Data
val df = Seq((1, 10, "2019-02-17 12:00:49"), (2, 20, "2019-02-17 11:10:46"),
(3, 30, "2019-02-17 13:23:34"),(2, 50, "2019-02-17 11:10:30"),
(1, 40, "2019-02-17 12:01:02"), (1, 60, "2019-02-17 12:01:57"))
.toDF("ID", "Volume", "TimeString")
df.show()
df.printSchema()
+---+------+-------------------+
| ID|Volume| TimeString|
+---+------+-------------------+
| 1| 10|2019-02-17 12:00:49|
| 2| 20|2019-02-17 11:10:46|
| 3| 30|2019-02-17 13:23:34|
| 2| 50|2019-02-17 11:10:30|
| 1| 40|2019-02-17 12:01:02|
| 1| 60|2019-02-17 12:01:57|
+---+------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
//Converted String Timestamp into Timestamp
val modifiedDF = df.withColumn("Time", to_timestamp($"TimeString"))
//Dropped String Timestamp from DF
val modifiedDF1 = modifiedDF.drop("TimeString")
modifiedDF.show(false)
modifiedDF.printSchema()
+---+------+-------------------+-------------------+
|ID |Volume|TimeString |Time |
+---+------+-------------------+-------------------+
|1 |10 |2019-02-17 12:00:49|2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|2019-02-17 12:01:57|
+---+------+-------------------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
|-- Time: timestamp (nullable = true)
modifiedDF1.show(false)
modifiedDF1.printSchema()
+---+------+-------------------+
|ID |Volume|Time |
+---+------+-------------------+
|1 |10 |2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|
+---+------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- Time: timestamp (nullable = true)
//Main logic
val modifiedDF2 = modifiedDF1.groupBy($"ID", window($"Time", "1 minutes","1 minutes","45 seconds")).sum("Volume")
//Renamed all columns of DF.
val newNames = Seq("ID", "WINDOW", "VOLUME")
val finalDF = modifiedDF2.toDF(newNames: _*)
finalDF.show(false)
+---+---------------------------------------------+------+
|ID |WINDOW |VOLUME|
+---+---------------------------------------------+------+
|2 |[2019-02-17 11:09:45.0,2019-02-17 11:10:45.0]|50 |
|1 |[2019-02-17 12:01:45.0,2019-02-17 12:02:45.0]|60 |
|1 |[2019-02-17 12:00:45.0,2019-02-17 12:01:45.0]|50 |
|3 |[2019-02-17 13:22:45.0,2019-02-17 13:23:45.0]|30 |
|2 |[2019-02-17 11:10:45.0,2019-02-17 11:11:45.0]|20 |
+---+---------------------------------------------+------+
}

How to extract values from key value map?

I have a column of type map, where the key and value changes. I am trying to extract the value and create a new column.
Input:
----------------+
|symbols |
+---------------+
|[3pea -> 3PEA] |
|[barello -> BA]|
|[] |
|[] |
+---------------+
Expected output:
--------+
|symbols|
+-------+
|3PEA |
|BA |
| |
| |
+-------+
Here is what I tried so far using a udf:
def map_value=udf((inputMap:Map[String,String])=> {inputMap.map(x=>x._2)
})
java.lang.UnsupportedOperationException: Schema for type scala.collection.immutable.Iterable[String] is not supported

Since Spark scala v2.3 api, sql v2.3 api, or pyspark v2.4 api you can use the spark sql function map_values
The following is in pyspark, scala would be very similar.
Setup (assuming working SparkSession as spark):
from pyspark.sql import functions as F
df = (
spark.read.json(sc.parallelize(["""[
{"key": ["3pea"], "value": ["3PEA"] },
{"key": ["barello"], "value": ["BA"] }
]"""]))
.select(F.map_from_arrays(F.col("key"), F.col("value")).alias("symbols") )
)
df.printSchema()
df.show()
root
|-- symbols: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---------------+
| symbols|
+---------------+
| [3pea -> 3PEA]|
|[barello -> BA]|
+---------------+
df.select((F.map_values(F.col("symbols"))[0]).alias("map_vals")).show()
+--------+
|map_vals|
+--------+
| 3PEA|
| BA|
+--------+

import org.apache.spark.sql.functions._
import spark.implicits._
val m = Seq(Array("A -> abc"), Array("B -> 0.11856755943424617"), Array("C -> kqcams"))
val df = m.toDF("map_data")
df.show
// Simulate your data I think.
val df2 = df.withColumn("xxx", split(concat_ws("",$"map_data"), "-> ")).select($"xxx".getItem(1).as("map_val")).drop("xxx")
df2.show(false)
results in:
+--------------------+
| map_data|
+--------------------+
| [A -> abc]|
|[B -> 0.118567559...|
| [C -> kqcams]|
+--------------------+
+-------------------+
|map_val |
+-------------------+
|abc |
|0.11856755943424617|
|kqcams |
+-------------------+

Convert Pyspark Dataframe column from array to new columns

I've a Pyspark Dataframe with this structure:
root
|-- Id: string (nullable = true)
|-- Q: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- pr: string (nullable = true)
| | |-- qt: double (nullable = true)
Something similar to:
+----+--------------------- ... --+
| Id | Q |
+----+---------------------- ... -+
| 001| [ [pr1,1.9], [pr3,2.0]...] |
| 002| [ [pr2,1.0], [pr9,3.9]...] |
| 003| [ [pr2,9.0], ... ] |
...
I wold like to convert Q array into columns (name pr value qt).
Also I would like to avoid duplicated columns by merging (add) same columns.
+----+-----+-----+------+ ... ----+
| Id | pr1 | pr2 | pr3 | ... prn |
+----+-----+-----+------+ ... ----+
| 001| 1.9 | 0.0 | 2.0 | ... |
| 002| 0.0 | 1.0 | 0 | ... |
| 003| 0.0 | 9.0 | ... | ... |
...
How can I perform this transformation?.
Thakyou in advance!!.
Julián.

You can do this with a combination of explode and pivot:
import pyspark.sql.functions as F
# explode to get "long" format
df=df.withColumn('exploded', F.explode('Q'))
# get the name and the name in separate columns
df=df.withColumn('name', F.col('exploded').getItem(0))
df=df.withColumn('value', F.col('exploded').getItem(1))
# now pivot
df.groupby('Id').pivot('name').agg(F.max('value')).na.fill(0)

Very interesting question. This is how I approached it.
test.csv
001,pr1:0.9,pr3:1.2,pr2:2.0
002,pr3:5.2,pr4:0.99
Pyspark
file = sc.textFile("file:///test2.csv")
//get it in (key,value)
//[(u'001', u'pr1:0.9')...]
//rdd1 = file.map(lambda r: r.replace(",","\t",1)).map(lambda r: r.split("\t")).map(lambda r: (r[0],r[1])).flatMapValues(lambda r: r.split(','))
rdd1 = file.map(lambda r: r.split(",")[0]).map(lambda r: (r[0],r[1])).flatMapValues(lambda r: r.split(','))
//create a DF with 3 columns
//[(u'001', u'pr1', u'0.9')...)]
+---+---+----+
| _1| _2| _3|
+---+---+----+
|001|pr1| 0.9|
|001|pr3| 1.2|
|001|pr2| 2.0|
|002|pr3| 5.2|
|002|pr4|0.99|
+---+---+----+
rdd2 = rdd1.map(lambda r: (r[0],r[1].split(":"))).map(lambda r: (r[0],r[1][0],r[1][1]))
df = rdd2.toDF()
//Perform the magic
df.groupBy("_1").pivot("_2").agg(expr("coalesce(first(_3),0)"))
+---+---+---+---+----+
| _1|pr1|pr2|pr3| pr4|
+---+---+---+---+----+
|001|0.9|2.0|1.2| 0|
|002| 0| 0|5.2|0.99|
+---+---+---+---+----+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

pyspark dataframe replace null in one column with another column by converting it from string to array - dataframe

Related

How to split a column by using length split and MaxSplit in Pyspark dataframe?

Create empty sparse vectors in PySpark

How do I group records that are within a specific time interval using Spark Scala or sql?

How to extract values from key value map?

Convert Pyspark Dataframe column from array to new columns

Categories

Resources