Programmatically adding several columns to Spark DataFrame - sql

I'm using spark with scala.
I have a Dataframe with 3 columns: ID,Time,RawHexdata.
I have a user defined function which takes RawHexData and expands it into X more columns. It is important to state that for each row X is the same (the columns do not vary). However, before I receive the first data, I do not know what the columns are. But once I have the head, I can deduce it.
I would like a second Dataframe with said columns: Id,Time,RawHexData,NewCol1,...,NewCol3.
The "Easiest" method I can think of to do this is:
1. deserialize each row into json (every data tyoe is serializable here)
2. add my new columns,
3. deserialize a new dataframe from the altered json,
However, that seems like a waste, as it involves 2 costly and redundant json serialization steps. I am looking for a cleaner pattern.
Using case-classes, seems like a bad idea, because I don't know the number of columns, or the column names in advance.

What you can do to dynamically extend your DataFrame is to operate on the row RDD which you can obtain by calling dataFrame.rdd. Having a Row instance, you can access the RawHexdata column and parse the contained data. By adding the newly parsed columns to the resulting Row, you've almost solved your problem. The only thing necessary to convert a RDD[Row] back into a DataFrame is to generate the schema data for your new columns. You can do this by collecting a single RawHexdata value on your driver and then extracting the column types.
The following code illustrates this approach.
object App {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(Person("a", 1), Person("b", 2)))
val dataFrame = input.df
dataFrame.show()
// create the extended rows RDD
val rowRDD = dataFrame.rdd.map{
row =>
val blob = row(1).asInstanceOf[Int]
val newColumns: Seq[Any] = Seq(blob, blob * 2, blob * 3)
Row.fromSeq(row.toSeq.init ++ newColumns)
}
val schema = dataFrame.schema
// we know that the new columns are all integers
val newColumns = StructType{
Seq(new StructField("1", IntegerType), new StructField("2", IntegerType), new StructField("3", IntegerType))
}
val newSchema = StructType(schema.init ++ newColumns)
val newDataFrame = sqlContext.createDataFrame(rowRDD, newSchema)
newDataFrame.show()
}
}

SELECT is your friend solving it without going back to RDD.
case class Entry(Id: String, Time: Long)
val entries = Seq(
Entry("x1", 100L),
Entry("x2", 200L)
)
val newColumns = Seq("NC1", "NC2", "NC3")
val df = spark.createDataFrame(entries)
.select(col("*") +: (newColumns.map(c => lit(null).as(c))): _*)
df.show(false)
+---+----+----+----+----+
|Id |Time|NC1 |NC2 |NC3 |
+---+----+----+----+----+
|x1 |100 |null|null|null|
|x2 |200 |null|null|null|
+---+----+----+----+----+

Related

Scala + Spark: filter a dataset if it contains elements from a list

I have a dataset and I want to filtered base on a column.
val test = Seq(
("1", "r2_test"),
("2", "some_other_value"),
("3", "hs_2_card"),
("4", "vsx_np_v2"),
("5", "r2_test"),
("2", "some_other_value2")
).toDF("id", "my_column")
I want to create a function to filter my dataframe based on the elements of this list using contains on "my_column"(if contains part of the string, the filter must be applied)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
def filteredElements(df: DataFrame): DataFrame = {
val elements = List("r2", "hs", "np")
df.filter($"my_column".contains(elements))
}
But like this, won't work for a list, just for a single element.
How can I do to adapt to use my list without have to do multiple filters?
Below how the expected output must be when apply the function
val output = test.transform(filteredElements)
expected =
("1", "r2_test"), // contains "rs"
("3", "hs_2_card"), // contains "hs"
("4", "vsx_np_v2"), // contains "np"
("5", "r2_test"), // contains "r2"
You can do it in one line without udf ( better for performance and simpler ):
df.filter(col("my_column").isNotNull).filter(row => elements.exists(row.getAs[String]("my_column").contains)).show()
One way to solve this would be to use a UDF. I think there should be some way to solve this with spark sql functions that I'm not aware of. Anyway, you can define a udf to tell weather a String contains any of the values in your elements List or not:
import org.apache.sql.functions._
val elements = List("r2", "hs", "np")
val isContainedInList = udf { (value: String) =>
elements.exists(e => value.indexOf(e) != -1)
}
You can use this udf in select, filter, basically anywhere you want:
def filteredElements(df: DataFrame): DataFrame = {
df.filter(isContainedInList($"my_column"))
}
And the result is as expected:
+---+---------+
| id|my_column|
+---+---------+
| 1| r2_test|
| 3|hs_2_card|
| 4|vsx_np_v2|
| 5| r2_test|
+---+---------+

Optimization query for DataFrame Spark

I try create DataFrame from Hive table. But I bad work with Spark API.
I need help to optimize the query in method getLastSession, make two tasks into one task for spark:
val pathTable = new File("/src/test/spark-warehouse/test_db.db/test_table").getAbsolutePath
val path = new Path(s"$pathTable${if(onlyPartition) s"/name_process=$processName" else ""}").toString
val df = spark.read.parquet(path)
def getLastSession: Dataset[Row] = {
val lastTime = df.select(max(col("time_write"))).collect()(0)(0).toString
val lastSession = df.select(col("id_session")).where(col("time_write") === lastTime).collect()(0)(0).toString
val dfByLastSession = df.filter(col("id_session") === lastSession)
dfByLastSession.show()
/*
+----------+----------------+------------------+-------+
|id_session| time_write| key| value|
+----------+----------------+------------------+-------+
|alskdfksjd|1639950466414000|schema2.table2.csv|Failure|
*/
dfByLastSession
}
PS. My Source Table (for example):
name_process
id_session
time_write
key
value
OtherClass
jsdfsadfsf
43434883477
schema0.table0.csv
Success
OtherClass
jksdfkjhka
23212123323
schema1.table1.csv
Success
OtherClass
alskdfksjd
23343212234
schema2.table2.csv
Failure
ExternalClass
sdfjkhsdfd
34455453434
schema3.table3.csv
Success
You can use row_number with Window like this:
import org.apache.spark.sql.expressions.Window
val dfByLastSession = df.withColumn(
"rn",
row_number().over(Window.orderBy(desc("time_write")))
).filter("rn=1").drop("rn")
dfByLastSession.show()
However, as you do not partition by any field maybe it can degrade performances.
Another thing you can change in your code, is using struct ordering to get the id_session associated with most recent time_write with one query:
val lastSession = df.select(max(struct(col("time_write"), col("id_session")))("id_session")).first.getString(0)
val dfByLastSession = df.filter(col("id_session") === lastSession)

How do I create a new DataFame based on an old DataFrame?

I have csv file: dbname1.table1.csv:
|target | source |source_table |relation_type|
---------------------------------------------------------------------------------------
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | protocol_dttm | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | indirect
csv format for this table:
target,source,source_table,relation_type
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,protocol_dttm,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,indirect
Then I create a dataframe by reading it:
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
Now I need to create a new dataframe based on dfDL.
The structure of the new dataframe looks like this:
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
The information for the fields of the new DataFrame is obtained from a csv file:
pseudocode:
schema_from = source_table.split(".")(0) // Example: custom_cib_ml_stg
table_from = source_table.split(".")(1) // Example: p_overall_part_tend_cust
column_from = source // Example: inn_num
link_type = relation_type // Example: direct
schema_to = "dbname1.table1.csv".split(".")(0) // Example: dbname1
table_to = "dbname1.table1.csv".split(".")(1) // Example: table1
column_to = target // Example: avg_ensure_sum_12m
I need to create a new dataframe. I can't cope on my own.
P.S. I need this dataframe to create a json file from it later.
Example JSON:
[{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"inn_num",
"link_type":"direct",
"schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"
},
{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"protocol_dttm",
"link_type":"direct","schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"}
I don't like my current implementation:
def readDLFromHDFS(file: LocatedFileStatus): Array[DataLink] = {
val arrTableName = file.getPath.getName.split("\\.")
val (schemaTo, tableTo) = (arrTableName(0), arrTableName(1))
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
//val sourceTable = dfDL.select("source_table").collect().map(value => value.toString().split("."))
dfDL.collect.map(row => DataLink(row.getString(2).split("\\.")(0),
row.getString(2).split("\\.")(1),
row.getString(1),
row.getString(3),
schemaTo,
tableTo,
row.getString(0)))
}
def toJSON(dataLinks: Array[DataLink]): Option[JValue] =
dataLinks.map(Extraction.decompose).reduceOption(_ ++ _)
}
You definitely don't want to collect, that defeats the point of using spark here. As always with Spark you have a lot of options. You can use RDDs but I don't see a need to switch between modes here. You just want to apply custom logic to some columns and end up with a dataframe with the resulting column alone.
First, define a UDF that you want to apply:
def convert(target, source, source_table, relation_type) =
DataLink(source_table.split("\\.")(0),
source_table.split("\\.")(1),
source,
"dbname1.table1.csv".split(".")(0)
"dbname1.table1.csv".split(".")(1)
target))
Then apply this function to all the relevant columns (making sure you wrap it in udf to make it a spark function rather than a plain Scala function) and select the result:
df.select(udf(convert)($"target", $"source", $"source_table", $"relation_type"))
If you want to a DataFrame with 7 columns as your result:
df.select(
split(col("source_table"), "\\.").getItem(0),
split(col("source_table"), "\\.").getItem(1),
col("source"),
lit("dbname1"),
lit("table1"),
col("target")
)
You can also add .as("column_name") to each of these 7 columns.
You can use dataset directly.
import spark.implicits._
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
val filename = "dbname1.table1.csv"
val df = spark.read.option("header","true").csv("test.csv")
df.show(false)
+------------------+-------------+------------------------------------------+-------------+
|target |source |source_table |relation_type|
+------------------+-------------+------------------------------------------+-------------+
|avg_ensure_sum_12m|inn_num |custom_cib_ml_stg.p_overall_part_tend_cust|direct |
|avg_ensure_sum_12m|protocol_dttm|custom_cib_ml_stg.p_overall_part_tend_cust|direct |
|avg_ensure_sum_12m|inn_num |custom_cib_ml_stg.p_overall_part_tend_cust|indirect |
+------------------+-------------+------------------------------------------+-------------+
df.createOrReplaceTempView("table")
val df2 = spark.sql(s"""
select split(source_table, '[.]')[0] as schema_from
, split(source_table, '[.]')[1] as table_from
, source as column_from
, relation_type as link_type
, split('${filename}', '[.]')[0] as schema_to
, split('${filename}', '[.]')[1] as table_to
, target as column_to
from table
""").as[DataLink]
df2.show()
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
| schema_from| table_from| column_from|link_type|schema_to|table_to| column_to|
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
|custom_cib_ml_stg|p_overall_part_te...| inn_num| direct| dbname1| table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...|protocol_dttm| direct| dbname1| table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...| inn_num| indirect| dbname1| table1|avg_ensure_sum_12m|
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
My progress...
Now, i can create new DataFrame, but he contain only 1 column.
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
val convertCase = (target: String, source: String, source_table: String, relation_type: String) =>
DataLink(
source_table.split("\\.")(0),
source_table.split("\\.")(1),
source,
relation_type,
schemaTo,
tableTo,
target,
)
val udfConvert = udf(convertCase)
val dfForJson = dfDL.select(udfConvert(col("target"),
col("source"),
col("source_table"),
col("relation_type")))

Extract words from a string in spark hadoop with scala

I was using the code below to extract strings I needed in Spark SQL. But now I am working with more data in Spark Hadoop and I want to extract strings. I tried the same code, but it does not work.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
}
tmplst.mkString(",")
}
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()
How do I solve this problem?
First off, you're using Spark not the way its meant to. Your DataFrame isn't partitioned at all. Use:
val values = List("#always_nidhi", "#YouTube", "no", "i", "dnt", "understand" ...). That way, each bulk of words will be assigned to a different partition, different JVMs and/or clusters (depending on the total number of partitions and size of data). In your solution, the entire sentence is assigned to a specific partition and thus there's no parallelism nor distribution.
Second, you don't have to use a UDF (try to avoid those in general).
In order to find your regex, you can simply execute:
dataFrame.filter(col("words") rlike "#\\w+")
Hope it helps :-)

Scala/Apache Spark Converting DataFrame column values and type, multiple when otherwise

I have a primary SQL table that I am reading into Spark and modifying to write to CassandraDB. Currently I have a working implementation for converting a gender from 0, 1, 2, 3 (integers) to "Male", "Female", "Trans", etc (Strings). Though the below method does work, it seems very inefficient to make a seperate Array with those mappings into a DataFrame, join it into the main table/DataFrame, then remove, rename, etc.
I have seen:
.withColumn("gender", when(col("gender) === 1, "male").otherwise("female")
that would allow me to continue method chaining on the primary table but have not been able to get it working with more than 2 options. Is there a way to do this? I have around 10 different columns on this table that each need their own custom conversion created. Since this code will be processing TBs of data, is there a less repetitive and more efficient way to accomplish this. Thanks for any help in advance!
case class Gender(tmpid: Int, tmpgender: String)
private def createGenderDf(spark:SparkSession): DataFrame = {
import spark.implicits._
Seq(
Gender(1, "Male"),
Gender(2, "Female"),
Gender(777, "Prefer not to answer")
).toDF
}
private def createPersonsDf(spark: SparkSession): DataFrame = {
val genderDf = createGenderDf(spark)
genderDf.show()
val personsDf: DataFrame = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\t")
.load(dataPath + "people.csv")
.withColumnRenamed("ID", "id")
.withColumnRenamed("name_first", "firstname")
val personsDf1: DataFrame = personsDf
.join(genderDf, personsDf("gender") === genderDf("tmpid"), "leftouter")
val personsDf2: DataFrame = personsDf1
.drop("gender")
.drop("tmpid")
.withColumnRenamed("tmpgender", "gender")
}
You can use nested when function which would eliminate your need of creating genderDf, join, drop, rename etc. As for your example you can do the following
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
personsDf.withColumn("gender", when(col("gender") === 1, "male").otherwise(when(col("gender") ===2, "female").otherwise("Prefer not to answer")).cast(StringType))
You can add more when function in the above nested structure and you can repeate the same for other 10 columns as well.