Optimization query for DataFrame Spark - dataframe

I try create DataFrame from Hive table. But I bad work with Spark API.
I need help to optimize the query in method getLastSession, make two tasks into one task for spark:
val pathTable = new File("/src/test/spark-warehouse/test_db.db/test_table").getAbsolutePath
val path = new Path(s"$pathTable${if(onlyPartition) s"/name_process=$processName" else ""}").toString
val df = spark.read.parquet(path)
def getLastSession: Dataset[Row] = {
val lastTime = df.select(max(col("time_write"))).collect()(0)(0).toString
val lastSession = df.select(col("id_session")).where(col("time_write") === lastTime).collect()(0)(0).toString
val dfByLastSession = df.filter(col("id_session") === lastSession)
dfByLastSession.show()
/*
+----------+----------------+------------------+-------+
|id_session| time_write| key| value|
+----------+----------------+------------------+-------+
|alskdfksjd|1639950466414000|schema2.table2.csv|Failure|
*/
dfByLastSession
}
PS. My Source Table (for example):
name_process
id_session
time_write
key
value
OtherClass
jsdfsadfsf
43434883477
schema0.table0.csv
Success
OtherClass
jksdfkjhka
23212123323
schema1.table1.csv
Success
OtherClass
alskdfksjd
23343212234
schema2.table2.csv
Failure
ExternalClass
sdfjkhsdfd
34455453434
schema3.table3.csv
Success

You can use row_number with Window like this:
import org.apache.spark.sql.expressions.Window
val dfByLastSession = df.withColumn(
"rn",
row_number().over(Window.orderBy(desc("time_write")))
).filter("rn=1").drop("rn")
dfByLastSession.show()
However, as you do not partition by any field maybe it can degrade performances.
Another thing you can change in your code, is using struct ordering to get the id_session associated with most recent time_write with one query:
val lastSession = df.select(max(struct(col("time_write"), col("id_session")))("id_session")).first.getString(0)
val dfByLastSession = df.filter(col("id_session") === lastSession)

Related

How to use a Dataframe, which is created from Dstream, outside of foreachRDD block?

i've been tried to working on spark streaming. My problem is I want to use wordCountsDataFrame again outside of the foreach block.
i want to conditionally join wordCountsDataFrame and another dataframe that is created from Dstream. Is there any way to do that or another approach?
Thanks.
My scala code block is below.
val Seq(projectId, subscription) = args.toSeq
val sparkConf = new SparkConf().setAppName("PubsubWordCount")
val ssc = new StreamingContext(sparkConf, Milliseconds(5000))
val credentail = SparkGCPCredentials.builder.build()
val pubsubStream: ReceiverInputDStream[SparkPubsubMessage] = PubsubUtils.createStream(ssc, projectId, None, subscription, credentail, StorageLevel.MEMORY_AND_DISK_SER_2)
val stream1= pubsubStream.map(message => new String(message.getData()))
stream1.foreachRDD{ rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
// Convert RDD[String] to DataFrame
val wordsDataFrame = rdd.toDF("word")
wordsDataFrame.createOrReplaceTempView("words")
val wordCountsDataFrame =
spark.sql("select word, count(*) from words group by word")
wordCountsDataFrame.show()
}

How to use Lucene's DistinctValuesCollector?

My objective is to collect distinct values of select fields to provided them as filter options for the frontend. DistinctValuesCollector seems to be the tool for this, however since I haven't found code sample and documentation except for the Javadocs I can't currently correctly construct this collector. Can anyone provide an example?
This is my attempt which doesn't deliver the desired distinct values of the field PROJEKTSTATUS.name.
val groupSelector = TermGroupSelector(PROJEKTSTATUS.name)
val searchGroup = SearchGroup<BytesRef>()
val valueSelector = TermGroupSelector(PROJEKTSTATUS.name)
val groups = mutableListOf(searchGroup)
val distinctValuesCollector = DistinctValuesCollector(groupSelector, groups, valueSelector)
That field is indexed as follows:
document.add(TextField(PROJEKTSTATUS.name, aggregat.projektstatus, YES))
document.add(SortedDocValuesField(PROJEKTSTATUS.name, BytesRef(aggregat.projektstatus)))
Thanks to #andrewJames's hint to a test class I could figure it out:
fun IndexSearcher.collectFilterOptions(query: Query, field: String, topNGroups: Int = 128, mapper: Function<String?, String?> = Function { it }): Set<String?> {
val firstPassGroupingCollector = FirstPassGroupingCollector(TermGroupSelector(field), Sort(), topNGroups)
search(query, firstPassGroupingCollector)
val topGroups = firstPassGroupingCollector.getTopGroups(0)
val groupSelector = firstPassGroupingCollector.groupSelector
val distinctValuesCollector = DistinctValuesCollector(groupSelector, topGroups, groupSelector)
search(query, distinctValuesCollector)
return distinctValuesCollector.groups.map { mapper.apply(it.groupValue.utf8ToString()) }.toSet()
}

How to set all fields to null in scala dataframe except few fields

I have a use case where I need to set all fields in file to null except few and write the updated dataframe back to the same file.
I am trying this but it give the error
Exception in User Class: org.apache.spark.sql.catalyst.parser.ParseException :
import org.apache.spark.sql.types._
val excluded_list = List("date","id")
var transformed_df = source_df
for (t <- source_df.dtypes){
println(t._1,t._2)
if (excluded_list.contains(t._1) == false) {
print(t._1)
transformed_df = transformed_df.withColumn(t._1,lit(null).cast(t._2))
}
}
Although this works :
val transformed_df = hudi_dataframe.withColumn("amount",lit(null).cast(IntegerType))
Could someone please help me with this? Is there any other way to achieve this.

Extract words from a string in spark hadoop with scala

I was using the code below to extract strings I needed in Spark SQL. But now I am working with more data in Spark Hadoop and I want to extract strings. I tried the same code, but it does not work.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
}
tmplst.mkString(",")
}
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()
How do I solve this problem?
First off, you're using Spark not the way its meant to. Your DataFrame isn't partitioned at all. Use:
val values = List("#always_nidhi", "#YouTube", "no", "i", "dnt", "understand" ...). That way, each bulk of words will be assigned to a different partition, different JVMs and/or clusters (depending on the total number of partitions and size of data). In your solution, the entire sentence is assigned to a specific partition and thus there's no parallelism nor distribution.
Second, you don't have to use a UDF (try to avoid those in general).
In order to find your regex, you can simply execute:
dataFrame.filter(col("words") rlike "#\\w+")
Hope it helps :-)

Programmatically adding several columns to Spark DataFrame

I'm using spark with scala.
I have a Dataframe with 3 columns: ID,Time,RawHexdata.
I have a user defined function which takes RawHexData and expands it into X more columns. It is important to state that for each row X is the same (the columns do not vary). However, before I receive the first data, I do not know what the columns are. But once I have the head, I can deduce it.
I would like a second Dataframe with said columns: Id,Time,RawHexData,NewCol1,...,NewCol3.
The "Easiest" method I can think of to do this is:
1. deserialize each row into json (every data tyoe is serializable here)
2. add my new columns,
3. deserialize a new dataframe from the altered json,
However, that seems like a waste, as it involves 2 costly and redundant json serialization steps. I am looking for a cleaner pattern.
Using case-classes, seems like a bad idea, because I don't know the number of columns, or the column names in advance.
What you can do to dynamically extend your DataFrame is to operate on the row RDD which you can obtain by calling dataFrame.rdd. Having a Row instance, you can access the RawHexdata column and parse the contained data. By adding the newly parsed columns to the resulting Row, you've almost solved your problem. The only thing necessary to convert a RDD[Row] back into a DataFrame is to generate the schema data for your new columns. You can do this by collecting a single RawHexdata value on your driver and then extracting the column types.
The following code illustrates this approach.
object App {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(Person("a", 1), Person("b", 2)))
val dataFrame = input.df
dataFrame.show()
// create the extended rows RDD
val rowRDD = dataFrame.rdd.map{
row =>
val blob = row(1).asInstanceOf[Int]
val newColumns: Seq[Any] = Seq(blob, blob * 2, blob * 3)
Row.fromSeq(row.toSeq.init ++ newColumns)
}
val schema = dataFrame.schema
// we know that the new columns are all integers
val newColumns = StructType{
Seq(new StructField("1", IntegerType), new StructField("2", IntegerType), new StructField("3", IntegerType))
}
val newSchema = StructType(schema.init ++ newColumns)
val newDataFrame = sqlContext.createDataFrame(rowRDD, newSchema)
newDataFrame.show()
}
}
SELECT is your friend solving it without going back to RDD.
case class Entry(Id: String, Time: Long)
val entries = Seq(
Entry("x1", 100L),
Entry("x2", 200L)
)
val newColumns = Seq("NC1", "NC2", "NC3")
val df = spark.createDataFrame(entries)
.select(col("*") +: (newColumns.map(c => lit(null).as(c))): _*)
df.show(false)
+---+----+----+----+----+
|Id |Time|NC1 |NC2 |NC3 |
+---+----+----+----+----+
|x1 |100 |null|null|null|
|x2 |200 |null|null|null|
+---+----+----+----+----+