How to set all fields to null in scala dataframe except few fields - dataframe

I have a use case where I need to set all fields in file to null except few and write the updated dataframe back to the same file.
I am trying this but it give the error
Exception in User Class: org.apache.spark.sql.catalyst.parser.ParseException :
import org.apache.spark.sql.types._
val excluded_list = List("date","id")
var transformed_df = source_df
for (t <- source_df.dtypes){
println(t._1,t._2)
if (excluded_list.contains(t._1) == false) {
print(t._1)
transformed_df = transformed_df.withColumn(t._1,lit(null).cast(t._2))
}
}
Although this works :
val transformed_df = hudi_dataframe.withColumn("amount",lit(null).cast(IntegerType))
Could someone please help me with this? Is there any other way to achieve this.

Related

How to use Lucene's DistinctValuesCollector?

My objective is to collect distinct values of select fields to provided them as filter options for the frontend. DistinctValuesCollector seems to be the tool for this, however since I haven't found code sample and documentation except for the Javadocs I can't currently correctly construct this collector. Can anyone provide an example?
This is my attempt which doesn't deliver the desired distinct values of the field PROJEKTSTATUS.name.
val groupSelector = TermGroupSelector(PROJEKTSTATUS.name)
val searchGroup = SearchGroup<BytesRef>()
val valueSelector = TermGroupSelector(PROJEKTSTATUS.name)
val groups = mutableListOf(searchGroup)
val distinctValuesCollector = DistinctValuesCollector(groupSelector, groups, valueSelector)
That field is indexed as follows:
document.add(TextField(PROJEKTSTATUS.name, aggregat.projektstatus, YES))
document.add(SortedDocValuesField(PROJEKTSTATUS.name, BytesRef(aggregat.projektstatus)))
Thanks to #andrewJames's hint to a test class I could figure it out:
fun IndexSearcher.collectFilterOptions(query: Query, field: String, topNGroups: Int = 128, mapper: Function<String?, String?> = Function { it }): Set<String?> {
val firstPassGroupingCollector = FirstPassGroupingCollector(TermGroupSelector(field), Sort(), topNGroups)
search(query, firstPassGroupingCollector)
val topGroups = firstPassGroupingCollector.getTopGroups(0)
val groupSelector = firstPassGroupingCollector.groupSelector
val distinctValuesCollector = DistinctValuesCollector(groupSelector, topGroups, groupSelector)
search(query, distinctValuesCollector)
return distinctValuesCollector.groups.map { mapper.apply(it.groupValue.utf8ToString()) }.toSet()
}

Optimization query for DataFrame Spark

I try create DataFrame from Hive table. But I bad work with Spark API.
I need help to optimize the query in method getLastSession, make two tasks into one task for spark:
val pathTable = new File("/src/test/spark-warehouse/test_db.db/test_table").getAbsolutePath
val path = new Path(s"$pathTable${if(onlyPartition) s"/name_process=$processName" else ""}").toString
val df = spark.read.parquet(path)
def getLastSession: Dataset[Row] = {
val lastTime = df.select(max(col("time_write"))).collect()(0)(0).toString
val lastSession = df.select(col("id_session")).where(col("time_write") === lastTime).collect()(0)(0).toString
val dfByLastSession = df.filter(col("id_session") === lastSession)
dfByLastSession.show()
/*
+----------+----------------+------------------+-------+
|id_session| time_write| key| value|
+----------+----------------+------------------+-------+
|alskdfksjd|1639950466414000|schema2.table2.csv|Failure|
*/
dfByLastSession
}
PS. My Source Table (for example):
name_process
id_session
time_write
key
value
OtherClass
jsdfsadfsf
43434883477
schema0.table0.csv
Success
OtherClass
jksdfkjhka
23212123323
schema1.table1.csv
Success
OtherClass
alskdfksjd
23343212234
schema2.table2.csv
Failure
ExternalClass
sdfjkhsdfd
34455453434
schema3.table3.csv
Success
You can use row_number with Window like this:
import org.apache.spark.sql.expressions.Window
val dfByLastSession = df.withColumn(
"rn",
row_number().over(Window.orderBy(desc("time_write")))
).filter("rn=1").drop("rn")
dfByLastSession.show()
However, as you do not partition by any field maybe it can degrade performances.
Another thing you can change in your code, is using struct ordering to get the id_session associated with most recent time_write with one query:
val lastSession = df.select(max(struct(col("time_write"), col("id_session")))("id_session")).first.getString(0)
val dfByLastSession = df.filter(col("id_session") === lastSession)

VarcharType mismatch Spark dataframe

I'am trying to change the schema of a dataframe. every time i have a column of string type i want to change it's type to VarcharType(max) where max is the maximum lentgh of string in that column. i wrote the following code. ( i want to export the dataframe later to sql server and i don't want to have nvarchar in sql server so i'am trying to limit it on spark side )
val df = spark.sql(s"SELECT * FROM $tableName")
var l : List [StructField] = List()
val schema = df.schema
schema.fields.foreach(x => {
if (x.dataType == StringType) {
val dataColName = x.name
val maxLength = df.select(dataColName).reduce((x, y) => {
if (x.getString(0).length >= y.getString(0).length) {
x
} else {
y
}
}).getString(0).length
val dataType = VarcharType(maxLength)
l = l :+ StructField(dataColName, dataType)
} else {
l = l :+ x
}
})
val newSchema = StructType(l)
val newDf = spark.createDataFrame(df.rdd, newSchema)
However when running it i get this error.
20/01/22 15:29:44 ERROR ApplicationMaster: User class threw exception: scala.MatchError:
VarcharType(9) (of class org.apache.spark.sql.types.VarcharType)
scala.MatchError: VarcharType(9) (of class org.apache.spark.sql.types.VarcharType)
Can a dataframe column can be of type VarcharType(n) ?
The data mapping from a database to/from dataframe happens in the dialect class. For MS SQL server the class is org.apache.spark.sql.jdbc.MsSqlServerDialect. You can inherit from this and override getJDBCType to influence datatype mapping from a dataframe to a table. Then register your dialect for it to take effect.
I have done this for Oracle (not sqlserver), however it can be done similarly.
//Change this
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case TimestampType => Some(JdbcType("DATETIME", java.sql.Types.TIMESTAMP))
case StringType => Some(JdbcType("NVARCHAR(MAX)", java.sql.Types.NVARCHAR))
case BooleanType => Some(JdbcType("BIT", java.sql.Types.BIT))
case _ => None
}
You can't use VarcharType because it is not a DataType. Also you can't check length of actual data because it is not exposed. You only have access to "dt: DataType", so you can set a default size for NVARCHAR if max is not acceptable.

Is it possible to parameterize queries or parameters for an Acolyte ScalaCompositeHandler?

Background:
I have attempted to accomplish the question defined here, and I have not been able to succeed. Acolyte requires you to define the queries and parameters you want to handle within a match expression, and the values used in match expressions must be known at compile time. (Note, however, that this StackOverflow answer appears to provide a way around this limitation).
If this is indeed not possible, the inability to dynamically define the parameters and queries for Acolyte would be, for my use case, a severe limitation of the framework. I suspect this would be a limitation for others as well.
One SO user who has advocated for the use of Acolyte across a handful of questions stated in this comment that it is possible to dynamically define queries and their responses. So, I have opened this question as an invitation for someone to show that to be the case.
Question:
Using Acolyte, I want to be able to encapsulate the logic for matching queries and generating their responses. This is a desired feature because I want to keep my code DRY. In other words, I am looking for something like the following pseudo-code:
def generateHandler(query: String, accountId: Int, parameters: Seq[String]): ScalaCompositeHandler = AcolyteDSL.handleQuery {
parameters.foreach(p =>
// Tell the handler to handle this specific parameter
case acolyte.jdbc.QueryExecution(query, ExecutedParameter(accountId) :: ExecutedParameter(p) :: Nil) =>
someResultFunction(p)
)
}
Is this possible in Acolyte? If so, please provide an example.
It is indeed possible to parameterize queries and/or parameters by utilizing pattern matching.
See the code below for an example:
import java.sql.DriverManager
import acolyte.jdbc._
import acolyte.jdbc.Implicits._
import org.scalatest.FunSpec
class AcolyteTest extends FunSpec {
describe("Using pattern matching to extract a query parameter") {
it("should extract the parameter and make it usable for dynamic result returning") {
val query = "SELECT someresult FROM someDB WHERE id = ?"
val rows = RowLists.rowList1(classOf[String] -> "someresult")
val handlerName = "testOneHandler"
val handler = AcolyteDSL.handleQuery {
case acolyte.jdbc.QueryExecution(`query`, ExecutedParameter(id) :: _) =>
rows.append(id.toString)
}
Driver.register(handlerName, handler)
val connection = DriverManager.getConnection(s"jdbc:acolyte:anything-you-want?handler=$handlerName")
val preparedStatement = connection.prepareStatement(query)
preparedStatement.setString(1, "hello world")
val resultSet = preparedStatement.executeQuery()
resultSet.next()
assertResult(resultSet.getString(1))("hello world")
}
it("should support a slightly more complex example") {
val firstResult = "The first result"
val secondResult = "The second result"
val query = "SELECT someresult FROM someDB WHERE id = ?"
val rows = RowLists.rowList1(classOf[String] -> "someresult")
val results: Map[String, RowList1.Impl[String]] = Map(
"one" -> rows.append(firstResult),
"two" -> rows.append(secondResult)
)
def getResult(parameter: String): QueryResult = {
results.get(parameter) match {
case Some(row) => row.asResult()
case _ => acolyte.jdbc.QueryResult.Nil
}
}
val handlerName = "testTwoHandler"
val handler = AcolyteDSL.handleQuery {
case acolyte.jdbc.QueryExecution(`query`, ExecutedParameter(id) :: _) =>
getResult(id.toString)
}
Driver.register(handlerName, handler)
val connection = DriverManager.getConnection(s"jdbc:acolyte:anything-you-want?handler=$handlerName")
val preparedStatement = connection.prepareStatement(query)
preparedStatement.setString(1, "one")
val resultSetOne = preparedStatement.executeQuery()
resultSetOne.next()
assertResult(resultSetOne.getString(1))(firstResult)
preparedStatement.setString(1, "two")
val resultSetTwo = preparedStatement.executeQuery()
resultSetTwo.next()
assertResult(resultSetTwo.getString(1))(secondResult)
}
}
}

Spark sql Dataframe - import sqlContext.implicits._

I have main that creates spark context:
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
Then creates dataframe and does filters and validations on the dataframe.
val convertToHourly = udf((time: String) => time.substring(0, time.indexOf(':')) + ":00:00")
val df = sqlContext.read.schema(struct).format("com.databricks.spark.csv").load(args(0))
// record length cannot be < 2
.na.drop(3)
// round to hours
.withColumn("time",convertToHourly($"time"))
This works great.
BUT When I try moving my validations to another file by sending the dataframe to
function ValidateAndTransform(df: DataFrame) : DataFrame = {...}
that gets the Dataframe & does the validations and transformations: It seems like I need the
import sqlContext.implicits._
To avoid the error: “value $ is not a member of StringContext”
that happens on line:
.withColumn("time",convertToHourly($"time"))
But to use the import sqlContext.implicits._
I also need the sqlContext either defined in the new file like so:
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
or send it to the
function ValidateAndTransform(df: DataFrame) : DataFrame = {...}
function
I feel like the separation I'm trying to do to 2 files (main & validation) is not done correctly...
Any idea on how to design this? Or simply send the sqlContext to the function?
Thanks!
You can work with a singleton instance of the SQLContext. You can take a look at this example in the spark repository
/** Lazily instantiated singleton instance of SQLContext */
object SQLContextSingleton {
#transient private var instance: SQLContext = _
def getInstance(sparkContext: SparkContext): SQLContext = {
if (instance == null) {
instance = new SQLContext(sparkContext)
}
instance
}
}
...
//And wherever you want you can do
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._