Read parquet file having mixed data type in a column - apache-spark-sql

I want to read a parquet file using spark sql in which one column has mixed datatype (string and integer).
val sqlContext = new SQLContext(sparkContext)
val df = sqlContext.read.parquet("/tmp/data")
This throws me exception : Failed to merge incompatible data types IntegerType and StringType
Is there a way to explicitly type cast the column during read ?

The only way that I have found is to manually cast one of the fields so that they match. You can do this by reading in the individual parquet files into a sequence and iteratively modifying them as such:
def unionReduce(dfs: Seq[DataFrame]) = {
dfs.reduce{ (x, y) =>
def schemaTruncate(df: DataFrame) = df.schema.map(schema => schema.name -> schema.dataType)
val diff = schemaTruncate(y).toSet.diff(schemaTruncate(x).toSet)
val fixedX = diff.foldLeft(x) { case (df, (name, dataType)) =>
Try(df.withColumn(name, col(name).cast(dataType))) match {
case Success(newDf) => newDf
case Failure(error) => df.withColumn(name, lit(null).cast(dataType))
}
}
fixedX.select(y.columns.map(col): _*).unionAll(y)
}
}
The above function first finds the differently named or typed columns which are in Y but not in X. It then adds those columns to X by attempting to cast the existing columns, and upon failure adding the column as a literal null, then it selects only the columns in Y from the new fixed X incase there are columns in X not in Y and returns the result of the union.

Related

How to read Key Value pair in spark SQL?

How do I get this output using spark sql or scala ? I have a table with columns storing such values - need to split in seprate columns.
Input :
Output :
It pretty much depends on what libs you want to use (as you mentioned in Scala or Spark).
Using spark
val rawJson = """
{"Name":"ABC.txt","UploaddedById":"xxxxx1123","UploadedByName":"James"}
"""
spark.read.json(Seq(rawJson).toDS)
Using common json libraries:
// play
Json.parse(rawJson) match {
case obj: JsObject =>
val values = obj.values
val keys = obj.keys
// construct dataframe having keys and values
case other => // handle other types (like JsArray, etc,.)
}
// circe
import io.circe._, io.circe.parser._
parse(rawJson) match {
case Right(json) => // fetch key values, construct df, much like above
case Left(parseError) => ...
}
You can use almost any json library to parse your json object, and then convert it to spark df very easily.

Scala Spark: Parse SQL string to Column

I have two functions, foo and bar, that I want to write like follows:
def foo(df : DataFrame, conditionString : String) =
val conditionColumn : Column = something(conditionString) //help me define "something"
bar(df, conditionColumn)
}
def bar(df : DataFrame, conditionColumn : Column) = {
df.where(conditionColumn)
}
Where condition is a sql string like "person.age >= 18 AND person.citizen == true" or something.
Because reasons, I don't want to change the type signatures here. I feel this should work because if I could change the type signatures, I could just write:
def foobar(df : DataFrame, conditionString : String) = {
df.where(conditionString)
}
As .where is happy to accept a sql string expression.
So, how can I turn a string representing a column expression into a column? If the expression were just the name of a single column in df I could just do col(colName), but that doesn't seem to take the range of expressions that .where does.
If you need more context for why I'm doing this, I'm working on a databricks notebook that can only accept string arguments (and needs to take a condition as an argument), which calls a library I want to take column-typed arguments.
You can use functions.expr:
def expr(expr: String): Column
Parses the expression string into the column that it represents

Extend Groupby to include multiply aggregation

I implemented a groupby function which groups columns based on a particular aggregation successfully. The issue is I am using a argument for chosen columns and aggregation as Map[String,String] which means multiple aggregations cannot be performed on one column. for example sum, mean and max all on one column.
below is what works soo far:
groupByFunction(input, Map("someSignal" -> "mean"))
def groupByFunction(dataframeDummy: DataFrame,
columnsWithOperation: Map[String,String],
someSession: String = "sessionId",
someSignal: String = "signalName"): DataFrame = {
dataframeDummy
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation)
}
Upon looking into it a bit more, the agg function can take a list of columns like below
userData
.groupBy(
window(
(col(timeStampColumnName) / lit(millisSecondsPerSecond)).cast(TimestampType),
timeWindowInS.toString.concat(" seconds")
),
col(sessionColumnName),
col(signalColumnName)
).agg(
mean("physicalSignalValue"),
sum("physicalSignalValue")).show()
So I decided to try to manipulate the input to look like that, below is how I did it:
val signalIdColumn = columnsWithOperation.toSeq.flatMap { case (key, list) => list.map(key -> _) }
val result = signalIdColumn.map(tuple =>
if (tuple._2 == "mean")
mean(tuple._1)
else if (tuple._2 == "sum")
sum(tuple._1)
else if (tuple._2 == "max")
max(tuple._1))
Now I have a list of columns, which is still a problem for agg funciton.
I was able to solve it using a sequence of tuples like this Seq[(String, String)] instead of Map[String,String]
def groupByFunction(dataframeDummy: DataFrame,
columnsWithOperation: Seq[(String, String)],
someSession: String = "sessionId",
someSignal: String = "signalName"): DataFrame = {
dataframeDummy
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation)
and then with the information
from below post:
https://stackoverflow.com/a/34955432/2091294
userData
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation.head, columnsWithOperation.tail: _*)

VarcharType mismatch Spark dataframe

I'am trying to change the schema of a dataframe. every time i have a column of string type i want to change it's type to VarcharType(max) where max is the maximum lentgh of string in that column. i wrote the following code. ( i want to export the dataframe later to sql server and i don't want to have nvarchar in sql server so i'am trying to limit it on spark side )
val df = spark.sql(s"SELECT * FROM $tableName")
var l : List [StructField] = List()
val schema = df.schema
schema.fields.foreach(x => {
if (x.dataType == StringType) {
val dataColName = x.name
val maxLength = df.select(dataColName).reduce((x, y) => {
if (x.getString(0).length >= y.getString(0).length) {
x
} else {
y
}
}).getString(0).length
val dataType = VarcharType(maxLength)
l = l :+ StructField(dataColName, dataType)
} else {
l = l :+ x
}
})
val newSchema = StructType(l)
val newDf = spark.createDataFrame(df.rdd, newSchema)
However when running it i get this error.
20/01/22 15:29:44 ERROR ApplicationMaster: User class threw exception: scala.MatchError:
VarcharType(9) (of class org.apache.spark.sql.types.VarcharType)
scala.MatchError: VarcharType(9) (of class org.apache.spark.sql.types.VarcharType)
Can a dataframe column can be of type VarcharType(n) ?
The data mapping from a database to/from dataframe happens in the dialect class. For MS SQL server the class is org.apache.spark.sql.jdbc.MsSqlServerDialect. You can inherit from this and override getJDBCType to influence datatype mapping from a dataframe to a table. Then register your dialect for it to take effect.
I have done this for Oracle (not sqlserver), however it can be done similarly.
//Change this
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case TimestampType => Some(JdbcType("DATETIME", java.sql.Types.TIMESTAMP))
case StringType => Some(JdbcType("NVARCHAR(MAX)", java.sql.Types.NVARCHAR))
case BooleanType => Some(JdbcType("BIT", java.sql.Types.BIT))
case _ => None
}
You can't use VarcharType because it is not a DataType. Also you can't check length of actual data because it is not exposed. You only have access to "dt: DataType", so you can set a default size for NVARCHAR if max is not acceptable.

Spark Dataframe size check on columns does not work as expected using vararg and if else - Scala

I do not want to use foldLeft or withColumn with when over all columns in a dataframe, but want a select as per https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015, embellished with an if else statement and cols with vararg. All I want is to replace an empty array column in a Spark dataframe using Scala. I am using size but it never computes the zero (0) correctly.
val resDF2 = aggDF.select(cols.map { col =>
( if (size(aggDF(col)) == 0) lit(null) else aggDF(col) ).as(s"$col")
}: _*)
if (size(aggDF(col)) == 0) lit(null) does not work here functionally, but it does run and size(aggDF(col)) returns the correct length if I return that.
I am wondering what the silly issue is. Must be something I am obviously overlooking!
if-else won't work with DataFrame API, these are for Scala logical expressions. With DataFrames you need when/otherwise:
val resDF2 = aggDF.select(cols.map { col => ( when(size(aggDF(col)) === 0,lit(null)).otherwise(aggDF(col))).as(s"$col") }: _*)
This can further be simplified because when without otherwise automatically returns null (i.e. otherwise(lit(null)) is the default):
val resDF2 = aggDF.select(cols.map { col => when(size(aggDF(col)) > 0,aggDF(col)).as(s"$col") }: _*)
See also https://stackoverflow.com/a/48074218/1138523