How to use SQL to query a csv file with Scala? - sql

I am new to Spark Scala and I am trying to make a SQL query on a csv file and return the records. Below is what I have, but is not working:
val file = sc.textFile(“file:///data/home_data.csv”)
val records = file.sqlContext("SELECT id FROM home_data WHERE yr_built < 1979")
combined.collect().foreach(records)
I get errors with the file.sqlContext function.
Thanks

Can you use class to map the data with the respective field names and datatypes, then use your query:
case class Person(first_name:String,last_name: String,age:Int)
val pmap = p.map ( line => line.split (","))
val personRDD = pmap.map ( p => Person (p(0), p(1), p(2). toInt))
val personDF = personRDD. toDF
then query the persondf.
I dont know the schema, so i formulated this way.

Related

Optimization query for DataFrame Spark

I try create DataFrame from Hive table. But I bad work with Spark API.
I need help to optimize the query in method getLastSession, make two tasks into one task for spark:
val pathTable = new File("/src/test/spark-warehouse/test_db.db/test_table").getAbsolutePath
val path = new Path(s"$pathTable${if(onlyPartition) s"/name_process=$processName" else ""}").toString
val df = spark.read.parquet(path)
def getLastSession: Dataset[Row] = {
val lastTime = df.select(max(col("time_write"))).collect()(0)(0).toString
val lastSession = df.select(col("id_session")).where(col("time_write") === lastTime).collect()(0)(0).toString
val dfByLastSession = df.filter(col("id_session") === lastSession)
dfByLastSession.show()
/*
+----------+----------------+------------------+-------+
|id_session| time_write| key| value|
+----------+----------------+------------------+-------+
|alskdfksjd|1639950466414000|schema2.table2.csv|Failure|
*/
dfByLastSession
}
PS. My Source Table (for example):
name_process
id_session
time_write
key
value
OtherClass
jsdfsadfsf
43434883477
schema0.table0.csv
Success
OtherClass
jksdfkjhka
23212123323
schema1.table1.csv
Success
OtherClass
alskdfksjd
23343212234
schema2.table2.csv
Failure
ExternalClass
sdfjkhsdfd
34455453434
schema3.table3.csv
Success
You can use row_number with Window like this:
import org.apache.spark.sql.expressions.Window
val dfByLastSession = df.withColumn(
"rn",
row_number().over(Window.orderBy(desc("time_write")))
).filter("rn=1").drop("rn")
dfByLastSession.show()
However, as you do not partition by any field maybe it can degrade performances.
Another thing you can change in your code, is using struct ordering to get the id_session associated with most recent time_write with one query:
val lastSession = df.select(max(struct(col("time_write"), col("id_session")))("id_session")).first.getString(0)
val dfByLastSession = df.filter(col("id_session") === lastSession)

How to write a One to Many Query with Scala Slick which returns something like this `(Model1, Option[Seq[Model2]])`

I know this question has been asked before but i cant figure it out.
In my data model i have a news-model containing an arbitrary amount of images.
case class NewsDataModel(
newsId: Option[Long],
name: String,
description: String,
author: String,
creationDateTime: Option[OffsetDateTime]
)
case class Image(
id: Long,
url: String,
newsId: Long
)
Now i want to query my db to get something like this (NewsDataModel, Option[Seq[Image]])
My query is currently implemented as followed:
val q = newsTable.joinLeft(imagesTable).on(_.newsId === _.newsId).result
db.run(q)
This evaluates to Future[Seq[(NewsDataModel, Option[Image])]]. I guess the right way to solve this would be to use the groupBy-function but i dont know how to implement it since this
val q = newsTable.joinLeft(imagesTable).on(_.newsId === _.newsId)
.groupBy(_._1.newsId)
.result
db.run(q)
evaluates to Future[Seq[(Option[Long], Query[(NewsTable, Rep[Option[ImagesTable]]), (NewsDataModel, Option[Image]), Seq])]]
Slick won't automatically create that data structure for you. (I find it helpful to think of Slick in terms of rows and tables and what you can do in portable SQL, and not in terms of "object-relational mappers" or similar).
What you'll want to do is convert the rows into the format you want in Scala, after the database layer. There are many ways you can do that.
Here's one way to do that.
Given this example data...
scala> case class NewsDataModel(newsId: Long)
class NewsDataModel
scala> case class Image(id: Long)
class Image
scala> val results = Seq(
| ( NewsDataModel(1L), Some(Image(1L)) ),
| ( NewsDataModel(1L), Some(Image(10L)) ),
| ( NewsDataModel(1L), None ),
| ( NewsDataModel(2L), None ),
| ( NewsDataModel(3L), Some(Image(3L)) ),
| )
|
val results: Seq[(NewsDataModel, Option[Image])] = List((NewsDataModel(1),Some(Image(1))), (NewsDataModel(1),Some(Image(10))), (NewsDataModel(1),None), (NewsDataModel(2),None), (NewsDataModel(3),Some(Image(3))))
We can group by the key:
scala> val groups = results.groupBy { case (key, values) => key }
val groups: scala.collection.immutable.Map[NewsDataModel,Seq[(NewsDataModel, Option[Image])]] = HashMap(NewsDataModel(3) -> List((NewsDataModel(3),Some(Image(3)))), NewsDataModel(1) -> List((NewsDataModel(1),Some(Image(1))), (NewsDataModel(1),Some(Image(10))), (NewsDataModel(1),None)), NewsDataModel(2) -> List((NewsDataModel(2),None)))
And convert that into something like the type you want:
scala> val flat = groups.map { case (key, seq) => key -> seq.flatMap(_._2) }
val flat: scala.collection.immutable.Map[NewsDataModel,Seq[Image]] = HashMap(NewsDataModel(3) -> List(Image(3)), NewsDataModel(1) -> List(Image(1), Image(10)), NewsDataModel(2) -> List())
That flat result is a map, but you can turn it into (for example) a List with the type signature (close to) the type you want:
scala> flat.toList
val res18: List[(NewsDataModel, Seq[Image])] = List((NewsDataModel(3),List(Image(3))), (NewsDataModel(1),List(Image(1), Image(10))), (NewsDataModel(2),List()))
You can find lots of ways to do that, but the point is you're doing it in Scala, not Slick (SQL). Note, in particular, the groupBy method I've used are the Scala one in the collection library, not the Slick ones (which would be a SQL GROUP BY clause). That is, I'm modifying the result of running the query, not modifying the query itself.
I'd suggest putting whateever conversion you want into a method and then applying it to the Slick action. For example:
def convert(input: Seq[(NewsDataModel, Option[Image])]): Seq[(NewsDataModel, Seq[Image])] =
??? // your implementation here
val action = newsTable.joinLeft(imagesTable).on(_.newsId === _.newsId).result
val convertedAction = action.map(convert)
db.run(convertedAction)

VarcharType mismatch Spark dataframe

I'am trying to change the schema of a dataframe. every time i have a column of string type i want to change it's type to VarcharType(max) where max is the maximum lentgh of string in that column. i wrote the following code. ( i want to export the dataframe later to sql server and i don't want to have nvarchar in sql server so i'am trying to limit it on spark side )
val df = spark.sql(s"SELECT * FROM $tableName")
var l : List [StructField] = List()
val schema = df.schema
schema.fields.foreach(x => {
if (x.dataType == StringType) {
val dataColName = x.name
val maxLength = df.select(dataColName).reduce((x, y) => {
if (x.getString(0).length >= y.getString(0).length) {
x
} else {
y
}
}).getString(0).length
val dataType = VarcharType(maxLength)
l = l :+ StructField(dataColName, dataType)
} else {
l = l :+ x
}
})
val newSchema = StructType(l)
val newDf = spark.createDataFrame(df.rdd, newSchema)
However when running it i get this error.
20/01/22 15:29:44 ERROR ApplicationMaster: User class threw exception: scala.MatchError:
VarcharType(9) (of class org.apache.spark.sql.types.VarcharType)
scala.MatchError: VarcharType(9) (of class org.apache.spark.sql.types.VarcharType)
Can a dataframe column can be of type VarcharType(n) ?
The data mapping from a database to/from dataframe happens in the dialect class. For MS SQL server the class is org.apache.spark.sql.jdbc.MsSqlServerDialect. You can inherit from this and override getJDBCType to influence datatype mapping from a dataframe to a table. Then register your dialect for it to take effect.
I have done this for Oracle (not sqlserver), however it can be done similarly.
//Change this
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case TimestampType => Some(JdbcType("DATETIME", java.sql.Types.TIMESTAMP))
case StringType => Some(JdbcType("NVARCHAR(MAX)", java.sql.Types.NVARCHAR))
case BooleanType => Some(JdbcType("BIT", java.sql.Types.BIT))
case _ => None
}
You can't use VarcharType because it is not a DataType. Also you can't check length of actual data because it is not exposed. You only have access to "dt: DataType", so you can set a default size for NVARCHAR if max is not acceptable.

Programmatically adding several columns to Spark DataFrame

I'm using spark with scala.
I have a Dataframe with 3 columns: ID,Time,RawHexdata.
I have a user defined function which takes RawHexData and expands it into X more columns. It is important to state that for each row X is the same (the columns do not vary). However, before I receive the first data, I do not know what the columns are. But once I have the head, I can deduce it.
I would like a second Dataframe with said columns: Id,Time,RawHexData,NewCol1,...,NewCol3.
The "Easiest" method I can think of to do this is:
1. deserialize each row into json (every data tyoe is serializable here)
2. add my new columns,
3. deserialize a new dataframe from the altered json,
However, that seems like a waste, as it involves 2 costly and redundant json serialization steps. I am looking for a cleaner pattern.
Using case-classes, seems like a bad idea, because I don't know the number of columns, or the column names in advance.
What you can do to dynamically extend your DataFrame is to operate on the row RDD which you can obtain by calling dataFrame.rdd. Having a Row instance, you can access the RawHexdata column and parse the contained data. By adding the newly parsed columns to the resulting Row, you've almost solved your problem. The only thing necessary to convert a RDD[Row] back into a DataFrame is to generate the schema data for your new columns. You can do this by collecting a single RawHexdata value on your driver and then extracting the column types.
The following code illustrates this approach.
object App {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(Person("a", 1), Person("b", 2)))
val dataFrame = input.df
dataFrame.show()
// create the extended rows RDD
val rowRDD = dataFrame.rdd.map{
row =>
val blob = row(1).asInstanceOf[Int]
val newColumns: Seq[Any] = Seq(blob, blob * 2, blob * 3)
Row.fromSeq(row.toSeq.init ++ newColumns)
}
val schema = dataFrame.schema
// we know that the new columns are all integers
val newColumns = StructType{
Seq(new StructField("1", IntegerType), new StructField("2", IntegerType), new StructField("3", IntegerType))
}
val newSchema = StructType(schema.init ++ newColumns)
val newDataFrame = sqlContext.createDataFrame(rowRDD, newSchema)
newDataFrame.show()
}
}
SELECT is your friend solving it without going back to RDD.
case class Entry(Id: String, Time: Long)
val entries = Seq(
Entry("x1", 100L),
Entry("x2", 200L)
)
val newColumns = Seq("NC1", "NC2", "NC3")
val df = spark.createDataFrame(entries)
.select(col("*") +: (newColumns.map(c => lit(null).as(c))): _*)
df.show(false)
+---+----+----+----+----+
|Id |Time|NC1 |NC2 |NC3 |
+---+----+----+----+----+
|x1 |100 |null|null|null|
|x2 |200 |null|null|null|
+---+----+----+----+----+

Strings concatenation in Spark SQL query

I'm experimenting with Spark and Spark SQL and I need to concatenate a value at the beginning of a string field that I retrieve as output from a select (with a join) like the following:
val result = sim.as('s)
.join(
event.as('e),
Inner,
Option("s.codeA".attr === "e.codeA".attr))
.select("1"+"s.codeA".attr, "e.name".attr)
Let's say my tables contain:
sim:
codeA,codeB
0001,abcd
0002,efgh
events:
codeA,name
0001,freddie
0002,mercury
And I would want as output:
10001,freddie
10002,mercury
In SQL or HiveQL I know I have the concat function available, but it seems Spark SQL doesn't support this feature. Can somebody suggest me a workaround for my issue?
Thank you.
Note:
I'm using Language Integrated Queries but I could use just a "standard" Spark SQL query, in case of eventual solution.
The output you add in the end does not seem to be part of your selection, or your SQL logic, if I understand correctly. Why don't you proceed by formatting the output stream as a further step ?
val results = sqlContext.sql("SELECT s.codeA, e.code FROM foobar")
results.map(t => "1" + t(0), t(1)).collect()
It's relatively easy to implement new Expression types directly in your project. Here's what I'm using:
case class Concat(children: Expression*) extends Expression {
override type EvaluatedType = String
override def foldable: Boolean = children.forall(_.foldable)
def nullable: Boolean = children.exists(_.nullable)
def dataType: DataType = StringType
def eval(input: Row = null): EvaluatedType = {
children.map(_.eval(input)).mkString
}
}
val result = sim.as('s)
.join(
event.as('e),
Inner,
Option("s.codeA".attr === "e.codeA".attr))
.select(Concat("1", "s.codeA".attr), "e.name".attr)