Output is empty - dataframe

I am new to scala. I have written a sample code which reads from csv after taking reference from websites. I am executing the code in Databricks. Here is the sample one
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import java.io.FileNotFoundException
import java.io.IOException
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
object Sample{
def main(args:Array[String]){
val conf = new SparkConf().setAppName("Read_Events").setMaster("yarn-cluster")
val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val schema = StructType(List(
StructField("id", LongType, nullable = true),
StructField("name", StringType, nullable = true),
StructField("value", StringType, nullable = true),
StructField("timestamp", LongType, nullable = true)))//Creating the schema
try{
val myDF = spark.read.schema(schema).option("header", "true").option("delimiter", ",").csv("dbfs/FileStore/shared_uploads/tru.csv")
myDF.show()
}
catch {
case ex: FileNotFoundException =>{
println("Input file not available in path")
}
case ex: IOException => {
println("IO Exception")
}}
}
}
Seems the df.show() is not outputting
Seems I am doing something wrong. Any guidance would be Great!

In this code, you have created an object called Sample inside which you have code for reading the csv.
When I implemented the same code, it gave the following output. But myDf.show() does not give any output as well.
Here, we defined only the Sample object. We need to run the main() function in this Sample object to get the output. You can use the following code:
%scala
Sample.main(Array())
NOTE: Instead of this, you can directly write the contents of main() function in a scala cell to get the output directly.

Related

How to set all fields to null in scala dataframe except few fields

I have a use case where I need to set all fields in file to null except few and write the updated dataframe back to the same file.
I am trying this but it give the error
Exception in User Class: org.apache.spark.sql.catalyst.parser.ParseException :
import org.apache.spark.sql.types._
val excluded_list = List("date","id")
var transformed_df = source_df
for (t <- source_df.dtypes){
println(t._1,t._2)
if (excluded_list.contains(t._1) == false) {
print(t._1)
transformed_df = transformed_df.withColumn(t._1,lit(null).cast(t._2))
}
}
Although this works :
val transformed_df = hudi_dataframe.withColumn("amount",lit(null).cast(IntegerType))
Could someone please help me with this? Is there any other way to achieve this.

How to deserialize JSON from Kafka Consumer Record

I'm looking to access some fields on a Kafka Consumer record. I'm able to receive the event data which is a Java object i.e ConsumerRecord(topic = test.topic, partition = 0, leaderEpoch = 0, offset = 0, CreateTime = 1660933724665, serialized key size = 32, serialized value size = 394, headers = RecordHeaders(headers = [], isReadOnly = false), key = db166cbf1e9e438ab4eae15093f89c34, value = {"eventInfo":...}).
I'm able to access the eventInfo values which comes back as a json string. I'm fairly new to Kotlin and using Kafka so I'm not entirely sure if this is correct but I'm looking to basically access the fields in value but I can't get rid of an error that appears when trying to use mapper.readValue which is:
None of the following functions can be called with the arguments supplied.
import com.afterpay.shop.favorites.model.Product
import com.fasterxml.jackson.module.kotlin.jacksonObjectMapper
import org.apache.avro.generic.GenericData.Record
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.springframework.kafka.annotation.KafkaListener
import org.springframework.kafka.support.Acknowledgment
import org.springframework.stereotype.Component
#Component
class KafkaConsumer {
#KafkaListener(topics = ["test.topic"], groupId = "group-id")
fun consume(consumerRecord: ConsumerRecord<String, Any>, ack: Acknowledgment) {
val mapper = jacksonObjectMapper()
val value = consumerRecord.value()
val record = mapper.readValue(value, Product::class.java)
println(value)
ack.acknowledge()
}
}
Is this the correct way to accomplish this?
First, change ConsumerRecord<String, Any> to ConsumerRecord<String, Product>, then change value.deserializer in your consumer config/factory to use JSONDeserializer
Then your consumerRecord.value() will already be a Product instance, and you don't need an ObjectMapper
https://docs.spring.io/spring-kafka/docs/current/reference/html/#json-serde
Otherwise, if you use StringDeserializer, change Any to String so that the mapper.readValue argument types are correct.

Extract words from a string in spark hadoop with scala

I was using the code below to extract strings I needed in Spark SQL. But now I am working with more data in Spark Hadoop and I want to extract strings. I tried the same code, but it does not work.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
}
tmplst.mkString(",")
}
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()
How do I solve this problem?
First off, you're using Spark not the way its meant to. Your DataFrame isn't partitioned at all. Use:
val values = List("#always_nidhi", "#YouTube", "no", "i", "dnt", "understand" ...). That way, each bulk of words will be assigned to a different partition, different JVMs and/or clusters (depending on the total number of partitions and size of data). In your solution, the entire sentence is assigned to a specific partition and thus there's no parallelism nor distribution.
Second, you don't have to use a UDF (try to avoid those in general).
In order to find your regex, you can simply execute:
dataFrame.filter(col("words") rlike "#\\w+")
Hope it helps :-)

Issue with toDF, Value toDF is not a member of org.apache.spark.rdd.RDD

I have attached code snippet for error "value toDF is not a member of org.apache.spark.rdd.RDD". I am using scala 2.11.8 and spark 2.0.0.
Can you please help me to resolve this issue for API toDF()?
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions._
object HHService {
case class Services(
uhid:String,
locationid:String,
doctorid:String,
billdate:String,
servicename:String,
servicequantity:String,
starttime:String,
endtime:String,
servicetype:String,
servicecategory:String,
deptname:String
)
def toService = (p: Seq[String]) => Services(p(0), p(1),p(2),p(3),p(4),p(5),p(6),p(7),p(8),p(9),p(10))
def main(args: Array[String]){
val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
val spark = SparkSession
.builder
.appName(getClass.getSimpleName)
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
val sc = spark.sparkContext
val sqlContext = spark.sqlContext;
import spark.implicits._
import sqlContext.implicits._
val hospitalDataText = sc.textFile("D:/Books/bboks/spark/Intellipaat/Download/SparkHH/SparkHH/services.csv")
val header = hospitalDataText.first()
val hospitalData= hospitalDataText.filter(a => a!= header)
//val HData = hospitalData.map(_.split(",")).map(p=>Services(p(0), p(1),p(2),p(3),p(4),p(5),p(6),p(7),p(8),p(9),p(10)))
val HData = hospitalData.map(_.split(",")).map(toService(_))
val hosService=HData.toDF()
}
}
1] Need to get sqlContext as below.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
This solved my issue. Earlier below code snippet is used to get sqlcontext.
val sqlContext = spark.sqlContext
(This way it is worked with spark-shell)
2]
case class need to be out of method. This is also mentioned in most of the blogs.
Got the same issue using notebooks in DataBricks converting my code to simple functions. Had to déclare the class out of the function and all was working well :
%scala
case class className(param1 : String,
param2 : String,
...
lastoaram : Double)
def myFunction(params) = {
a lot of code
...
var myVarBasedOnClasseDefinition = Seq(myVarBasedOnClasseDefinition ("init","init","init",0.0,0.0,"init",0.0))
for(iteration <- iterator) myVarBasedOnClasseDefinition = myVarBasedOnClasseDefinition ++ additionnalSequence
display(myVarBasedOnClasseDefinition.toDF())
}
Hope this will help as the sentence "case class need to be out of method" didn't really seemed to apply in my case in the beginning of my search has all was working fine using a procedural like code.

Spark sql Dataframe - import sqlContext.implicits._

I have main that creates spark context:
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
Then creates dataframe and does filters and validations on the dataframe.
val convertToHourly = udf((time: String) => time.substring(0, time.indexOf(':')) + ":00:00")
val df = sqlContext.read.schema(struct).format("com.databricks.spark.csv").load(args(0))
// record length cannot be < 2
.na.drop(3)
// round to hours
.withColumn("time",convertToHourly($"time"))
This works great.
BUT When I try moving my validations to another file by sending the dataframe to
function ValidateAndTransform(df: DataFrame) : DataFrame = {...}
that gets the Dataframe & does the validations and transformations: It seems like I need the
import sqlContext.implicits._
To avoid the error: “value $ is not a member of StringContext”
that happens on line:
.withColumn("time",convertToHourly($"time"))
But to use the import sqlContext.implicits._
I also need the sqlContext either defined in the new file like so:
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
or send it to the
function ValidateAndTransform(df: DataFrame) : DataFrame = {...}
function
I feel like the separation I'm trying to do to 2 files (main & validation) is not done correctly...
Any idea on how to design this? Or simply send the sqlContext to the function?
Thanks!
You can work with a singleton instance of the SQLContext. You can take a look at this example in the spark repository
/** Lazily instantiated singleton instance of SQLContext */
object SQLContextSingleton {
#transient private var instance: SQLContext = _
def getInstance(sparkContext: SparkContext): SQLContext = {
if (instance == null) {
instance = new SQLContext(sparkContext)
}
instance
}
}
...
//And wherever you want you can do
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._