How can I write spark Dataframe to clickhouse - dataframe

val df = spark.read.parquet(path)
val IP ="190.176.35.145"
val port = "9000"
val table = "table1"
val user = "defalut"
val password = "default"
I don't know how to write df directly into clickhouse,
and I am not finding any similar answers.

Writing to the clickhouse database is similar to writing any other database through JDBC. Just make sure to import the ClickHouseDriver class to your code.
The username and password are passed into the ckProperties object.
The write command is as follows, you can replace the database name in the string:
import ru.yandex.clickhouse._
val jdbcUrl = "jdbc:clickhouse://190.176.35.145:9000/your_database_name"
val ckProperties = new Properties()
df.write.mode("append").option("driver", "ru.yandex.clickhouse.ClickHouseDriver").jdbc(jdbcUrl, table = "table1", ckProperties)

Related

How to use a Dataframe, which is created from Dstream, outside of foreachRDD block?

i've been tried to working on spark streaming. My problem is I want to use wordCountsDataFrame again outside of the foreach block.
i want to conditionally join wordCountsDataFrame and another dataframe that is created from Dstream. Is there any way to do that or another approach?
Thanks.
My scala code block is below.
val Seq(projectId, subscription) = args.toSeq
val sparkConf = new SparkConf().setAppName("PubsubWordCount")
val ssc = new StreamingContext(sparkConf, Milliseconds(5000))
val credentail = SparkGCPCredentials.builder.build()
val pubsubStream: ReceiverInputDStream[SparkPubsubMessage] = PubsubUtils.createStream(ssc, projectId, None, subscription, credentail, StorageLevel.MEMORY_AND_DISK_SER_2)
val stream1= pubsubStream.map(message => new String(message.getData()))
stream1.foreachRDD{ rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
// Convert RDD[String] to DataFrame
val wordsDataFrame = rdd.toDF("word")
wordsDataFrame.createOrReplaceTempView("words")
val wordCountsDataFrame =
spark.sql("select word, count(*) from words group by word")
wordCountsDataFrame.show()
}

Optimization query for DataFrame Spark

I try create DataFrame from Hive table. But I bad work with Spark API.
I need help to optimize the query in method getLastSession, make two tasks into one task for spark:
val pathTable = new File("/src/test/spark-warehouse/test_db.db/test_table").getAbsolutePath
val path = new Path(s"$pathTable${if(onlyPartition) s"/name_process=$processName" else ""}").toString
val df = spark.read.parquet(path)
def getLastSession: Dataset[Row] = {
val lastTime = df.select(max(col("time_write"))).collect()(0)(0).toString
val lastSession = df.select(col("id_session")).where(col("time_write") === lastTime).collect()(0)(0).toString
val dfByLastSession = df.filter(col("id_session") === lastSession)
dfByLastSession.show()
/*
+----------+----------------+------------------+-------+
|id_session| time_write| key| value|
+----------+----------------+------------------+-------+
|alskdfksjd|1639950466414000|schema2.table2.csv|Failure|
*/
dfByLastSession
}
PS. My Source Table (for example):
name_process
id_session
time_write
key
value
OtherClass
jsdfsadfsf
43434883477
schema0.table0.csv
Success
OtherClass
jksdfkjhka
23212123323
schema1.table1.csv
Success
OtherClass
alskdfksjd
23343212234
schema2.table2.csv
Failure
ExternalClass
sdfjkhsdfd
34455453434
schema3.table3.csv
Success
You can use row_number with Window like this:
import org.apache.spark.sql.expressions.Window
val dfByLastSession = df.withColumn(
"rn",
row_number().over(Window.orderBy(desc("time_write")))
).filter("rn=1").drop("rn")
dfByLastSession.show()
However, as you do not partition by any field maybe it can degrade performances.
Another thing you can change in your code, is using struct ordering to get the id_session associated with most recent time_write with one query:
val lastSession = df.select(max(struct(col("time_write"), col("id_session")))("id_session")).first.getString(0)
val dfByLastSession = df.filter(col("id_session") === lastSession)

Extract words from a string in spark hadoop with scala

I was using the code below to extract strings I needed in Spark SQL. But now I am working with more data in Spark Hadoop and I want to extract strings. I tried the same code, but it does not work.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
}
tmplst.mkString(",")
}
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()
How do I solve this problem?
First off, you're using Spark not the way its meant to. Your DataFrame isn't partitioned at all. Use:
val values = List("#always_nidhi", "#YouTube", "no", "i", "dnt", "understand" ...). That way, each bulk of words will be assigned to a different partition, different JVMs and/or clusters (depending on the total number of partitions and size of data). In your solution, the entire sentence is assigned to a specific partition and thus there's no parallelism nor distribution.
Second, you don't have to use a UDF (try to avoid those in general).
In order to find your regex, you can simply execute:
dataFrame.filter(col("words") rlike "#\\w+")
Hope it helps :-)

Get or Insert within a Transaction on Doobie in Scala

I'm reading through the Doobie documentation and trying to do a simple get or create within a transaction. I get an option off the first query and attempt to do a getOrElse and run an insert within the else, however I keep getting a value map is not a member of Any within the getOrElse call. What's the correct way to either get an existing or create a new row in instances and return that result in a transaction?
import doobie._
import doobie.implicits._
import cats._
import cats.effect._
import cats.implicits._
import org.joda.time.DateTime
import scala.concurrent.ExecutionContext
case class Instance(id : Int, hostname : String)
case class User(id : Int, instanceId: Int, username : String, email : String, created : DateTime)
class Database(dbUrl : String, dbUser: String, dbPass: String) {
implicit val cs = IO.contextShift(ExecutionContext.global)
val xa = Transactor.fromDriverManager[IO](
"org.postgresql.Driver", dbUrl, dbUser, dbPass
)
def getOrCreateInstance(hostname: String) = for {
existingInstance <- sql"SELECT id, hostname FROM instances i WHERE i.hostname = $hostname".query[Instance].option
ensuredInstance <- existingInstance.getOrElse(sql"INSERT INTO instances(hostname) VALUES(?)".update.withGeneratedKeys[Instance]("id", "hostname"))
} yield ensuredInstance
}
I got the following answer thanks to the people on the #scala/freenode chatroom. I'm posting it here for completeness and if people are interested in doing this without the for comprehension in the other answer.
def getOrCreateInstance(hostname: String): ConnectionIO[Instance] =
OptionT(sql"SELECT id, hostname FROM instances i WHERE i.hostname = $hostname".query[Instance].option)
.getOrElseF(sql"INSERT INTO instances(hostname) VALUES($hostname)".update.withGeneratedKeys[Instance]("id", "hostname").compile.lastOrError)
I believe something like this should work for you,
def getOrCreateInstance(hostname: String): ConnectionIO[Instance] = for {
existingInstance <- sql"SELECT id, hostname FROM instances i WHERE i.hostname = $hostname".query[Instance].option
ensuredInstance <- existingInstance.fold(sql"INSERT INTO instances(hostname) VALUES($hostname)".update.withGeneratedKeys[Instance]("id", "hostname").take(1).compile.lastOrError)(_.pure[ConnectionIO])
} yield ensuredInstance
where you are compiling the fs2 Stream and also lifting the existing instance into a ConnectionIO in the case that it does already exist.

How to use SQL to query a csv file with Scala?

I am new to Spark Scala and I am trying to make a SQL query on a csv file and return the records. Below is what I have, but is not working:
val file = sc.textFile(“file:///data/home_data.csv”)
val records = file.sqlContext("SELECT id FROM home_data WHERE yr_built < 1979")
combined.collect().foreach(records)
I get errors with the file.sqlContext function.
Thanks
Can you use class to map the data with the respective field names and datatypes, then use your query:
case class Person(first_name:String,last_name: String,age:Int)
val pmap = p.map ( line => line.split (","))
val personRDD = pmap.map ( p => Person (p(0), p(1), p(2). toInt))
val personDF = personRDD. toDF
then query the persondf.
I dont know the schema, so i formulated this way.