I have the following data:
53,Male,11th,<=50K
53,Male,11th,<=50K
53,Male,11th,<=50K
20,Female,Masters,>50K
20,Female,Masters,>50K
33,Male,Bachelors,<=50K
Next I need to group the above data by using select and group. so it will be something like:
53,Male,11th,<=50K,3
20,Female,Masters,>50K,2
33,Male,Bachelors,<=50K,1
where the last number shows the number of similar records. Now I need to filter the number of equivalent records > 2, and store it in a separate file
I have grouped data by sql queries in Scala command. To ungroup data, I though we may create a table and add the grouped data by (insert command), and line by line. It works, but this is very very slow, and took around one hour for few records. Is there any thoughts using Scala with many thanks.
The command is shown below:
import spark.sqlContext.implicits._
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
case class Rating(age: Double,edu: String, sex: String, salary: String)
val Result = sc.textFile("hdfs://NameNode01:9000/input/adult.csv").map(_.split(",")).map(p => Rating(p(0).trim.toDouble,p(1),p(2),p(3))).toDF()
Result.registerTempTable("Start")
val sal1=spark.sqlContext.sql("SELECT age,edu,sex,salary,count(*) as cnt from Start group by age,edu,sex,salary")
sal1.registerTempTable("adult")
val sal2=spark.sqlContext.sql("SELECT age,edu,sex,salary,cnt from adult WHERE cnt>3")
sal2.registerTempTable("adult2")
var ag=sal2.map(age => ""+age(0)).collect()
var ed=sal2.map(edu => ""+edu(1)).collect()
var se=sal2.map(sex => ""+sex(2)).collect()
var sa=sal2.map(salary => ""+salary(3)).collect()
var cn=sal2.map(cnt => ""+cnt(4)).collect()
//convert age to double
val ages= ag.map(_.toDouble)
//convert the cnt to integer
val counts= cn.map(_.toInt)
//length of the array
var cnt_length=counts.size
//create a table and add the sal2 records in it
val adlt2=spark.sqlContext.sql("CREATE TABLE adult3 (age double, edu string, sex string, salary string)")
//loop and enter the number of cn
var sql_querys="query"
var i=0
var j=0
var loop_cnt=0
for(i <-0 to cnt_length-1){
loop_cnt=counts(i)
for(j <-0 to loop_cnt-1){
sql_querys="INSERT into adult3 values ("+ages(i)+",'"+ed(i)+"','"+se(i)+"','"+sa(i)+"')"
val adlt3=spark.sqlContext.sql("INSERT into adult3 values ("+ages(i)+",'"+ed(i)+"','"+se(i)+"','"+sa(i)+"')")
}
}
The main part is the loop at the end of the code.
Here is a shorter solution which uses only rdds:
val result = sc
.textFile("hdfs://NameNode01:9000/input/adult.csv")
.map({ (line: String) =>
val p = line.split(",")
(Rating(p(0).trim.toDouble,p(1),p(2),p(3)), 1)
})
.reduceByKey(_ + _)
.filter(_._2 > 2)
.flatMap(rating => Array.fill(rating._2)(rating._1))
It works as follows:
textfile loads the rdd from the file
map transforms the lines to pairs of the form (rating, 1)
reduceByKey groups the pairs by the rating and sums the 1s (i.e. counts the occurrences of each rating)
filter discards the ratings which appear less than 3 times
flatmap repeats each rating so many times as its count and then flattens all the results to a single rdd
Here are some reasons why the initial approach is not performant:
collect is used on the dataframe to read its contents on a local machine. This means that you directly loose all parallelization and clustering benefits of spark.
The for loop executes single insertions to the dataframe. The available transformations of spark objects (e.g. map, filter, reduce, single sql queries) are highly optimised to perform those actions in a distributed fashion. By using a for loop to perform single row actions, you loose this advantage and additionally you risk getting extreme overhead by dataframes which are copied during each iteration in the loop.
(minor) Converting an RDD to a dataframe adds some additional computational cost. Thus unless you plan to perform several operations which would benefit from performance features of dataframes or datasets, I would suggest keeping it simple by using just rdds.
You might want to consider ungrouping your dataframe using explode in accordance with the groupBy count:
import org.apache.spark.sql.functions._
case class Rating(age: Double, edu: String, sex: String, salary: String)
val Result = sc.textFile("/Users/leo/projects/spark/files/testfile.csv").
map(_.split(",")).
map(p => Rating(p(0).trim.toDouble, p(1).trim, p(2).trim, p(3).trim)).
toDF
val saDF1 = Result.groupBy("age", "edu", "sex", "salary").agg(count("*") as "cnt")
val saDF2 = Result.groupBy("age", "edu", "sex", "salary").agg(count("*") as "cnt").where($"cnt" > 2)
// Create a UDF to fill array of 1's to be later exploded
val fillArr = (n: Int) => Array.fill(n)(1)
val fillArrUDF = udf(fillArr)
val expandedDF1 = saDF1.withColumn("arr", fillArrUDF($"cnt"))
expandedDF1.show
+----+------+---------+------+---+---------+
| age| edu| sex|salary|cnt| arr|
+----+------+---------+------+---+---------+
|33.0| Male|Bachelors| <=50K| 1| [1]|
|20.0|Female| Masters| >50K| 2| [1, 1]|
|53.0| Male| 11th| <=50K| 3|[1, 1, 1]|
+----+------+---------+------+---+---------+
// Ungroup dataframe using explode
val ungroupedDF1 = expandedDF1.withColumn("a", explode($"arr")).
select("age", "edu", "sex", "salary")
ungroupedDF1.show
+----+------+---------+------+
| age| edu| sex|salary|
+----+------+---------+------+
|33.0| Male|Bachelors| <=50K|
|20.0|Female| Masters| >50K|
|20.0|Female| Masters| >50K|
|53.0| Male| 11th| <=50K|
|53.0| Male| 11th| <=50K|
|53.0| Male| 11th| <=50K|
+----+------+---------+------+
According to what I understood from your question that you want to filter out similar records which are greater than 2 and write to a file. If thats so following can be your solution.
You must already have original dataframe as
+----+------+---------+------+
|age |edu |sex |salary|
+----+------+---------+------+
|53.0|Male |11th |<=50K |
|53.0|Male |11th |<=50K |
|53.0|Male |11th |<=50K |
|20.0|Female|Masters |>50K |
|20.0|Female|Masters |>50K |
|33.0|Male |Bachelors|<=50K |
+----+------+---------+------+
You don't need to write complex sql queries to find the count, you can just use inbuilt functions as
val columnNames = Result.columns
val finalTemp = Result.groupBy(columnNames.map(col): _*).agg(count("salary").as("similar records"))
This should give output as
+----+------+---------+------+---------------+
|age |edu |sex |salary|similar records|
+----+------+---------+------+---------------+
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
|53.0|Male |11th |<=50K |3 |
+----+------+---------+------+---------------+
Now to filter, you can just use filter function as
val finalTable = finalTemp.filter($"similar records" < 3)
final output is
+----+------+---------+------+---------------+
|age |edu |sex |salary|similar records|
+----+------+---------+------+---------------+
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
+----+------+---------+------+---------------+
you can save it to a file
finalTable.write.format("com.databricks.spark.csv").save("output path")
If you want the original data with filtered out then you can simply use join as
Result.join(finalTable, Seq(columnNames: _*)).show(false)
output is
+----+------+---------+------+---------------+
|age |edu |sex |salary|similar records|
+----+------+---------+------+---------------+
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
|20.0|Female|Masters |>50K |2 |
+----+------+---------+------+---------------+
You can save it to a file as above
Note: you will need following import for above functions to work
import org.apache.spark.sql.functions._
Related
I have an array of strings defined in a variable which contains name of the column. I would like to perform group by and get count.
I am trying below code but throws error.
val keys = Array("Col1", "Col2")
val grouppedByDf = myDf.groupBy(keys.mkString(",").count
Can you please guide me what I am doing wrong here ?
import spark.implicits._
val df = Seq(("βήτα", "άλφα", 20), ("άλφα", "βήτα", 10), ("άλφα", "βήτα", 20), ("βήτα", "άλφα", 10)).toDF("α", "β", "ω")
val keys = Array("α", "β")
df
.groupBy(keys.map(col(_)): _*)
.count()
.show()
+----+----+-----+
| α| β|count|
+----+----+-----+
|βήτα|άλφα| 2|
|άλφα|βήτα| 2|
+----+----+-----+
I've been studying Spark for a while but today I got stuck, I'm working in a Recommendation model using Audioscrobbler Dataset.
I have my model based in ALS and the following definition for make the recommendations:
def makeRecommendations(model: ALSModel, userID: Int,howMany: Int): DataFrame = {
val toRecommend = model.itemFactors.select($"id".as("artist")).withColumn("user", lit(userID))
model.transform(toRecommend).
select("artist", "prediction", "user").
orderBy($"prediction".desc).
limit(howMany)
}
It's generating the expected output, but now I would like to create a new list of DataFrames using Predictions DF and User Data DF.
DataFrame Example
New list of DF consisting of the Predicted value from "Predictions DF" and "Listened" that will be 0 if the user didn't listened the artist or 1 if the user listened, something like this:
Expected DF
I tried the following solution:
val recommendationsSeq = someUsers.map { userID =>
//Gets the artists from user in testData
val artistsOfUser = testData.where($"user".===(userID)).select("artist").rdd.map(r => r(0)).collect.toList
// Recommendations for each user
val recoms = makeRecommendations(model, userID, numRecom)
//Insert a column listened with 1 if the artist in the test set for the user and 0 otherwise
val recomOutput = recoms.withColumn("listened", when($"artist".isin(artistsOfUser: _*), 1.0).otherwise(0.0)).drop("artist")
(recomOutput)
}.toSeq
But its very time consuming when the recommendation has more than 30 users. I believe there's a better way to do it,
Could someone give some idea?
Thanks,
You can try joining dataframes then goupby and count:
scala> val df1 = Seq((1205,0.9873411,1000019)).toDF("artist","prediction","user")
scala> df1.show()
+------+----------+-------+
|artist|prediction| user|
+------+----------+-------+
| 1205| 0.9873411|1000019|
+------+----------+-------+
scala> val df2 = Seq((1000019,1205,40)).toDF("user","artist","playcount")
scala> df2.show()
+-------+------+---------+
| user|artist|playcount|
+-------+------+---------+
|1000019| 1205| 40|
+-------+------+---------+
scala> df1.join(df2,Seq("artist","user")).groupBy('prediction).count().show()
+----------+-----+
|prediction|count|
+----------+-----+
| 0.9873411| 1|
+----------+-----+
Given following code:
import java.sql.Date
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SortQuestion extends App{
val spark = SparkSession.builder().appName("local").master("local[*]").getOrCreate()
import spark.implicits._
case class ABC(a: Int, b: Int, c: Int)
val first = Seq(
ABC(1, 2, 3),
ABC(1, 3, 4),
ABC(2, 4, 5),
ABC(2, 5, 6)
).toDF("a", "b", "c")
val second = Seq(
(1, 2, (Date.valueOf("2018-01-02"), 30)),
(1, 3, (Date.valueOf("2018-01-01"), 20)),
(2, 4, (Date.valueOf("2018-01-02"), 50)),
(2, 5, (Date.valueOf("2018-01-01"), 60))
).toDF("a", "b", "c")
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b")).groupBy("a").agg(sort_array(collect_list("c2")))
.show(false)
}
Spark produces following result:
+---+----------------------------------+
|a |sort_array(collect_list(c2), true)|
+---+----------------------------------+
|1 |[[2018-01-01,20], [2018-01-02,30]]|
|2 |[[2018-01-01,60], [2018-01-02,50]]|
+---+----------------------------------+
This implies that Spark is sorting an array by date (since it is the first field), but I want to instruct Spark to sort by specific field from that nested struct.
I know I can reshape array to (value, date) but it seems inconvenient, I want a general solution (imagine I have a big nested struct, 5 layers deep, and I want to sort that structure by particular column).
Is there a way to do that? Am I missing something?
According to the Hive Wiki:
sort_array(Array<T>) : Sorts the input array in ascending order according to the natural ordering of the array elements and returns it (as of version 0.9.0).
This means that the array will be sorted lexicographically which holds true even with complex data types.
Alternatively, you can create a UDF to sort it (and witness performance degradation) based on the second element:
val sortUdf = udf { (xs: Seq[Row]) => xs.sortBy(_.getAs[Int](1) )
.map{ case Row(x:java.sql.Date, y: Int) => (x,y) }}
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
.groupBy("a")
.agg(sortUdf(collect_list("c2")))
.show(false)
//+---+----------------------------------+
//|a |UDF(collect_list(c2, 0, 0)) |
//+---+----------------------------------+
//|1 |[[2018-01-01,20], [2018-01-02,30]]|
//|2 |[[2018-01-02,50], [2018-01-01,60]]|
//+---+----------------------------------+
For Spark 3+, you can pass a custom comparator function to array_sort:
The comparator will take two arguments representing two elements of
the array. It returns -1, 0, or 1 as the first element is less than,
equal to, or greater than the second element. If the comparator
function returns other values (including null), the function will fail
and raise an error.
val df = first
.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
.groupBy("a")
.agg(collect_list("c2").alias("list"))
val df2 = df.withColumn(
"list",
expr(
"array_sort(list, (left, right) -> case when left._2 < right._2 then -1 when left._2 > right._2 then 1 else 0 end)"
)
)
df2.show(false)
//+---+------------------------------------+
//|a |list |
//+---+------------------------------------+
//|1 |[[2018-01-01, 20], [2018-01-02, 30]]|
//|2 |[[2018-01-02, 50], [2018-01-01, 60]]|
//+---+------------------------------------+
Where _2 is the name of the struct field you wan to use for sorting
If you have complex object it is much better to use statically typed Dataset.
case class Result(a: Int, b: Int, c: Int, c2: (java.sql.Date, Int))
val joined = first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
joined.as[Result]
.groupByKey(_.a)
.mapGroups((key, xs) => (key, xs.map(_.c2).toSeq.sortBy(_._2)))
.show(false)
// +---+----------------------------------+
// |_1 |_2 |
// +---+----------------------------------+
// |1 |[[2018-01-01,20], [2018-01-02,30]]|
// |2 |[[2018-01-02,50], [2018-01-01,60]]|
// +---+----------------------------------+
In simple cases it is also possible to udf, but leads to inefficient and fragile code in general and quickly goes out of control, when complexity of objects grows.
If I have a dataframe with fields ['did','doc'] such as
data = sc.parallelize(['This is a test',
'This is also a test',
'These sentence are tests',
'This tests these sentences'])\
.zipWithIndex()\
.map(lambda x: (x[1],x[0]))\
.toDF(['did','doc'])
data.show()
+---+--------------------+--------------------+
|did| doc| words|
+---+--------------------+--------------------+
| 0| This is a test| [this, is, a, test]|
| 1| This is also a test|[this, is, also, ...|
| 2|These sentence ar...|[these, sentence,...|
| 3|This tests these ...|[this, tests, the...|
+---+--------------------+--------------------+
and I do some transformations on that document like tokenizing and finding 2-grams:
data = Tokenizer(inputCol = 'doc',outputCol = 'words').transform(data)
data = NGram(n=2,inputCol = 'words',outputCol='grams').transform(data)
data.show()
+---+--------------------+--------------------+--------------------+
|did| doc| words| grams|
+---+--------------------+--------------------+--------------------+
| 0| This is a test| [this, is, a, test]|[this is, is a, a...|
| 1| This is also a test|[this, is, also, ...|[this is, is also...|
| 2|These sentence ar...|[these, sentence,...|[these sentence, ...|
| 3|This tests these ...|[this, tests, the...|[this tests, test...|
+---+--------------------+--------------------+--------------------+
then at the end I want to combine the two-grams and words into a single column of features with a VectorAssembler:
data = VectorAssembler(inputCol=['words','grams'],
outputCol='features').transform(data)
then I get the following error:
Py4JJavaError: An error occurred while calling o504.transform.
: java.lang.IllegalArgumentException: Data type ArrayType(StringType,true) is not supported.
because the VectorAssembler doesn't like to work with lists of strings. To get around that I can drop the dataframe to an rdd, map the rdd to appropriate rows, and rezip it back up into a dataframe, a la
data = data.rdd.map(lambda x: Row(did = x['did'],
features = x['words']+x['grams'])) .toDF(['did','features'])
Which is not a problem for this tiny dataset, but which is prohibitively expensive for a large dataset.
Is there any way to achieve this more efficiently than the above?
You can use a udf to create the features column like this
import pyspark.sql.functions as f
import pyspark.sql.types as t
udf_add = f.udf(lambda x,y: x+y, t.ArrayType(t.StringType()))
data.withColumn('features', udf_add('words','grams')).select('features').show()
[Row(features=['this', 'is', 'a', 'test', 'this is', 'is a', 'a test']),
Row(features=['this', 'is', 'also', 'a', 'test', 'this is', 'is also', 'also a', 'a test']),
Row(features=['these', 'sentence', 'are', 'tests', 'these sentence', 'sentence are', 'are tests']),
Row(features=['this', 'tests', 'these', 'sentences', 'this tests', 'tests these', 'these sentences'])]
I would like to read a .csv file with Spark and associate the columns with fitting Types.
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Name")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("date", DateType, true),
StructField("time",StringType, true),
StructField("am", DoubleType, true),
StructField("hum", DoubleType, true),
StructField("temp", DoubleType, true)
))
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.schema(customSchema)
.load("data.csv")
A line of the .csv I am reading looks like this
+----------+--------+-----+-----+-----+
| date| time| am| hum| temp|
+----------+--------+-----+-----+-----+
|04.10.2016|12:51:20|1.121|0.149|0.462|
+----------+--------+-----+-----+-----+
Spark will read the .csv and associate the Types correctly if I set the type for the date to String. If I keep the customSchema like in the code shown above, Spark will throw an exception due to the wrong date format (DateType will expect YYYY-MM-DD while mine is DD.MM.YYYY).
Is there a way to re-format the date Strings to YYYY-MM-DD and apply the schema afterwards? Or can I also alter the DateType given by Spark by adding parameters?
Thanks in advance
Use dateFormat option:
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.option("dateFormat", "dd.MM.yyyy")
.schema(customSchema)
.load("data.csv")
I recommend parsing the dates afterwards. Also refer to this.
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
and:
scala> date.format(DateTimeFormatter.ofPattern("yyyy.MM.dd"))
res0: String = 2020.01.01
Also as a side note, since spark 2.0, you use spark session object only and use encoders for inferring schema(instead of sc, sqlcontext etc). Something like this:
spark = SparkSession(...)
case class User(id:Int, city:String, loc:Array[Double], pop:Long, state:String)
val users = (spark.read.option("inferSchema","true").option("header","true").csv("data/users1.csv").as[User])