I have a dataset of type ExpenseEntry. ExpenseEntry is a basic dat structure to keep track of amount spent on each category
case class ExpenseEntry(
name: String,
category: String,
amount: BigDecimal
)
Example values -
ExpenseEntry("John", "candy", 0.5)
ExpenseEntry("Tia", "game", 0.25)
ExpenseEntry("John", "candy", 0.15)
ExpenseEntry("Tia", "candy", 0.55)
Expected answer is,
category - name - amount
candy - John - 0.65
candy - Tia - 0.55
game - Tia - 0.25
What I want to do is, get the total amount spent per cause per name. So, I have the below dataset query
dataset.groupBy("category", "name").agg(sum("amount"))
The query seems right to me theoretically. However, the sum is showing as 0E-18 which as 0. I am guessing the amount is getting typecast to int inside the sum function. How do I cast it to BigInt? Is my understanding of the issue right?
package spark
import org.apache.spark.sql.{DataFrame, SparkSession}
object SumBig extends App{
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
case class ExpenseEntry(
name: String,
category: String,
amount: BigDecimal
)
val df = Seq(
ExpenseEntry("John", "candy", 0.5),
ExpenseEntry("Tia", "game", 0.25),
ExpenseEntry("John", "candy", 0.15),
ExpenseEntry("Tia", "candy", 0.55)
).toDF()
df.show(false)
val r = df.groupBy("category", "name").sum("amount")
r.show(false)
// +--------+----+--------------------+
// |category|name|sum(amount) |
// +--------+----+--------------------+
// |game |Tia |0.250000000000000000|
// |candy |John|0.650000000000000000|
// |candy |Tia |0.550000000000000000|
// +--------+----+--------------------+
}
You can use bound() to limit the decimals
Sum wont change the data type of the column from decimal to int.
df.groupBy("category", "name").agg( sum(bround( col("amount"),2) ).as("sum_amount")).show()
Related
I've tried replacing int with Decimal in spark, but I'm getting the below error.
spark.createDataFrame([{'abc': 7010930}], StructType([StructField('abc', DecimalType(20, 0), True)]))
field abc: DecimalType(20,0) can not accept object 7010930 in type
<class 'int'>
I understand that if the Precision is 20 it should be able to get 7 digits, but where am I going wrong?
You can pass a decimal point instead of int object using the python class decimal.Decimal:
from decimal import Decimal
df = spark.createDataFrame(
[(Decimal(7010930),)],
StructType([StructField('abc', DecimalType(20, 0), True)])
)
df.printSchema()
#root
# |-- abc: decimal(20,0) (nullable = true))
I am new to scala and I desperately need some guidance on the following problem:
I have a dataframe like the one below (some elements may be NULL)
val dfDouble = Seq(
(1.0, 1.0, 1.0, 3.0),
(1.0, 2.0, 0.0, 0.0),
(1.0, 3.0, 1.0, 1.0),
(1.0, 4.0, 0.0, 2.0)).toDF("m1", "m2", "m3", "m4")
dfDouble.show
+---+---+---+---+
| m1| m2| m3| m4|
+---+---+---+---+
|1.0|1.0|1.0|3.0|
|1.0|2.0|0.0|0.0|
|1.0|3.0|1.0|1.0|
|1.0|4.0|0.0|2.0|
+---+---+---+---+
I need to get the following statistics out of this dataframe:
a vector that contains the mean of each column (some elements might be NULL and I want to calculate the mean using only the non-NULL elements); I would also like to refer to each element of the vector by name for example, vec_mean["m1_mean"] would return the first element
vec_mean: Vector(m1_mean, m2_mean, m3_mean, m4_mean)
a variance-covariance matrix that is (4 x 4), where the diagonals are var(m1), var(m2),..., and the off-diagonals are cov(m1,m2), cov(m1,m3) ... Here, I would also like to only use the non-NULL elements in the variance-covariance calculation
A vector that contains the number of non-null for each column
vec_n: Vector(m1_n, m2_n, m3_n, m4_n)
A vector that contains the standard deviation of each column
vec_stdev: Vector(m1_stde, m2_stde, m3_stde, m4_stde)
In R I would convert everything to a matrix and then the rest is easy. But in scala, I'm unfamiliar with matrices and there are apparently multiple types of matrices, which are confusing (DenseMatrix, IndexedMatrix, etc.)
Edited: apparently it makes a difference if the content of the dataframe is Double or Int. Revised the elements to be double
Used the following command per suggested answer and it worked!
val rdd = dfDouble0.rdd.map {
case a: Row => (0 until a.length).foldRight(Array[Double]())((b, acc) =>
{ val k = a.getAs[Double](b)
if(k == null)
acc.+:(0.0)
else acc.+:(k)}).map(_.toDouble)
}
Yo can work with Spark RowMatrix. It has these kind of operations like computing the co-variance matrix using each row as an observation, mean, varianze, etc... The only thing that you have to know is how to build it from a Dataframe.
It turns out that a Dataframe in Spark contains a schema that represents the type of information that can be stored in it, and it is not only floating point numbers arrays. So the first thing is to transform this DF to a RDD of vectors(dense vector in this case).
Having this DF:
val df = Seq(
(1, 1, 1, 3),
(1, 2, 0, 0),
(1, 3, 1, 1),
(1, 4, 0, 2),
(1, 5, 0, 1),
(2, 1, 1, 3),
(2, 2, 1, 1),
(2, 3, 0, 0)).toDF("m1", "m2", "m3", "m4")
Convert it to a RDD Row[DenseVector] representation. There must be dozens of ways of doing this. One could be:
val rdd = df.rdd.map {
case a: Row =>
(0 until a.length).foldRight(Array[Int]())((b, acc) => {
val k = a.getAs[Int](b)
if(k == null) acc.+:(0) else acc.+:(k)
}).map(_.toDouble)
}
As you can see in your IDE, the inferred type is RDD[Array[Float]. Now convert this to a RDD[DenseVector]. As simple as doing:
val rowsRdd = rdd.map(Vectors.dense(_))
And now you can build your Matrix:
val mat: RowMatrix = new RowMatrix(rowsRdd)
Once you have the matrix, you can easily compute the different metrix per column:
println("Mean: " + mat.computeColumnSummaryStatistics().mean)
println("Variance: " + mat.computeColumnSummaryStatistics().variance)
It gives:
Mean: [1.375,2.625,0.5,1.375]
Variance:
[0.26785714285714285,1.9821428571428572,0.2857142857142857,1.4107142857142858]
you can read more info about the capabilities of Spark and these distributed types in the doc: https://spark.apache.org/docs/latest/mllib-data-types.html#data-types-rdd-based-api
You can also compute the Covariance matrix, doing the SVD, etc...
I try to get the result of spark sql query and do some calculations for them in Scala.
val sql_DF = spark.sql("SELECT count(distinct(my_id)) total_id FROM some_ids_table ")
val total_ids = sql_DF.select("total_id").first().toSeq.asInstanceOf[Seq[Double]][0]
val sql_DF01 = spark.sql("SELECT count(distinct(other_id)) other_ids FROM some_ids_table where column_value1 = 1")
val other_id_1 = sql_DF01.select("other_ids").first().toSeq.asInstanceOf[Seq[Double]][0]
println(other_id_1/total_ids)
I got error:
error: identifier expected but integer literal found.
val total_ids = sql_DF.select("total_id").first().toSeq.asInstanceOf[Seq[Double]][0]
How to convert the result in sql query row to double so that I can do some math calculations on them?
e.g.
other_ids / total_ids
thanks
You can use the CAST function as below
scala> val sql_DF = spark.sql("SELECT count(distinct(eid)) total_id FROM employee ")
sql_DF: org.apache.spark.sql.DataFrame = [total_id: bigint]
Using the CAST function the bigint is converted to double
scala> val sql_DF = spark.sql("SELECT CAST(count(distinct(eid)) AS DOUBLE) total_id FROM employee ")
sql_DF: org.apache.spark.sql.DataFrame = [total_id: double]
The question didn't seem quite clear to me. Here's how you generally handle the situation below.
import spark.implicits._
val df = Seq(("a",90.0),("b",30.0)).toDF("name","marks")
Collect the data in column "marks", which will be of type'Array[Row]'
val rows: Array[Row] = df.select("marks").collect()
You specifically need to get each element using map and give it the type as per its datatype in the column.
val values: Array[Double] = rows.map(row => row.getAs[Double](0)/*row.getDouble(0)*/)
There are different points in your question. If you want to get the count of distinct 'my_id' and assign it to a value, you can do it directly with:
total_ids = sql_DF.dropDuplicates("my_id").count()
The error that you are getting is due to the way to access the data from DF (You need to access to row object: https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Row.html).
Alternative: You can directly cast your DF using .as[DataType], like this:
val total_id = sql_DF.selectExpr("count(distinct(my_id)) as total_id").as[Long].first()
Given following code:
import java.sql.Date
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SortQuestion extends App{
val spark = SparkSession.builder().appName("local").master("local[*]").getOrCreate()
import spark.implicits._
case class ABC(a: Int, b: Int, c: Int)
val first = Seq(
ABC(1, 2, 3),
ABC(1, 3, 4),
ABC(2, 4, 5),
ABC(2, 5, 6)
).toDF("a", "b", "c")
val second = Seq(
(1, 2, (Date.valueOf("2018-01-02"), 30)),
(1, 3, (Date.valueOf("2018-01-01"), 20)),
(2, 4, (Date.valueOf("2018-01-02"), 50)),
(2, 5, (Date.valueOf("2018-01-01"), 60))
).toDF("a", "b", "c")
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b")).groupBy("a").agg(sort_array(collect_list("c2")))
.show(false)
}
Spark produces following result:
+---+----------------------------------+
|a |sort_array(collect_list(c2), true)|
+---+----------------------------------+
|1 |[[2018-01-01,20], [2018-01-02,30]]|
|2 |[[2018-01-01,60], [2018-01-02,50]]|
+---+----------------------------------+
This implies that Spark is sorting an array by date (since it is the first field), but I want to instruct Spark to sort by specific field from that nested struct.
I know I can reshape array to (value, date) but it seems inconvenient, I want a general solution (imagine I have a big nested struct, 5 layers deep, and I want to sort that structure by particular column).
Is there a way to do that? Am I missing something?
According to the Hive Wiki:
sort_array(Array<T>) : Sorts the input array in ascending order according to the natural ordering of the array elements and returns it (as of version 0.9.0).
This means that the array will be sorted lexicographically which holds true even with complex data types.
Alternatively, you can create a UDF to sort it (and witness performance degradation) based on the second element:
val sortUdf = udf { (xs: Seq[Row]) => xs.sortBy(_.getAs[Int](1) )
.map{ case Row(x:java.sql.Date, y: Int) => (x,y) }}
first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
.groupBy("a")
.agg(sortUdf(collect_list("c2")))
.show(false)
//+---+----------------------------------+
//|a |UDF(collect_list(c2, 0, 0)) |
//+---+----------------------------------+
//|1 |[[2018-01-01,20], [2018-01-02,30]]|
//|2 |[[2018-01-02,50], [2018-01-01,60]]|
//+---+----------------------------------+
For Spark 3+, you can pass a custom comparator function to array_sort:
The comparator will take two arguments representing two elements of
the array. It returns -1, 0, or 1 as the first element is less than,
equal to, or greater than the second element. If the comparator
function returns other values (including null), the function will fail
and raise an error.
val df = first
.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
.groupBy("a")
.agg(collect_list("c2").alias("list"))
val df2 = df.withColumn(
"list",
expr(
"array_sort(list, (left, right) -> case when left._2 < right._2 then -1 when left._2 > right._2 then 1 else 0 end)"
)
)
df2.show(false)
//+---+------------------------------------+
//|a |list |
//+---+------------------------------------+
//|1 |[[2018-01-01, 20], [2018-01-02, 30]]|
//|2 |[[2018-01-02, 50], [2018-01-01, 60]]|
//+---+------------------------------------+
Where _2 is the name of the struct field you wan to use for sorting
If you have complex object it is much better to use statically typed Dataset.
case class Result(a: Int, b: Int, c: Int, c2: (java.sql.Date, Int))
val joined = first.join(second.withColumnRenamed("c", "c2"), Seq("a", "b"))
joined.as[Result]
.groupByKey(_.a)
.mapGroups((key, xs) => (key, xs.map(_.c2).toSeq.sortBy(_._2)))
.show(false)
// +---+----------------------------------+
// |_1 |_2 |
// +---+----------------------------------+
// |1 |[[2018-01-01,20], [2018-01-02,30]]|
// |2 |[[2018-01-02,50], [2018-01-01,60]]|
// +---+----------------------------------+
In simple cases it is also possible to udf, but leads to inefficient and fragile code in general and quickly goes out of control, when complexity of objects grows.
For the following code, I want to generate random value but the value in dfCombined["TotalFreeSize"] can be up to 9.941458e+11, and it throws error for randint, what should I do?
And I can't find a randlong function..
# get average, std, and number of NaN values in col
average_age_test = dfCombined["TotalFreeSize"].mean()
std_age_test = dfCombined["TotalFreeSize"].std()
count_nan_age_test = dfCombined["TotalFreeSize"].isnull().sum()
rand_1 = np.random.randint(average_age_test - std_age_test, average_age_test + std_age_test, size = count_nan_age_test)
randint takes two values as parameters, both ints. a lower bound value and a higher bound value. it will than return an int between those values.
mean(), std() in the other hand return float values, as the mean and std of a set of values.
what you should do is give parameters that are int into the randint function.
you can:
average_age_test = int ( dfCombined["TotalFreeSize"].mean() )
std_age_test = int ( dfCombined["TotalFreeSize"].std() )
hope that helps! if it does please upvote :)
sample:
input:
A = int (9.941458e+11)
random.randint(0, A)
output:
153271550649L (type: long)
You can simply map random values from 0 to 1 (generated with random) to your max desired value, and python will cast it appropriately. This seems to work even for very large max values:
import random
max = 9.941458e91 # notice the '91' exponent here!
for i in xrange(10):
val = long(random.random() * max)
print val
print type(val)
print
Sample output (single value):
45525909158271540655933151699075210889481964268830820171688052775301692920778132706971090944
<type 'long'>