Related
Spark is returning garbage/incorrect values for decimal fields when querying an external hive table on parquet in Spark code using Spark SQL.
In my application flow, a spark process originally writes data to these parquet files directly into HDFS on which external Hive table exists. Incorrect data is fetched when the second Spark process is trying to consume from Hive table using Spark-SQL.
Scenario steps: This is a simple demo reproducing the issue:
Write to Parquet: I am writing data to parquet file in HDFS, Spark itself assumes precision for decimal fields as Decimal(28,26).
scala> val df = spark.sql("select 'dummy' as name, 10.70000000000000000000000000 as value")
df: org.apache.spark.sql.DataFrame = [name: string, value: decimal(28,26)]
scala> df.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,false), StructField(value,DecimalType(28,26),false))
scala> df.show
+-----+--------------------+
| name| value|
+-----+--------------------+
|dummy|10.70000000000000...|
+-----+--------------------+
scala> df.write.option("overwrite",true).parquet("/my/hdfs/location/test")
Read parquet file: to see if value is correctly written.
scala> val df_parq = spark.read.option("spark.sql.decimalOperations.allowPrecisionLoss",false).parquet("/tenants/gwm/morph/test/tablePrecisionTest/test")
df_parq: org.apache.spark.sql.DataFrame = [name: string, value: decimal(28,26)]
scala> df_parq.show
+-------+--------------------+
| name| value|
+-------+--------------------+
| dummy|10.70000000000000...|
+-------+--------------------+
Create external hive table: on top of parquet location with Decimal field as Decimal(18,6).
hive> create external table db1.test_precision(name string, value Decimal(18,6)) STORED As PARQUET LOCATION '/my/hdfs/location/test';
Run Hive query in beeline: to verify that correct data is returned.
hive> select * from db1.test_precision;
+----------------------+-----------------------+--+
| test_precision.name | test_precision.value |
+----------------------+-----------------------+--+
| dummy | 10.7 |
+----------------------+-----------------------+--+
Run same query using Spark Sql: Incorrect decimal values are produced.
scala> val df_hive = spark.sql("select * from db1.test_precision")
df_hive: org.apache.spark.sql.DataFrame = [name: string, value: decimal(18,6)]
scala> df_hive.show
+-----+-----------+
| name| value|
+-----+-----------+
|dummy|-301.989888|
+-----+-----------+
Note - I am aware that storing the value to parquet with an explicit cast(value as Decima(18,6)) on first step can fix the issue, but I already have historical data that I can't reload right away.
Is there a way I can fix this while reading the value at step 5?
I reproduced your example completely except for step 3. You should keep precision and scale when you create the table for type Decimal.
In your case, you have created a Decimal(28,26)
df: org.apache.spark.sql.DataFrame = [name: string, value: decimal(28,26)]
so you should create a table with the same precision and scale for decimal type.
hive> CREATE EXTERNAL TABLE test.test_precision(name string, value Decimal(28,26)) STORED AS PARQUET LOCATION 'hdfs://quickstart.cloudera:8020/user/cloudera/test_decimal';
/**AND NOT**/
hive> create external table db1.test_precision(name string, value Decimal(18,6)) STORED As PARQUET LOCATION '/my/hdfs/location/test';
scala> val df = spark.sql("select 'dummy' as name, 10.70000000000000000000000000 as value")
df: org.apache.spark.sql.DataFrame = [name: string, value: decimal(28,26)]
scala> df.show()
+-----+--------------------+
| name| value|
+-----+--------------------+
|dummy|10.70000000000000...|
+-----+--------------------+
scala> df.printSchema()
root
|-- name: string (nullable = false)
|-- value: decimal(28,26) (nullable = false)
scala> df.write.option("overwrite",true).parquet("hdfs://quickstart.cloudera:8020/user/cloudera/test_decimal")
scala> val df_parq = spark.read.option("spark.sql.decimalOperations.allowPrecisionLoss",false).parquet("hdfs://quickstart.cloudera:8020/user/cloudera/test_decimal")
df_parq: org.apache.spark.sql.DataFrame = [name: string, value: decimal(28,26)]
scala> df_parq.printSchema
root
|-- name: string (nullable = true)
|-- value: decimal(28,26) (nullable = true)
scala> df_parq.show
+-----+--------------------+
| name| value|
+-----+--------------------+
|dummy|10.70000000000000...|
+-----+--------------------+
hive> CREATE EXTERNAL TABLE test.test_precision(name string, value Decimal(28,26)) STORED AS PARQUET LOCATION 'hdfs://quickstart.cloudera:8020/user/cloudera/test_decimal';
hive> select * from test_precision;
+----------------------+-----------------------+--+
| test_precision.name | test_precision.value |
+----------------------+-----------------------+--+
| dummy | 10.7 |
+----------------------+-----------------------+--+
scala> val df_hive = spark.sql("select * from test.test_precision")
df_hive: org.apache.spark.sql.DataFrame = [name: string, value: decimal(28,26)]
scala> df_hive.show
+-----+--------------------+
| name| value|
+-----+--------------------+
|dummy|10.70000000000000...|
+-----+--------------------+
scala> df_hive.printSchema
root
|-- name: string (nullable = true)
|-- value: decimal(28,26) (nullable = true)
I would like to aggregate a column values (json) in spark dataframe and hive table.
e.g.
year, month, val (json)
2010 01 [{"a_id":"caes"},{"a_id":"rgvtsa"},{"a_id":"btbsdv"}]
2010 01 [{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"ohcwa"}]
2008 10 [{"a_id":"rfve"},{"a_id":"yjndf"},{"a_id":"onbds"}]
2008 10 [{"a_id":"fvds"},{"a_id":"yjndf"},{"a_id":"yesva"}]
I need:
year, month, val (json), num (int)
2010 01 [{"a_id":"caes"},{"a_id":"rgvtsa"},{"a_id":"btbsdv},{"a_id":"uktf"}, {"a_id":"ohcwa"}] 5
2008 10 [{"a_id":"rfve"},{"a_id":"yjndf"},{"a_id":"onbds"},{"a_id":"yesva"}] 4
I need to remove the duplicates and also find the size of the json string (num of "a_id") in it.
The data is saved as a hive table so it could be better to work on it by pyspark sql ?
I also would like to know how to work on it if it is saved as a spark dataframe.
I have tried:
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType(
[
StructField('a_id', StringType(), True)
]
)
df.withColumn("val", from_json("val", schema))\
.select(col('year'), col('month'), col('val.*'))\
.show()
But, all values in "val1" are null.
thanks
UPDTAE
my hive version:
%sh
ls /databricks/hive | grep "hive"
spark--maven-trees--spark_1.4_hive_0.13
My DDL:
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.types import *
def concate_elements(val):
return reduce (lambda x, y:x+y, val)
flatten_array = F.udf(concate_elements, T.ArrayType(T.StringType()))
remove_duplicates = udf(lambda row: list(set(row)),
ArrayType(StringType()))
#final results
df.select("year","month", flatten_array("val").alias("flattenvalues")).withColumn("uniquevalues", remove_duplicates("flattenvalues")).withColumn("size",F.size("uniquevalues")).show()
considered input data input Json file json-input.json
{"year":"2010","month":"01","value":[{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"ohcwa"}]}
{"year":"2011","month":"01","value":[{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"uktf"},{"a_id":"sathya"}]}
Approach 1. Read data from hive
1. insert data into hive
ADD JAR /home/sathya/Downloads/json-serde-1.3.7-jar-with-dependencies.jar
CREATE EXTERNAL TABLE json_table (
year string,
month string,
value array<struct<a_id:string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
load data local inpath '/home/sathya/json-input.json' into table json_table;
select * from json_table;
OK
2010 01 [{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"ohcwa"}]
2011 01 [{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"uktf"},{"a_id":"sathya"}]
2. Read data from spark:
pyspark --jars /home/sathya/Downloads/json-serde-1.3.7-jar-with-dependencies.jar --driver-class-path /home/sathya/Downloads/json-serde-1.3.7-jar-with-dependencies.jar
df=spark.sql("select * from default.json_table")
df.show(truncate=False)
'''
+----+-----+----------------------------------+
|year|month|value |
+----+-----+----------------------------------+
|2010|01 |[[caes], [uktf], [ohcwa]] |
|2011|01 |[[caes], [uktf], [uktf], [sathya]]|
+----+-----+----------------------------------+
'''
#UDFs for concatenating the array elements & removing duplicates in an array
def concate_elements(val):
return reduce (lambda x, y:x+y, val)
flatten_array = F.udf(concate_elements, T.ArrayType(T.StringType()))
remove_duplicates = udf(lambda row: list(set(row)), ArrayType(StringType()))
#final results
df.select("year","month",flattenUdf("value").alias("flattenvalues")).withColumn("uniquevalues", remove_duplicates("flattenvalues")).withColumn("size",size("uniquevalues")).show()
'''
+----+-----+--------------------------+--------------------+----+
|year|month|flattenvalues |uniquevalues |size|
+----+-----+--------------------------+--------------------+----+
|2010|01 |[caes, uktf, ohcwa] |[caes, uktf, ohcwa] |3 |
|2011|01 |[caes, uktf, uktf, sathya]|[caes, sathya, uktf]|3 |
+----+-----+--------------------------+--------------------+----+
'''
Approach 2 - direct read from input Json file json-input.json
{"year":"2010","month":"01","value":[{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"ohcwa"}]}
{"year":"2011","month":"01","value":[{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"uktf"},{"a_id":"sathya"}]}
code for your scenario is:
import os
import logging
from pyspark.sql import SQLContext,SparkSession
from pyspark import SparkContext
from pyspark.sql.types import *
from pyspark.sql import functions as F
import pyspark.sql.types as T
df=spark.read.json("file:///home/sathya/json-input.json")
df.show(truncate=False)
'''
+-----+----------------------------------+----+
|month|value |year|
+-----+----------------------------------+----+
|01 |[[caes], [uktf], [ohcwa]] |2010|
|01 |[[caes], [uktf], [uktf], [sathya]]|2011|
+-----+----------------------------------+----+
'''
#UDFs for concatenating the array elements & removing duplicates in an array
def concate_elements(val):
return reduce (lambda x, y:x+y, val)
flatten_array = F.udf(concate_elements, T.ArrayType(T.StringType()))
remove_duplicates = udf(lambda row: list(set(row)), ArrayType(StringType()))
#final results
df.select("year","month",flattenUdf("value").alias("flattenvalues")).withColumn("uniquevalues", remove_duplicates("flattenvalues")).withColumn("size",size("uniquevalues")).show()
'''
+----+-----+--------------------------+--------------------+----+
|year|month|flattenvalues |uniquevalues |size|
+----+-----+--------------------------+--------------------+----+
|2010|01 |[caes, uktf, ohcwa] |[caes, uktf, ohcwa] |3 |
|2011|01 |[caes, uktf, uktf, sathya]|[caes, sathya, uktf]|3 |
+----+-----+--------------------------+--------------------+----+
'''
Here is a solution that'll work in Databricks:
#Import libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
#Define schema
schema1=StructType([
StructField('year',IntegerType(),True),
StructField('month',StringType(),True),
StructField('val',ArrayType(StructType([
StructField('a_id',StringType(),True)
])))
])
#Test data
rowsArr=[
[2010,'01',[{"a_id":"caes"},{"a_id":"rgvtsa"},{"a_id":"btbsdv"}]],
[2010,'01',[{"a_id":"caes"},{"a_id":"uktf"},{"a_id":"ohcwa"}]],
[2008,'10',[{"a_id":"rfve"},{"a_id":"yjndf"},{"a_id":"onbds"}]],
[2008,'10',[{"a_id":"fvds"},{"a_id":"yjndf"},{"a_id":"yesva"}]]
]
#Create dataframe
df1=(spark
.createDataFrame(rowsArr,schema=schema1)
)
#Create database
spark.sql('CREATE DATABASE IF NOT EXISTS testdb')
#Dump it into hive table
(df1
.write
.mode('overwrite')
.options(schema=schema1)
.saveAsTable('testdb.testtable')
)
#read from hive table
df_ht=(spark
.sql('select * from testdb.testtable')
)
#Perform transformation
df2=(df_ht
.groupBy('year','month')
.agg(array_distinct(flatten(collect_list('val'))).alias('val'))
.withColumn('num',size('val'))
)
Input DF:
Output DF:
This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
I have a dataframe:
DF:
1,2016-10-12 18:24:25
1,2016-11-18 14:47:05
2,2016-10-12 21:24:25
2,2016-10-12 20:24:25
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
How to keep only latest record for each group? (there are 3 groups above (1,2,3)).
Result should be:
1,2016-11-18 14:47:05
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
Trying also to make it efficient (e.g. to finish within few short minutes on a moderate cluster with 100 million records), so sorting/ordering should be done (if they are required) in most efficient and correct manner..
You have to use the window function.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window
you have to partition the window by the group and OrderBy time, below pyspark script do the work
from pyspark.sql.functions import *
from pyspark.sql.window import Window
schema = "Group int,time timestamp "
df = spark.read.format('csv').schema(schema).options(header=False).load('/FileStore/tables/Group_window.txt')
w = Window.partitionBy('Group').orderBy(desc('time'))
df = df.withColumn('Rank',dense_rank().over(w))
df.filter(df.Rank == 1).drop(df.Rank).show()
+-----+-------------------+
|Group| time|
+-----+-------------------+
| 1|2016-11-18 14:47:05|
| 3|2016-10-12 17:24:25|
| 2|2016-10-12 22:24:25|
+-----+-------------------+ ```
You can use window functions as described here for cases like this:
scala> val in = Seq((1,"2016-10-12 18:24:25"),
| (1,"2016-11-18 14:47:05"),
| (2,"2016-10-12 21:24:25"),
| (2,"2016-10-12 20:24:25"),
| (2,"2016-10-12 22:24:25"),
| (3,"2016-10-12 17:24:25")).toDF("id", "ts")
in: org.apache.spark.sql.DataFrame = [id: int, ts: string]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val win = Window.partitionBy("id").orderBy("ts desc")
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#59fa04f7
scala> in.withColumn("rank", row_number().over(win)).where("rank == 1").show(false)
+---+-------------------+----+
| id| ts|rank|
+---+-------------------+----+
| 1|2016-11-18 14:47:05| 1|
| 3|2016-10-12 17:24:25| 1|
| 2|2016-10-12 22:24:25| 1|
+---+-------------------+----+
I there an easy way to call sql on multiple column on spark sql.
For example, Let's say I have a query that should be applied to most columns
select
min(c1) as min,
max(c1) as max,
max(c1) - min(c1) range
from table tb1
If there are multiple columns, is there a way to execute the query for all the columns, and get result one time.
Similar to how df.describe does.
Use the meta data (columns in this case) included in your dataframe (which you can get via spark.table("<table_name>") if you don't have it in scope already to get the column names, then apply the functions you want and pass to df.select (or df.selectExpr).
Build some test data:
scala> var seq = Seq[(Int, Int, Float)]()
seq: Seq[(Int, Int, Float)] = List()
scala> (1 to 1000).foreach(n => { seq = seq :+ (n,r.nextInt,r.nextFloat) })
scala> val df = seq.toDF("id", "some_int", "some_float")
Denote some functions we want to run on all the columns:
scala> val functions_to_apply = Seq("min", "max")
functions_to_apply: Seq[String] = List(min, max)
Setup the final Seq of SQL Columns:
scala> var select_columns = Seq[org.apache.spark.sql.Column]()
select_columns: Seq[org.apache.spark.sql.Column] = List()
Iterate over the columns and functions to apply to populate the select_columns Seq:
scala> val cols = df.columns
scala> cols.foreach(col => { functions_to_apply.foreach(f => {select_columns = select_columns :+ expr(s"$f($col)")})})
Run the actual query:
scala> df.select(select_columns:_*).show
+-------+-------+-------------+-------------+---------------+---------------+
|min(id)|max(id)|min(some_int)|max(some_int)|min(some_float)|max(some_float)|
+-------+-------+-------------+-------------+---------------+---------------+
| 1| 1000| -2143898568| 2147289642| 1.8781424E-4| 0.99964607|
+-------+-------+-------------+-------------+---------------+---------------+
Description
Given a dataframe df
id | date
---------------
1 | 2015-09-01
2 | 2015-09-01
1 | 2015-09-03
1 | 2015-09-04
2 | 2015-09-04
I want to create a running counter or index,
grouped by the same id and
sorted by date in that group,
thus
id | date | counter
--------------------------
1 | 2015-09-01 | 1
1 | 2015-09-03 | 2
1 | 2015-09-04 | 3
2 | 2015-09-01 | 1
2 | 2015-09-04 | 2
This is something I can achieve with window function, e.g.
val w = Window.partitionBy("id").orderBy("date")
val resultDF = df.select( df("id"), rowNumber().over(w) )
Unfortunately, Spark 1.4.1 does not support window functions for regular dataframes:
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
Questions
How can I achieve the above computation on current Spark 1.4.1 without using window functions?
When will window functions for regular dataframes be supported in Spark?
Thanks!
You can use HiveContext for local DataFrames as well and, unless you have a very good reason not to, it is probably a good idea anyway. It is a default SQLContext available in spark-shell and pyspark shell (as for now sparkR seems to use plain SQLContext) and its parser is recommended by Spark SQL and DataFrame Guide.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
object HiveContextTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Hive Context")
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(
("foo", 1) :: ("foo", 2) :: ("bar", 1) :: ("bar", 2) :: Nil
).toDF("k", "v")
val w = Window.partitionBy($"k").orderBy($"v")
df.select($"k", $"v", rowNumber.over(w).alias("rn")).show
}
}
You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.
val df = sqlContext.sql("select 1, '2015-09-01'"
).unionAll(sqlContext.sql("select 2, '2015-09-01'")
).unionAll(sqlContext.sql("select 1, '2015-09-03'")
).unionAll(sqlContext.sql("select 1, '2015-09-04'")
).unionAll(sqlContext.sql("select 2, '2015-09-04'"))
// dataframe as an RDD (of Row objects)
df.rdd
// grouping by the first column of the row
.groupBy(r => r(0))
// map each group - an Iterable[Row] - to a list and sort by the second column
.map(g => g._2.toList.sortBy(row => row(1).toString))
.collect()
The above gives a result like the following:
Array[List[org.apache.spark.sql.Row]] =
Array(
List([1,2015-09-01], [1,2015-09-03], [1,2015-09-04]),
List([2,2015-09-01], [2,2015-09-04]))
If you want the position within the 'group' as well, you can use zipWithIndex.
df.rdd.groupBy(r => r(0)).map(g =>
g._2.toList.sortBy(row => row(1).toString).zipWithIndex).collect()
Array[List[(org.apache.spark.sql.Row, Int)]] = Array(
List(([1,2015-09-01],0), ([1,2015-09-03],1), ([1,2015-09-04],2)),
List(([2,2015-09-01],0), ([2,2015-09-04],1)))
You could flatten this back to a simple List/Array of Row objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.
The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.
I totally agree that Window functions for DataFrames are the way to go if you have Spark version (>=)1.5. But if you are really stuck with an older version(e.g 1.4.1), here is a hacky way to solve this
val df = sc.parallelize((1, "2015-09-01") :: (2, "2015-09-01") :: (1, "2015-09-03") :: (1, "2015-09-04") :: (1, "2015-09-04") :: Nil)
.toDF("id", "date")
val dfDuplicate = df.selecExpr("id as idDup", "date as dateDup")
val dfWithCounter = df.join(dfDuplicate,$"id"===$"idDup")
.where($"date"<=$"dateDup")
.groupBy($"id", $"date")
.agg($"id", $"date", count($"idDup").as("counter"))
.select($"id",$"date",$"counter")
Now if you do dfWithCounter.show
You will get:
+---+----------+-------+
| id| date|counter|
+---+----------+-------+
| 1|2015-09-01| 1|
| 1|2015-09-04| 3|
| 1|2015-09-03| 2|
| 2|2015-09-01| 1|
| 2|2015-09-04| 2|
+---+----------+-------+
Note that date is not sorted, but the counter is correct. Also you can change the ordering of the counter by changing the <= to >= in the where statement.