I am trying to write Spark dataframe into Hbase, but when I perform any action or write/save method on the same dataframe it gives the following exception:
{
java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:50)
at org.apache.spark.sql.execution.datasources.hbase.HBaseFilter$.log(HBaseFilter.scala:121)
at org.apache.spark.sql.execution.datasources.hbase.HBaseFilter$.buildFilters(HBaseFilter.scala:124)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD.getPartitions(HBaseTableScan.scala:60)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
----------
Here is my code:
import org.apache.spark.sql.{SQLContext, _}
import org.apache.spark.sql.execution.datasources.hbase._
import org.apache.spark.{SparkConf, SparkContext}
def catalog = s"""{
| |"table":{"namespace":"default", "name":"Contacts"},
| |"rowkey":"key",
| |"columns":{
| |"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},
| |"officeAddress":{"cf":"Office", "col":"Address", "type":"string"},
| |"officePhone":{"cf":"Office", "col":"Phone", "type":"string"},
| |"personalName":{"cf":"Personal", "col":"Name", "type":"string"},
| |"personalPhone":{"cf":"Personal", "col":"Phone", "type":"string"}
| |}
| |}""".stripMargin
def withCatalog(cat: String): DataFrame = {
| spark.sqlContext
| .read
| .options(Map(HBaseTableCatalog.tableCatalog->cat))
| .format("org.apache.spark.sql.execution.datasources.hbase")
| .load()
| }
val df = withCatalog(catalog)
i was able to create dataframe, but i perform
df.show()
it gives me error:
java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:50)
at org.apache.spark.sql.execution.datasources.hbase.HBaseFilter$.log(HBaseFilter.scala:121)
at org.apache.spark.sql.execution.datasources.hbase.HBaseFilter$.buildFilters(HBaseFilter.scala:124)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD.getPartitions(HBaseTableScan.scala:60)`
please give some suggestions:
i am impoting table form Hbase and creating catlog and based on that creating dataframe on top of that using:-
Spark 1.6
HBase 1.2.0-cdh5.13.3
cloudera
Same problem encountered, I'm using hbase-spark 1.2.0-cdh5.8.4.
I've tried compiling it on version 1.2.0-cdh5.13.0, after that the error no longer exists. You should try to recompile the source code or use higher version.
Related
I have a CSV like that:
COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123
I want to load it having the column VAL as a numeric type (due to other requirements of the project) and then persist it back to another CSV as per structure below:
+-----+------------------+
| COL| VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2| 200000000.1234|
|TEST3| 9999.1234679123|
+-----+------------------+
The problem I'm facing is that whenever I load it, the numbers become scientific notation, and I cannot persist it back without having to inform the precision and scale of my data (I want to use the one that it is already in the file, whatever it is - I can't infer it).
Here's what I have tried:
Loading it with DoubleType() it gives me scientific notation:
schema = StructType([
StructField('COL', StringType()),
StructField('VAL', DoubleType())
])
csv_file = "Downloads/test.csv"
df2 = (spark.read.format("csv")
.option("sep",",")
.option("header", "true")
.schema(schema)
.load(csv_file))
df2.show()
+-----+--------------------+
| COL| VAL|
+-----+--------------------+
| TEST|1.0000000012345679E8|
|TEST2| 2.000000001234E8|
|TEST3| 9999.1234679123|
+-----+--------------------+
Loading it with DecimalType() I'm required to specify precision and scale, otherwise, I lose the decimals after the dot. However, specifying it, besides the risk of not getting the correct value (as my data might be rounded), I get zeros after the dot:
For example, using: StructField('VAL', DecimalType(38, 18)) I get:
[Row(COL='TEST', VAL=Decimal('100000000.123456790000000000')),
Row(COL='TEST2', VAL=Decimal('200000000.123400000000000000')),
Row(COL='TEST3', VAL=Decimal('9999.123467912300000000'))]
Realise that in this case, I have zeros on the right side that I don't want in my new file.
The only way I found to address it was using a UDF where I first use the float() to remove the scientific notation and then I convert it to string to make sure it will be persisted as I want:
to_decimal = udf(lambda n: str(float(n)))
df2 = df2.select("*", to_decimal("VAL").alias("VAL2"))
df2 = df2.select(["COL", "VAL2"]).withColumnRenamed("VAL2", "VAL")
df2.show()
display(df2.schema)
+-----+------------------+
| COL| VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2| 200000000.1234|
|TEST3| 9999.1234679123|
+-----+------------------+
StructType(List(StructField(COL,StringType,true),StructField(VAL,StringType,true)))
There's any way to reach the same without using the UDF trick?
Thank you!
The best way I found to address it was as bellow. It is still using UDF, but now, without the workarounds with Strings to avoid scientific notation. I won't make it as correct answer yet, because I still expect someone coming over with a solution without UDF (or a good explanation of why it's not possible without UDFs).
The CSV:
$ cat /Users/bambrozi/Downloads/testf.csv
COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123
TEST4,123456789.01234567
Load the CSV applying the default PySpark DecimalType precision and scale:
schema = StructType([
StructField('COL', StringType()),
StructField('VAL', DecimalType(38, 18))
])
csv_file = "Downloads/testf.csv"
df2 = (spark.read.format("csv")
.option("sep",",")
.option("header", "true")
.schema(schema)
.load(csv_file))
df2.show(truncate=False)
output:
+-----+----------------------------+
|COL |VAL |
+-----+----------------------------+
|TEST |100000000.123456790000000000|
|TEST2|200000000.123400000000000000|
|TEST3|9999.123467912300000000 |
|TEST4|123456789.012345670000000000|
+-----+----------------------------+
When you are ready to report it (print or save in a new file) you apply a format to trailing zeros:
import decimal
import pyspark.sql.functions as F
normalize_decimals = F.udf(lambda dec: dec.normalize())
(df2
.withColumn('VAL', normalize_decimals(F.col('VAL')))
.show(truncate=False))
output:
+-----+------------------+
|COL |VAL |
+-----+------------------+
|TEST |100000000.12345679|
|TEST2|200000000.1234 |
|TEST3|9999.1234679123 |
|TEST4|123456789.01234567|
+-----+------------------+
You can use spark to do that with sql query :
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
val sparkConf: SparkConf = new SparkConf(true)
.setAppName(this.getClass.getName)
.setMaster("local[*]")
implicit val spark: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
val df = spark.read.option("header", "true").format("csv").load(csv_file)
df.createOrReplaceTempView("table")
val query = "Select cast(VAL as BigDecimal) as VAL, COL from table"
val result = spark.sql(query)
result.show()
result.coalesce(1).write.option("header", "true").mode("overwrite").csv(outputPath + table)
This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
I have a dataframe:
DF:
1,2016-10-12 18:24:25
1,2016-11-18 14:47:05
2,2016-10-12 21:24:25
2,2016-10-12 20:24:25
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
How to keep only latest record for each group? (there are 3 groups above (1,2,3)).
Result should be:
1,2016-11-18 14:47:05
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
Trying also to make it efficient (e.g. to finish within few short minutes on a moderate cluster with 100 million records), so sorting/ordering should be done (if they are required) in most efficient and correct manner..
You have to use the window function.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window
you have to partition the window by the group and OrderBy time, below pyspark script do the work
from pyspark.sql.functions import *
from pyspark.sql.window import Window
schema = "Group int,time timestamp "
df = spark.read.format('csv').schema(schema).options(header=False).load('/FileStore/tables/Group_window.txt')
w = Window.partitionBy('Group').orderBy(desc('time'))
df = df.withColumn('Rank',dense_rank().over(w))
df.filter(df.Rank == 1).drop(df.Rank).show()
+-----+-------------------+
|Group| time|
+-----+-------------------+
| 1|2016-11-18 14:47:05|
| 3|2016-10-12 17:24:25|
| 2|2016-10-12 22:24:25|
+-----+-------------------+ ```
You can use window functions as described here for cases like this:
scala> val in = Seq((1,"2016-10-12 18:24:25"),
| (1,"2016-11-18 14:47:05"),
| (2,"2016-10-12 21:24:25"),
| (2,"2016-10-12 20:24:25"),
| (2,"2016-10-12 22:24:25"),
| (3,"2016-10-12 17:24:25")).toDF("id", "ts")
in: org.apache.spark.sql.DataFrame = [id: int, ts: string]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val win = Window.partitionBy("id").orderBy("ts desc")
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#59fa04f7
scala> in.withColumn("rank", row_number().over(win)).where("rank == 1").show(false)
+---+-------------------+----+
| id| ts|rank|
+---+-------------------+----+
| 1|2016-11-18 14:47:05| 1|
| 3|2016-10-12 17:24:25| 1|
| 2|2016-10-12 22:24:25| 1|
+---+-------------------+----+
When I do full outer join in Pyspark is not giving output.
from __future__ import print_function
import sys
import json
import os from pyspark.conf
import SparkConf from functools
import reduce from pyspark.sql
import SparkSession spark = SparkSession \.builder \.appName("SPARK-TEST")
\.config("spark.sql.shuffle.partitions", u"00")
\.config("spark.speculation.interval", "2000ms")
\ .getOrCreate()col_join="yearmonth"
intpath ='s3://....'
awspath ='s3://.....'
col_join="yearmonth"
DF1=spark.read.option("inferSchema",True).option("header",True).csv(intpath)
DF1.show(10,False)
count=DF1.count()
print("int count is "+str(count))
DF2 = spark.read.option("inferSchema", True).option("header",True).csv(awspath)
DF2.show(10,False)
count=DF2.count()
print("aws count is "+str(count))
DF1.createOrReplaceTempView("int")
DF2.createOrReplaceTempView("aws")
#final_result_notmatching = DF1.join(DF2, DF1[col_join] ==
DF2[col_join],"FullOuter" )
final_result_notmatching = spark.sql("select * from int full outer join aws
ON int.{} = aws.{}".format(col_join,col_join))
Output is zero is coming when we do full outer join what is wrong with this, inner and left outer is fine.
+---------+---------+--------------------+
|yearmonth|total_rec|total_unique_count |
+---------+---------+--------------------+
|2018-10 |1160863 |1160863 |
|2018-11 |1042284 |1042284 |
|2019-01 |172704 |172704 |
|2018-12 |952276 |952276 |
|2018-06 |1177168 |1177168 |
|2018-07 |1183703 |1183703 |
|2018-08 |1183003 |1183003 |
|2018-09 |1176182 |1176182 |
+---------+---------+--------------------+
int count is 8
+---------+---------+--------------------+
|yearmonth|total_rec|total_unique_count |
+---------+---------+--------------------+
|2018-06 |1154341 |1154341 |
|2018-08 |1112278 |1112278 |
|2018-11 |6794 |6794 |
|2018-07 |1155195 |1155195 |
|2018-09 |1059808 |1059808 |
|2018-10 |988629 |988629 |
+---------+---------+--------------------+
aws count is 6
Output is zero
+---------+---------+--------------------+---------+---------+--------------------+
|yearmonth|total_rec|total_unique_count |yearmonth|total_rec|total_unique_count |
+---------+---------+--------------------+---------+---------+--------------------+
+---------+---------+--------------------+---------+---------+--------------------+
Am I doing something wrong?
Any issue with query?
I have tried with both DF and SQL. Both are giving same results.
Any null conditions I need to add ?
I have an external partitioned Hive table with underling file ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
Reading data via Hive directly is just fine, but when using Spark's Dataframe API the delimiter '|' is not taken into consideration.
Create external partitioned table:
hive> create external table external_delimited_table(value1 string, value2 string)
partitioned by (year string, month string, day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
location '/client/edb/poc_database/external_delimited_table';
Create data file containg just one row and place it to external partitioned table location:
shell>echo "one|two" >> table_data.csv
shell>hadoop fs -mkdir -p /client/edb/poc_database/external_delimited_table/year=2016/month=08/day=20
shell>hadoop fs -copyFromLocal table_data.csv /client/edb/poc_database/external_delimited_table/year=2016/month=08/day=20
Make partition active:
hive> alter table external_delimited_table add partition (year='2016',month='08',day='20');
Sanity check:
hive> select * from external_delimited_table;
select * from external_delimited_table;
+----------------------------------+----------------------------------+--------------------------------+---------------------------------+-------------------------------+--+
| external_delimited_table.value1 | external_delimited_table.value2 | external_delimited_table.year | external_delimited_table.month | external_delimited_table.day |
+----------------------------------+----------------------------------+--------------------------------+---------------------------------+-------------------------------+--+
| one | two | 2016 | 08 | 20
Spark code:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkContext, SparkConf}
object TestHiveContext {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Test Hive Context")
val spark = new SparkContext(conf)
val hiveContext = new HiveContext(spark)
val dataFrame: DataFrame = hiveContext.sql("SELECT * FROM external_delimited_table")
dataFrame.show()
spark.stop()
}
dataFrame.show() output:
+-------+------+----+-----+---+
| value1|value2|year|month|day|
+-------+------+----+-----+---+
|one|two| null|2016| 08| 20|
+-------+------+----+-----+---+
This turned out to be a problem with Spark version 1.5.0. In version 1.6.0 issue doesn't take place:
scala> sqlContext.sql("select * from external_delimited_table")
res2: org.apache.spark.sql.DataFrame = [value1: string, value2: string, year: string, month: string, day: string]
scala> res2.show
+------+------+----+-----+---+
|value1|value2|year|month|day|
+------+------+----+-----+---+
| one| two|2016| 08| 20|
+------+------+----+-----+---+
Description
Given a dataframe df
id | date
---------------
1 | 2015-09-01
2 | 2015-09-01
1 | 2015-09-03
1 | 2015-09-04
2 | 2015-09-04
I want to create a running counter or index,
grouped by the same id and
sorted by date in that group,
thus
id | date | counter
--------------------------
1 | 2015-09-01 | 1
1 | 2015-09-03 | 2
1 | 2015-09-04 | 3
2 | 2015-09-01 | 1
2 | 2015-09-04 | 2
This is something I can achieve with window function, e.g.
val w = Window.partitionBy("id").orderBy("date")
val resultDF = df.select( df("id"), rowNumber().over(w) )
Unfortunately, Spark 1.4.1 does not support window functions for regular dataframes:
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
Questions
How can I achieve the above computation on current Spark 1.4.1 without using window functions?
When will window functions for regular dataframes be supported in Spark?
Thanks!
You can use HiveContext for local DataFrames as well and, unless you have a very good reason not to, it is probably a good idea anyway. It is a default SQLContext available in spark-shell and pyspark shell (as for now sparkR seems to use plain SQLContext) and its parser is recommended by Spark SQL and DataFrame Guide.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
object HiveContextTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Hive Context")
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(
("foo", 1) :: ("foo", 2) :: ("bar", 1) :: ("bar", 2) :: Nil
).toDF("k", "v")
val w = Window.partitionBy($"k").orderBy($"v")
df.select($"k", $"v", rowNumber.over(w).alias("rn")).show
}
}
You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.
val df = sqlContext.sql("select 1, '2015-09-01'"
).unionAll(sqlContext.sql("select 2, '2015-09-01'")
).unionAll(sqlContext.sql("select 1, '2015-09-03'")
).unionAll(sqlContext.sql("select 1, '2015-09-04'")
).unionAll(sqlContext.sql("select 2, '2015-09-04'"))
// dataframe as an RDD (of Row objects)
df.rdd
// grouping by the first column of the row
.groupBy(r => r(0))
// map each group - an Iterable[Row] - to a list and sort by the second column
.map(g => g._2.toList.sortBy(row => row(1).toString))
.collect()
The above gives a result like the following:
Array[List[org.apache.spark.sql.Row]] =
Array(
List([1,2015-09-01], [1,2015-09-03], [1,2015-09-04]),
List([2,2015-09-01], [2,2015-09-04]))
If you want the position within the 'group' as well, you can use zipWithIndex.
df.rdd.groupBy(r => r(0)).map(g =>
g._2.toList.sortBy(row => row(1).toString).zipWithIndex).collect()
Array[List[(org.apache.spark.sql.Row, Int)]] = Array(
List(([1,2015-09-01],0), ([1,2015-09-03],1), ([1,2015-09-04],2)),
List(([2,2015-09-01],0), ([2,2015-09-04],1)))
You could flatten this back to a simple List/Array of Row objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.
The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.
I totally agree that Window functions for DataFrames are the way to go if you have Spark version (>=)1.5. But if you are really stuck with an older version(e.g 1.4.1), here is a hacky way to solve this
val df = sc.parallelize((1, "2015-09-01") :: (2, "2015-09-01") :: (1, "2015-09-03") :: (1, "2015-09-04") :: (1, "2015-09-04") :: Nil)
.toDF("id", "date")
val dfDuplicate = df.selecExpr("id as idDup", "date as dateDup")
val dfWithCounter = df.join(dfDuplicate,$"id"===$"idDup")
.where($"date"<=$"dateDup")
.groupBy($"id", $"date")
.agg($"id", $"date", count($"idDup").as("counter"))
.select($"id",$"date",$"counter")
Now if you do dfWithCounter.show
You will get:
+---+----------+-------+
| id| date|counter|
+---+----------+-------+
| 1|2015-09-01| 1|
| 1|2015-09-04| 3|
| 1|2015-09-03| 2|
| 2|2015-09-01| 1|
| 2|2015-09-04| 2|
+---+----------+-------+
Note that date is not sorted, but the counter is correct. Also you can change the ordering of the counter by changing the <= to >= in the where statement.