How to read Key Value pair in spark SQL? - sql

How do I get this output using spark sql or scala ? I have a table with columns storing such values - need to split in seprate columns.
Input :
Output :

It pretty much depends on what libs you want to use (as you mentioned in Scala or Spark).
Using spark
val rawJson = """
{"Name":"ABC.txt","UploaddedById":"xxxxx1123","UploadedByName":"James"}
"""
spark.read.json(Seq(rawJson).toDS)
Using common json libraries:
// play
Json.parse(rawJson) match {
case obj: JsObject =>
val values = obj.values
val keys = obj.keys
// construct dataframe having keys and values
case other => // handle other types (like JsArray, etc,.)
}
// circe
import io.circe._, io.circe.parser._
parse(rawJson) match {
case Right(json) => // fetch key values, construct df, much like above
case Left(parseError) => ...
}
You can use almost any json library to parse your json object, and then convert it to spark df very easily.

Related

Spark Dataframe size check on columns does not work as expected using vararg and if else - Scala

I do not want to use foldLeft or withColumn with when over all columns in a dataframe, but want a select as per https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015, embellished with an if else statement and cols with vararg. All I want is to replace an empty array column in a Spark dataframe using Scala. I am using size but it never computes the zero (0) correctly.
val resDF2 = aggDF.select(cols.map { col =>
( if (size(aggDF(col)) == 0) lit(null) else aggDF(col) ).as(s"$col")
}: _*)
if (size(aggDF(col)) == 0) lit(null) does not work here functionally, but it does run and size(aggDF(col)) returns the correct length if I return that.
I am wondering what the silly issue is. Must be something I am obviously overlooking!
if-else won't work with DataFrame API, these are for Scala logical expressions. With DataFrames you need when/otherwise:
val resDF2 = aggDF.select(cols.map { col => ( when(size(aggDF(col)) === 0,lit(null)).otherwise(aggDF(col))).as(s"$col") }: _*)
This can further be simplified because when without otherwise automatically returns null (i.e. otherwise(lit(null)) is the default):
val resDF2 = aggDF.select(cols.map { col => when(size(aggDF(col)) > 0,aggDF(col)).as(s"$col") }: _*)
See also https://stackoverflow.com/a/48074218/1138523

Read parquet file having mixed data type in a column

I want to read a parquet file using spark sql in which one column has mixed datatype (string and integer).
val sqlContext = new SQLContext(sparkContext)
val df = sqlContext.read.parquet("/tmp/data")
This throws me exception : Failed to merge incompatible data types IntegerType and StringType
Is there a way to explicitly type cast the column during read ?
The only way that I have found is to manually cast one of the fields so that they match. You can do this by reading in the individual parquet files into a sequence and iteratively modifying them as such:
def unionReduce(dfs: Seq[DataFrame]) = {
dfs.reduce{ (x, y) =>
def schemaTruncate(df: DataFrame) = df.schema.map(schema => schema.name -> schema.dataType)
val diff = schemaTruncate(y).toSet.diff(schemaTruncate(x).toSet)
val fixedX = diff.foldLeft(x) { case (df, (name, dataType)) =>
Try(df.withColumn(name, col(name).cast(dataType))) match {
case Success(newDf) => newDf
case Failure(error) => df.withColumn(name, lit(null).cast(dataType))
}
}
fixedX.select(y.columns.map(col): _*).unionAll(y)
}
}
The above function first finds the differently named or typed columns which are in Y but not in X. It then adds those columns to X by attempting to cast the existing columns, and upon failure adding the column as a literal null, then it selects only the columns in Y from the new fixed X incase there are columns in X not in Y and returns the result of the union.

Spark Streaming + Spark SQL

I am trying to process logs via Spark Streaming and Spark SQL. The main idea is to have a "compacted" dataset with Parquet format for "old" data converted to DataFrame as needed for queries, the compacted dataset loading is done with:
SQLContext sqlContext = JavaSQLContextSingleton.getInstance(sc.sc());
DataFrame compact = null;
compact = sqlContext.parquetFile("hdfs://auto-ha/tmp/data/logs");
As the uncompacted dataset (I compact the dataset daily) is composed of many files, I would like to have the data in the current day within a DStream in order to get those queries fast.
I have tried the DataFrame approach without results....
DataFrame df = JavaSQLContextSingleton.getInstance(sc.sc()).createDataFrame(lastData, schema);
df.registerTempTable("lastData");
JavaDStream SumStream = inputStream.transform(new Function<JavaRDD<Row>, JavaRDD<Object>>() {
#Override
public JavaRDD<Object> call(JavaRDD<Row> v1) throws Exception {
DataFrame df = JavaSQLContextSingleton.getInstance(v1.context()).createDataFrame(v1, schema);
......drop old data from lastData table
df.insertInto("lastData");
}
});
Using this approach I do not get any results if I query the temp table in a different thread for example.
I have also tried to use the RDD transform method, more specifically I tried to follow the Spark Example where I create a empty RDD and then I union the DSStream RDD contents with the empty RDD:
JavaRDD<Row> lastData = sc.emptyRDD();
JavaDStream SumStream = inputStream.transform(new Function<JavaRDD<Row>, JavaRDD<Object>>() {
#Override
public JavaRDD<Object> call(JavaRDD<Row> v1) throws Exception {
lastData.union(v1).filter(let only recent data....);
}
});
This approach does not work too as I do not get any contents in the lastData
Could I use for this purpose Windowed computations or updateStateBy key?
Any suggestions?
Thanks for your help!
Well I finally got it.
I use updateState function and return 0 if the timestamp is older than 24 hour like this.
final static Function2<List<Long>, Optional<Long>, Optional<Long>> RETAIN_RECENT_DATA
= (List<Long> values, Optional<Long> state) -> {
Long newSum = state.or(0L);
for (Long value : values) {
newSum += value;
}
//current milis uses UTC
if (System.currentTimeMillis() - newSum > 86400000L) {
return Optional.absent();
} else {
return Optional.of(newSum);
}
};
Then on each batch I register the DataFrame as temp table:
finalsum.foreachRDD((JavaRDD<Row> rdd, Time time) -> {
if (!rdd.isEmpty()) {
HiveContext sqlContext1 = JavaSQLContextSingleton.getInstance(rdd.context());
if (sqlContext1.cacheManager().isCached("alarm_recent")) {
sqlContext1.uncacheTable("alarm_recent");
}
DataFrame wordsDataFrame = sqlContext1.createDataFrame(rdd, schema);
wordsDataFrame.registerTempTable("alarm_recent");
wordsDataFrame.cache();//
wordsDataFrame.first();
}
return null;
});
You could use mapwithState with Spark1.6.
The mapwithState function is much more efficient and easy to implement.
Take a look at this link.
mapwithState supports cool functionality like State time out and initialRDD which comes handy while maintaining a Stateful Dstream.
Thanks
Manas

Strings concatenation in Spark SQL query

I'm experimenting with Spark and Spark SQL and I need to concatenate a value at the beginning of a string field that I retrieve as output from a select (with a join) like the following:
val result = sim.as('s)
.join(
event.as('e),
Inner,
Option("s.codeA".attr === "e.codeA".attr))
.select("1"+"s.codeA".attr, "e.name".attr)
Let's say my tables contain:
sim:
codeA,codeB
0001,abcd
0002,efgh
events:
codeA,name
0001,freddie
0002,mercury
And I would want as output:
10001,freddie
10002,mercury
In SQL or HiveQL I know I have the concat function available, but it seems Spark SQL doesn't support this feature. Can somebody suggest me a workaround for my issue?
Thank you.
Note:
I'm using Language Integrated Queries but I could use just a "standard" Spark SQL query, in case of eventual solution.
The output you add in the end does not seem to be part of your selection, or your SQL logic, if I understand correctly. Why don't you proceed by formatting the output stream as a further step ?
val results = sqlContext.sql("SELECT s.codeA, e.code FROM foobar")
results.map(t => "1" + t(0), t(1)).collect()
It's relatively easy to implement new Expression types directly in your project. Here's what I'm using:
case class Concat(children: Expression*) extends Expression {
override type EvaluatedType = String
override def foldable: Boolean = children.forall(_.foldable)
def nullable: Boolean = children.exists(_.nullable)
def dataType: DataType = StringType
def eval(input: Row = null): EvaluatedType = {
children.map(_.eval(input)).mkString
}
}
val result = sim.as('s)
.join(
event.as('e),
Inner,
Option("s.codeA".attr === "e.codeA".attr))
.select(Concat("1", "s.codeA".attr), "e.name".attr)

Spark JavaPairRDD iteration

How can iterate on JavaPairRDD. I have done a group by and got back a RDD as below JavaPairRDD (Tuple 7 set of Strings and List of Objects)
Now I have to iterate over this RDD and do some calculations like FOR EACH in Pig.
Basically I would like to iterate the key and the list of values and do some operations and then return back a JavaPairRDD?
JavaPairRDD<Tuple7<String, String,String,String,String,String,String>, List<Records>> sizes =
piTagRecordData.groupBy( new Function<Records, Tuple7<String, String,String,String,String,String,String>>() {
private static final long serialVersionUID = 2885738359644652208L;
#Override
public Tuple7<String, String,String,String,String,String,String> call(Records row) throws Exception {
Tuple7<String, String,String,String,String,String,String> compositeKey = new Tuple7<String, String, String, String, String, String, String>(row.getAsset_attribute_id(),row.getDate_time_value(),row.getOperation(),row.getPi_tag_count(),row.getAsset_id(),row.getAttr_name(),row.getCalculation_type());
return compositeKey;
}
});
After this I want to perform FOR EACH member of sizes (JavaPairRDD), operation -- something like
rejected_records = FOREACH sizes GENERATE FLATTEN(Java function on the List of Records based on the group key
I am using Spark 0.9.0
Even though you are talking about "FOR EACH", it really sounds like you want the flatMap operation, since you want to produce new values and flatten them. This is available for Java RDDs, including a JavaPairRDD.
You can use void foreach(VoidFunction<T> f) method. More info and methods: https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/api/java/JavaRDDLike.html#foreach(org.apache.spark.api.java.function.VoidFunction)
if you want to view some value of JavaPairRDD, I would do like this
for (Tuple2<String, String> test : pairRdd.take(10)) //or pairRdd.collect()
{
System.out.println(test._1);
System.out.println(test._2);
}
Note:Tuple2 (assuming you have strings inside the JavaPairRDD), change the datatype according to the data type stored in the JavaPairRDD.