Spark: Select specific Index of value of type Any - dataframe

So I have a DataFrame in which one of the columns is of type WrappedArray(JSON). In each of these, the JSON has a format [String, String]
I have sucessfully accessed the inside of the array and now I have a column of type Any in which each value has a [String, String].
Point is: I just want to take the first value of these two strings, but if I try something like column(0), it raises an error since Any has no index. How can I access this value?
My code for now is:
val schema = StructType(Seq(
StructField("productId", StringType, true),
StructField("name", StringType, true)
))
df.withColumn("column", from_json($"column"(0), schema) )
And the schema of my df:
root
|-- customerId: string (nullable = true)
|-- column: struct (nullable = true)
| |-- productId: string (nullable = true)
| |-- name: string (nullable = true)
|-- date: date (nullable = true)

I managed to solve the problem myself. The answer was quite obvious: Instead of creating a column of type struct which contains two values, I created a MapType which had the same values.
My final code:
df.withColumn("column", from_json($"column"(0), MapType(StringType, StringType)) )
And then, for accessing the keys and values of the new column:
.select("column.productId", "column.name")

Related

In PySpark/Spark dataframe join, how to update column values only when there is a match found?

I am joining two dataframes: customer, price
customer dataframe schema:
customer:
root
|-- NAME: string (nullable = true)
|-- C_CODE: string (nullable = true)
|-- C_OPTION: string (nullable = true)
|-- C_MATERIAL: string(10,0) (nullable = true)
|-- CID: string (nullable = true)
|-- CUSTOMER_EXPENSES: string (nullable = true)
|-- CUSTOMER_LOCATION: string (nullable = true)
|-- PRODUCT_NAME: string (nullable = true)
price dataframe schema:
price:
root
|-- PRICE_ID: string (nullable = true)
|-- PRICE_CODE: string (nullable = true)
|-- PRICE_RANGE: string (nullable = true)
|-- C_MATERIAL: string(10,0) (nullable = true)
|-- CID: string (nullable = true)
|-- PRICE_DEFAULT: int (nullable = true)
|-- PRODUCT_LOCATION: string (nullable = true)
|-- PRODUCT_NAME: string (nullable = true)
I have the keys price.PRICE_CODE == customer.C_CODE & price.PRODUCT_LOCATION == customer.CUSTOMER_LOCATION to join the tables.
case 1: If there is a matching record in the price dataset per PRICE_CODE, PRODUCT_LOCATION then use CUSTOMER_EXPENSES from the customer dataset in a new column.
case 2: If there is no match, then use average of the column: PRICE_DEFAULT in the same new column.
I used left join to join price and customer tables. So the keys of customer table: C_CODE & CUSTOMER_LOCATION will be NULL if there is no match.
I joined them in the below fashion:
join_df = price.join(customer, on=(price['PRICE_CODE'] == customer['C_CODE']) & (price['PRODUCT_LOCATION'] == customer['CUSTOMER_LOCATION']), how='left')
I have taken the average of PRICE_DEFAULT from price table as below.
price_avg = str(join_df.select(avg('PRICE_DEFAULT')).collect()[0][0]
Now I will have all key entries from price table and NULLs from customer table that are not matching.
case 1 implementation:
price.join(customer, on=(price['PRICE_CODE'] == customer['C_CODE']) & (price['PRODUCT_LOCATION'] == customer['CUSTOMER_LOCATION']), how='left')
.drop(price.PRODUCT_NAME)\
.withColumn('derived',
when((col('PRICE_CODE').isNotNull()) & (col('PRODUCT_LOCATION').isNotNull()), customer.CUSTOMER_EXPENSES)
Case 2 implementation:
price.join(customer, on=(price['PRICE_CODE'] == customer['C_CODE']) & (price['PRODUCT_LOCATION'] == customer['CUSTOMER_LOCATION']), how='left')
.drop(price.PRODUCT_NAME)\
.withColumn('derived',
when((col('PRICE_CODE').isNotNull()) & (col('PRODUCT_LOCATION').isNotNull()), customer.CUSTOMER_EXPENSES).\
when((col('C_CODE').isNull()) & (col('CUSTOMER_LOCATION').isNull()), price_avg)
I am facing a problem here.
When the columns PRICE_CODE & PRODUCT_LOCATION are not null, I am seeing the values in the new column: derived as expected. i.e. value of customer.CUSTOMER_EXPENSES
But when the columns C_CODE & CUSTOMER_LOCATION are NULL, instead of the average value, I still see the value of customer.CUSTOMER_EXPENSES
Sample output:
The rows marked in yellow (rows 18-24) are coming out fine. But the rows marked in red (rows 2-17) are still containing the same values of PRICE_DEFAULT instead of the price_avg = AVERAGE(PRICE_DEFAULT)
Am I applying the JOIN+WHEN conditions correctly for cases 1 & 2?
Could anyone let me know what is the mistake I can fix it. Any suggestions or help is really appreciated.
I can see a few issues here:
The combination of price.PRICE_CODE and price.PRICE_DEFAULT is apparently not unique (judging from your picture). This will lead to duplication in the case of a successful join, and I'm not sure whether that's indended.
"I am facing a problem here. When the columns PRICE_CODE & PRODUCT_LOCATION are not null, I am seeing the values in the new column derived as expected. i.e. value of customer.CUSTOMER_EXPENSES" The customer.CUSTOMER_EXPENSES column is not visible in the image you provided. However, in the rows marked red, if there are no matching rows on the right hand side of the join, where would the values of customer.CUSTOMER_EXPENSES even come from? Something is off. It almost seems like in the image the derived column is actually derived from price.PRICE_DEFAULT, not from customer.CUSTOMER_EXPENSES. Are you sure that your posted code produces these exact results? Since you're writing "are still containing the same values of PRICE_DEFAULT" (which should read CUSTOMER_EXPENSES), perhaps the column names get mixed up somewhere in the process?
The duplicate when() statements are not required; an otherwise() in place of the second when() is sufficient in your case. However, there's an even simpler solution available using coalesce(). Using join_df from your definiton, simply do
price_avg = join_df.select(f.avg("PRICE_DEFAULT")).collect()[0][0] # no string conversion required
result = join_df.drop(price.PRODUCT_NAME).withColumn(
"derived", f.coalesce(f.col("CUSTOMER_EXPENSES"), f.lit(price_avg))
) # takes the first non-null column, i.e. a price_avg literal for rows with no match

Remove duplicate column name in a Pyspark Dataframe from a json column nested object

So I have a dataframe that I gather from a table I have in my database. When I read it from the database, the json column becomes a string in my dataframe, no problem I convert it using:
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
Now I get the following schema for my dataframe, I have removed a lot of irrelevant properties from the json, but trust me there is a lot:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- Id: string (nullable = true)
| | |-- ... many more properties
Something is wrong here, inside of the nested object, object, I have two properties with the same name. This is because some of the older entries in the database had the Id of object as a string type, but that was later changed to a long type. Because of this I cannot write my dataframe using:
df.write.format("delta").option('mergeSchema', 'true').option("inferSchema", "true").option("overwriteSchema", "true").mode("overwrite").save(output_folder)
I get the following error:
AnalysisException: Found duplicate column(s) in the metadata update: json.object.id;
How do I go about fixing this. Ideally I would like to merge these two columns names, by casting Id with a string type to a long type and merge them. So I could get a schema like:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- ... many more properties
Note how the root of the json also has a property called Id, I would like for that to be untouched!
Can someone help me with this?
Instead of modifying and remove the duplicate column with same name after having used:
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
I went with a solution where I used regex substitution on the JsonCol beforehand:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
This removes the " on each side of the Id, which means they will be interpreted as long instead of as a string.
Putting it all together:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
This gave me my expected output, and afterwards I was able to save it properly:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)

Spark elasticsearch connector: how to select _id field?

I'm using the spark elasticsearch connector and I want to extract from ES some fields plus the _id.
myquery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myquery)
.option("pushdown", "true")
.load("myindex/mytype")
.limit(10)
.select("myfield","_id")
Unfortunately, the _id field is not recognized:
AnalysisException: u'cannot resolve \'`_id`\' given input columns:
[query, size, #version, #timestamp,
sourceinfo, signaletic, document, metadata, fields, aggs]
With this mapping I can select for example document.{fieldA}, sourceinfo.{fieldB}, etc. but not _id. It's like if it's at a level to high in the mapping.
Any idea how to get this particular field?
You can access _id or the metadata by setting the es.read.metadata to True
For example in your code:
myquery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myquery)
.option("pushdown", "true")
.option("es.read.metadata",True) # Set it to True
.load("myindex/mytype")
.limit(10)
.select("myfield","_id")
When you do
df.printSchema()
it will print something like ( in my case )
|-- user_id: string (nullable = true)
|-- user_rt: string (nullable = true)
|-- user_rt_id: string (nullable = true)
|-- username: string (nullable = true)
|-- video: long (nullable = true)
|-- _metadata: map (nullable = true) # this map column will be added
| |-- key: string
| |-- value: string (valueContainsNull = true)
You can further extract the properties like
_index, _type, _id, _score, sort from _metadata column
To access the _id column from _metadata (map type) i have used this, you can use what you think best
df.createOrReplaceTempView('temp_table_name')
spark.sql("""
SELECT
_metadata._id as reference_id
FROM temp_table_name
""")
Hope this solves your problem
Some of the special characters like _,₹ not support in spark SQL.please try to replace the special characters with another character like "Id".

Issue while creating spark datasets

I am trying to read a json into a dataframe and then to a dataset.
I am facing the below issue. Could someone quickly help me.
There is a PrintDF dataframe was created successfully and below is the schema.
scala> personDF.printSchema();
root
|-- personDF: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- line1: string (nullable = true)
| |-- postalCode: string (nullable = true)
| |-- state: string (nullable = true)
| |-- type1: string (nullable = true)
I created a case class to capture the above DF.
scala> case class address1(city:String,line1:String,postalCode:String,state:String,type1:String)
defined class address1
Below is the data that personDF has at the moment
scala> personzDF.show()
+--------------------+
| personDF|
+--------------------+
|[CENTERPORT,5 PRO...|
|[HUNTINGTON,94 JA...|
|[RIVERHEAD,9 PATT...|
|[NORTHPORT,50 LIS...|
|[NORTHPORT,24 LAU...|
|[NORTHPORT,340 SC...|
|[GREENLAWN,166 BR...|
|[MELVILLE,1 MERID...|
+--------------------+
Finally while I am creating dataset, I am getting the below error.
scala> val ds = personDF.as[address1]
<console>:32: error: overloaded method value as with alternatives:
(alias: Symbol)org.apache.spark.sql.DataFrame <and>
(alias: String)org.apache.spark.sql.DataFrame
does not take type parameters
val ds = personDF.as[address1]
I did google search and could not find the reason.
Thanks,
Sivaram
Actually personDF is an array of structType personDF, but it is not in the structure of what you are expecting from your printSchema():
|-- personDF: struct (nullable = true)
Are u trying to convert personzDF to address1 type? Then try like this:
val ds = personzDF.map(rec => rec.split(","))
.map(rec => address1(rec(0), rec(1), rec(2), rec(3), rec(4)))
.toDF()
Hope it helps.

How to query JSON data column using Spark DataFrames?

I have a Cassandra table that for simplicity looks something like:
key: text
jsonData: text
blobData: blob
I can create a basic data frame for this using spark and the spark-cassandra-connector using:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?
Spark >= 2.4
If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema).
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json}
import collection.JavaConverters._
val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))
df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]().asJava))
Spark >= 2.1
You can use from_json function:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("k", StringType, true), StructField("v", DoubleType, true)
))
df.withColumn("jsonData", from_json($"jsonData", schema))
Spark >= 1.6
You can use get_json_object which takes a column and a path:
import org.apache.spark.sql.functions.get_json_object
val exprs = Seq("k", "v").map(
c => get_json_object($"jsonData", s"$$.$c").alias(c))
df.select($"*" +: exprs: _*)
and extracts fields to individual strings which can be further casted to expected types.
The path argument is expressed using dot syntax, with leading $. denoting document root (since the code above uses string interpolation $ has to be escaped, hence $$.).
Spark <= 1.5:
Is this currently possible?
As far as I know it is not directly possible. You can try something similar to this:
val df = sc.parallelize(Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")
)).toDF("key", "jsonData", "blobData")
I assume that blob field cannot be represented in JSON. Otherwise you cab omit splitting and joining:
import org.apache.spark.sql.Row
val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")
val jsons = sqlContext.read.json(df.drop("blobData").map{
case Row(key: String, json: String) =>
s"""{"key": "$key", "jsonData": $json}"""
})
val parsed = jsons.join(blobs, $"key" === $"bkey").drop("bkey")
parsed.printSchema
// root
// |-- jsonData: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: double (nullable = true)
// |-- key: long (nullable = true)
// |-- blobData: string (nullable = true)
An alternative (cheaper, although more complex) approach is to use an UDF to parse JSON and output a struct or map column. For example something like this:
import net.liftweb.json.parse
case class KV(k: String, v: Int)
val parseJson = udf((s: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(s).extract[KV]
})
val parsed = df.withColumn("parsedJSON", parseJson($"jsonData"))
parsed.show
// +---+--------------------+------------------+----------+
// |key| jsonData| blobData|parsedJSON|
// +---+--------------------+------------------+----------+
// | 1|{"k": "foo", "v":...|some_other_field_1| [foo,1]|
// | 2|{"k": "bar", "v":...|some_other_field_2| [bar,3]|
// +---+--------------------+------------------+----------+
parsed.printSchema
// root
// |-- key: string (nullable = true)
// |-- jsonData: string (nullable = true)
// |-- blobData: string (nullable = true)
// |-- parsedJSON: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: integer (nullable = false)
zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json():
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json(), as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json() will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json() will then yield a lot of null or empty values where the schema returned by schema_of_json() doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json() vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json() and deriving it from all the data using spark.read.json().
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a has a boolean value in one row and a string value in another. The merged schema for a would set its type to string. b would be an integer.
Let's see how the different approaches compare. First, the schema_of_json() approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello" couldn't be parsed as a boolean and returned null, and "b": 22 was just dropped because it wasn't in our schema.
Now with spark.read.json():
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true was cast as a string to match the schema of "a": "hello".
The main downside of using spark.read.json() is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json() against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio to something less than 1.0 in your call to spark.read.json() to look at a subset of the data.
Here are the docs for spark.read.json(): Scala API / Python API
The from_json function is exactly what you're looking for. Your code will look something like:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
//You can define whatever struct type that your json states
val schema = StructType(Seq(
StructField("key", StringType, true),
StructField("value", DoubleType, true)
))
df.withColumn("jsonData", from_json(col("jsonData"), schema))
underlying JSON String is
"{ \"column_name1\":\"value1\",\"column_name2\":\"value2\",\"column_name3\":\"value3\",\"column_name5\":\"value5\"}";
Below is the script to filter the JSON and load the required data in to Cassandra.
sqlContext.read.json(rdd).select("column_name1 or fields name in Json", "column_name2","column_name2")
.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "Table_name", "keyspace" -> "Key_Space_name"))
.mode(SaveMode.Append)
.save()
I use the following
(available since 2.2.0, and i am assuming that your json string column is at column index 0)
def parse(df: DataFrame, spark: SparkSession): DataFrame = {
val stringDf = df.map((value: Row) => value.getString(0), Encoders.STRING)
spark.read.json(stringDf)
}
It will automatically infer the schema in your JSON. Documented here:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameReader.html