In PySpark/Spark dataframe join, how to update column values only when there is a match found? - dataframe

I am joining two dataframes: customer, price
customer dataframe schema:
customer:
root
|-- NAME: string (nullable = true)
|-- C_CODE: string (nullable = true)
|-- C_OPTION: string (nullable = true)
|-- C_MATERIAL: string(10,0) (nullable = true)
|-- CID: string (nullable = true)
|-- CUSTOMER_EXPENSES: string (nullable = true)
|-- CUSTOMER_LOCATION: string (nullable = true)
|-- PRODUCT_NAME: string (nullable = true)
price dataframe schema:
price:
root
|-- PRICE_ID: string (nullable = true)
|-- PRICE_CODE: string (nullable = true)
|-- PRICE_RANGE: string (nullable = true)
|-- C_MATERIAL: string(10,0) (nullable = true)
|-- CID: string (nullable = true)
|-- PRICE_DEFAULT: int (nullable = true)
|-- PRODUCT_LOCATION: string (nullable = true)
|-- PRODUCT_NAME: string (nullable = true)
I have the keys price.PRICE_CODE == customer.C_CODE & price.PRODUCT_LOCATION == customer.CUSTOMER_LOCATION to join the tables.
case 1: If there is a matching record in the price dataset per PRICE_CODE, PRODUCT_LOCATION then use CUSTOMER_EXPENSES from the customer dataset in a new column.
case 2: If there is no match, then use average of the column: PRICE_DEFAULT in the same new column.
I used left join to join price and customer tables. So the keys of customer table: C_CODE & CUSTOMER_LOCATION will be NULL if there is no match.
I joined them in the below fashion:
join_df = price.join(customer, on=(price['PRICE_CODE'] == customer['C_CODE']) & (price['PRODUCT_LOCATION'] == customer['CUSTOMER_LOCATION']), how='left')
I have taken the average of PRICE_DEFAULT from price table as below.
price_avg = str(join_df.select(avg('PRICE_DEFAULT')).collect()[0][0]
Now I will have all key entries from price table and NULLs from customer table that are not matching.
case 1 implementation:
price.join(customer, on=(price['PRICE_CODE'] == customer['C_CODE']) & (price['PRODUCT_LOCATION'] == customer['CUSTOMER_LOCATION']), how='left')
.drop(price.PRODUCT_NAME)\
.withColumn('derived',
when((col('PRICE_CODE').isNotNull()) & (col('PRODUCT_LOCATION').isNotNull()), customer.CUSTOMER_EXPENSES)
Case 2 implementation:
price.join(customer, on=(price['PRICE_CODE'] == customer['C_CODE']) & (price['PRODUCT_LOCATION'] == customer['CUSTOMER_LOCATION']), how='left')
.drop(price.PRODUCT_NAME)\
.withColumn('derived',
when((col('PRICE_CODE').isNotNull()) & (col('PRODUCT_LOCATION').isNotNull()), customer.CUSTOMER_EXPENSES).\
when((col('C_CODE').isNull()) & (col('CUSTOMER_LOCATION').isNull()), price_avg)
I am facing a problem here.
When the columns PRICE_CODE & PRODUCT_LOCATION are not null, I am seeing the values in the new column: derived as expected. i.e. value of customer.CUSTOMER_EXPENSES
But when the columns C_CODE & CUSTOMER_LOCATION are NULL, instead of the average value, I still see the value of customer.CUSTOMER_EXPENSES
Sample output:
The rows marked in yellow (rows 18-24) are coming out fine. But the rows marked in red (rows 2-17) are still containing the same values of PRICE_DEFAULT instead of the price_avg = AVERAGE(PRICE_DEFAULT)
Am I applying the JOIN+WHEN conditions correctly for cases 1 & 2?
Could anyone let me know what is the mistake I can fix it. Any suggestions or help is really appreciated.

I can see a few issues here:
The combination of price.PRICE_CODE and price.PRICE_DEFAULT is apparently not unique (judging from your picture). This will lead to duplication in the case of a successful join, and I'm not sure whether that's indended.
"I am facing a problem here. When the columns PRICE_CODE & PRODUCT_LOCATION are not null, I am seeing the values in the new column derived as expected. i.e. value of customer.CUSTOMER_EXPENSES" The customer.CUSTOMER_EXPENSES column is not visible in the image you provided. However, in the rows marked red, if there are no matching rows on the right hand side of the join, where would the values of customer.CUSTOMER_EXPENSES even come from? Something is off. It almost seems like in the image the derived column is actually derived from price.PRICE_DEFAULT, not from customer.CUSTOMER_EXPENSES. Are you sure that your posted code produces these exact results? Since you're writing "are still containing the same values of PRICE_DEFAULT" (which should read CUSTOMER_EXPENSES), perhaps the column names get mixed up somewhere in the process?
The duplicate when() statements are not required; an otherwise() in place of the second when() is sufficient in your case. However, there's an even simpler solution available using coalesce(). Using join_df from your definiton, simply do
price_avg = join_df.select(f.avg("PRICE_DEFAULT")).collect()[0][0] # no string conversion required
result = join_df.drop(price.PRODUCT_NAME).withColumn(
"derived", f.coalesce(f.col("CUSTOMER_EXPENSES"), f.lit(price_avg))
) # takes the first non-null column, i.e. a price_avg literal for rows with no match

Related

Spark: Select specific Index of value of type Any

So I have a DataFrame in which one of the columns is of type WrappedArray(JSON). In each of these, the JSON has a format [String, String]
I have sucessfully accessed the inside of the array and now I have a column of type Any in which each value has a [String, String].
Point is: I just want to take the first value of these two strings, but if I try something like column(0), it raises an error since Any has no index. How can I access this value?
My code for now is:
val schema = StructType(Seq(
StructField("productId", StringType, true),
StructField("name", StringType, true)
))
df.withColumn("column", from_json($"column"(0), schema) )
And the schema of my df:
root
|-- customerId: string (nullable = true)
|-- column: struct (nullable = true)
| |-- productId: string (nullable = true)
| |-- name: string (nullable = true)
|-- date: date (nullable = true)
I managed to solve the problem myself. The answer was quite obvious: Instead of creating a column of type struct which contains two values, I created a MapType which had the same values.
My final code:
df.withColumn("column", from_json($"column"(0), MapType(StringType, StringType)) )
And then, for accessing the keys and values of the new column:
.select("column.productId", "column.name")

Remove duplicate column name in a Pyspark Dataframe from a json column nested object

So I have a dataframe that I gather from a table I have in my database. When I read it from the database, the json column becomes a string in my dataframe, no problem I convert it using:
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
Now I get the following schema for my dataframe, I have removed a lot of irrelevant properties from the json, but trust me there is a lot:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- Id: string (nullable = true)
| | |-- ... many more properties
Something is wrong here, inside of the nested object, object, I have two properties with the same name. This is because some of the older entries in the database had the Id of object as a string type, but that was later changed to a long type. Because of this I cannot write my dataframe using:
df.write.format("delta").option('mergeSchema', 'true').option("inferSchema", "true").option("overwriteSchema", "true").mode("overwrite").save(output_folder)
I get the following error:
AnalysisException: Found duplicate column(s) in the metadata update: json.object.id;
How do I go about fixing this. Ideally I would like to merge these two columns names, by casting Id with a string type to a long type and merge them. So I could get a schema like:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- ... many more properties
Note how the root of the json also has a property called Id, I would like for that to be untouched!
Can someone help me with this?
Instead of modifying and remove the duplicate column with same name after having used:
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
I went with a solution where I used regex substitution on the JsonCol beforehand:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
This removes the " on each side of the Id, which means they will be interpreted as long instead of as a string.
Putting it all together:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
This gave me my expected output, and afterwards I was able to save it properly:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)

Cannot resolve column from input columns after second join

I am trying to join a dataframe with 2 separate dataframes which have common parent. But I'm getting an error
This is the schema
demoDf
root
|-- id: integer (nullable = true)
parentDf
root
|-- id: integer (nullable = true)
|-- value: string (nullable = true)
This is the scala code
val demoDf: Dataframe
val parentDf: DataFrame
val childDf1 = parentDf.filter(col("id") > 50)
val childDf2 = parentDf.filter(col("id") < 50)
demoDf.alias("demoDf")
.join(childDf1.alias("childDf1"), $"demoDf.id" === $"childDf1.id")
.join(childDf2.alias("childDf2"), $"demoDf.id" === $"childDf2.id")
The error I get is the following. As you can see here demoDf.id is present
org.apache.spark.sql.AnalysisException: cannot resolve '`demoDf.id`' given input columns: [demoDf.id, childDf1.id, childDf1.value, childDf2.id, childDf2.value];;
The expected output should be a proper join. I wonder what I'm missing. On a side note, If I simply join childDf1 with demoDf it works and this error comes up only after I add second job.

Spark elasticsearch connector: how to select _id field?

I'm using the spark elasticsearch connector and I want to extract from ES some fields plus the _id.
myquery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myquery)
.option("pushdown", "true")
.load("myindex/mytype")
.limit(10)
.select("myfield","_id")
Unfortunately, the _id field is not recognized:
AnalysisException: u'cannot resolve \'`_id`\' given input columns:
[query, size, #version, #timestamp,
sourceinfo, signaletic, document, metadata, fields, aggs]
With this mapping I can select for example document.{fieldA}, sourceinfo.{fieldB}, etc. but not _id. It's like if it's at a level to high in the mapping.
Any idea how to get this particular field?
You can access _id or the metadata by setting the es.read.metadata to True
For example in your code:
myquery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myquery)
.option("pushdown", "true")
.option("es.read.metadata",True) # Set it to True
.load("myindex/mytype")
.limit(10)
.select("myfield","_id")
When you do
df.printSchema()
it will print something like ( in my case )
|-- user_id: string (nullable = true)
|-- user_rt: string (nullable = true)
|-- user_rt_id: string (nullable = true)
|-- username: string (nullable = true)
|-- video: long (nullable = true)
|-- _metadata: map (nullable = true) # this map column will be added
| |-- key: string
| |-- value: string (valueContainsNull = true)
You can further extract the properties like
_index, _type, _id, _score, sort from _metadata column
To access the _id column from _metadata (map type) i have used this, you can use what you think best
df.createOrReplaceTempView('temp_table_name')
spark.sql("""
SELECT
_metadata._id as reference_id
FROM temp_table_name
""")
Hope this solves your problem
Some of the special characters like _,₹ not support in spark SQL.please try to replace the special characters with another character like "Id".

Spark SQL - How to select on dates stored as UTC millis from the epoch?

I have been searching and have not found a solution as to how one might query on dates stored as UTC milliseconds from the epoch using Spark SQL. The schema I have pulled in from a NoSQL datasource (JSON from MongoDB) has the target date as:
|-- dateCreated: struct (nullable = true)
||-- $date: long (nullable = true)
The complete schema is as follows:
scala> accEvt.printSchema
root
|-- _id: struct (nullable = true)
| |-- $oid: string (nullable = true)
|-- appId: integer (nullable = true)
|-- cId: long (nullable = true)
|-- data: struct (nullable = true)
| |-- expires: struct (nullable = true)
| | |-- $date: long (nullable = true)
| |-- metadata: struct (nullable = true)
| | |-- another key: string (nullable = true)
| | |-- class: string (nullable = true)
| | |-- field: string (nullable = true)
| | |-- flavors: string (nullable = true)
| | |-- foo: string (nullable = true)
| | |-- location1: string (nullable = true)
| | |-- location2: string (nullable = true)
| | |-- test: string (nullable = true)
| | |-- testKey: string (nullable = true)
| | |-- testKey2: string (nullable = true)
|-- dateCreated: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- id: integer (nullable = true)
|-- originationDate: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- processedDate: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- receivedDate: struct (nullable = true)
| |-- $date: long (nullable = true)
and my goal is to write queries along the lines of:
SELECT COUNT(*) FROM myTable WHERE dateCreated BETWEEN [dateStoredAsLong0] AND [dateStoredAsLong1]
My process thus far has been:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#29200d25
scala> val accEvt = sqlContext.jsonFile("/home/bkarels/mongoexport/accomplishment_event.json")
...
14/10/29 15:03:38 INFO SparkContext: Job finished: reduce at JsonRDD.scala:46, took 4.668981083 s
accEvt: org.apache.spark.sql.SchemaRDD =
SchemaRDD[6] at RDD at SchemaRDD.scala:103
scala> accEvt.registerAsTable("accomplishmentEvent")
(At this point the following base line query executes successfully)
scala> sqlContext.sql("select count(*) from accomplishmentEvent").collect.foreach(println)
...
[74475]
Now, the voodoo that I cannot get right is how to form my select statement to reason about the dates. For example the following executes w/o error, but returns zero rather than the count of all records as it should (74475).
scala> sqlContext.sql("select count(*) from accomplishmentEvent where processedDate >= '1970-01-01'").collect.foreach(println)
...
[0]
I have also tried some ugliness like:
scala> val now = new java.util.Date()
now: java.util.Date = Wed Oct 29 15:05:15 CDT 2014
scala> val today = now.getTime
today: Long = 1414613115743
scala> val thirtydaysago = today - (30 * 24 * 60 * 60 * 1000)
thirtydaysago: Long = 1416316083039
scala> sqlContext.sql("select count(*) from accomplishmentEvent where processedDate <= %s and processedDate >= %s".format(today,thirtydaysago)).collect.foreach(println)
As recommended, I've selected on a named field to ensure that works. So:
scala> sqlContext.sql("select receivedDate from accomplishmentEvent limit 10").collect.foreach(println)
returns:
[[1376318850033]]
[[1376319429590]]
[[1376320804289]]
[[1376320832835]]
[[1376320832960]]
[[1376320835554]]
[[1376320914480]]
[[1376321041899]]
[[1376321109341]]
[[1376321121469]]
Then extending to try and get some kind of dates working I have tried:
scala> sqlContext.sql("select cId from accomplishmentEvent where receivedDate.date > '1970-01-01' limit 5").collect.foreach(println)
Results in the error:
java.lang.RuntimeException: No such field date in StructType(ArrayBuffer(StructField($date,LongType,true)))
...
Prefixing our field name with $ as also suggested results in a different kind of error:
scala> sqlContext.sql("select cId from accomplishmentEvent where receivedDate.$date > '1970-01-01' limit 5").collect.foreach(println)
java.lang.RuntimeException: [1.69] failure: ``UNION'' expected but ErrorToken(illegal character) found
select actualConsumerId from accomplishmentEvent where receivedDate.$date > '1970-01-01' limit 5
Clearly I am not getting how to select on dates stored in this way - can anyone help me fill in this gap?
I am newer to both Scala and Spark, so forgive me if this is an elementary question, but my searches have turned up empty on the forums and Spark documentation.
Thank you.
Your JSON is not flat, so the fields below the top level need to be addressed using qualified names, such as dateCreated.$date. Your specific date fields are all of long type, so you'll need to do numerical comparisons on them and it looks like you were on the right track for doing those.
An additional problem is that your field names have "$" characters, and Spark SQL won't let you query on them. One solution is that instead of reading the JSON directly as a SchemaRDD (as you have done) you first read it as an RDD[String], use the map method to perform the Scala string manipulations of your choice, and then use SQLContext's jsonRDD method to create the SchemaRDD.
val lines = sc.textFile(...)
// you may want something less naive than global replacement of all "$" chars
val linesFixed = lines.map(s => s.replaceAllLiterally("$", ""))
val accEvt = sqlContext.jsonRDD(linesFixed)
I've tested this with Spark 1.1.0.
For reference, the lack of quoting capability in Spark SQL has been noted in this bug report and perhaps others, and it seems that the fix was recently checked in, but will take some time to make it into a release