Cannot resolve column from input columns after second join - sql

I am trying to join a dataframe with 2 separate dataframes which have common parent. But I'm getting an error
This is the schema
demoDf
root
|-- id: integer (nullable = true)
parentDf
root
|-- id: integer (nullable = true)
|-- value: string (nullable = true)
This is the scala code
val demoDf: Dataframe
val parentDf: DataFrame
val childDf1 = parentDf.filter(col("id") > 50)
val childDf2 = parentDf.filter(col("id") < 50)
demoDf.alias("demoDf")
.join(childDf1.alias("childDf1"), $"demoDf.id" === $"childDf1.id")
.join(childDf2.alias("childDf2"), $"demoDf.id" === $"childDf2.id")
The error I get is the following. As you can see here demoDf.id is present
org.apache.spark.sql.AnalysisException: cannot resolve '`demoDf.id`' given input columns: [demoDf.id, childDf1.id, childDf1.value, childDf2.id, childDf2.value];;
The expected output should be a proper join. I wonder what I'm missing. On a side note, If I simply join childDf1 with demoDf it works and this error comes up only after I add second job.

Related

In PySpark/Spark dataframe join, how to update column values only when there is a match found?

I am joining two dataframes: customer, price
customer dataframe schema:
customer:
root
|-- NAME: string (nullable = true)
|-- C_CODE: string (nullable = true)
|-- C_OPTION: string (nullable = true)
|-- C_MATERIAL: string(10,0) (nullable = true)
|-- CID: string (nullable = true)
|-- CUSTOMER_EXPENSES: string (nullable = true)
|-- CUSTOMER_LOCATION: string (nullable = true)
|-- PRODUCT_NAME: string (nullable = true)
price dataframe schema:
price:
root
|-- PRICE_ID: string (nullable = true)
|-- PRICE_CODE: string (nullable = true)
|-- PRICE_RANGE: string (nullable = true)
|-- C_MATERIAL: string(10,0) (nullable = true)
|-- CID: string (nullable = true)
|-- PRICE_DEFAULT: int (nullable = true)
|-- PRODUCT_LOCATION: string (nullable = true)
|-- PRODUCT_NAME: string (nullable = true)
I have the keys price.PRICE_CODE == customer.C_CODE & price.PRODUCT_LOCATION == customer.CUSTOMER_LOCATION to join the tables.
case 1: If there is a matching record in the price dataset per PRICE_CODE, PRODUCT_LOCATION then use CUSTOMER_EXPENSES from the customer dataset in a new column.
case 2: If there is no match, then use average of the column: PRICE_DEFAULT in the same new column.
I used left join to join price and customer tables. So the keys of customer table: C_CODE & CUSTOMER_LOCATION will be NULL if there is no match.
I joined them in the below fashion:
join_df = price.join(customer, on=(price['PRICE_CODE'] == customer['C_CODE']) & (price['PRODUCT_LOCATION'] == customer['CUSTOMER_LOCATION']), how='left')
I have taken the average of PRICE_DEFAULT from price table as below.
price_avg = str(join_df.select(avg('PRICE_DEFAULT')).collect()[0][0]
Now I will have all key entries from price table and NULLs from customer table that are not matching.
case 1 implementation:
price.join(customer, on=(price['PRICE_CODE'] == customer['C_CODE']) & (price['PRODUCT_LOCATION'] == customer['CUSTOMER_LOCATION']), how='left')
.drop(price.PRODUCT_NAME)\
.withColumn('derived',
when((col('PRICE_CODE').isNotNull()) & (col('PRODUCT_LOCATION').isNotNull()), customer.CUSTOMER_EXPENSES)
Case 2 implementation:
price.join(customer, on=(price['PRICE_CODE'] == customer['C_CODE']) & (price['PRODUCT_LOCATION'] == customer['CUSTOMER_LOCATION']), how='left')
.drop(price.PRODUCT_NAME)\
.withColumn('derived',
when((col('PRICE_CODE').isNotNull()) & (col('PRODUCT_LOCATION').isNotNull()), customer.CUSTOMER_EXPENSES).\
when((col('C_CODE').isNull()) & (col('CUSTOMER_LOCATION').isNull()), price_avg)
I am facing a problem here.
When the columns PRICE_CODE & PRODUCT_LOCATION are not null, I am seeing the values in the new column: derived as expected. i.e. value of customer.CUSTOMER_EXPENSES
But when the columns C_CODE & CUSTOMER_LOCATION are NULL, instead of the average value, I still see the value of customer.CUSTOMER_EXPENSES
Sample output:
The rows marked in yellow (rows 18-24) are coming out fine. But the rows marked in red (rows 2-17) are still containing the same values of PRICE_DEFAULT instead of the price_avg = AVERAGE(PRICE_DEFAULT)
Am I applying the JOIN+WHEN conditions correctly for cases 1 & 2?
Could anyone let me know what is the mistake I can fix it. Any suggestions or help is really appreciated.
I can see a few issues here:
The combination of price.PRICE_CODE and price.PRICE_DEFAULT is apparently not unique (judging from your picture). This will lead to duplication in the case of a successful join, and I'm not sure whether that's indended.
"I am facing a problem here. When the columns PRICE_CODE & PRODUCT_LOCATION are not null, I am seeing the values in the new column derived as expected. i.e. value of customer.CUSTOMER_EXPENSES" The customer.CUSTOMER_EXPENSES column is not visible in the image you provided. However, in the rows marked red, if there are no matching rows on the right hand side of the join, where would the values of customer.CUSTOMER_EXPENSES even come from? Something is off. It almost seems like in the image the derived column is actually derived from price.PRICE_DEFAULT, not from customer.CUSTOMER_EXPENSES. Are you sure that your posted code produces these exact results? Since you're writing "are still containing the same values of PRICE_DEFAULT" (which should read CUSTOMER_EXPENSES), perhaps the column names get mixed up somewhere in the process?
The duplicate when() statements are not required; an otherwise() in place of the second when() is sufficient in your case. However, there's an even simpler solution available using coalesce(). Using join_df from your definiton, simply do
price_avg = join_df.select(f.avg("PRICE_DEFAULT")).collect()[0][0] # no string conversion required
result = join_df.drop(price.PRODUCT_NAME).withColumn(
"derived", f.coalesce(f.col("CUSTOMER_EXPENSES"), f.lit(price_avg))
) # takes the first non-null column, i.e. a price_avg literal for rows with no match

Spark: Select specific Index of value of type Any

So I have a DataFrame in which one of the columns is of type WrappedArray(JSON). In each of these, the JSON has a format [String, String]
I have sucessfully accessed the inside of the array and now I have a column of type Any in which each value has a [String, String].
Point is: I just want to take the first value of these two strings, but if I try something like column(0), it raises an error since Any has no index. How can I access this value?
My code for now is:
val schema = StructType(Seq(
StructField("productId", StringType, true),
StructField("name", StringType, true)
))
df.withColumn("column", from_json($"column"(0), schema) )
And the schema of my df:
root
|-- customerId: string (nullable = true)
|-- column: struct (nullable = true)
| |-- productId: string (nullable = true)
| |-- name: string (nullable = true)
|-- date: date (nullable = true)
I managed to solve the problem myself. The answer was quite obvious: Instead of creating a column of type struct which contains two values, I created a MapType which had the same values.
My final code:
df.withColumn("column", from_json($"column"(0), MapType(StringType, StringType)) )
And then, for accessing the keys and values of the new column:
.select("column.productId", "column.name")

Remove duplicate column name in a Pyspark Dataframe from a json column nested object

So I have a dataframe that I gather from a table I have in my database. When I read it from the database, the json column becomes a string in my dataframe, no problem I convert it using:
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
Now I get the following schema for my dataframe, I have removed a lot of irrelevant properties from the json, but trust me there is a lot:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- Id: string (nullable = true)
| | |-- ... many more properties
Something is wrong here, inside of the nested object, object, I have two properties with the same name. This is because some of the older entries in the database had the Id of object as a string type, but that was later changed to a long type. Because of this I cannot write my dataframe using:
df.write.format("delta").option('mergeSchema', 'true').option("inferSchema", "true").option("overwriteSchema", "true").mode("overwrite").save(output_folder)
I get the following error:
AnalysisException: Found duplicate column(s) in the metadata update: json.object.id;
How do I go about fixing this. Ideally I would like to merge these two columns names, by casting Id with a string type to a long type and merge them. So I could get a schema like:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- ... many more properties
Note how the root of the json also has a property called Id, I would like for that to be untouched!
Can someone help me with this?
Instead of modifying and remove the duplicate column with same name after having used:
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
I went with a solution where I used regex substitution on the JsonCol beforehand:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
This removes the " on each side of the Id, which means they will be interpreted as long instead of as a string.
Putting it all together:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
This gave me my expected output, and afterwards I was able to save it properly:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)

Spark dataframe statements work differently inside class definition

Trying to create a spark-scala class for working with a calendar reference table.
I use sql against a Hadoop database to create a Spark dataframe:
scala> val dfCyccal = sql(sqlCyccal)
dfCyccal: org.apache.spark.sql.DataFrame = [DT_WORKDAY: date, NWKD: int ... 4 more fields]
scala> dfCyccal.printSchema
root
|-- DT_WORKDAY: date (nullable = true)
|-- NWKD: integer (nullable = true)
|-- DT_PREV_WD: date (nullable = true)
|-- DT_NEXT_WD: date (nullable = true)
|-- DT_MNTHEND: date (nullable = true)
|-- ACCTG_MNTH: date (nullable = true)
scala> dfCyccal.show(5)
+----------+----+----------+----------+----------+----------+
|DT_WORKDAY|NWKD|DT_PREV_WD|DT_NEXT_WD|DT_MNTHEND|ACCTG_MNTH|
+----------+----+----------+----------+----------+----------+
|2004-01-29| 20|2003-12-30|2004-02-27|2004-01-29|2004-01-01|
|2004-01-30| 1|2003-12-31|2004-03-02|2004-02-27|2004-02-01|
|2004-02-02| 2|2004-01-02|2004-03-03|2004-02-27|2004-02-01|
|2004-02-03| 3|2004-01-05|2004-03-04|2004-02-27|2004-02-01|
|2004-02-04| 4|2004-01-06|2004-03-05|2004-02-27|2004-02-01|
+----------+----+----------+----------+----------+----------+
only showing top 5 rows
I then set reference constants for the extract:
scala> val MIN_DT_WORKDAY : java.sql.Date = dfCyccal.agg(min('DT_WORKDAY)).first().getDate(0)
MIN_DT_WORKDAY: java.sql.Date = 2004-01-29
scala> val MAX_DT_WORKDAY : java.sql.Date = dfCyccal.agg(max('DT_WORKDAY)).first().getDate(0)
MAX_DT_WORKDAY: java.sql.Date = 2020-12-01
Problem is, when I try to encapsulate this in a class definition, I get a different result:
class CYCCAL(parameters for SQL) {
...
val dfCyccal = sql(sqlCyccal).persist;
<console>:143: error: not found: value min
val MIN_DT_WORKDAY : java.sql.Date = dfCyccal.agg(min('DT_WORKDAY)).first().getDate(0)
^
<console>:144: error: not found: value max
val MAX_DT_WORKDAY : java.sql.Date = dfCyccal.agg(max('DT_WORKDAY)).first().getDate(0)
}; // end of CYCCAL
How does the Class setup change the operations on the DataFrame?
They work the same. spark-shell just imports many objects by default including:
import org.apache.spark.sql.functions._
which are missing in your own code.
This worked. I had to add the following INSIDE the Class definition:
import org.apache.spark.sql.functions.{min,max};
I also had to change the notation on the column from
val MIN_DT_WORKDAY : java.sql.Date = dfCyccal.agg(min('DT_WORKDAY)).first().getDate(0)
to
val MIN_DT_WORKDAY : java.sql.Date = dfCyccal.agg(min("DT_WORKDAY")).first().getDate(0)
The original was being processed as a Symbol, and you could not apply the functions to it.

Spark elasticsearch connector: how to select _id field?

I'm using the spark elasticsearch connector and I want to extract from ES some fields plus the _id.
myquery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myquery)
.option("pushdown", "true")
.load("myindex/mytype")
.limit(10)
.select("myfield","_id")
Unfortunately, the _id field is not recognized:
AnalysisException: u'cannot resolve \'`_id`\' given input columns:
[query, size, #version, #timestamp,
sourceinfo, signaletic, document, metadata, fields, aggs]
With this mapping I can select for example document.{fieldA}, sourceinfo.{fieldB}, etc. but not _id. It's like if it's at a level to high in the mapping.
Any idea how to get this particular field?
You can access _id or the metadata by setting the es.read.metadata to True
For example in your code:
myquery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myquery)
.option("pushdown", "true")
.option("es.read.metadata",True) # Set it to True
.load("myindex/mytype")
.limit(10)
.select("myfield","_id")
When you do
df.printSchema()
it will print something like ( in my case )
|-- user_id: string (nullable = true)
|-- user_rt: string (nullable = true)
|-- user_rt_id: string (nullable = true)
|-- username: string (nullable = true)
|-- video: long (nullable = true)
|-- _metadata: map (nullable = true) # this map column will be added
| |-- key: string
| |-- value: string (valueContainsNull = true)
You can further extract the properties like
_index, _type, _id, _score, sort from _metadata column
To access the _id column from _metadata (map type) i have used this, you can use what you think best
df.createOrReplaceTempView('temp_table_name')
spark.sql("""
SELECT
_metadata._id as reference_id
FROM temp_table_name
""")
Hope this solves your problem
Some of the special characters like _,₹ not support in spark SQL.please try to replace the special characters with another character like "Id".