Cannot cast dataframe column containing an array to String - dataframe

I have the following dataframe:
I wouldl ike to transform the results column into another dataframe.
This is the code I am trying to execute:
val JsonString = df.select(col("results")).as[String]
val resultsDF = spark.read.json(JsonString)
But the first line returns this error:
AnalysisException: Cannot up cast `results` from array<struct<auctions:bigint,bid_price_sum:double,bid_selected_price_sum:double,bids_cancelled:bigint,bids_done:bigint,bids_fail_currency:bigint,bids_fail_parsing:bigint,bids_failed:bigint,bids_filtered_blockrule:bigint,bids_filtered_duration:bigint,bids_filtered_floor_price:bigint,bids_lost:bigint,bids_selected:bigint,bids_timeout:bigint,clicks:bigint,content_owner_id:string,content_owner_name:string,date:bigint,impressions:bigint,intext_inventory:bigint,ivt_blocked:struct<blocked_reason_automated_browsing:bigint,blocked_reason_data_center:bigint,blocked_reason_false_representation:bigint,blocked_reason_irregular_pattern:bigint,blocked_reason_known_crawler:bigint,blocked_reason_manipulated_behavior:bigint,blocked_reason_misleading_uer_interface:bigint,blocked_reason_undisclosed_classification:bigint,blocked_reason_undisclosed_classification_ml:bigint,blocked_reason_undisclosed_use_of_incentives:bigint,ivt_blocked_requests:bigint>,no_bid:bigint,requests:bigint,requests_country:bigint,revenue:double,vtr0:bigint,vtr100:bigint,vtr25:bigint,vtr50:bigint,vtr75:bigint>> to string.
The type path of the target object is:
- root class: "java.lang.String"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object

This means that results is not a String.
For example
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.col
object Main extends App {
val spark = SparkSession.builder
.master("local")
.appName("Spark app")
.getOrCreate()
import spark.implicits._
case class MyClass(auctions: Int, bid_price_sum: Double)
val df: DataFrame =
Seq(
("xxx", "yyy", 1, """[{"auctions":9343, "bid_price_sum":1.062}, {"auctions":1225, "bid_price_sum":0.153}]"""),
("xxx1", "yyy1", 2, """{"auctions":1111, "bid_price_sum":0.111}"""),
)
.toDF("col1", "col2", "col3", "results")
df.show()
val JsonString = df.select(col("results")).as[String]
val resultsDF = spark.read.json(JsonString)
resultsDF.show()
}
produces
+----+----+----+--------------------+
|col1|col2|col3| results|
+----+----+----+--------------------+
| xxx| yyy| 1|[{"auctions":9343...|
|xxx1|yyy1| 2|{"auctions":1111,...|
+----+----+----+--------------------+
+--------+-------------+
|auctions|bid_price_sum|
+--------+-------------+
| 9343| 1.062|
| 1225| 0.153|
| 1111| 0.111|
+--------+-------------+
while
// ........................
val df: DataFrame =
Seq(
("xxx", "yyy", 1, Seq(MyClass(9343, 1.062), MyClass(1225, 0.153))),
("xxx1", "yyy1", 2, Seq(MyClass(1111, 0.111))),
)
.toDF("col1", "col2", "col3", "results")
// ........................
produces your exception
org.apache.spark.sql.AnalysisException: Cannot up cast results
from "ARRAY<STRUCT<auctions: INT, bid_price_sum: DOUBLE>>" to "STRING".
The type path of the target object is:
- root class: "java.lang.String"
You can either add an explicit cast to the input data
or choose a higher precision type of the field in the target object
You can fix the exception with
df.select(col("results")).as[Seq[MyClass]]
instead of
df.select(col("results")).as[String]
Similarly,
val df = spark.read.json("src/main/resources/file.json")
in the above code produces correct result for the following file.json
{"col1": "xxx", "col2": "yyy", "col3": 1, "results": "[{\"auctions\":9343, \"bid_price_sum\":1.062}, {\"auctions\":1225, \"bid_price_sum\":0.153}]"}
{"col1": "xxx1", "col2": "yyy1", "col3": 2, "results": "{\"auctions\":1111, \"bid_price_sum\":0.111}"}
but throws for the following file
{"col1": "xxx", "col2": "yyy", "col3": 1, "results": [{"auctions":9343, "bid_price_sum":1.062}, {"auctions":1225, "bid_price_sum":0.153}]}
{"col1": "xxx1", "col2": "yyy1", "col3": 2, "results": [{"auctions":1111, "bid_price_sum":0.111}]}
Extract columns from a json and write into a dataframe

Related

Spark: convert string to Date

I'm using spark/scala
I have a dataframe. There are columns year/month/day with value, for ex. 2020/9/2. How can I add a column to the same dataframe with conversion to datetime (yyyy-mm-dd)?
I found how to convert date from String to Date format, but I can't find a solution how to combine values and convert it to datetime.
thanks for any advice or hint
You can use the to_date function.
val df1 = Seq(
("2020/9/2"),
("2020/9/15"),
("2020/9/30")
).toDF("str")
val df2 = df1.withColumn("dt", to_date(col("str"), "y/M/d"))
df2.show()
I do some test, I think you can use my examples to convert date. I hope I can help you.
package com.jackpan.spark.examples
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
object SomeExamples {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("SomeExamples")
.getOrCreate()
val dataDF = spark.createDataFrame(Seq(("2022", "12", "09"), ("2022", "12", "19"),
("2022", "12", "15"))).toDF("year", "month", "day")
dataDF.withColumn("dateStr",
concat(col("year"), lit("-"),col("month"), lit("-"), col("day")))
.withColumn("date", to_date(col("dateStr"), "yyyy-MM-dd"))
.show(false)
}
}
And this function show result is diplay like below :
+----+-----+---+----------+----------+
|year|month|day|dateStr |date |
+----+-----+---+----------+----------+
|2022|12 |09 |2022-12-09|2022-12-09|
|2022|12 |19 |2022-12-19|2022-12-19|
|2022|12 |15 |2022-12-15|2022-12-15|
+----+-----+---+----------+----------+

Vectorized pandas udf in pyspark with dict lookup

I'm trying to learn to use pandas_udf in pyspark (Databricks).
One of the assignments is to write a pandas_udf to sort by day of the week. I know how to do this using spark udf:
from pyspark.sql.functions import *
data = [('Sun', 282905.5), ('Mon', 238195.5), ('Thu', 264620.0), ('Sat', 278482.0), ('Wed', 227214.0)]
schema = 'day string, avg_users double'
df = spark.createDataFrame(data, schema)
print('Original')
df.show()
#udf()
def udf(day: str) -> str:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return dow[day] + '-' + day
print('with spark udf')
final_df = df.select(col('avg_users'), udf(col('day')).alias('day')).sort('day')
final_df.show()
Prints:
Original
+---+-----------+
|day| avg_users|
+---+-----------+
|Sun| 282905.5|
|Mon| 238195.5|
|Thu| 264620.0|
|Sat| 278482.0|
|Wed| 227214.0|
+---+-----------+
with spark udf
+-----------+-----+
| avg_users| day|
+-----------+-----+
| 238195.5|1-Mon|
| 227214.0|3-Wed|
| 264620.0|4-Thu|
| 278482.0|6-Sat|
| 282905.5|7-Sun|
+-----------+-----+
Trying to do the same with pandas_udf
import pandas as pd
#pandas_udf('string')
def p_udf(day: pd.Series) -> pd.Series:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return dow[day.str] + '-' + day.str
p_final_df = df.select(df.avg_users, p_udf(df.day))
print('with pandas udf')
p_final_df.show()
I get KeyError: <pandas.core.strings.accessor.StringMethods object at 0x7f31197cd9a0>. I think it's coming from dow[day.str], which kinda makes sense.
I also tried:
return dow[day.str.__str__()] + '-' + day.str # KeyError: .... StringMethods
return dow[str(day.str)] + '-' + day.str # KeyError: .... StringMethods
return dow[day.str.upper()] + '-' + day.str # TypeError: unhashable type: 'Series'
return f"{dow[day.str]}-{day.str}" # KeyError: .... StringMethods (but I think this is logically
# wrong, returning a string instead of a Series)
I've read:
API reference
PySpark equivalent for lambda function in Pandas UDF
How to convert Scalar Pyspark UDF to Pandas UDF?
Pandas UDF in pyspark
Using the .str method alone without any actual vectorized transformation was giving you the error. Also, you can not use the whole series as a key for your dow dict. Use a map method for pandas.Series:
from pyspark.sql.functions import *
import pandas as pd
data = [('Sun', 282905.5), ('Mon', 238195.5), ('Thu', 264620.0), ('Sat', 278482.0), ('Wed', 227214.0)]
schema = 'day string, avg_users double'
df = spark.createDataFrame(data, schema)
#pandas_udf("string")
def p_udf(day: pd.Series) -> pd.Series:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return day.map(dow) + '-' + day
df.select(df.avg_users, p_udf(df.day).alias("day")).show()
+---------+-----+
|avg_users| day|
+---------+-----+
| 282905.5|7-Sun|
| 238195.5|1-Mon|
| 264620.0|4-Thu|
| 278482.0|6-Sat|
| 227214.0|3-Wed|
+---------+-----+
what about we return a dataframe using groupeddata and orderby after you do the udf. Pandas sort_values is quite problematic within udfs.
Basically, in the udf I generate the numbers using python and then concatenate them back to the day column.
from pyspark.sql.functions import pandas_udf
import pandas as pd
from pyspark.sql.types import *
import calendar
def sortdf(pdf):
day=pdf.day
pdf =pdf.assign(day=(day.map(dict(zip(calendar.day_abbr, range(7))))+1).astype(str) + '-'+day)
return pdf
df.groupby('avg_users').applyInPandas(sortdf, schema=df.schema).show()
+-----+---------+
| day|avg_users|
+-----+---------+
|3-Wed| 227214.0|
|1-Mon| 238195.5|
|4-Thu| 264620.0|
|6-Sat| 278482.0|
|7-Sun| 282905.5|
+-----+---------+

pandas_udf to extract a value from a column containing maps

I have the following spark df
id | country
------------------
1 | Null
2 | {"date": null, "value": "BRA", "context": "nationality", "state": null}
3 | {"date": null, "value": "ITA", "context": "residence", "state": null}
4 | {"date": null, "value": null, "context": null, "state": null}
And I want to create a pandas user defined function that, when run like below, would output the df like shown below:
(i'm working in databricks notebooks, the display function simply prints at the console the output of the command within parens)
display(df.withColumn("country_context", get_country_context(col("country"))))
would output
id | country | country_context
-----------------------------------
1 | Null | null
2 | {"date": n...| nationality
3 | {"date": n...| residence
4 | {"date": n...| null
The pandas_udf I created is the following:
from pyspark.sql.functions import pandas_udf, col
import pandas as pd
#pandas_udf("string")
def get_country_context(country_series: pd.Series) -> pd.Series:
return country_series.map(lambda d:
d.get("context", "Null")
if d else "Null")
display(df
.withColumn("country_context", get_country_context(col("country"))))
I get the following error:
PythonException: 'AttributeError: 'DataFrame' object has no attribute 'map''
I know I don't need a udf, nor a pandas_udf for this - but i would like to understand why my function doesn't work.
I changed syntax from Series -> Series to It[Series] -> It[Series] and it works. Not sure why but it does.
#pandas_udf('string')
def my_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
return map(lambda d:d.get("context", "Null"), iterator)

Replacing a column in a dataframe raises ValueError: Columns must be same length as key

There is a description of using the replace method:
[https://www.geeksforgeeks.org/replace-values-of-a-dataframe-with-the-value-of-another-dataframe-in-pandas/][1]
Unfortunately, I have an error during replacing columns of dataframe with a column from another dataframe.
import pandas as pd
# initialise data of lists.
colors = {
"first_set": ["99", "88", "77", "66", "55", "44", "33", "22"],
"second_set": ["1", "2", "3", "4", "5", "6", "7", "8"],
}
color = {
"first_set": ["a", "b", "c", "d", "e", "f", "g", "h"],
"second_set": ["VI", "IN", "BL", "GR", "YE", "OR", "RE", "WI"],
}
# Calling DataFrame constructor on list
df = pd.DataFrame(colors, columns=["first_set", "second_set"])
df1 = pd.DataFrame(color, columns=["first_set", "second_set"])
# Display the Output
display(df)
display(df1)
Here is a code with an error:
# replace column of one DataFrame with
# the column of another DataFrame
ser1 = df1["first_set"]
ser2 = df["second_set"]
print(ser1)
print(ser2)
df["second_set"] = df1.replace(to_replace=ser1, value=ser2)
--------------------------------------------------------------------------- ValueError Traceback (most recent call
last) ~\AppData\Local\Temp/ipykernel_28648/2104797653.py in
5 print(ser1)
6 print(ser2)
----> 7 df['second_set'] = df1.replace(to_replace=ser1,value=ser2)
~.virtualenvs\01_python_packages-rD-UbwAe\lib\site-packages\pandas\core\frame.py
in setitem(self, key, value) 3600
self._setitem_array(key, value) 3601 elif isinstance(value,
DataFrame):
-> 3602 self._set_item_frame_value(key, value) 3603 elif ( 3604 is_list_like(value)
~.virtualenvs\01_python_packages-rD-UbwAe\lib\site-packages\pandas\core\frame.py
in _set_item_frame_value(self, key, value) 3727
len_cols = 1 if is_scalar(cols) else len(cols) 3728 if
len_cols != len(value.columns):
-> 3729 raise ValueError("Columns must be same length as key") 3730 3731 # align right-hand-side columns
if self.columns
ValueError: Columns must be same length as key

pyspark pass multiple options in dataframe

I am new to python and pyspark. I would like to know
how can I write the below spark dataframe function in pyspark:
val df = spark.read.format("jdbc").options(
Map(
"url" -> "jdbc:someDB",
"user" -> "root",
"password" -> "password",
"dbtable" -> "tableName",
"driver" -> "someDriver")).load()
I tried to write as below in pyspark. But, getting syntax error:
df = spark.read.format("jdbc").options(
map(lambda : ("url","jdbc:someDB"), ("user","root"), ("password","password"), ("dbtable","tableName"), ("driver","someDriver"))).load()
Thanks in Advance
In PySpark, pass the options as keyword arguments:
df = spark.read\
.format("jdbc")\
.options(
url="jdbc:someDB",
user="root",
password="password",
dbtable="tableName",
driver="someDriver",
)\
.load()
Sometimes it's handy to keep them in a dict and unpack them later using the splat operator:
options = {
"url": "jdbc:someDB",
"user": "root",
"password": "password",
"dbtable": "tableName",
"driver": "someDriver",
}
df = spark.read\
.format("jdbc")\
.options(**options)\
.load()
Regarding the code snippets from your question: you happened to mix up two different concepts of "map":
Map in Scala is a data structure also known as "associative array" or "dictionary", equivalent to Python's dict
map in Python is a higher-order function you can use for applying a function to an iterable, e.g.:
In [1]: def square(x: int) -> int:
...: return x**2
...:
In [2]: list(map(square, [1, 2, 3, 4, 5]))
Out[2]: [1, 4, 9, 16, 25]
In [3]: # or just use a lambda
In [4]: list(map(lambda x: x**2, [1, 2, 3, 4, 5]))
Out[4]: [1, 4, 9, 16, 25]
Try to use option() instead:
df = spark.read \
.format("jdbc") \
.option("url","jdbc:someDB") \
.option("user","root") \
.option("password","password") \
.option("dbtable","tableName") \
.option("driver","someDriver") \
.load()
To load a CSV file with multiple parameters, pass the arguments to load():
df = spark.read.load("examples/src/main/resources/people.csv",
format="csv", sep=":", inferSchema="true", header="true")
Here's the documentation for that.