pandas_udf to extract a value from a column containing maps - pandas

I have the following spark df
id | country
------------------
1 | Null
2 | {"date": null, "value": "BRA", "context": "nationality", "state": null}
3 | {"date": null, "value": "ITA", "context": "residence", "state": null}
4 | {"date": null, "value": null, "context": null, "state": null}
And I want to create a pandas user defined function that, when run like below, would output the df like shown below:
(i'm working in databricks notebooks, the display function simply prints at the console the output of the command within parens)
display(df.withColumn("country_context", get_country_context(col("country"))))
would output
id | country | country_context
-----------------------------------
1 | Null | null
2 | {"date": n...| nationality
3 | {"date": n...| residence
4 | {"date": n...| null
The pandas_udf I created is the following:
from pyspark.sql.functions import pandas_udf, col
import pandas as pd
#pandas_udf("string")
def get_country_context(country_series: pd.Series) -> pd.Series:
return country_series.map(lambda d:
d.get("context", "Null")
if d else "Null")
display(df
.withColumn("country_context", get_country_context(col("country"))))
I get the following error:
PythonException: 'AttributeError: 'DataFrame' object has no attribute 'map''
I know I don't need a udf, nor a pandas_udf for this - but i would like to understand why my function doesn't work.

I changed syntax from Series -> Series to It[Series] -> It[Series] and it works. Not sure why but it does.
#pandas_udf('string')
def my_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
return map(lambda d:d.get("context", "Null"), iterator)

Related

Cannot cast dataframe column containing an array to String

I have the following dataframe:
I wouldl ike to transform the results column into another dataframe.
This is the code I am trying to execute:
val JsonString = df.select(col("results")).as[String]
val resultsDF = spark.read.json(JsonString)
But the first line returns this error:
AnalysisException: Cannot up cast `results` from array<struct<auctions:bigint,bid_price_sum:double,bid_selected_price_sum:double,bids_cancelled:bigint,bids_done:bigint,bids_fail_currency:bigint,bids_fail_parsing:bigint,bids_failed:bigint,bids_filtered_blockrule:bigint,bids_filtered_duration:bigint,bids_filtered_floor_price:bigint,bids_lost:bigint,bids_selected:bigint,bids_timeout:bigint,clicks:bigint,content_owner_id:string,content_owner_name:string,date:bigint,impressions:bigint,intext_inventory:bigint,ivt_blocked:struct<blocked_reason_automated_browsing:bigint,blocked_reason_data_center:bigint,blocked_reason_false_representation:bigint,blocked_reason_irregular_pattern:bigint,blocked_reason_known_crawler:bigint,blocked_reason_manipulated_behavior:bigint,blocked_reason_misleading_uer_interface:bigint,blocked_reason_undisclosed_classification:bigint,blocked_reason_undisclosed_classification_ml:bigint,blocked_reason_undisclosed_use_of_incentives:bigint,ivt_blocked_requests:bigint>,no_bid:bigint,requests:bigint,requests_country:bigint,revenue:double,vtr0:bigint,vtr100:bigint,vtr25:bigint,vtr50:bigint,vtr75:bigint>> to string.
The type path of the target object is:
- root class: "java.lang.String"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object
This means that results is not a String.
For example
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.col
object Main extends App {
val spark = SparkSession.builder
.master("local")
.appName("Spark app")
.getOrCreate()
import spark.implicits._
case class MyClass(auctions: Int, bid_price_sum: Double)
val df: DataFrame =
Seq(
("xxx", "yyy", 1, """[{"auctions":9343, "bid_price_sum":1.062}, {"auctions":1225, "bid_price_sum":0.153}]"""),
("xxx1", "yyy1", 2, """{"auctions":1111, "bid_price_sum":0.111}"""),
)
.toDF("col1", "col2", "col3", "results")
df.show()
val JsonString = df.select(col("results")).as[String]
val resultsDF = spark.read.json(JsonString)
resultsDF.show()
}
produces
+----+----+----+--------------------+
|col1|col2|col3| results|
+----+----+----+--------------------+
| xxx| yyy| 1|[{"auctions":9343...|
|xxx1|yyy1| 2|{"auctions":1111,...|
+----+----+----+--------------------+
+--------+-------------+
|auctions|bid_price_sum|
+--------+-------------+
| 9343| 1.062|
| 1225| 0.153|
| 1111| 0.111|
+--------+-------------+
while
// ........................
val df: DataFrame =
Seq(
("xxx", "yyy", 1, Seq(MyClass(9343, 1.062), MyClass(1225, 0.153))),
("xxx1", "yyy1", 2, Seq(MyClass(1111, 0.111))),
)
.toDF("col1", "col2", "col3", "results")
// ........................
produces your exception
org.apache.spark.sql.AnalysisException: Cannot up cast results
from "ARRAY<STRUCT<auctions: INT, bid_price_sum: DOUBLE>>" to "STRING".
The type path of the target object is:
- root class: "java.lang.String"
You can either add an explicit cast to the input data
or choose a higher precision type of the field in the target object
You can fix the exception with
df.select(col("results")).as[Seq[MyClass]]
instead of
df.select(col("results")).as[String]
Similarly,
val df = spark.read.json("src/main/resources/file.json")
in the above code produces correct result for the following file.json
{"col1": "xxx", "col2": "yyy", "col3": 1, "results": "[{\"auctions\":9343, \"bid_price_sum\":1.062}, {\"auctions\":1225, \"bid_price_sum\":0.153}]"}
{"col1": "xxx1", "col2": "yyy1", "col3": 2, "results": "{\"auctions\":1111, \"bid_price_sum\":0.111}"}
but throws for the following file
{"col1": "xxx", "col2": "yyy", "col3": 1, "results": [{"auctions":9343, "bid_price_sum":1.062}, {"auctions":1225, "bid_price_sum":0.153}]}
{"col1": "xxx1", "col2": "yyy1", "col3": 2, "results": [{"auctions":1111, "bid_price_sum":0.111}]}
Extract columns from a json and write into a dataframe

Vectorized pandas udf in pyspark with dict lookup

I'm trying to learn to use pandas_udf in pyspark (Databricks).
One of the assignments is to write a pandas_udf to sort by day of the week. I know how to do this using spark udf:
from pyspark.sql.functions import *
data = [('Sun', 282905.5), ('Mon', 238195.5), ('Thu', 264620.0), ('Sat', 278482.0), ('Wed', 227214.0)]
schema = 'day string, avg_users double'
df = spark.createDataFrame(data, schema)
print('Original')
df.show()
#udf()
def udf(day: str) -> str:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return dow[day] + '-' + day
print('with spark udf')
final_df = df.select(col('avg_users'), udf(col('day')).alias('day')).sort('day')
final_df.show()
Prints:
Original
+---+-----------+
|day| avg_users|
+---+-----------+
|Sun| 282905.5|
|Mon| 238195.5|
|Thu| 264620.0|
|Sat| 278482.0|
|Wed| 227214.0|
+---+-----------+
with spark udf
+-----------+-----+
| avg_users| day|
+-----------+-----+
| 238195.5|1-Mon|
| 227214.0|3-Wed|
| 264620.0|4-Thu|
| 278482.0|6-Sat|
| 282905.5|7-Sun|
+-----------+-----+
Trying to do the same with pandas_udf
import pandas as pd
#pandas_udf('string')
def p_udf(day: pd.Series) -> pd.Series:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return dow[day.str] + '-' + day.str
p_final_df = df.select(df.avg_users, p_udf(df.day))
print('with pandas udf')
p_final_df.show()
I get KeyError: <pandas.core.strings.accessor.StringMethods object at 0x7f31197cd9a0>. I think it's coming from dow[day.str], which kinda makes sense.
I also tried:
return dow[day.str.__str__()] + '-' + day.str # KeyError: .... StringMethods
return dow[str(day.str)] + '-' + day.str # KeyError: .... StringMethods
return dow[day.str.upper()] + '-' + day.str # TypeError: unhashable type: 'Series'
return f"{dow[day.str]}-{day.str}" # KeyError: .... StringMethods (but I think this is logically
# wrong, returning a string instead of a Series)
I've read:
API reference
PySpark equivalent for lambda function in Pandas UDF
How to convert Scalar Pyspark UDF to Pandas UDF?
Pandas UDF in pyspark
Using the .str method alone without any actual vectorized transformation was giving you the error. Also, you can not use the whole series as a key for your dow dict. Use a map method for pandas.Series:
from pyspark.sql.functions import *
import pandas as pd
data = [('Sun', 282905.5), ('Mon', 238195.5), ('Thu', 264620.0), ('Sat', 278482.0), ('Wed', 227214.0)]
schema = 'day string, avg_users double'
df = spark.createDataFrame(data, schema)
#pandas_udf("string")
def p_udf(day: pd.Series) -> pd.Series:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return day.map(dow) + '-' + day
df.select(df.avg_users, p_udf(df.day).alias("day")).show()
+---------+-----+
|avg_users| day|
+---------+-----+
| 282905.5|7-Sun|
| 238195.5|1-Mon|
| 264620.0|4-Thu|
| 278482.0|6-Sat|
| 227214.0|3-Wed|
+---------+-----+
what about we return a dataframe using groupeddata and orderby after you do the udf. Pandas sort_values is quite problematic within udfs.
Basically, in the udf I generate the numbers using python and then concatenate them back to the day column.
from pyspark.sql.functions import pandas_udf
import pandas as pd
from pyspark.sql.types import *
import calendar
def sortdf(pdf):
day=pdf.day
pdf =pdf.assign(day=(day.map(dict(zip(calendar.day_abbr, range(7))))+1).astype(str) + '-'+day)
return pdf
df.groupby('avg_users').applyInPandas(sortdf, schema=df.schema).show()
+-----+---------+
| day|avg_users|
+-----+---------+
|3-Wed| 227214.0|
|1-Mon| 238195.5|
|4-Thu| 264620.0|
|6-Sat| 278482.0|
|7-Sun| 282905.5|
+-----+---------+

Update pyspark dataframe from a column having the target column values

I have a dataframe which has a column('target_column' in this case) and I need to update these target columns with 'val' column values.
I have tried using udfs and .withcolumn but they all expect fixed column value. In my case it can be variable. Also using rdd map transformations didn't work as rdds are immutable.
def test():
data = [("jose_1", 'mase', "firstname", "jane"), ("li_1", "ken", 'lastname', 'keno'), ("liz_1", 'durn', 'firstname', 'liz')]
source_df = spark.createDataFrame(data, ["firstname", "lastname", "target_column", "val"])
source_df.show()
if __name__ == "__main__":
spark = SparkSession.builder.appName('Name Group').getOrCreate()
test()
spark.stop()
Input:
+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
| jose_1| mase| firstname|jane|
| li_1| ken| lastname|keno|
| liz_1| durn| firstname| liz|
+---------+--------+-------------+----+
Expected output:
+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
| jane| mase| firstname|jane|
| li_1| keno| lastname|keno|
| liz| durn| firstname| liz|
+---------+--------+-------------+----+
For e.g. in first row in input the target_column is 'firstname' and val is 'jane'. So I need to update the firstname with 'jane' in that row.
Thanks
You can do a loop with all you columns:
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(
col,
F.when(
F.col("target_column")==F.lit(col),
F.col("val")
).otherwise(F.col(col))
)

Iterate over one dataframe and add values to another dataframe without "append" or "concat"?

I have a dataframe "df_edges" where I want to iterate over.
Inside the iteration is an if/else and a string split. I need to add the values from the if/else into a new dataframe (each iteration = one new row in the other dataframe).
Example data of "df_edges":
+-----------------------------------------+
| channelId ... featuredChannelsUrlsCount |
+-----------------------------------------+
| 0 UC-ry8ngUIJHTMBWeoARZGmA ... 1 |
| 1 UC-zK3cJdazy01AKTu8g_amg ... 6 |
| 2 UC05_iIGvXue0sR01JNpRHzw ... 10 |
| 3 UC141nSav5cjmTXN7B70ts0g ... 0 |
| 4 UC1cQzKmbx9x0KipvoCt4NJg ... 0 |
+-----------------------------------------+
# new empty dataframe where I want to add the values
df_edges_to_db = pd.DataFrame(columns=["Source", "Target"])
#iteration over the dataframe
for row in df_edges.itertuples():
if row.featuredChannelsUrlsCount != 0:
featured_channels = row[2].split(',')
for fc in featured_channels:
writer.writerow([row[1], fc])
df_edges_to_db = df_edges_to_db.append({"Source": row[1], "Target": fc}, ignore_index=True)
else:
writer.writerow([row[1], row[1]])
df_edges_to_db = df_edges_to_db.append({"Source": row[1], "Target": row[1]}, ignore_index=True)
This seems to work. But the documentation says (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html):
The following, while not recommended methods for generating DataFrames
So, is there a more "best practice" way (besides append/concat) to add the rows with the values?
Here is possible create list of dictionaries by python append, not DataFrame.append like in your solution, and the call only once DataFrame constructor:
L = []
#iteration over the dataframe
for row in df_edges.itertuples():
if row.featuredChannelsUrlsCount != 0:
featured_channels = row[2].split(',')
for fc in featured_channels:
writer.writerow([row[1], fc])
L.append({"Source": row[1], "Target": fc})
else:
writer.writerow([row[1], row[1]])
L.append({"Source": row[1], "Target": row[1]})
df_edges_to_db = pd.DataFrame(L)
Actually I am not clear how your df_edges dataFrame looks like. By looking your code I will suggest you to replace your body of outer for-loop with something like this :
new_list= [someOperationOn(x) if x==0 else otherOperationOn(x) for x in mylist]

Arrays to row in pandas

I have following dict which I want to convert into pandas. this dict have nested list which can appear for one node but not other.
dis={"companies": [{"object_id": 123,
"name": "Abd ",
"contact_name": ["xxxx",
"yyyy"],
"contact_id":[1234,
33455]
},
{"object_id": 654,
"name": "DDSPP"},
{"object_id": 987,
"name": "CCD"}
]}
AS
object_id, name, contact_name, contact_id
123,Abd,xxxx,1234
123,Abd,yyyy,
654,DDSPP,,
987,CCD,,
How can i achive this
I was trying to do like
abc = pd.DataFrame(dis).set_index['object_id','contact_name']
but it says
'method' object is not subscriptable
This is inspired from #jezrael answer in this link: Splitting multiple columns into rows in pandas dataframe
Use:
s = {"companies": [{"object_id": 123,
"name": "Abd ",
"contact_name": ["xxxx",
"yyyy"],
"contact_id":[1234,
33455]
},
{"object_id": 654,
"name": "DDSPP"},
{"object_id": 987,
"name": "CCD"}
]}
df = pd.DataFrame(s) #convert into DF
df = df['companies'].apply(pd.Series) #this splits the internal keys and values into columns
split1 = df.apply(lambda x: pd.Series(x['contact_id']), axis=1).stack().reset_index(level=1, drop=True)
split2 = df.apply(lambda x: pd.Series(x['contact_name']), axis=1).stack().reset_index(level=1, drop=True)
df1 = pd.concat([split1,split2], axis=1, keys=['contact_id','contact_name'])
pd.options.display.float_format = '{:.0f}'.format
print (df.drop(['contact_id','contact_name'], axis=1).join(df1).reset_index(drop=True))
Output with regular index:
name object_id contact_id contact_name
0 Abd 123 1234 xxxx
1 Abd 123 33455 yyyy
2 DDSPP 654 nan NaN
3 CCD 987 nan NaN
Is this something you were looking for?
If you have just only one column needs to convert, then you can use something more shortly, like this:
df = pd.DataFrame(d['companies'])
d = df.loc[0].apply(pd.Series)
d[1].fillna(d[0], inplace=True)
df.drop([0],0).append(d.T)
Otherwise, if you need to do this with more then one raw, you can use it, but it have to be modified.