Adding a element to an existing struct in a Spark dataframe - dataframe

I have a PySpark dataframe and the schema looks like this:
root
|-- useragent: string (nullable = true)
|-- properties: struct (nullable = true)
| |-- browser: string (nullable = true)
| |-- device: string (nullable = true)
I am using udf + withColumn to extract more info out of the useragent. But I am only able to store any additional property I am creating in a new column. Is there a way I can append it to the struct itself?
I've changed the schema but it doesn't transfer all the data I needed from the other properties that already exist
df = df.schema['properties'].dataType.add(StructField('type', StringType()))
Would it make more sense to convert this to a RDD?

You will not get any data using this piece of code, since you are not selecting any data.
What you are doing in this operation is nothing but accessing the Schema property, modifying it in some way and assigning it to a variable df (this is not the modified dataframe you are expecting) but just a StructType object (There seems to be some syntax error there, please check)
What you should do is to modify the dataframe through select and suitable transformation functions. You may considering creating a new column of type StructField with the required structure and add it as a new column using withColumn function

withField can be used:
F.col("properties").withField("type", extract_function)
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("type=type1,mode=5", ("Chrome", "PC1"))],
"useragent string, properties struct<browser:string,device:string>"
)
df.printSchema()
# root
# |-- useragent: string (nullable = true)
# |-- properties: struct (nullable = true)
# | |-- browser: string (nullable = true)
# | |-- device: string (nullable = true)
df.show()
# +-----------------+-------------+
# | useragent| properties|
# +-----------------+-------------+
# |type=type1,mode=5|{Chrome, PC1}|
# +-----------------+-------------+
df = df.withColumn("properties", F.col("properties").withField(
"type",
F.regexp_extract("useragent", r"type=(\w+)", 1)
))
df.printSchema()
# root
# |-- useragent: string (nullable = true)
# |-- properties: struct (nullable = true)
# | |-- browser: string (nullable = true)
# | |-- device: string (nullable = true)
# | |-- type: string (nullable = true)
df.show()
# +-----------------+--------------------+
# | useragent| properties|
# +-----------------+--------------------+
# |type=type1,mode=5|{Chrome, PC1, type1}|
# +-----------------+--------------------+

Related

How to convert datetime to int on pyspark

df['DATE'].apply(lambda x: x.strftime("%Y%m%d")).astype('float64')
Provides an error of
TypeError: 'Column' object is not callable
How would I convert this syntax to comply with pyspark?
a simple way to format 'yyyy-MM-dd' to 'yyyyMMdd'
data= [
('2022-08-10', 1),
('2022-08-09', 2),
]
df = spark.createDataFrame(data, ['DATE','idx'])
df.printSchema()
# root
# |-- DATE: string (nullable = true)
# |-- idx: long (nullable = true)
df = df.withColumn('DATE', regexp_replace(col('DATE'), '-', '').cast('long'))
df.printSchema()
# root
# |-- DATE: long (nullable = true)
# |-- idx: long (nullable = true)
df.show(10, False)
# +--------+---+
# |DATE |idx|
# +--------+---+
# |20220810|1 |
# |20220809|2 |
# +--------+---+

How to Handle different date Format in csv file while reading Dataframe in SPARK using option("dateFormat")?

I am having below Input file
INPUTFILE_CNTRY_CODE|INPUTFILE_CTY_CODE|INPUTFILE_ID|INPUTFILE_LTY_ID|INPUTFILE_CNSM_ID|INPUTFILE_DATE|INPUTFILE_TIME|INPUTFILE_TRDATE
GBR|263|326735246||I034867789V|15/11/30|2015-11-30 00:00:00.000000|2016-22-06
GBR|263|397802068|PC7135361|PC7135361|16/05/20|2016-10-06 11:50:05.000000|2016-22-07
and I am trying to Read this as below.
val registeration_schema = StructType(List(
StructField("INPUTFILE_CNTRY_CODE", StringType),
StructField("INPUTFILE_CTY_CODE", IntegerType),
StructField("INPUTFILE_ID", IntegerType),
StructField("INPUTFILE_LTY_ID", StringType),
StructField("INPUTFILE_CNSM_ID", StringType),
StructField("INPUTFILE_DATE", DateType),
StructField("INPUTFILE_TIME", TimestampType),
StructField("INPUTFILE_TRDATE", DateType)
))
val registerationDF = spark.read
.option("header", "true")
.option("delimiter", "|")
.option("mode", "FAILFAST")
.schema(registeration_schema)
.option("dateFormat", "yy/M/d")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS")
.csv("registration2.csv")
And I am getting below Error .
Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: java.time.format.DateTimeParseException: Text '2016-22-06' could not be parsed at index 2
at org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:262)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$doParse$2(UnivocityParser.scala:200)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.parse(UnivocityParser.scala:207)
at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:347)
at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
... 27 more
Caused by: java.time.format.DateTimeParseException: Text '2016-22-06' could not be parsed at index 2
This is happening because of different date formats and I have specified only one of the date formats while loading Dataframe. Can someone guide how to handle more than one dateformat while reading CSV into dataframe.
You can not define more than one format for DateType while loading a CSV. But you can achieve this by playing around date_format() and to_date() functions available with spark 2.2+.
Steps at high level are as follows -
Define one of these two columns as string in original Schema. I chose to define INPUTFILE_DATE as string in my demonstration.
Format INPUTFILE_DATE in an appropriate format using date_format() and to_date() functions to convert its datatype to date.
Define your original schema with INPUTFILE_DATE --> StringType
val registeration_schema = StructType(List(
StructField("INPUTFILE_CNTRY_CODE", StringType),
StructField("INPUTFILE_CTY_CODE", IntegerType),
StructField("INPUTFILE_ID", IntegerType),
StructField("INPUTFILE_LTY_ID", StringType),
StructField("INPUTFILE_CNSM_ID", StringType),
StructField("INPUTFILE_DATE", StringType),
StructField("INPUTFILE_TIME", TimestampType),
StructField("INPUTFILE_TRDATE", DateType)
))
val registerationDF = spark.read
.option("header", "true")
.option("delimiter", "|")
.option("mode", "FAILFAST")
.schema(registeration_schema)
.option("dateFormat", "yyyy-dd-MM")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS")
.csv("registration2.csv")
Core part of this solution is -
val targetDF = registerationDF.withColumn("INPUTFILE_DATE",to_date((date_format(to_date(col("INPUTFILE_DATE"),"yy/MM/dd"),"yyyy-dd-MM")),"yyyy-dd-MM"))
End Result -
scala> targetDF.printSchema()
root
|-- INPUTFILE_CNTRY_CODE: string (nullable = true)
|-- INPUTFILE_CTY_CODE: integer (nullable = true)
|-- INPUTFILE_ID: integer (nullable = true)
|-- INPUTFILE_LTY_ID: string (nullable = true)
|-- INPUTFILE_CNSM_ID: string (nullable = true)
|-- INPUTFILE_DATE: date (nullable = true)
|-- INPUTFILE_TIME: timestamp (nullable = true)
|-- INPUTFILE_TRDATE: date (nullable = true)
scala> targetDF.show()
+--------------------+------------------+------------+----------------+-----------------+--------------+-------------------+----------------+
|INPUTFILE_CNTRY_CODE|INPUTFILE_CTY_CODE|INPUTFILE_ID|INPUTFILE_LTY_ID|INPUTFILE_CNSM_ID|INPUTFILE_DATE| INPUTFILE_TIME|INPUTFILE_TRDATE|
+--------------------+------------------+------------+----------------+-----------------+--------------+-------------------+----------------+
| GBR| 263| 326735246| null| I034867789V| 2015-11-30|2015-11-30 00:00:00| 2017-10-06|
| GBR| 263| 397802068| PC7135361| PC7135361| 2016-05-20|2016-10-06 11:50:05| 2017-10-07|
+--------------------+------------------+------------+----------------+-----------------+--------------+-------------------+----------------+

Need to pass this SQL code to spark syntax

Having a hard time passing this code to just pyspark syntax, without using spark.sql
TableD = spark.sql("SELECT Region, SUM(moneyspent)/count(distinct id) as avgspent FROM finaldatatable as Data GROUP BY Region ORDER BY avgspent desc ")
TableD.show()
Please help, really stuck here. Thanks.
Check below code.
>>> from pyspark.sql import functions as F
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- region: string (nullable = true)
|-- moneyspent: double (nullable = true)
>>> df
.groupBy(F.col("region"))
.agg((F.sum(F.col("moneyspent"))/F.countDistinct(F.col("id"))).alias("avgspent"))
.orderBy(F.col("avgspent"),ascending=False)
.show()

Pyspark - How to set the schema when reading parquet file from another DF?

I have DF1 with schema:
df1 = spark.read.parquet(load_path1)
df1.printSchema()
root
|-- PRODUCT_OFFERING_ID: string (nullable = true)
|-- CREATED_BY: string (nullable = true)
|-- CREATION_DATE: string (nullable = true)
and DF2:
df2 = spark.read.parquet(load_path2)
df2.printSchema()
root
|-- PRODUCT_OFFERING_ID: decimal(38,10) (nullable = true)
|-- CREATED_BY: decimal(38,10) (nullable = true)
|-- CREATION_DATE: timestamp (nullable = true)
Now I want to Union these 2 dataframes..
Sometime it gives errors when I try to UNION these 2 DFs because of different schemas..
How to set for DF2 to have exact same schema (during the load time) as DF1?
I tried with:
df2 = spark.read.parquet(load_path2).schema(df1.schema)
Getting error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'StructType' object is not callable
Or should I CAST it instead (once DF2 is read)?
Thanks.
Move .schema() before .parquet() then spark will read the parquet file with the specified schema
df2 = spark.read.schema(df1.schema).parquet(load_path2)

How do you get a row back into a dataframe

This was suppose to be simple test to move the first row of my dataframe into a new dataframe.
first issue df.first() returns a "row" not a dataframe.
next problem, when I try to use spark.createDataFrame(df.first()) it will tell you that it can not infer the schema.
next problem spark.createDataFrame(df.first(), df.schema) does not work.
so for the original schema below:
root
|-- entity_name: string (nullable = true)
|-- field_name: array (nullable = true)
| |-- element: string (containsNull = true)
|-- data_row: array (nullable = true)
| |-- element: string (containsNull = true)
|-- data_schema: array (nullable = true)
| |-- element: string (containsNull = true)
I defined the schema in code thus:
xyz_schema = StructType([
StructField('entity_name',StringType(),True)
,StructField('field_name',ArrayType(StringType(),True),True)
,StructField('data_row',ArrayType(StringType(),True),True)
,StructField('data_schema',ArrayType(StringType(),True),True)
])
print(xyz.first())
xyz_1stRow = spark.createDataFrame(xyz.first(), xyz_schema)
The above does not work! I get the following error:
"TypeError: StructType can not accept object 'parquet/assignment/v1' in type <class 'str'>"
this is what the print shows me...
Row(entity_name='parquet/assignment/v1',
field_name=['Contract_ItemNumber', 'UPC', 'DC_ID', 'AssignDate',
'AssignID', 'AssignmentQuantity', 'ContractNumber', 'MaterialNumber',
'OrderReason', 'RequirementCategory', 'MSKU'], data_row=['\n
350,192660436296,2001,10/1/2019,84009248020191000,5,840092480,1862291010,711,V1\n\t\t\t\t\t',
'\n
180,191454773838,2001,10/1/2019,84009248020191000,6,840092480,1791301010,711,V1\n\t\t\t\t\t'],
data_schema=['StringType', 'StringType', 'StringType', None,
'StringType', 'IntegerType', 'StringType', 'StringType', 'StringType',
'StringType', 'StringType'])
What am I doing wrong? why does a stringtype not accept a string?
I'm working in pyspark (current version) with Azure databricks. I'd prefer to stay with pyspark, not R, not Scala, and not have to convert to pandas and risk my data being corrupted converting between all these languages.
According to the documentation the createDataFrame function takes a RDD, a list or a pandas.DataFrame and creates a dataframe from it. Therefore you have to put the result of df.first in parentheses to make it a list. Have a look at the example below:
df = spark.createDataFrame(
[('Galaxy', 2017, 27841, 17529),
('Galaxy', 2017, 29395, 11892),
('Novato', 2018, 35644, 22876),
('Novato', 2018, 8765, 54817)],
['model','year','price','mileage']
)
bla = spark.createDataFrame([df.first()])
bla.show()
Output:
+------+----+-----+-------+
| model|year|price|mileage|
+------+----+-----+-------+
|Galaxy|2017|27841| 17529|
+------+----+-----+-------+