How to convert datetime to int on pyspark - dataframe

df['DATE'].apply(lambda x: x.strftime("%Y%m%d")).astype('float64')
Provides an error of
TypeError: 'Column' object is not callable
How would I convert this syntax to comply with pyspark?

a simple way to format 'yyyy-MM-dd' to 'yyyyMMdd'
data= [
('2022-08-10', 1),
('2022-08-09', 2),
]
df = spark.createDataFrame(data, ['DATE','idx'])
df.printSchema()
# root
# |-- DATE: string (nullable = true)
# |-- idx: long (nullable = true)
df = df.withColumn('DATE', regexp_replace(col('DATE'), '-', '').cast('long'))
df.printSchema()
# root
# |-- DATE: long (nullable = true)
# |-- idx: long (nullable = true)
df.show(10, False)
# +--------+---+
# |DATE |idx|
# +--------+---+
# |20220810|1 |
# |20220809|2 |
# +--------+---+

Related

Passing schema to construct DataFrame

I'm working on migrating a job. As a part of this, I need to pass column datatypes when constructing dataframes.
I was able to construct a dictionary which holds table name as key and schema definition for the table as value. When I'm trying to pass the values to dataframe schema it is complaining that it should be Struct type, but not string type.
Dictionary I'm creating:
{'table1': StructType([
StructField("column1",varchar(), True),
StructField("column2",numeric(), True),
StructField("column3",numeric(), True),
StructField("column4",timestamp(), True),
StructField("column5",timestamp(), True),
StructField("column6",timestamp(), True)
])}
I'm aware the datatypes up there can be wrongly placed, but this is just an example.
Error: expecting a Struct not a string literal for schema definition.
I'm not sure how you use your dictionary, but the following way of passing schema as dict value works well:
my_dict = {'table1': StructType([
StructField("column1", StringType(), True),
StructField("column2", LongType(), True),
StructField("column3", LongType(), True),
StructField("column4", TimestampType(), True),
StructField("column5", TimestampType(), True),
StructField("column6", TimestampType(), True)
])}
df = spark.createDataFrame([], my_dict["table1"])
df.printSchema()
# root
# |-- column1: string (nullable = true)
# |-- column2: long (nullable = true)
# |-- column3: long (nullable = true)
# |-- column4: timestamp (nullable = true)
# |-- column5: timestamp (nullable = true)
# |-- column6: timestamp (nullable = true)

How to Handle different date Format in csv file while reading Dataframe in SPARK using option("dateFormat")?

I am having below Input file
INPUTFILE_CNTRY_CODE|INPUTFILE_CTY_CODE|INPUTFILE_ID|INPUTFILE_LTY_ID|INPUTFILE_CNSM_ID|INPUTFILE_DATE|INPUTFILE_TIME|INPUTFILE_TRDATE
GBR|263|326735246||I034867789V|15/11/30|2015-11-30 00:00:00.000000|2016-22-06
GBR|263|397802068|PC7135361|PC7135361|16/05/20|2016-10-06 11:50:05.000000|2016-22-07
and I am trying to Read this as below.
val registeration_schema = StructType(List(
StructField("INPUTFILE_CNTRY_CODE", StringType),
StructField("INPUTFILE_CTY_CODE", IntegerType),
StructField("INPUTFILE_ID", IntegerType),
StructField("INPUTFILE_LTY_ID", StringType),
StructField("INPUTFILE_CNSM_ID", StringType),
StructField("INPUTFILE_DATE", DateType),
StructField("INPUTFILE_TIME", TimestampType),
StructField("INPUTFILE_TRDATE", DateType)
))
val registerationDF = spark.read
.option("header", "true")
.option("delimiter", "|")
.option("mode", "FAILFAST")
.schema(registeration_schema)
.option("dateFormat", "yy/M/d")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS")
.csv("registration2.csv")
And I am getting below Error .
Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: java.time.format.DateTimeParseException: Text '2016-22-06' could not be parsed at index 2
at org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:262)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$doParse$2(UnivocityParser.scala:200)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.parse(UnivocityParser.scala:207)
at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:347)
at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
... 27 more
Caused by: java.time.format.DateTimeParseException: Text '2016-22-06' could not be parsed at index 2
This is happening because of different date formats and I have specified only one of the date formats while loading Dataframe. Can someone guide how to handle more than one dateformat while reading CSV into dataframe.
You can not define more than one format for DateType while loading a CSV. But you can achieve this by playing around date_format() and to_date() functions available with spark 2.2+.
Steps at high level are as follows -
Define one of these two columns as string in original Schema. I chose to define INPUTFILE_DATE as string in my demonstration.
Format INPUTFILE_DATE in an appropriate format using date_format() and to_date() functions to convert its datatype to date.
Define your original schema with INPUTFILE_DATE --> StringType
val registeration_schema = StructType(List(
StructField("INPUTFILE_CNTRY_CODE", StringType),
StructField("INPUTFILE_CTY_CODE", IntegerType),
StructField("INPUTFILE_ID", IntegerType),
StructField("INPUTFILE_LTY_ID", StringType),
StructField("INPUTFILE_CNSM_ID", StringType),
StructField("INPUTFILE_DATE", StringType),
StructField("INPUTFILE_TIME", TimestampType),
StructField("INPUTFILE_TRDATE", DateType)
))
val registerationDF = spark.read
.option("header", "true")
.option("delimiter", "|")
.option("mode", "FAILFAST")
.schema(registeration_schema)
.option("dateFormat", "yyyy-dd-MM")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS")
.csv("registration2.csv")
Core part of this solution is -
val targetDF = registerationDF.withColumn("INPUTFILE_DATE",to_date((date_format(to_date(col("INPUTFILE_DATE"),"yy/MM/dd"),"yyyy-dd-MM")),"yyyy-dd-MM"))
End Result -
scala> targetDF.printSchema()
root
|-- INPUTFILE_CNTRY_CODE: string (nullable = true)
|-- INPUTFILE_CTY_CODE: integer (nullable = true)
|-- INPUTFILE_ID: integer (nullable = true)
|-- INPUTFILE_LTY_ID: string (nullable = true)
|-- INPUTFILE_CNSM_ID: string (nullable = true)
|-- INPUTFILE_DATE: date (nullable = true)
|-- INPUTFILE_TIME: timestamp (nullable = true)
|-- INPUTFILE_TRDATE: date (nullable = true)
scala> targetDF.show()
+--------------------+------------------+------------+----------------+-----------------+--------------+-------------------+----------------+
|INPUTFILE_CNTRY_CODE|INPUTFILE_CTY_CODE|INPUTFILE_ID|INPUTFILE_LTY_ID|INPUTFILE_CNSM_ID|INPUTFILE_DATE| INPUTFILE_TIME|INPUTFILE_TRDATE|
+--------------------+------------------+------------+----------------+-----------------+--------------+-------------------+----------------+
| GBR| 263| 326735246| null| I034867789V| 2015-11-30|2015-11-30 00:00:00| 2017-10-06|
| GBR| 263| 397802068| PC7135361| PC7135361| 2016-05-20|2016-10-06 11:50:05| 2017-10-07|
+--------------------+------------------+------------+----------------+-----------------+--------------+-------------------+----------------+

Pyspark - How to set the schema when reading parquet file from another DF?

I have DF1 with schema:
df1 = spark.read.parquet(load_path1)
df1.printSchema()
root
|-- PRODUCT_OFFERING_ID: string (nullable = true)
|-- CREATED_BY: string (nullable = true)
|-- CREATION_DATE: string (nullable = true)
and DF2:
df2 = spark.read.parquet(load_path2)
df2.printSchema()
root
|-- PRODUCT_OFFERING_ID: decimal(38,10) (nullable = true)
|-- CREATED_BY: decimal(38,10) (nullable = true)
|-- CREATION_DATE: timestamp (nullable = true)
Now I want to Union these 2 dataframes..
Sometime it gives errors when I try to UNION these 2 DFs because of different schemas..
How to set for DF2 to have exact same schema (during the load time) as DF1?
I tried with:
df2 = spark.read.parquet(load_path2).schema(df1.schema)
Getting error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'StructType' object is not callable
Or should I CAST it instead (once DF2 is read)?
Thanks.
Move .schema() before .parquet() then spark will read the parquet file with the specified schema
df2 = spark.read.schema(df1.schema).parquet(load_path2)

How to change column name of a dataframe with respect to other dataframe

I have a requirement to change column name of a dataframe df with respect to other dataframe df_col using pyspark
df
+----+---+----+----+
|code| id|name|work|
+----+---+----+----+
| ASD|101|John| DEV|
| klj|102| ben|prod|
+----+---+----+----+
df_col
+-----------+-----------+
|col_current|col_updated|
+-----------+-----------+
| id| Row_id|
| name| Name|
| code| Row_code|
| Work| Work_Code|
+-----------+-----------+
if df column matches col_current, df column should replace with col_updated. ex: if df.id matches df.col_current, df.id should replace with Row_id.
expected output
Row_id,Name,Row_code,Work_code
101,John,ASD,DEV
102,ben,klj,prod
Note: I want this process to be dynamic.
Just collect the df_col as dictionary:
df = spark.createDataFrame(
[("ASD", "101" "John", "DEV"), ("klj","102", "ben", "prod")],
("code", "id", "name", "work")
)
df_col = spark.createDataFrame(
[("id", "Row_id"), ("name", "Name"), ("code", "Row_code"), ("Work", "Work_Code")],
("col_current", "col_updated")
)
name_dict = df_col.rdd.collectAsMap()
and use select with list comprehension:
df.select([df[c].alias(name_dict.get(c, c)) for c in df.columns]).printSchema()
# root
# |-- Row_code: string (nullable = true)
# |-- Row_id: string (nullable = true)
# |-- Name: string (nullable = true)
# |-- work: string (nullable = true)
where name_dict is standard Python dictionary:
{'Work': 'Work_Code', 'code': 'Row_code', 'id': 'Row_id', 'name': 'Name'}
name_dict.get(c, c) gets new name, given current name, or current name if no match:
name_dict.get("code", "code")
# 'Row_code'
name_dict.get("work", "work") # Case sensitive
# 'work'
and alias just renames column (df[col]) to name returned from name_dict.get.

Adding a element to an existing struct in a Spark dataframe

I have a PySpark dataframe and the schema looks like this:
root
|-- useragent: string (nullable = true)
|-- properties: struct (nullable = true)
| |-- browser: string (nullable = true)
| |-- device: string (nullable = true)
I am using udf + withColumn to extract more info out of the useragent. But I am only able to store any additional property I am creating in a new column. Is there a way I can append it to the struct itself?
I've changed the schema but it doesn't transfer all the data I needed from the other properties that already exist
df = df.schema['properties'].dataType.add(StructField('type', StringType()))
Would it make more sense to convert this to a RDD?
You will not get any data using this piece of code, since you are not selecting any data.
What you are doing in this operation is nothing but accessing the Schema property, modifying it in some way and assigning it to a variable df (this is not the modified dataframe you are expecting) but just a StructType object (There seems to be some syntax error there, please check)
What you should do is to modify the dataframe through select and suitable transformation functions. You may considering creating a new column of type StructField with the required structure and add it as a new column using withColumn function
withField can be used:
F.col("properties").withField("type", extract_function)
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("type=type1,mode=5", ("Chrome", "PC1"))],
"useragent string, properties struct<browser:string,device:string>"
)
df.printSchema()
# root
# |-- useragent: string (nullable = true)
# |-- properties: struct (nullable = true)
# | |-- browser: string (nullable = true)
# | |-- device: string (nullable = true)
df.show()
# +-----------------+-------------+
# | useragent| properties|
# +-----------------+-------------+
# |type=type1,mode=5|{Chrome, PC1}|
# +-----------------+-------------+
df = df.withColumn("properties", F.col("properties").withField(
"type",
F.regexp_extract("useragent", r"type=(\w+)", 1)
))
df.printSchema()
# root
# |-- useragent: string (nullable = true)
# |-- properties: struct (nullable = true)
# | |-- browser: string (nullable = true)
# | |-- device: string (nullable = true)
# | |-- type: string (nullable = true)
df.show()
# +-----------------+--------------------+
# | useragent| properties|
# +-----------------+--------------------+
# |type=type1,mode=5|{Chrome, PC1, type1}|
# +-----------------+--------------------+