Convert Integer Column to Date in PySpark

Convert Integer Column to Date in PySpark - apache-spark-sql

I have an Integer column called birth_date in this format: 20141130
I want to convert that to 2014-11-30 in PySpark.
This converts the date incorrectly:
.withColumn("birth_date", F.to_date(F.from_unixtime(F.col("birth_date"))))
This gives an error: argument 1 requires (string or date or timestamp) type, however, 'birth_date' is of int type
.withColumn('birth_date', F.to_date(F.unix_timestamp(F.col('birth_date'), 'yyyyMMdd').cast('timestamp')))
What is the best way to convert it to the date I want?

Convert the birth_date column from Integer to String before you pass it to the to_date function:
from pyspark.sql import functions as F
df.withColumn("birth_date", F.to_date(F.col("birth_date").cast("string"), \
'yyyyMMdd')).show()
+----------+
|birth_date|
+----------+
|2014-11-30|
+----------+

I hope the following code should work fine.
import pyspark.sql.functions as psf
raw_df.select('timestamp')\
.withColumn("as_date", psf.to_utc_timestamp(psf.from_unixtime(psf.col("timestamp"),'yyyy-MM-dd HH:mm:ss'),'IST')).show(20)

Related

PySpark Add Months to Date Field Based on vlaue of column

I have a dataframe with a date column and an integer column and I'd like to add months based on the integer column to the date column. I tried the following, but I'm getting an error:
from pyspark.sql import functions as f
withColumn('future', f.add_months('cohort', col('period')))
Where 'cohort' is my date column and period is an integer. I'm getting the following error:
TypeError: Column is not iterable

Use expr to pass a column as second parameter for add_months function:
df.withColumn('future', F.expr("add_months(cohort, period)"))

Getting null while converting string to date in spark sql

I have dates in the format '6/30/2020'. It is a string and I want to convert it into date format.
List of methods I have tried
Cast('6/30/2020' as date) #returns null
to_date('6/30/2020','yyyy/MM/dd') #returns null
I also tried splitting the string and then concatenating it into data.
After trying all this and putting all the possible combinations in the to_date function, I am still getting the answer as null.
Now I am confused as I have used all the functions to convert string to date.
Thanks in advance for your help.

The date format you used was incorrect. Try this:
select to_date('6/30/2020', 'M/dd/yyyy')
If you want to format your result, you can use date_format:
select date_format(to_date('6/30/2020', 'M/dd/yyyy'), 'yyyy/MM/dd')
Note that to_date converts a given string from the given format, while date_format converts a given date to the given format.

Convert date in format yyyy-MM-dd HH:mm:ss:SSS to start of day using Spark SQL

I have a column of string data type that represents a date. The date format is 'yyyy-MM-dd HH:mm:ss:SSS'. I want to truncate this date to the start of day. So form example,
2011-07-19 12:44:42.453 should become 2011-07-19 00:00:00.0
I have tried the following trunc(record_timestamp,'DD') but it just gives me blank string.
I also tried date_trunc(record_timestamp, 'DD') but I got the following exception:
java.lang.RuntimeException: org.apache.spark.sql.AnalysisException: Undefined function: 'date_trunc'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7
Any help is appreciated.

Try this -
scala> spark.sql(s""" select "2011-07-19 12:44:42.453" as TS, concat(substring("2011-07-19 12:44:42.453", 0,10), " 00:00:00.0") as TS_Start_Of_Day """).show(false)
+-----------------------+---------------------+
|TS |TS_Start_Of_Day |
+-----------------------+---------------------+
|2011-07-19 12:44:42.453|2011-07-19 00:00:00.0|
+-----------------------+---------------------+

Spark SQL converting string to timestamp

I'm new to Spark SQL and am trying to convert a string to a timestamp in a spark data frame. I have a string that looks like '2017-08-01T02:26:59.000Z' in a column called time_string
My code to convert this string to timestamp is
CAST (time_string AS Timestamp)
But this gives me a timestamp of 2017-07-31 19:26:59
Why is it changing the time? Is there a way to do this without changing the time?
Thanks for any help!

You could use unix_timestamp function to convert the utc formatted date to timestamp
val df2 = Seq(("a3fac", "2017-08-01T02:26:59.000Z")).toDF("id", "eventTime")
df2.withColumn("eventTime1", unix_timestamp($"eventTime", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'").cast(TimestampType))
Output:
+-------------+---------------------+
|userid |eventTime |
+-------------+---------------------+
|a3fac |2017-08-01 02:26:59.0|
+-------------+---------------------+
Hope this helps!

Solution on Java
There are some Spark SQL functions which let you to play with the date format.
Conversion example : 20181224091530 -> 2018-12-24 09:15:30
Solution (Spark SQL statement) :
SELECT
...
to_timestamp(cast(DECIMAL_DATE as string),'yyyyMMddHHmmss') as `TIME STAMP DATE`,
...
FROM some_table
You can use the SQL statements by using an instance of org.apache.spark.sql.SparkSession. For example if you want to execute an sql statement, Spark provide the following solution:
...
// You have to create an instance of SparkSession
sparkSession.sql(sqlStatement);
...
Notes:
You have to convert the decimal to string and after you can achieve the parsing to timestamp format
You can play with the format the get however format you want...

In spark sql you can use to_timestamp and then format it as your requirement.
select
date_format(to_timestamp(,'yyyy/MM/dd HH:mm:ss'),"yyyy-MM-dd HH:mm:ss") as
from
Here 'timestamp' with value is 2019/02/23 12:00:00 and it is StringType column in 'event' table.
To convert into TimestampType apply to_timestamp(timestamp, 'yyyy/MM/dd HH:mm:ss). It is need to make sure the format for timestamp is same as your column value. Then you apply date_format to convert it as per your requirement.
> select date_format(to_timestamp(timestamp,'yyyy/MM/dd HH:mm:ss'),"yyyy-MM-dd HH:mm:ss") as timeStamp from event

trying to format pandas.to_datetime

I'm trying to get today's date in a few different formats and I keep getting errors:
pd.to_datetime('Today',format='%m/%d/%Y') + MonthEnd(-1)
ValueError: time data 'Today' does not match format '%m/%d/%Y' (match)
What is the correct syntax to get todays date in yyyy-mm-dd and yyyymm formats?

For YYYY-MM-DD format, you can do this:
import datetime as dt
print(dt.datetime.today().date())
2017-05-23
For YYYY-MM format, you can do this:
print(dt.datetime.today().date().strftime('%Y-%m'))
2017-05

If you need to do this on just a few columns you can use:
import pandas as pd
dataframe_name['Date_Column_name'].apply(pd.tslib.normalize_date)
This method doesn't use any other module except pandas. If you need a "custom" date format you can always do:
from datetime import datetime as dt
dataframe_name['Date_Column_name'].dt.strftime('%d/%m/%Y')
Here is a list of strftime options.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Convert Integer Column to Date in PySpark - apache-spark-sql

Convert the birth_date column from Integer to String before you pass it to the to_date function: from pyspark.sql import functions as F df.withColumn("birth_date", F.to_date(F.col("birth_date").cast("string"), \ 'yyyyMMdd')).show() +----------+ |birth_date| +----------+ |2014-11-30| +----------+

I hope the following code should work fine. import pyspark.sql.functions as psf raw_df.select('timestamp')\ .withColumn("as_date", psf.to_utc_timestamp(psf.from_unixtime(psf.col("timestamp"),'yyyy-MM-dd HH:mm:ss'),'IST')).show(20)

Related

PySpark Add Months to Date Field Based on vlaue of column

Getting null while converting string to date in spark sql

Convert date in format yyyy-MM-dd HH:mm:ss:SSS to start of day using Spark SQL

Spark SQL converting string to timestamp

trying to format pandas.to_datetime

Categories

Resources