Scala dataframe get last 6 months latest data - dataframe

I have column in dataframe like below
+-------------------+
| timestampCol|
+-------------------+
|2020-11-27 00:00:00|
|2020-11-27 00:00:00|
+-------------------+
I need to filter the data based on this date and I want to get last 6 moths data only , could anyone please suggest how can I do that ?

import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
dataset.filter(dataset.col("timestampCol").cast("date")
.gt(add_months(current_date(),-6)));
This will filter all the timestampCol values that are older than 6 months.

Depending on the dataset schema you may need to cast the value as a date.
If it's a date just compare it directly with a java.sql.Timestamp instance.
val someMomentInTime =
java.sql.Timestamp.valueOf("yyyy-[m]m-[d]d hh:mm:ss")
val df: Dataframe =
???
df.filter(col("timestampCol") > someMomentInTime) //Dataframe is Dataset[Row]

Related

How to add Extra column with current date in Spark dataframe

I am trying to add one column in my existing Pyspark Dataframe using withColumn method.I want to insert current date in this column.From my Source I don't have any date column so i am adding this current date column in my dataframe and saving this dataframe in my table so later for tracking purpose i can use this current date column.
I am using below code
df2=df.withColumn("Curr_date",datetime.now().strftime('%Y-%m-%d'))
here df is my existing Dataframe and i want to save df2 as table with Curr_date column.
but here its expecting existing column or lit method instead of datetime.now().strftime('%Y-%m-%d').
someone please guide me how should i add this Date column in my dataframe.?
use either lit or current_date
from pyspark.sql import functions as F
df2 = df.withColumn("Curr_date", F.lit(datetime.now().strftime("%Y-%m-%d")))
# OR
df2 = df.withColumn("Curr_date", F.current_date())
current_timestamp() is good but it is evaluated during the serialization time.
If you prefer to use the timestamp of the processing time of a row, then you may use the below method,
withColumn('current', expr("reflect('java.time.LocalDateTime', 'now')"))
There is a spark function current_timestamp().
from pyspark.sql.functions import *
df.withColumn('current', date_format(current_timestamp(), 'yyyy-MM-dd')).show()
+----+----------+
|test| current|
+----+----------+
|test|2020-09-09|
+----+----------+

Spark SQL - How to find total number of transactions on an hourly basis

For example, if i have a table with transaction number and transaction date [as timestamp] columns, how do i find out the total number of transactions on an hourly basis?
Is there any Spark sql functions available for this kind of range calculation?
You can use from_unixtime function.
val sqlContext = new SQLContext(sc)
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df = // your dataframe, assuming transaction_date is timestamp in seconds
df.select('transaction_number, hour(from_unixtime('transaction_date)) as 'hour)
.groupBy('hour)
.agg(count('transaction_number) as 'transactions)
Result:
+----+------------+
|hour|transactions|
+----+------------+
| 10| 1000|
| 12| 2000|
| 13| 3000|
| 14| 4000|
| ..| ....|
+----+------------+
Here I'm trying to give some pointer to approach, rather complete code, please see this
Time Interval Literals :
Using interval literals, it is possible to perform subtraction or addition of an arbitrary amount of time from a date or timestamp value. This representation can be useful when you want to add or subtract a time period from a fixed point in time. For example, users can now easily express queries like
“Find all transactions that have happened during the past hour”.
An interval literal is constructed using the following syntax:
[sql]INTERVAL value unit[/sql]
Below is the way in python. you can modify the below example to match your requirement i.e transaction date start time, end time accordingly. instead of id in your case its transaction number.
# Import functions.
from pyspark.sql.functions import *
# Create a simple DataFrame.
data = [
("2015-01-01 23:59:59", "2015-01-02 00:01:02", 1),
("2015-01-02 23:00:00", "2015-01-02 23:59:59", 2),
("2015-01-02 22:59:58", "2015-01-02 23:59:59", 3)]
df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"])
df = df.select(
df.start_time.cast("timestamp").alias("start_time"),
df.end_time.cast("timestamp").alias("end_time"),
df.id)
# Get all records that have a start_time and end_time in the
# same day, and the difference between the end_time and start_time
# is less or equal to 1 hour.
condition = \
(to_date(df.start_time) == to_date(df.end_time)) & \
(df.start_time + expr("INTERVAL 1 HOUR") >= df.end_time)
df.filter(condition).show()
+———————+———————+—+
|start_time | end_time |id |
+———————+———————+—+
|2015-01-02 23:00:00.0|2015-01-02 23:59:59.0|2 |
+———————+———————+—+
using this method, you can apply group function to find total number of transactions in your case.
Above is python code, what about scala ?
expr function used above also available in scala as well
Also have a look at spark-scala-datediff-of-two-columns-by-hour-or-minute
which describes below..
import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
val df2 = df1
.withColumn( "diff_secs", diff_secs_col )
.withColumn( "diff_mins", diff_secs_col / 60D )
.withColumn( "diff_hrs", diff_secs_col / 3600D )
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )

subset a data frame based on date range [duplicate]

I have a Pandas DataFrame with a 'date' column. Now I need to filter out all rows in the DataFrame that have dates outside of the next two months. Essentially, I only need to retain the rows that are within the next two months.
What is the best way to achieve this?
If date column is the index, then use .loc for label based indexing or .iloc for positional indexing.
For example:
df.loc['2014-01-01':'2014-02-01']
See details here http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection
If the column is not the index you have two choices:
Make it the index (either temporarily or permanently if it's time-series data)
df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]
See here for the general explanation
Note: .ix is deprecated.
Previous answer is not correct in my experience, you can't pass it a simple string, needs to be a datetime object. So:
import datetime
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]
And if your dates are standardized by importing datetime package, you can simply use:
df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]
For standarding your date string using datetime package, you can use this function:
import datetime
datetime.datetime.strptime
If you have already converted the string to a date format using pd.to_datetime you can just use:
df = df[(df['Date'] > "2018-01-01") & (df['Date'] < "2019-07-01")]
The shortest way to filter your dataframe by date:
Lets suppose your date column is type of datetime64[ns]
# filter by single day
df_filtered = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']
# filter by single month
df_filtered = df[df['date'].dt.strftime('%Y-%m') == '2014-01']
# filter by single year
df_filtered = df[df['date'].dt.strftime('%Y') == '2014']
If your datetime column have the Pandas datetime type (e.g. datetime64[ns]), for proper filtering you need the pd.Timestamp object, for example:
from datetime import date
import pandas as pd
value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]
If the dates are in the index then simply:
df['20160101':'20160301']
You can use pd.Timestamp to perform a query and a local reference
import pandas as pd
import numpy as np
df = pd.DataFrame()
ts = pd.Timestamp
df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')
print(df)
print(df.query('date > #ts("20190515T071320")')
with the output
date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25
date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25
Have a look at the pandas documentation for DataFrame.query, specifically the mention about the local variabile referenced udsing # prefix. In this case we reference pd.Timestamp using the local alias ts to be able to supply a timestamp string
So when loading the csv data file, we'll need to set the date column as index now as below, in order to filter data based on a range of dates. This was not needed for the now deprecated method: pd.DataFrame.from_csv().
If you just want to show the data for two months from Jan to Feb, e.g. 2020-01-01 to 2020-02-29, you can do so:
import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost']
This has been tested working for Python 3.7. Hope you will find this useful.
I'm not allowed to write any comments yet, so I'll write an answer, if somebody will read all of them and reach this one.
If the index of the dataset is a datetime and you want to filter that just by (for example) months, you can do following:
df.loc[df.index.month == 3]
That will filter the dataset for you by March.
How about using pyjanitor
It has cool features.
After pip install pyjanitor
import janitor
df_filtered = df.filter_date(your_date_column_name, start_date, end_date)
You could just select the time range by doing: df.loc['start_date':'end_date']
In pandas version 1.1.3 I encountered a situation where the python datetime based index was in descending order. In this case
df.loc['2021-08-01':'2021-08-31']
returned empty. Whereas
df.loc['2021-08-31':'2021-08-01']
returned the expected data.
Another solution if you would like to use the .query() method.
It allows you to use write readable code like .query(f"{start} < MyDate < {end}") on the trade off, that .query() parses strings and the columns values must be in pandas date format (so that it is also understandable for .query())
df = pd.DataFrame({
'MyValue': [1,2,3],
'MyDate': pd.to_datetime(['2021-01-01','2021-01-02','2021-01-03'])
})
start = datetime.date(2021,1,1).strftime('%Y%m%d')
end = datetime.date(2021,1,3).strftime('%Y%m%d')
df.query(f"{start} < MyDate < {end}")
(following the comment from #Phillip Cloud, answer from #Retozi)
import the pandas library
import pandas as pd
STEP 1: convert the date column into a string using the pd.to_datetime() method
df['date']=pd.to_datetime(df["date"],unit='s')
STEP 2: perform the filtering in any predetermined manner ( i.e 2 months)
df = df[(df["date"] >"2022-03-01" & df["date"] < "2022-05-03")]
STEP 3 : Check the output
print(df)
# 60 days from today
after_60d = pd.to_datetime('today').date() + datetime.timedelta(days=60)
# filter date col less than 60 days date
df[df['date_col'] < after_60d]

Fetching distinct values on a column using Spark DataFrame

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?
import sqlContext.implicits._
preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)
preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {
val applicationId = x.getAs[String](ApplicationId)
val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)
// DO SOME TASK PER applicationId
})
preProcessedData.unpersist()
Well to obtain all different values in a Dataframe you can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDF in order to transform each record.
For example:
val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")
// I obtain all different values. If you show you must see only {1, 3}
val distinctValuesDF = df.select(df("age")).distinct
// Define your udf. In this case I defined a simple function, but they can get complicated.
val myTransformationUDF = udf(value => value / 10)
// Run that transformation "over" your DataFrame
val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))
In Pyspark try this,
df.select('col_name').distinct().show()
This solution demonstrates how to transform data with Spark native functions which are better than UDFs. It also demonstrates how dropDuplicates which is more suitable than distinct for certain queries.
Suppose you have this DataFrame:
+-------+-------------+
|country| continent|
+-------+-------------+
| china| asia|
| brazil|south america|
| france| europe|
| china| asia|
+-------+-------------+
Here's how to take all the distinct countries and run a transformation:
df
.select("country")
.distinct
.withColumn("country", concat(col("country"), lit(" is fun!")))
.show()
+--------------+
| country|
+--------------+
|brazil is fun!|
|france is fun!|
| china is fun!|
+--------------+
You can use dropDuplicates instead of distinct if you don't want to lose the continent information:
df
.dropDuplicates("country")
.withColumn("description", concat(col("country"), lit(" is a country in "), col("continent")))
.show(false)
+-------+-------------+------------------------------------+
|country|continent |description |
+-------+-------------+------------------------------------+
|brazil |south america|brazil is a country in south america|
|france |europe |france is a country in europe |
|china |asia |china is a country in asia |
+-------+-------------+------------------------------------+
See here for more information about filtering DataFrames and here for more information on dropping duplicates.
Ultimately, you'll want to wrap your transformation logic in custom transformations that can be chained with the Dataset#transform method.
df = df.select("column1", "column2",....,..,"column N").distinct.[].collect()
in the empty list, you can insert values like [ to_JSON()] if you want the df in a JSON format.

Find minimum for a timestamp through Spark groupBy dataframe

When I try to group my dataframe on a column then try to find the minimum for each grouping groupbyDatafram.min('timestampCol') it appears I cannot do it on non numerical columns. Then how can I properly filter the minimum (earliest) date on the groupby?
I am streaming the dataframe from a postgresql S3 instance, so that data is already configured.
Just perform aggregation directly instead of using min helper:
import org.apache.spark.sql.functions.min
val sqlContext: SQLContext = ???
import sqlContext.implicits._
val df = Seq((1L, "2016-04-05 15:10:00"), (1L, "2014-01-01 15:10:00"))
.toDF("id", "ts")
.withColumn("ts", $"ts".cast("timestamp"))
df.groupBy($"id").agg(min($"ts")).show
// +---+--------------------+
// | id| min(ts)|
// +---+--------------------+
// | 1|2014-01-01 15:10:...|
// +---+--------------------+
Unlike min it will work on any Orderable type.