Groupby with when condition in Pyspark

Groupby with when condition in Pyspark - apache-spark-sql

My data frame looks like
id |reg_date | txn_date|
+----------+----------+--------------------+
|1 |2019-01-06| 2019-02-15 12:51:15|
|1 |2019-01-06| 2019-03-29 13:15:27|
|1 |2019-01-06| 2019-06-01 01:42:57|
|1 |2019-01-06| 2019-01-06 17:01:...|
|5 |2019-06-16| 2019-07-19 11:50:34|
|5 |2019-06-16| 2019-07-13 19:49:39|
|5 |2019-06-16| 2019-08-27 17:37:22|
|2 |2018-07-30| 2019-01-01 07:03:...|
|2 |2018-07-30| 2019-07-30 01:27:57|
|2 |2018-07-30| 2019-02-01 00:08:35
I want to pickup the 1st txn_date after reg_date , i.e. the first txn_date of reg_date >= txn_date.
Expected output
id |reg_date | txn_date|
+----------+----------+--------------------+
|1 |2019-01-06| 2019-01-06 17:01:...|
|5 |2019-06-16| 2019-07-13 19:49:39|
|2 |2018-07-30| 2019-07-30 01:27:57|
I have done so far,
df = df.withColumn('txn_date',to_date(unix_timestamp(F.col('txn_date'),'yyyy-MM-dd HH:mm:ss').cast("timestamp")))
df = df.withColumn('reg_date',to_date(unix_timestamp(F.col('reg_date'),'yyyy-MM-dd').cast("timestamp")))
gg = df.groupBy('id','reg_date').agg(min(F.col('txn_date')))
But getting wrong results.

The condition reg_date >= txn_date can be ambiguous.
Does 2019-01-06>=2019-01-06 17:01:30 mean 2019-01-06 00:00:00>=2019-01-06 17:01:30 or 2019-01-06 23:59:59>=2019-01-06 17:01:30?
In your example, 2019-01-06>=2019-01-06 17:01:30 is evaluated to be true, so I assume it is the latter case, i.e. the case with 23:59:59.
Proceeding with the assumption above, here is how I coded it.
import pyspark.sql.functions as F
#create a sample data frame
data = [('2019-01-06','2019-02-15 12:51:15'),('2019-01-06','2019-03-29 13:15:27'),('2019-01-06','2019-01-06 17:01:30'),\
('2019-07-30','2019-07-30 07:03:01'),('2019-07-30','2019-07-30 01:27:57'),('2019-07-30','2019-07-30 00:08:35')]
cols = ('reg_date', 'txn_date')
df = spark.DataFrame(data,cols)
#add 23:59:59 to reg_date as a dummy_date for a timestamp comparison later
df = df.withColumn('dummy_date', F.concat(F.col('reg_date'), F.lit(' 23:59:59')))
#convert columns to the appropriate time data types
df = df.select([F.to_date(F.col('reg_date'),'yyyy-MM-dd').alias('reg_date'),\
F.to_timestamp(F.col('txn_date'),'yyyy-MM-dd HH:mm:ss').alias('txn_date'),\
F.to_timestamp(F.col('dummy_date'),'yyyy-MM-dd HH:mm:ss').alias('dummy_date')])
#implementation part
(df.orderBy('reg_date')
.filter(F.col('dummy_date')>=F.col('txn_date'))
.groupBy('reg_date')
.agg(F.first('txn_date').alias('txn_date'))
.show())
#+----------+----------------------+
#| reg_date| txn_date|
#+----------+----------------------+
#|2019-01-06| 2019-01-06 17:01:30|
#|2019-07-30| 2019-07-30 07:03:01|
#+----------+----------------------+

You don't need to order. You can discard all smaller values with a filter, then aggregate by id and get the smaller timestamp, because the first timestamp will be the minimum. Something like:
df.filter(df.reg_date >= df.txn_date) \
.groupBy(df.reg_date) \
.agg(F.min(df.txn_date)) \
.show()

Related

How do I use flatmap with multiple columns in Dataframe using Pyspark

I have a DF as below:
Name city starttime endtime
user1 London 2019-08-02 03:34:45 2019-08-02 03:52:03
user2 Boston 2019-08-13 13:34:10 2019-08-13 15:02:10
I would like to check the endtime and if it crosses into the next hour then update the current record with the last minute/second of current hour and append another row or rows with similar data as shown below(user2). Do I use flapmap or convert the DF to RDD and use map or is another way?
Name city starttime endtime
user1 London 2019-08-02 03:34:45 2019-08-02 03:52:03
user2 Boston 2019-08-13 13:34:10 2019-08-13 13:59:59
user2 Boston 2019-08-13 14:00:00 2019-08-13 14:59:59
user2 Boston 2019-08-13 15:00:00 2019-08-13 15:02:10
Thanks

>>> from pyspark.sql.functions import *
>>> df.show()
+-----+------+-------------------+-------------------+
| Name| city| starttime| endtime|
+-----+------+-------------------+-------------------+
|user1|London|2019-08-02 03:34:45|2019-08-02 03:52:03|
|user2|Boston|2019-08-13 13:34:10|2019-08-13 15:02:10|
+-----+------+-------------------+-------------------+
>>> df1 = df.withColumn("diff", ((hour(col("endtime")) - hour(col("starttime")))).cast("Int"))
.withColumn("loop", expr("split(repeat(':', diff),':')"))
.select(col("*"), posexplode(col("loop")).alias("pos", "value"))
.drop("value", "loop")
>>> df1.withColumn("starttime", when(col("pos") == 0, col("starttime")).otherwise(from_unixtime(unix_timestamp(col("starttime")) + (col("pos") * 3600) - minute(col("starttime"))*60 - second(col("starttime")))))
.withColumn("endtime", when((col("diff") - col("pos")) == 0, col("endtime")).otherwise(from_unixtime(unix_timestamp(col("endtime")) - ((col("diff") - col("pos")) * 3600) - minute(col("endtime"))*60 - second(col("endtime")) + lit(59) * lit(60) + lit(59))))
.drop("diff", "pos")
.show()
+-----+------+-------------------+-------------------+
| Name| city| starttime| endtime|
+-----+------+-------------------+-------------------+
|user1|London|2019-08-02 03:34:45|2019-08-02 03:52:03|
|user2|Boston|2019-08-13 13:34:10|2019-08-13 13:59:59|
|user2|Boston|2019-08-13 14:00:00|2019-08-13 14:59:59|
|user2|Boston|2019-08-13 15:00:00|2019-08-13 15:02:10|
+-----+------+-------------------+-------------------+

How to go between a set of dates and times

I have a set of data where one column is date and time. I have been asked for all the data in the table, between two date ranges and within those dates, only certain time scale. For example, I was data between 01/02/2019 - 10/02/2019 and within the times 12:00 AM to 07:00 AM. (My real date ranges are over a number of months, just using these dates as an example)
I can cast the date and time into two different columns to separate them out as shown below:
select
name
,dateandtimetest
,cast(dateandtimetest as date) as JustDate
,cast(dateandtimetest as time) as JustTime
INTO #Test01
from [dbo].[TestTable]
I put this into a test table so that I could see if I could use a between function on the JustTime column, because I know I can do the between on the dates no problem. My idea was to get them done in two separate tables and perform an inner join to get the results I need
from #Test01
WHERE justtime between '00:00' and '05:00'
The above code will not give me the data I need. I have been racking my brain for this so any help would be much appreciated!
The test table I am using to try and get the correct code is shown below:
|Name | DateAndTimeTest
-----------------------------------------|
|Lauren | 2019-02-01 04:14:00 |
|Paul | 2019-02-02 08:20:00 |
|Bill | 2019-02-03 12:00:00 |
|Graham | 2019-02-05 16:15:00 |
|Amy | 2019-02-06 02:43:00 |
|Jordan | 2019-02-06 03:00:00 |
|Sid | 2019-02-07 15:45:00 |
|Wes | 2019-02-18 01:11:00 |
|Adam | 2019-02-11 11:11:00 |
|Rhodesy | 2019-02-11 15:16:00 |
I have now tried and got the data to show me information between the times on one date using the below code, but now I would need to make this piece of code run for every date over a 3 month period
select *
from dbo.TestTable
where DateAndTimeTest between '2019-02-11 00:00:00' and '2019-02-11 08:30:00'

You can use SQL similar to following:
select *
from dbo.TestTable
where (CAST(DateAndTimeTest as date) between '2019-02-11' AND '2019-02-11') AND
(CAST(DateAndTimeTest as time) between '00:00:00' and '08:30:00')
Above query will return all records where DateAndTimeTest value in date range 2019-02-11 to 2019-02-11 and with time between 12AM to 8:30AM.

Finding all events happening at a specific hour

I have a database of events and I want to make a daily schedule of it.
It looks something as following:
+-----+-----+---+--------+
|Event|Start|End|Duration|
+-----+-----+---+--------+
|A |08 |10 |2 |
+-----+-----+---+--------+
|B |09 |10 |1 |
+-----+-----+---+--------+
|C |13 |15 |2 |
+-----+-----+---+--------+
I want to query for all events that are held at 9 and I can't figure the math behind calculating the time.
The query should return A and B for this example. I tried:
start + duration > 9 and start <=9 but it isn't correct...
Any help please?

What you want is this where clause:
where start <= 9 and end > 9
That is, something happens at 9 if it starts one or before 9 and ends after 9. (If you want things that end at 9 to be included, just change the > to >=).
I notice that you have leading zeros. This suggests that the values are stored as strings. In that case, do string comparisons:
where start <= '09' and end > '09'

Grouping of data

I have a database that records clients who have a rating score upon entry of the service we provide, this is between 0 - 50, they are seen on average once a week and after four sessions they are re-evaluated on the same score to see a trend so say initially they may score 22 and after four weeks it may be 44
What i am after is a sql query to group this data
+----+-------+--------+
|name|initial|followup|
+----+-------+--------+
|joe |22 | |
+----+-------+--------+
|joe | |44 |
+----+-------+--------+
i want this to show
+----+-------+--------+
|name|initial|followup|
+----+-------+--------+
|joe |22 |44 |
+----+-------+--------+
i know this is a simple question and have done this before but tis the time of the year and the pressure is on from management
many thanks in advance

Assuming the - means NULL, just use aggregation:
select name, max(initial) as initial, max(followup) as followup
from t
group by name;

SQL Query : Calculating cross distances based on Master detail predefined tables

I have a database with many tables, especially two tables one store paths and the other one store cities of a path :
Table Paths [ PathID, Name ]
Table Routes [ ID, PathID(Forein Key), City, GoTime, BackTime, GoDistance, BackDistance]
Table Paths :
---------------------------------------
|PathID |Name |
|-------+-----------------------------|
|1 |NewYork Casablanca Alpha 1 |
|7 |Paris Dubai 6007 10:00 |
---------------------------------------
Table Routes :
ID PathID City GoTime BackTime GoDistance BackDistance
1 1 NewYork 08:00 23:46 5810 NULL
2 1 Casablanca 15:43 16:03 NULL 5800
3 7 Paris 10:20 14:01 3215 NULL
4 7 Cairo 14:50 09:31 2425 3215
3 7 Dubai 18:21 06:00 NULL 2425
I want a Query that gives me all the possible combinations inside the same Path, something like :
PathID CityFrom CityTo Distance
I don't know if I made myself clear or not but hope you guys could help me, thanx in advance.
This is the good answer done manually !!
------------------------------------------------------
|PathID |Go_Back |CityA |CityB |Distance|
|-------+-----------+-----------+-----------+--------|
|1 |Go |NewYork |Casablanca |5810 |
|1 |Back |Casablanca |NewYork |5800 |
|7 |Go |Paris |Cairo |3215 |
|7 |Go |Paris |Dubai |5640 |
|7 |Go |Cairo |Dubai |2425 |
|7 |Back |Dubai |Cairo |2425 |
|7 |Back |Dubai |Paris |5640 |
|7 |Back |Cairo |Paris |3215 |
------------------------------------------------------

This comes down to two questions.
Q1:
How to split up column "Name" from table "Paths", so that it is in first normal form. See wikipedia for a definition. The domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain. You must do this yourself. It might be cumbersome to use the text-processing functions of your database to split up the nonatomic column values.
Write a script (perl/python/... ) that does this, and re-import the results into a new table.
Q2:
HOw to calculate "possible paths combinations".
Maybe it is possible with a simple SQL query, by sorting the table. You haven't shown enough data.
Ultimately, this can be done with recursive SQL. Postgres can do this. It is an advanced topic.
You definitely must decide if your paths can contain loops. (A traveller might decide to take a circular detour many times, although it makes no sense practically. mathematically it is possible, though.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Groupby with when condition in Pyspark - apache-spark-sql

You don't need to order. You can discard all smaller values with a filter, then aggregate by id and get the smaller timestamp, because the first timestamp will be the minimum. Something like: df.filter(df.reg_date >= df.txn_date) \ .groupBy(df.reg_date) \ .agg(F.min(df.txn_date)) \ .show()

Related

How do I use flatmap with multiple columns in Dataframe using Pyspark

How to go between a set of dates and times

Finding all events happening at a specific hour

Grouping of data

SQL Query : Calculating cross distances based on Master detail predefined tables

Categories

Resources