Related
I have a dataframe:
import pandas as pd
data = [('s1', 's2'),
('s1', 's3'),
('s2', 's4'),
('s3', 's5'),
('s5', 's6')]
df = pd.DataFrame(data, columns=['start', 'end'])
+-----+---+
|start|end|
+-----+---+
| s1| s2|
| s1| s3|
| s2| s4|
| s3| s5|
| s5| s6|
+-----+---+
I want to see if the end column has values from the start column and write their values in a new end2 column
new_df = df
df = df.join(new_df , (df.start== new_df.end))
The result is something like this:
+-----+---+----+
|start|end|end2|
+-----+---+----+
| s1| s2| s4|
| s1| s3| s5|
| s2| s4|null|
| s3| s5| s6|
| s5| s6|null|
+-----+---+----+
Then I want to join again to see if end2 has values in start and write their values from the end column to the new end3 column. And so do the join until the last column is filled with all the None values
That is, an iterative join along the columns comes out (there are actually a lot more rows in my dataframe, so writing each join is not very good). But I don't understand how to do it. The result should be something like this
while df[-1].notnull():
df = df.join... #join ends columns
new_df[] = #add new column
+-----+---+----+----+----+
|start|end|end2|end3|end4|
+-----+---+----+----+----+
| s1| s2| s4|None|None|
| s1| s3| s5| s6|None|
| s2| s4|None|None|None|
| s3| s5| s6|None|None|
| s5| s6|None|None|None|
+-----+---+----+----+----+
Try mapping the original dictionary to the end column, using start as the key, and end as the values:
df.assign(end2 = df['end'].map(dict(df.to_records(index=False))))
Output:
start end end2
0 s1 s2 s4
1 s1 s3 s5
2 s2 s4 NaN
3 s3 s5 s6
4 s5 s6 NaN
To create all possible columns, we can use a while loop:
i = 2
m = dict(df.to_records(index=False))
while df.iloc[:,-1].count() != 0:
df['end{}'.format(i)] = df.iloc[:,-1].map(m)
i += 1
Output:
start end end2 end3 end4
0 s1 s2 s4 NaN NaN
1 s1 s3 s5 s6 NaN
2 s2 s4 NaN NaN NaN
3 s3 s5 s6 NaN NaN
4 s5 s6 NaN NaN NaN
I need help for this case to fill, with a new row, missing values:
This is just an example, but I have a lot of rows with different IDs.
Input dataframe:
ID
FLAG
DATE
123
1
01/01/2021
123
0
01/02/2021
123
1
01/03/2021
123
0
01/06/2021
123
0
01/08/2021
777
0
01/01/2021
777
1
01/03/2021
So I have a finite set of dates and I wanna take until the last one for each ID (in the example, for ID = 123: 01/01/2021, 01/02/2021, 01/03/2021... until 01/08/2021). So basically I could do a cross join with a calendar, but I don't know how can I fill missing value with a rule or a filter, after the cross join.
Expected output: (in bold the generated missing values)
ID
FLAG
DATE
123
1
01/01/2021
123
0
01/02/2021
123
1
01/03/2021
123
1
01/04/2021
123
1
01/05/2021
123
0
01/06/2021
123
0
01/07/2021
123
0
01/08/2021
777
0
01/01/2021
777
0
01/02/2021
777
1
01/03/2021
You can first group by id to calculate max and min date then using sequence function, generate all the dates from min_date to max_date. Finally, join with original dataframe and fill nulls with last non null per group of id. Here's a complete working example:
Your input dataframe:
from pyspark.sql import Window
import pyspark.sql.functions as F
df = spark.createDataFrame([
(123, 1, "01/01/2021"), (123, 0, "01/02/2021"),
(123, 1, "01/03/2021"), (123, 0, "01/06/2021"),
(123, 0, "01/08/2021"), (777, 0, "01/01/2021"),
(777, 1, "01/03/2021")
], ["id", "flag", "date"])
Groupby id and generate all possible dates for each id:
all_dates_df = df.groupBy("id").agg(
F.date_trunc("mm", F.max(F.to_date("date", "dd/MM/yyyy"))).alias("max_date"),
F.date_trunc("mm", F.min(F.to_date("date", "dd/MM/yyyy"))).alias("min_date")
).select(
"id",
F.expr("sequence(min_date, max_date, interval 1 month)").alias("date")
).withColumn(
"date", F.explode("date")
).withColumn(
"date",
F.date_format("date", "dd/MM/yyyy")
)
Now, left join with df and use last function over a Window partitioned by id to fill null values:
w = Window.partitionBy("id").orderBy("date")
result = all_dates_df.join(df, ["id", "date"], "left").select(
"id",
"date",
*[F.last(F.col(c), ignorenulls=True).over(w).alias(c)
for c in df.columns if c not in ("id", "date")
]
)
result.show()
#+---+----------+----+
#| id| date|flag|
#+---+----------+----+
#|123|01/01/2021| 1|
#|123|01/02/2021| 0|
#|123|01/03/2021| 1|
#|123|01/04/2021| 1|
#|123|01/05/2021| 1|
#|123|01/06/2021| 0|
#|123|01/07/2021| 0|
#|123|01/08/2021| 0|
#|777|01/01/2021| 0|
#|777|01/02/2021| 0|
#|777|01/03/2021| 1|
#+---+----------+----+
You can find the ranges of dates between the DATE value in the current row and the following row and then use sequence to generate all intermediate dates and explode this array to fill in values for the missing dates.
from pyspark.sql import functions as F
from pyspark.sql import Window
data = [(123, 1, "01/01/2021",),
(123, 0, "01/02/2021",),
(123, 1, "01/03/2021",),
(123, 0, "01/06/2021",),
(123, 0, "01/08/2021",),
(777, 0, "01/01/2021",),
(777, 1, "01/03/2021",), ]
df = spark.createDataFrame(data, ("ID", "FLAG", "DATE",)).withColumn("DATE", F.to_date(F.col("DATE"), "dd/MM/yyyy"))
window_spec = Window.partitionBy("ID").orderBy("DATE")
next_date = F.coalesce(F.lead("DATE", 1).over(window_spec), F.col("DATE") + F.expr("interval 1 month"))
end_date_range = next_date - F.expr("interval 1 month")
df.withColumn("Ranges", F.sequence(F.col("DATE"), end_date_range, F.expr("interval 1 month")))\
.withColumn("DATE", F.explode("Ranges"))\
.withColumn("DATE", F.date_format("date", "dd/MM/yyyy"))\
.drop("Ranges").show(truncate=False)
Output
+---+----+----------+
|ID |FLAG|DATE |
+---+----+----------+
|123|1 |01/01/2021|
|123|0 |01/02/2021|
|123|1 |01/03/2021|
|123|1 |01/04/2021|
|123|1 |01/05/2021|
|123|0 |01/06/2021|
|123|0 |01/07/2021|
|123|0 |01/08/2021|
|777|0 |01/01/2021|
|777|0 |01/02/2021|
|777|1 |01/03/2021|
+---+----+----------+
After the answer by #Vaebhav realized the question was not set up correctly.
Hence editing it with his code snippet.
I have the following table
from pyspark.sql.types import IntegerType,TimestampType,DoubleType
input_str = """
4219,2018-01-01 08:10:00,3.0,50.78,
4216,2018-01-02 08:01:00,5.0,100.84,
4217,2018-01-02 20:00:00,4.0,800.49,
4139,2018-01-03 11:05:00,1.0,400.0,
4170,2018-01-03 09:10:00,2.0,100.0,
4029,2018-01-06 09:06:00,6.0,300.55,
4029,2018-01-06 09:16:00,2.0,310.55,
4217,2018-01-06 09:36:00,5.0,307.55,
1139,2018-01-21 11:05:00,1.0,400.0,
2170,2018-01-21 09:10:00,2.0,100.0,
4218,2018-02-06 09:36:00,5.0,307.55,
4218,2018-02-06 09:36:00,5.0,307.55
""".split(",")
input_values = list(map(lambda x: x.strip() if x.strip() != '' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "customer_id,timestamp,quantity,price".split(',')))
n = len(input_values)
n_cols = 4
input_list = [tuple(input_values[i:i+n_cols]) for i in range(0,n,n_cols)]
sparkDF = sqlContext.createDataFrame(input_list,cols)
sparkDF = sparkDF.withColumn('customer_id',F.col('customer_id').cast(IntegerType()))\
.withColumn('timestamp',F.col('timestamp').cast(TimestampType()))\
.withColumn('quantity',F.col('quantity').cast(IntegerType()))\
.withColumn('price',F.col('price').cast(DoubleType()))
I want to calculate the aggergate as follows :
trxn_date
unique_cust_visits
next_7_day_visits
next_30_day_visits
2018-01-01
1
7
9
2018-01-02
2
6
8
2018-01-03
2
4
6
2018-01-06
2
2
4
2018-01-21
2
2
3
2018-02-06
1
1
1
where the
trxn_date is date from the timestamp column,
daily_cust_visits is unique count of customers,
next_7_day_visits is a count of customers on a 7 day rolling window basis.
next_30_day_visits is a count of customers on a 30 day rolling window basis.
I want to write the code as a single SQL query.
You can achieve this by using ROW rather than a RANGE Frame Type , a good explanation can be found here
ROW - based on physical offsets from the position of the current input row
RANGE - based on logical offsets from the position of the current input row
Also in your implementation ,a PARTITION BY clause would be redundant, as it wont create the required Frames for a look-ahead.
Data Preparation
input_str = """
4219,2018-01-02 08:10:00,3.0,50.78,
4216,2018-01-02 08:01:00,5.0,100.84,
4217,2018-01-02 20:00:00,4.0,800.49,
4139,2018-01-03 11:05:00,1.0,400.0,
4170,2018-01-03 09:10:00,2.0,100.0,
4029,2018-01-06 09:06:00,6.0,300.55,
4029,2018-01-06 09:16:00,2.0,310.55,
4217,2018-01-06 09:36:00,5.0,307.55
""".split(",")
input_values = list(map(lambda x: x.strip() if x.strip() != '' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "customer_id timestamp quantity price".split('\t')))
n = len(input_values)
n_cols = 4
input_list = [tuple(input_values[i:i+n_cols]) for i in range(0,n,n_cols)]
sparkDF = sql.createDataFrame(input_list,cols)
sparkDF = sparkDF.withColumn('customer_id',F.col('customer_id').cast(IntegerType()))\
.withColumn('timestamp',F.col('timestamp').cast(TimestampType()))\
.withColumn('quantity',F.col('quantity').cast(IntegerType()))\
.withColumn('price',F.col('price').cast(DoubleType()))
sparkDF.show()
+-----------+-------------------+--------+------+
|customer_id| timestamp|quantity| price|
+-----------+-------------------+--------+------+
| 4219|2018-01-02 08:10:00| 3| 50.78|
| 4216|2018-01-02 08:01:00| 5|100.84|
| 4217|2018-01-02 20:00:00| 4|800.49|
| 4139|2018-01-03 11:05:00| 1| 400.0|
| 4170|2018-01-03 09:10:00| 2| 100.0|
| 4029|2018-01-06 09:06:00| 6|300.55|
| 4029|2018-01-06 09:16:00| 2|310.55|
| 4217|2018-01-06 09:36:00| 5|307.55|
+-----------+-------------------+--------+------+
Window Aggregates
sparkDF.createOrReplaceTempView("transactions")
sql.sql("""
SELECT
TO_DATE(timestamp) as trxn_date
,COUNT(DISTINCT customer_id) as unique_cust_visits
,SUM(COUNT(DISTINCT customer_id)) OVER (
ORDER BY 'timestamp'
ROWS BETWEEN CURRENT ROW AND 7 FOLLOWING
) as next_7_day_visits
FROM transactions
GROUP BY 1
""").show()
+----------+------------------+-----------------+
| trxn_date|unique_cust_visits|next_7_day_visits|
+----------+------------------+-----------------+
|2018-01-02| 3| 7|
|2018-01-03| 2| 4|
|2018-01-06| 2| 2|
+----------+------------------+-----------------+
Building upon #Vaebhav's answer the required query in this case is
sqlContext.sql("""
SELECT
TO_DATE(timestamp) as trxn_date
,COUNT(DISTINCT customer_id) as unique_cust_visits
,SUM(COUNT(DISTINCT customer_id)) OVER (
ORDER BY CAST(TO_DATE(timestamp) AS TIMESTAMP) DESC
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) as next_7_day_visits
,SUM(COUNT(DISTINCT customer_id)) OVER (
ORDER BY CAST(TO_DATE(timestamp) AS TIMESTAMP) DESC
RANGE BETWEEN INTERVAL 30 DAYS PRECEDING AND CURRENT ROW
) as next_30_day_visits
FROM transactions
GROUP BY 1
ORDER by trxn_date
""").show()
trxn_date
unique_cust_visits
next_7_day_visits
next_30_day_visits
2018-01-01
1
7
9
2018-01-02
2
6
8
2018-01-03
2
4
6
2018-01-06
2
2
4
2018-01-21
2
2
3
2018-02-06
1
1
1
I have a DataFrame about connection log with columns Id, targetIP, Time. Every record in this DataFrame is a connection event to one system. Id means this connection, targetIP means the target IP address this time, Time is the connection time. With Values:
ID
Time
targetIP
1
1
192.163.0.1
2
2
192.163.0.2
3
3
192.163.0.1
4
5
192.163.0.1
5
6
192.163.0.2
6
7
192.163.0.2
7
8
192.163.0.2
I want to create a new column under some condition: count of connections to this time's target IP address in the past 2 time units. So the result DataFrame should be:
ID
Time
targetIP
count
1
1
192.163.0.1
0
2
2
192.163.0.2
0
3
3
192.163.0.1
1
4
5
192.163.0.1
1
5
6
192.163.0.2
0
6
7
192.163.0.2
1
7
8
192.163.0.2
2
For example, ID=7, the targetIP is 192.163.0.2 Connected to system in past 2 time units, which are ID=5 and ID=6, and their targetIP are also 192.163.0.2. So the count about ID=7 is 2.
Looking forward to your help.
So, what you basically need is a window function.
Let's start with your initial data
import org.apache.spark.sql.expressions.Window
import spark.implicits._
case class Event(ID: Int, Time: Int, targetIP: String)
val events = Seq(
Event(1, 1, "192.163.0.1"),
Event(2, 2, "192.163.0.2"),
Event(3, 3, "192.163.0.1"),
Event(4, 5, "192.163.0.1"),
Event(5, 6, "192.163.0.2"),
Event(6, 7, "192.163.0.2"),
Event(7, 8, "192.163.0.2")
).toDS()
Now we need to define a window function itself
val timeWindow = Window.orderBy($"Time").rowsBetween(-2, -1)
And now the most interesting part: how to count something over the window? There is no simple way, so we'll do the following
Aggregate all the targetIp's into the list
Filter the list to find only needed ips
Count size of the list
val df = events
.withColumn("tmp", collect_list($"targetIp").over(timeWindow))
.withColumn("count", size(expr("filter(tst, x -> x == targetIp)")))
.drop($"tmp")
And the result will contain a new column "count" which we need!
UPD:
There is the much shorter version without aggregation, written by #blackbishop,
val timeWindow = Window.partitionBy($"targetIP").orderBy($"Time").rangeBetween(-2, Window.currentRow)
val df = events
.withColumn("count", count("*").over(timeWindow) - lit(1))
.explain(true)
You can use count over Window bounded with range between - 2 and current row, to get the count of IP in the last 2 time units.
Using Spark SQL you can do something like this:
df.createOrReplaceTempView("connection_logs")
df1 = spark.sql("""
SELECT *,
COUNT(*) OVER(PARTITION BY targetIP ORDER BY Time
RANGE BETWEEN 2 PRECEDING AND CURRENT ROW
) -1 AS count
FROM connection_logs
ORDER BY ID
""")
df1.show()
#+---+----+-----------+-----+
#| ID|Time| targetIP|count|
#+---+----+-----------+-----+
#| 1| 1|192.163.0.1| 0|
#| 2| 2|192.163.0.2| 0|
#| 3| 3|192.163.0.1| 1|
#| 4| 5|192.163.0.1| 1|
#| 5| 6|192.163.0.2| 0|
#| 6| 7|192.163.0.2| 1|
#| 7| 8|192.163.0.2| 2|
#+---+----+-----------+-----+
Or using DataFrame API:
from pyspark.sql import Window
from pyspark.sql import functions as F
time_unit = lambda x: x
w = Window.partitionBy("targetIP").orderBy(col("Time").cast("int")).rangeBetween(-time_unit(2), 0)
df1 = df.withColumn("count", F.count("*").over(w) - 1).orderBy("ID")
df1.show()
I have a list with users and the dates of their last visit. For every time they visit, I want to know how many times they visited over the last 2 years.
# Create toy example
import pandas as pd
import numpy as np
date_range = pd.date_range(pd.to_datetime('2010-01-01'),
pd.to_datetime('2016-01-01'), freq='D')
date_range = np.random.choice(date_range, 8)
visits = {'user': list(np.repeat(1, 4)) + list(np.repeat(2, 4)) ,
'time': list(date_range)}
df = pd.DataFrame(visits)
df.sort_values(by = ['user', 'time'], axis = 0)
df = spark.createDataFrame(df).repartition(1).cache()
df.show()
What I am looking for is something like this:
time user nr_visits_during_2_previous_years
0 2010-02-27 1 0
2 2012-02-21 1 1
3 2013-04-30 1 1
1 2013-06-20 1 2
6 2010-06-23 2 0
4 2011-10-19 2 1
5 2011-11-10 2 2
7 2014-02-06 2 0
Suppose you create a dataframe with these values and you need to check for visits after 2015-01-01.
import pyspark.sql.functions as f
import pyspark.sql.types as t
df = spark.createDataFrame([("2014-02-01", "1"),("2015-03-01", "2"),("2017-12-01", "3"),
("2014-05-01", "2"),("2016-10-12", "1"),("2016-08-21", "1"),
("2017-07-01", "3"),("2015-09-11", "1"),("2016-08-24", "1")
,("2016-04-05", "2"),("2014-11-19", "3"),("2016-03-11", "3")], ["date", "id"])
Now, you need to change your date column to DateType from StringType and then filter rows for which user visited after 2015-01-01.
df2 = df.withColumn("date",f.to_date('date', 'yyyy-MM-dd'))
df3 = df2.where(df2.date >= f.lit('2015-01-01'))
Last part, just use groupby on id column and use count to get the number of visits by a user after 2015-01-01
df3.groupby('id').count().show()
+---+-----+
| id|count|
+---+-----+
| 3| 3|
| 1| 4|
| 2| 2|
+---+-----+