Spark Dataframe: Query counts of column having continuously the same result - apache-spark-sql

I have a dataset for warehouse stock where the current state is reported every two minutes.
I want to get the dates where there is no stock left i.e. stock_amount = 0 and if the value remains at 0 for some time, get the count of each subsequent row where the stock_amount is 0. In other words, I want to get the dates where stock_amount becomes 0, and then the number of times where it remains 0.
For instance, given the below data
Row(date='24-06-2020 11:03:00', stock_amount = 1)
Row(date='24-06-2020 11:05:00', stock_amount = 0)
Row(date='24-06-2020 11:07:00', stock_amount = 2)
Row(date='24-06-2020 11:09:00', stock_amount = 3)
Row(date='24-06-2020 16:32:00', stock_amount = 0)
Row(date='24-06-2020 16:34:00', stock_amount = 0)
Row(date='24-06-2020 16:36:00', stock_amount = 0)
Row(date='24-06-2020 16:38:00', stock_amount = 0)
Row(date='24-06-2020 16:40:00', stock_amount = 2)
The result should be:
(date='24-06-2020 11:05:00', count=1)
(date='24-06-2020 16:32:00', count=4)
And for this data,
Row(date='26-07-2020 12:03:00', stock_amount = 3)
Row(date='26-07-2020 12:05:00', stock_amount = 0)
Row(date='26-07-2020 12:07:00', stock_amount = 4)
Row(date='26-07-2020 12:09:00', stock_amount = 4)
Row(date='26-07-2020 12:11:00', stock_amount = 0)
Row(date='26-07-2020 12:13:00', stock_amount = 2)
Row(date='26-07-2020 17:32:00', stock_amount = 0)
Row(date='26-07-2020 17:34:00', stock_amount = 0)
Row(date='26-07-2020 17:36:00', stock_amount = 0)
Row(date='26-07-2020 17:38:00', stock_amount = 0)
Row(date='26-07-2020 17:40:00', stock_amount = 1)
the result should be:
(date='26-07-2020 12:05:00', count=1)
(date='26-07-2020 12:11:00', count=1)
(date='26-07-2020 17:32:00', count=4)

This can be done with window function and some grouping.
Note : The multiple steps are for more detailing, but can be combined in a single command.
#%%
import pyspark.sql.functions as F
from pyspark.sql import Window
data= sqlContext.createDataFrame([
('24-06-2020 11:03:00', 1),
('24-06-2020 11:05:00', 0),
('24-06-2020 11:07:00', 2),
('24-06-2020 11:09:00', 3),
('24-06-2020 16:32:00', 0),
('24-06-2020 16:34:00', 0),
('24-06-2020 16:36:00', 0),
('24-06-2020 16:38:00', 0),
('24-06-2020 16:40:00', 2)],schema=['date','stock'])
data_grp = data.withColumn("stk_grp",(F.col('stock')!=0).cast('int'))
w= Window.orderBy('date')
data_sum = data_grp.withColumn("count_grp", F.sum('stk_grp').over(w))
data_stk = data_sum.withColumn("stk_zero",(F.col('stock')==0).cast('int'))
w1= Window.partitionBy('count_grp').orderBy('date')
data_res = data_stk.withColumn("fin_count", F.sum('stk_zero').over(w1))
#%%
data_filt = data_res.where("fin_count!=0")
data_res = data_filt.groupby('count_grp').agg(F.min('date').alias('date'),F.max('fin_count').alias('count'))
data_res.show()
+---------+-------------------+--------------+
|count_grp| date| count|
+---------+-------------------+--------------+
| 1|24-06-2020 11:05:00| 1|
| 3|24-06-2020 16:32:00| 4|
+---------+-------------------+--------------+

Related

How to find closed regions delimited by lines defined by 2 points each

Given a series of lines, each with an ID and a the ids of two points that define it, I would like to find the regions that are delimited by those lines, like Region 1 = Points (1, 2, 3, 4) in the following example:
The points 1,2,5,6 will not form a region obviously.
I solved the problem in a brute force method, by first finding the combinations of all possible 4 points, as follows:
and here is the working solution in VBA:
Option Explicit
''+------------------------------------------------------------------+
''| |
''| |
''+------------------------------------------------------------------+
Sub Determine_Regions(X() As Integer, L() As Integer)
Dim i As Integer, j As Integer, k As Integer
Dim nbRegions As Integer, count As Integer
Dim LP1 As Integer, LP2 As Integer
Dim P1 As Integer, P2 As Integer, P3 As Integer, P4 As Integer
Dim nL As Integer: nL = UBound(L) - LBound(L) + 1
Dim R() As Variant
' Cycle through all possible combinations
Dim nr As Integer: nr = UBound(X, 1) - LBound(X, 1) + 1
Dim nc As Integer: nc = UBound(X, 2) - LBound(X, 2) + 1
For i = 1 To nr
' the 4 points on that particular combination are:
P1 = X(i, 1): P2 = X(i, 2): P3 = X(i, 3): P4 = X(i, 4)
' do I have 4 distinct lines that use these 4 points?
count = 0
For k = 1 To nL
LP1 = L(k, 2): LP2 = L(k, 3)
If (LP1 = P1 And LP2 = P2) Or (LP1 = P2 And LP2 = P1) Or _
(LP1 = P1 And LP2 = P3) Or (LP1 = P3 And LP2 = P1) Or _
(LP1 = P1 And LP2 = P4) Or (LP1 = P4 And LP2 = P1) Or _
(LP1 = P2 And LP2 = P3) Or (LP1 = P3 And LP2 = P2) Or _
(LP1 = P2 And LP2 = P4) Or (LP1 = P4 And LP2 = P2) Or _
(LP1 = P3 And LP2 = P4) Or (LP1 = P4 And LP2 = P3) Then
count = count + 1
Debug.Print count
End If
Next k
If count = 4 Then
nbRegions = nbRegions + 1
' the Transpose operation wraps the Redim Preserve because VBA
' will not allow changing the first dimension on a 2D array
If nbRegions = 1 Then
ReDim Preserve R(1 To nbRegions, 1 To 4):
ElseIf nbRegions > 1 Then
R = Application.Transpose(R)
ReDim Preserve R(1 To 4, 1 To nbRegions):
R = Application.Transpose(R)
End If
R(nbRegions, 1) = P1: R(nbRegions, 2) = P2:
R(nbRegions, 3) = P3: R(nbRegions, 4) = P4:
End If
Next i
Debug.Print "nb of regions = " & nbRegions
End Sub
''+------------------------------------------------------------------+
''| |
''| |
''+------------------------------------------------------------------+
Sub Test_Determine_Regions()
' Input of all possible combinations of 4 points out of 6 points,
' without repetitions
' Note: In the final program, this would be done automatically
' by another function
Dim X() As Integer: ReDim X(1 To 15, 1 To 4)
X(1, 1) = 1: X(1, 2) = 2: X(1, 3) = 3: X(1, 4) = 4
X(2, 1) = 1: X(2, 2) = 2: X(2, 3) = 3: X(2, 4) = 5
X(3, 1) = 1: X(3, 2) = 2: X(3, 3) = 3: X(3, 4) = 6
X(4, 1) = 1: X(4, 2) = 2: X(4, 3) = 4: X(4, 4) = 5
X(5, 1) = 1: X(5, 2) = 2: X(5, 3) = 4: X(5, 4) = 6
X(6, 1) = 1: X(6, 2) = 2: X(6, 3) = 5: X(6, 4) = 6
X(7, 1) = 1: X(7, 2) = 3: X(7, 3) = 4: X(7, 4) = 5
X(8, 1) = 1: X(8, 2) = 3: X(8, 3) = 4: X(8, 4) = 6
X(9, 1) = 1: X(9, 2) = 3: X(9, 3) = 5: X(9, 4) = 6
X(10, 1) = 1: X(10, 2) = 4: X(10, 3) = 5: X(10, 4) = 6
X(11, 1) = 2: X(11, 2) = 3: X(11, 3) = 4: X(11, 4) = 5
X(12, 1) = 2: X(12, 2) = 3: X(12, 3) = 4: X(12, 4) = 6
X(13, 1) = 2: X(13, 2) = 3: X(13, 3) = 5: X(13, 4) = 6
X(14, 1) = 2: X(14, 2) = 4: X(14, 3) = 5: X(14, 4) = 6
X(15, 1) = 3: X(15, 2) = 4: X(15, 3) = 5: X(15, 4) = 6
' Input of the lines, each with the 2 connected points
Dim L() As Integer: ReDim L(1 To 7, 1 To 3)
' Line ID Point1 Point 2
L(1, 1) = 1: L(1, 2) = 1: L(1, 3) = 2:
L(2, 1) = 2: L(2, 2) = 2: L(2, 3) = 3:
L(3, 1) = 3: L(3, 2) = 3: L(3, 3) = 4:
L(4, 1) = 4: L(4, 2) = 4: L(4, 3) = 1:
L(5, 1) = 5: L(5, 2) = 3: L(5, 3) = 5:
L(6, 1) = 6: L(6, 2) = 5: L(6, 3) = 6:
L(7, 1) = 7: L(7, 2) = 4: L(7, 3) = 6:
Determine_Regions X, L
End Sub
That said, I am convinced there are a better way to handle this problem. Any idea on how to improve my code, say for instances I have also triangles, and what algorithms are best suited here?
Your points are graph vertices, lines are graph edges.
You have not defined well, but seem you want to select some cycle basis - perhaps set of fundamental cycles.
Look here and choose appropriate algorithm (description involves spanning tree algo).
If your graph always is planar, and you know coordinates of vertices, you can enumerate faces using a kind of traversal - choose top vertex, walk the "leftmost" next neighbor, continue until the first vertex is met. Then choose another vertex (not marked yet) and do the same.

PostgreSQL: No function matches the given name and argument types. Weekly user login cohort analysis

function sum(boolean) does not exist LINE 13: ISNULL(SUM(s.Offset =
0), 0) w1, ^ HINT: No function matches the given name and argument
types. You might need to add explicit type casts.
I'm trying to create a weekly cohort analysis that would show the weekly login stats.
As you can see this is what I want to achieve:
This is what I have found, and what I'm trying to re-create:
http://sqlfiddle.com/#!9/172dbe/1
These are the tables that I'm trying to take the data for:
And this is what I have refactored so far:
SELECT
STR_TO_DATE(CONCAT(tb.cohort, ' Monday'), '%X-%V %W') as date,
size,
w1,
w2,
w3,
w4,
w5,
w6,
w7
FROM (
SELECT u.cohort,
ISNULL(SUM(s.Offset = 0), 0) w1,
ISNULL(SUM(s.Offset = 1), 0) w2,
ISNULL(SUM(s.Offset = 2), 0) w3,
ISNULL(SUM(s.Offset = 3), 0) w4,
ISNULL(SUM(s.Offset = 4), 0) w5,
ISNULL(SUM(s.Offset = 5), 0) w6,
ISNULL(SUM(s.Offset = 6), 0) w7
FROM (
SELECT
id,
last_login AS cohort
FROM users_user
) as u
LEFT JOIN (
SELECT DISTINCT
login_log.user_id,
DATE_PART('day',(users_user.last_login - users_user.date_joined)/7) AS Offset
FROM users_userloginlog login_log
LEFT JOIN users_user ON (users_user.id = login_log.user_id)
) as s ON s.user_id = u.id
GROUP BY u.cohort
) as tb
LEFT JOIN (
SELECT DATE_FORMAT(AddedDate, "%Y-%u") dt, COUNT(*) size FROM users GROUP BY dt
) size ON tb.cohort = size.dt
As the error message tells you: you can't sum boolean values. And s.Offset = 0 returns true or false (or null). What would be the "sum" of true, false, true, true, false?
You can achieve what you want using filtered aggregation:
SELECT u.cohort,
count(*) filter (where s.Offset = 0 as w1,
count(*) filter (where s.Offset = 1) as w2,
count(*) filter (where s.Offset = 2) as w3,
count(*) filter (where s.Offset = 3) as w4,
count(*) filter (where s.Offset = 4) as w5,
count(*) filter (where s.Offset = 5) as w6,
count(*) filter (where s.Offset = 6) as w7,
....

If with multiple and & or

I've been trying to run the below IF condition, however it does not work as intended:
If (LandscapingDataRange(MailCounter, 6) = 1 And (LandscapingDataRange(MailCounter, 4) = 0 Or LandscapingDataRange(MailCounter, 5) = 0)) _
Or (LandscapingDataRange(MailCounter, 6) = 1 And ((LandscapingDataRange(MailCounter, 4) - LandscapingDataRange(MailCounter, 7)) / LandscapingDataRange(MailCounter, 4)) > 0.1) _
Or (LandscapingDataRange(MailCounter, 6) = 0 And LandscapingDataRange(MailCounter, 5) > 0) _
Or (0 < LandscapingDataRange(MailCounter, 6) < 1 And (LandscapingDataRange(MailCounter, 4) = 0 Or LandscapingDataRange(MailCounter, 5) = 0)) _
Or (0 < LandscapingDataRange(MailCounter, 6) < 1 And ((LandscapingDataRange(MailCounter, 4) - LandscapingDataRange(MailCounter, 7)) / LandscapingDataRange(MailCounter, 4)) > 0.1) Then
Those rows with 0 in (LandscapingDataRange(MailCounter, 6)
and 0 in LandscapingDataRange(MailCounter, 5)
Are still applied in the IF function.
I would take a guess that you are using Or where you should be using ElseIf and so your logic is wrong. In that case, use Select Case instead - it's quicker and cleaner. Here's an example with the first 3 conditions and an Else clause - just add the other conditions as required.
Select Case True
(LandscapingDataRange(MailCounter, 6) = 1 And (LandscapingDataRange(MailCounter, 4) = 0 Or LandscapingDataRange(MailCounter, 5) = 0))
'// Do Something
(LandscapingDataRange(MailCounter, 6) = 1 And ((LandscapingDataRange(MailCounter, 4) - LandscapingDataRange(MailCounter, 7)) / LandscapingDataRange(MailCounter, 4)) > 0.1)
'// Do Something
(LandscapingDataRange(MailCounter, 6) = 0 And LandscapingDataRange(MailCounter, 5) > 0)
'// Do Something
Case Else
'// Do Something if none of the criteria are met.
End Select
I don't get, what you want to achieve, but logic operator have an order, in which they are resolved, just like multiplication/division before addition/substraction.
Maybe that is the problem in your case. Before any Or is computed, all And are resolved. So you have to use brackets, if this isn't the order, you want it to be computed.
The precedence order in Visual Basic is Not before And before Or before Xor, and the same rules should apply to VBA too.

How i can use a criteria in cicle for i

i want to use a criteria in my loop
but not work
UR = 3
ReDim arng(UR, 3) As Variant
For X = 0 To UR
arng(X, 0) = ConvDate(Cells(X , 8))
arng(X, 1) = ConvDate(Cells(X , 12))
arng(X, 2) = Iif(Cells(X , 12) = "", MsgBox("empty"), MsgBox("Full"))
Next X
even if the cell(X,12) is actually empty both messages show
Why?!?!?
isn't possible to use a criteria??
thank
Worked over a lil bit, try this
dim UR as integer
UR = 3
ReDim arng(UR, 3) As Variant
For X = 0 To UR
arng(X, 0) = ConvDate(Cells(X , 8))
arng(X, 1) = ConvDate(Cells(X , 12))
if Cells(X , 12) = "" then
MsgBox("empty")
else
msgbox("full")
end if
Next X

<> not equal to function doesn't work appropriately to filter records in SQL Server

I have a business requirement where I need to alter the SSRS report based on some additional filtering. I have a field name as ProductShortName where they don't want records where Product name is 'BLOC', 'Small Business Visa', Product name starting with 'WOW' and Product name ending with 'Review'.
This is the original where condition:
WHERE ( A.AppDetailSavePointID = 0) AND (B.QueueID = 1)
AND (A.DecisionStatusName <> N'Cancelled')
AND (A.DecisionStatusName <> N'Withdrawn')
OR (A.AppDetailSavePointID = 0)
AND ((B.QueueID = - 25) OR (B.QueueID = - 80))
AND (A.DecisionStatusName <> N'Cancelled')
AND (A.DecisionStatusName <> N'Withdrawn')
OR (A.AppDetailSavePointID = 0)
AND (A.DecisionStatusName <> N'Cancelled')
AND (A.DecisionStatusName <> N'Withdrawn')
AND (LEFT(C.QueueName, 2) = 'LC')
I added additional filtering to meet the criteria:
WHERE (A.AppDetailSavePointID = 0)
AND ((A.ProductShortName <> 'BLOC')
AND (A.ProductShortName <> 'Small Business Visa')
AND NOT (A.ProductShortName LIKE 'WOW%')
AND NOT (A.ProductShortName LIKE '%Review'))
AND (B.QueueID = 1)
AND (A.DecisionStatusName <> N'Cancelled')
AND (A.DecisionStatusName <> N'Withdrawn')
OR (A.AppDetailSavePointID = 0)
AND ((B.QueueID = - 25) OR (B.QueueID = - 80))
AND (A.DecisionStatusName <> N'Cancelled')
AND (A.DecisionStatusName <> N'Withdrawn')
AND ((A.ProductShortName <> 'BLOC')
AND (A.ProductShortName <> 'Small Business Visa')
AND NOT (A.ProductShortName LIKE 'WOW%')
AND NOT (A.ProductShortName LIKE '%Review'))
AND (A.AppDetailSavePointID = 0)
AND (A.DecisionStatusName <> N'Cancelled')
AND (A.DecisionStatusName <> N'Withdrawn')
AND (LEFT(C.QueueName, 2) = 'LC')
AND ((A.ProductShortName <> 'BLOC')
AND (A.ProductShortName <> 'Small Business Visa')
AND NOT (A.ProductShortName LIKE 'WOW%')
AND NOT (A.ProductShortName LIKE '%Review'))
While this removes the products but it additionally removes few more products. I don't understand how? Can anyone please suggest an appropriate where condition?
You should avoid mixing AND and OR conditions without bracketing them properly.
If you are mixing ANDs and ORs then put brackets to resolve the confusions. If you don't do that, the results would be unexpected.
For example, in your query, if AppDetailSavePointID = 0 then all other conditions become invalid/irrelevent. I'm sure this not what you want.
WHERE (AppDetailSavePointID = 0) AND (QueueID = 1)
AND (DecisionStatusName <> N'Cancelled')
AND (DecisionStatusName <> N'Withdrawn')
OR (AppDetailSavePointID = 0)
AND ((QueueID = - 25) OR (QueueID = - 80))
AND (DecisionStatusName <> N'Cancelled')
AND (DecisionStatusName <> N'Withdrawn')
OR (AppDetailSavePointID = 0)
AND (DecisionStatusName <> N'Cancelled')
AND (DecisionStatusName <> N'Withdrawn')
AND (LEFT(QueueName, 2) = 'LC')
EDIT
You should take either AND or OR as the major part, but not a mixture of AND and OR (without brackets). You can use additonal brackets to specify the other.
e.g.
Assuming a,b,c,d,e,f... are conditions of type Field op value (e.g. AppDetailSavePointID = 0, DecisionStatusName <> N'Cancelled' etc.).
You should not do this:
-- don't do this.
WHERE a
AND b
OR c
AND d
OR e
AND f
OR g
You can do either of these two things:
-- this is ok.
WHERE a
AND b
AND c
AND (d OR e)
AND (f OR g)
Or,
-- this is ok.
WHERE a
OR b
OR c
OR (d AND e)
OR (f AND g)
It may be easier to read if you deassociate the universal predicates.
(X and Y) or (X and Z) == X and (Y or Z)
This yields:
WHERE (A.ProductShortName NOT LIKE 'WOW%')
AND (A.ProductShortName NOT LIKE '%Review')
AND (A.ProductShortName <> 'Small Business Visa')
AND (A.DecisionStatusName <> N'Cancelled')
AND (A.DecisionStatusName <> N'Withdrawn')
AND (A.AppDetailSavePointID = 0)
AND ( QueueID = 1
OR QueueID = -25
OR QueueID = -80
OR LEFT(C.QueueName, 2) = 'LC'
)
If the original condition was working I would just add to it. Try the following:
WHERE
(AppDetailSavePointID = 0) AND (QueueID = 1)
AND (DecisionStatusName <> N'Cancelled')
AND (DecisionStatusName <> N'Withdrawn')
OR (AppDetailSavePointID = 0)
AND ((QueueID = - 25) OR (QueueID = - 80))
AND (DecisionStatusName <> N'Cancelled')
AND (DecisionStatusName <> N'Withdrawn')
OR (AppDetailSavePointID = 0)
AND (DecisionStatusName <> N'Cancelled')
AND (DecisionStatusName <> N'Withdrawn')
AND (LEFT(QueueName, 2) = 'LC')
AND NOT (
ProductShortName IN ('BLOC', 'Small Business Visa')
OR ProductShortName LIKE 'WOW%'
OR ProductShortName LIKE '%Review'
)