I would like to have some guidance or help to address the following problem:
I have the following data in a Spark Data frame.
I would like to create a window of n days preceding a succeeding a reference record and then calculate a division using reference values with the values in the window.
However I have not figured out how to do this kind of operation, everything that I find is just mean, count or sum operations in the window.
Original data looks like this:
| symbol_id | date | close | is_reference |
|----------|------------|----------|--------------|
| XXXX | 2000-01-19 | 809.9644 | FALSE |
| XXXX | 2000-01-20 | 784.274 | FALSE |
| XXXX | 2000-01-21 | 774.2831 | FALSE |
| XXXX | 2000-01-24 | 760.0106 | FALSE |
| XXXX | 2000-01-25 | 750.7335 | FALSE |
| XXXX | 2000-01-26 | 750.7335 | TRUE |
| XXXX | 2000-01-27 | 742.17 | FALSE |
| XXXX | 2000-01-28 | 749.3063 | FALSE |
| XXXX | 2000-01-31 | 750.02 | FALSE |
| XXXX | 2000-02-01 | 762.8653 | FALSE |
| XXXX | 2000-02-02 | 749.3063 | FALSE |
Expected output looks like this:
| symbol_id | date | close | is_reference | reference_change |
|----------|------------|----------|--------------|-------------------|
| XXXX | 2000-01-19 | 809.9644 | FALSE | 1.07889737170381 |
| XXXX | 2000-01-20 | 784.274 | FALSE | 1.04467697258748 |
| XXXX | 2000-01-21 | 774.2831 | FALSE | 1.03136878799201 |
| XXXX | 2000-01-24 | 760.0106 | FALSE | 1.0123573811479 |
| XXXX | 2000-01-25 | 750.7335 | FALSE | 1 |
| XXXX | 2000-01-26 | 750.7335 | TRUE | 1 |
| XXXX | 2000-01-27 | 742.17 | FALSE | 0.988593155893536 |
| XXXX | 2000-01-28 | 749.3063 | FALSE | 0.99809892591712 |
| XXXX | 2000-01-31 | 750.02 | FALSE | 0.999049596161621 |
| XXXX | 2000-02-01 | 762.8653 | FALSE | 1.01615992892285 |
| XXXX | 2000-02-02 | 749.3063 | FALSE | 0.99809892591712 |
I'm currently partition by symbol_id using the following snippet:
val window = Window.partitionBy(SYMBOL_ID)
.orderBy(col(DATE).desc)
.rowsBetween(5,0) // RangeBetween looks better but i just trying with rowsBetween for now
And trying to do something like this on reference_change column.
df
.withColumn("close_movement", $"close"/lit(col("close")
.where(col("is_reference") === true)).over(window)) // This command is wrong but its the most similar to thoughts in my mind.
So at the end I will be using the close WHERE is_reference = true divide by the close on the windows like the reference_change column we have on the expected output.
Thank you for your help!
I would just use a simple join:
val ref = df.filter($"is_reference")
df.join(ref, df.col("symbol_id") === ref.col("symbol_id") &&
abs(date_diff(df.col("date"), ref.col("date"))) <= 5)
.select(df.col("symbol_id"), df.col("date"), df.col("close"), df.col("is_reference"),
(df.col("close") / ref.col("close")).as("reference_change"))
Related
I am trying to come up with a SQL that will NOT select records when error value is "true" and when the name of that person as well as the date are the same. I thought perhaps a main query using the IN function where the parameter would be a sub query that will identify what the duplicates are for User_ID and Error_Dt. So for example:
Sample Data:
+----------+-------+---------+----------+
| Error_ID | Error | User_ID | Error_Dt |
+----------+-------+---------+----------+
| Err_A_01 | True | JP_123 | 20200307 |
| Err_A_02 | True | DF_455 | 20200605 |
| Err_A_03 | True | DF_455 | 20200605 |
| Err_A_04 | False | DF_455 | 20200703 |
| Err_B_01 | False | BH_135 | 20200219 |
| Err_B_02 | True | DP_246 | 20200310 |
| Err_B_03 | True | DP_246 | 20200310 |
| Err_B_04 | True | DP_246 | 20200509 |
| Err_B_05 | False | DP_246 | 20200601 |
| Err_B_06 | True | KS_159 | 20200120 |
| Err_B_07 | True | KS_159 | 20200120 |
| Err_B_08 | True | KS_159 | 20200310 |
| Err_C_01 | False | JH_123 | 20200702 |
+----------+-------+---------+----------+
Desire Results:
+----------+-------+---------+----------+
| Error_ID | Error | User_ID | Error_Dt |
+----------+-------+---------+----------+
| Err_A_01 | True | JP_123 | 20200307 |
| Err_A_04 | False | DF_455 | 20200703 |
| Err_B_01 | False | BH_135 | 20200219 |
| Err_B_04 | True | DP_246 | 20200509 |
| Err_B_05 | False | DP_246 | 20200601 |
| Err_B_08 | True | KS_159 | 20200310 |
| Err_C_01 | False | JH_123 | 20200702 |
+----------+-------+---------+----------+
Select only unique Error + User_ID + Error_Dt rows or those not 'True'.
select Error_ID, Error, User_ID, Error_Dt
from (
select *,
count(*) over(partition by Error, User_ID, Error_Dt) cnt
from tbl ) t
where Error <> 'True' OR cnt = 1
order by Error_ID;
Right now im working to generate a label based on quantity in excel. I managed to get it copy & paste based on value from cell. But, i didnt know how to make some cell change according to the loop.
Below is as example :
Current result :
| A | B | C | D | E |
|------------------------------- |----- |-------------------- |----- |----- |
| NMB IN DIA | | MADE IN THAILAND | | |
| INVOICE NO | : | MM035639 | | |
| C/NO | : | 1 | / | 2 |
| SHIP TO | : | A | | |
| QTY | : | 100 | | |
| NMB PARTS NO | : | SFASDF234 | | |
| | | *SFASDF234* | | |
| CUST PARTS NO | : | SFASDF234 | | |
| CUST ORDER NO | : | | | |
| ----------------------------- | --- | ------------------ | --- | --- |
| NMB IN DIA | | MADE IN THAILAND | | |
| INVOICE NO | : | MM035639 | | |
| C/NO | : | 1 | / | 2 |
| SHIP TO | : | A | | |
| QTY | : | 100 | | |
| NMB PARTS NO | : | SFASDF234 | | |
| | | *SFASDF234* | | |
| CUST PARTS NO | : | | | |
| CUST ORDES NO | : | | | |
Expected result :
| A | B | C | D | E |
|------------------------------- |----- |-------------------- |----- |----- |
| NMB IN DIA | | MADE IN THAILAND | | |
| INVOICE NO | : | MM035639 | | |
| C/NO | : | 1 | / | 2 |
| SHIP TO | : | A | | |
| QTY | : | 100 | | |
| NMB PARTS NO | : | SFASDF234 | | |
| | | *SFASDF234* | | |
| CUST PARTS NO | : | SFASDF234 | | |
| CUST ORDER NO | : | | | |
| ----------------------------- | --- | ------------------ | --- | --- |
| NMB IN DIA | | MADE IN THAILAND | | |
| INVOICE NO | : | MM035639 | | |
| C/NO | : | 2 | / | 2 |
| SHIP TO | : | A | | |
| QTY | : | 100 | | |
| NMB PARTS NO | : | SFASDF234 | | |
| | | *SFASDF234* | | |
| CUST PARTS NO | : | | | |
| CUST ORDES NO | : | | | |
As you can see on the expected result, the C/No is loop based on quantity. Not just copy paste. Is there anything I can add?
Below is my current code :
Private Sub CommandButton1_Click()
Dim i As Long
For i = 2 To Worksheets("Sheet3").Range("E3").Value
Range("A1:A9", Range("E9")).Copy Sheet3.Range("A65536").End(xlUp)(2)
Next i
End Sub
Just set the value of the relevant cell to i:
Private Sub CommandButton1_Click()
Dim i As Long
Dim NewLoc As Range
For i = 2 To Worksheets("Sheet3").Range("E3").Value
'Decide where to copy the output to
Set NewLoc = Sheet3.Cells(Sheet3.Rows.Count, "A").End(xlUp).OffSet(1, 0)
'Copy the range
Range("A1:E9").Copy NewLoc
'Change the value of the cell 2 rows down and 2 rows to the right
NewLoc.Offset(2, 2).Value = i
Next i
End Sub
I have made a Crosstab Query that should give information about the total working hours in each day for every Vessel we had in our small harbor.
my query:
TRANSFORM Sum(Main.WorkingH) AS SumOfWorkingH
SELECT DateValue([DeptDate]) AS [Date]
FROM Vessels INNER JOIN Main ON Vessels.ID = Main.VesselID
GROUP BY DateValue([DeptDate])
ORDER BY DateValue([DeptDate])
PIVOT Vessels.Vessel;
the problem here is this query is returning the total working hours start from departure date
| +---------------+--------+----+----+----+----+----+----+ |
| | | | | | | | | | |
| +---------------+--------+----+----+----+----+----+----+ |
| | Date | A1 | A2 | A3 | F3 | F4 | F5 | F6 | |
| | 26-May-17 | | | 32 | 29 | | | | |
| | 27-May-17 | 3 | 13 | | | | | | |
| | 28-May-17 | | | | | | | 73 | |
| | 29-May-17 | | | | 12 | 6 | 27 | | |
| | 01-Jun-17 | | | 10 | | 7 | 41 | | |
| | 02-Jun-17 | | 2 | 15 | 5 | | | | |
| | 03-Jun-17 | | 4 | | | | | | |
| +---------------+--------+----+----+----+----+----+----+ |
The desired Result
when a vessel leaves at 6/1 9pm and arrive back at 6/3 10am. This should appear as following:
6/1-->3Hours
6/2-->24Hours
6/3-->10Hours
**NOT** 6/1-->37Hours as in the previous table.
This is how it should look like
| +----------------+-----+----+----+----+--------+----+----+ |
| | Date | A1 | A2 | A3 | F3 | F4 | F5 | F6 | |
| +----------------+-----+----+----+----+--------+----+----+ |
| | 26-May-17 | | | 5 | 7 | | | | |
| | 27-May-17 | 3 | 13 | 24 | 21 | | | | |
| | 28-May-17 | | | 2 | | | | 9 | |
| | 29-May-17 | | | | 12 | 6 | 8 | 24 | |
| | 30-May-17 | | | | | | 18 | 24 | |
| | 31-May-17 | | | | | | | 15 | |
| | 01-Jun-17 | | | 10 | | 7 | 0 | | |
| | 02-Jun-17 | | 2 | 15 | 5 | 24 | | | |
| | 03-Jun-17 | | 4 | | | | 16 | | |
| +----------------+-----+----+----+----+--------+----+----+ |
These values are not accurate (I wrote them by hand), but I think you got the Idea
The Suggested Solution
while trying to fix this problem I made the following code which takes the
Public Function HoursByDate1(stTime, EndTime)
For dayloop = Int(EndTime) To Int(stTime) Step -1
If dayloop = Int(stTime) Then
WorkingHours = Hour(dayloop + 1 - stTime)
ElseIf dayloop = Int(EndTime) Then
WorkingHours = Hour(EndTime - dayloop)
Else
WorkingHours = 24
End If
HoursByDate1 = WorkingHours
Debug.Print "StartDate: " & stTime & ", EndDate:" & EndTime & ", The day:" & dayloop & " --> " & WorkingHours & " hours."
Next dayloop
End Function
It prints the data as following:
which is exactly what I want
But when I try to call this function from my query, It gets only the last value for each trip. as following:
| +-----------+----+----+----+----+----+----+----+ |
| | Date | A1 | A2 | A3 | F3 | F4 | F5 | F6 | |
| +-----------+----+----+----+----+----+----+----+ |
| | 5/26/2017 | | | 5 | 7 | | | | |
| | 5/27/2017 | 15 | 19 | | | | | | |
| | 5/28/2017 | | | | | | | 9 | |
| | 5/29/2017 | | | | 8 | 7 | 8 | | |
| | 6/1/2017 | | | 3 | | 6 | 0 | | |
| | 6/2/2017 | | 8 | 8 | 19 | | | | |
| | 6/3/2017 | | 9 | | | | | |
I seek any Solution: From VBA side of things or SQL Query Side.
Sorry for the very long question, but I wanted to show my effort on the subject because every time I am told that this is not enough Information
I have a table test
+----+--+------+--+--+--------------+--+--------------+
| ID | | Name1 | | | Name2 |
+----+--+------+--+--+--------------+--+--------------+
| 1 | | Andy | | | NULL |
| 2 | | Kevin | | | NULL |
| 3 | | Phil | | | NULL |
| 4 | | Maria | | | NULL |
| 5 | | Jackson | | | NULL |
+----+--+------+--+--+----------+--+--
I am expecting output like
+----+--+------+--+--+----------+--
| ID | | Name1 | | | Name2 |
+----+--+------+--+--+----------+--
| 1 | | NULL | | | Andy |
| 2 | | NULL | | | Kevin |
| 3 | | NULL | | | Phil |
| 4 | | NULL | | | Maria |
| 5 | | NULL | | | Jackson |
+----+--+------+--+--+----------+--
I unfortunately inserted data in wrong column and now I want to shift the data to the next column.
You can use an UPDATE statement with no WHERE condition, to cover the entire table.
UPDATE test
SET Name2 = Name1,
Name1 = NULL
I'm working on a report. How do I get columns from the outside that are displaying dates to be next to a column inside the matrix that is displaying values.
For example it is setup like this:
| HiredDt | TermDt | [Type] | LicDt | MedDt |
---------------------------------------------------------------------------------
ID | [HiredDt] | [TermDt] | SUM([Count_of_Type]) | [LicDt] | [MedDt] |
---------------------------------------------------------------------------------
And looks like this:
| HiredDt | TermDt | Lic | Med | App | LicDt | MedDt |
----------------------------------------------------------------------------------------
1 | 1/31/12 | 1/31/14 | 1 | 1 | 12 | 6/1/15 | 9/1/14 |
2 | 2/19/12 | 9/18/14 | 1 | 1 | 12 | 3/2/15 | 9/1/14 |
But when I use inside grouping to match up the date next to the associated document type I get:
| HiredDt | TermDt | Lic | | | Med | | | App | | |
----------------------------------------------------------------------------------------------------------------------------
1 | 1/31/12 | 1/31/14 | 1 | 6/1/15 | | 1 | | 9/1/2014 | 12 | | |
2 | 2/19/12 | 9/18/14 | 1 | 3/2/15 | | 1 | | 9/1/2014 | 12 | | |
What I'm trying to get this:
| HiredDt | TermDt | Lic | LicDt | Med | MedDt | App |
--------------------------------------------------------------------------------------
1 | 1/31/12 | 1/31/14 | 1 | 6/1/15 | 1 | 9/1/14 | 12 |
2 | 2/19/12 | 9/18/14 | 1 | 3/2/15 | 1 | 9/1/14 | 12 |
Is this possible?
I would right-click on the cell you have labelled SUM([Count_of_Type]) and choose Insert Column - Inside Group - Right.
In that new cell I would set the expression to: = Max ( [LicDt] )