Complex row manipulation based on column value in SQL or Power Query - sql

I have a call dataset. Looks like this
If a call about a certain member comes in within 30 days of an "original call", that call is considered a callback. I need some logic or Power Query magic to handle this dataset using this logic. So the end result should look like this
Right now, I have the table left joined to itself which gives me every possible combination. I thought I could do something with that but it's proven difficult and when I have over 2 million unique case keys, the duplicates kill run time and overload memory. Any suggestions? I'd prefer to do the manipulation in Power Query editor but can do it in SQL. Plz and thank you.

I think you can do this in Power Query, but I have no idea how it will run with two million records.
It may be able to be sped up with judicious use of the Table.Buffer function. But give it a try as written first.
The code should be reasonably self-documenting
Group by Member ID
For each Member ID, create a table from a list of records which is created using the stated logic.
expand the tables
Mark the rows to be deleted by shifting up the Datediff column by one and applying appropriate logic to the Datediff and shifted columns.
Code assumes that the dates for each Member ID are in ascending order. If not, an extra sorting step would need to be added
Try this M code. (Change the Source line to be congruent with your own data source).
Edit:
Code edited to allow for multiple call backs from an initial call
let
//Change next line to be congruent with your actual data source
Source = Excel.CurrentWorkbook(){[Name="Table3"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{
{"Case key", type text}, {"Member ID", Int64.Type}, {"Call Date", type date}}),
//Group by Member ID
// then create tables with call back date using the stated logic
#"Grouped Rows" = Table.Group(#"Changed Type", {"Member ID"}, {
{"Call Backs",(t)=>Table.FromRecords(
List.Generate(
()=>[ck=t[Case key]{0}, cd=t[Call Date]{0}, cb = null, df=null, idx=0],
each [idx] < Table.RowCount(t),
each [ck=if Duration.Days(t[Call Date]{[idx]+1} - [cd]) < 30
then [ck] else t[Case key]{[idx]+1},
cd=if Duration.Days(t[Call Date]{[idx]+1} - [cd]) < 30
then [cd] else t[Call Date]{[idx]+1},
cb = if Duration.Days(t[Call Date]{[idx]+1} - [cd]) < 30
then t[Call Date]{[idx]+1} else null,
df = if Duration.Days(t[Call Date]{[idx]+1} - [cd]) < 30
then Duration.Days(t[Call Date]{[idx]+1} - [cd]) else null,
idx = [idx]+1],
each Record.FromList({[ck],[cd],[cb],[df]},{"Case key","Call Date","Call Back Date", "Datediff"}))
)}
}),
#"Expanded Call Backs" = Table.ExpandTableColumn(#"Grouped Rows", "Call Backs",
{"Case key", "Call Date", "Call Back Date", "Datediff"},
{"Case key", "Call Date", "Call Back Date", "Datediff"}),
#"Shifted Datediff" = Table.FromColumns(
Table.ToColumns(#"Expanded Call Backs") & {
List.RemoveFirstN(#"Expanded Call Backs"[Datediff]) & {null}},
type table[Member ID=Int64.Type, Case key=text, Call Date=date, Call Back Date=date, Datediff=Int64.Type, shifted=Int64.Type ]),
#"Filter" = Table.SelectRows(#"Shifted Datediff", each [shifted]=null or [Datediff]<>null),
#"Removed Columns" = Table.RemoveColumns(Filter,{"shifted"})
in
#"Removed Columns"
Example with multiple callbacks

Think you can do this with Lead function.
here is the fiddle https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=f7cabdbe4d1193e5f0da6bd6a4571b96
select
a.*,
LEAD(CallDate, 1) OVER (
Partition by memberId
ORDER BY
CallDate
) AS "CallbackDate",
LEAD(CallDate, 1) OVER (
Partition by memberId
ORDER BY
CallDate
) - a.calldate AS DateDiff
from
mytable a

Related

Calculate "Working hours " based on IN and OUT time Employee Attendance Power bi

I am using Power bi and I would like to calculate the working hours of my employees based on the time they entered and left the company.
Here is a sample of data
the supposed total hours should be 7 hours and 8 minutes.
SO, the time where the employee was out and do nothing. In my case, I should exclude the time starting from OUT 2:04:20 to IN 3:09:46 about 1 hour.
You can calculate the next out for each in and take the difference:
select t.employee_id,
sum(datediff(second, local_time, next_local_time)) as diff_seconds
from (select t.*,
lead(local_time) over (partition by employee_id order by local_time) as next_local_time
from t
) t
where action = 'IN'
group by t.employee_id;
Note: This assumes that INs and OUTs are interleaved, so there are no rows with the same action in a row.
This also gives the result in seconds -- which can be converted to decimal hours or any other particular format you want.
In Power BI/Power Query, assuming your table is the same structure as your post you can create a new table with the following script. In this script yourSource is reference to your table.
let
#"Sorted Rows" = Table.Sort(yourSource,{{"EmployeeID", Order.Ascending}, {"LOCAL_TIME", Order.Ascending}}),
#"Added Index" = Table.AddIndexColumn(#"Sorted Rows", "Index", 1, 1, Int64.Type),
#"Added Custom1" = Table.AddColumn(#"Added Index", "Employee_In_Out_Index", each if [Action] = "IN" then [Index] else null),
#"Filled Down" = Table.FillDown(#"Added Custom1",{"Employee_In_Out_Index"}),
#"Removed Columns" = Table.RemoveColumns(#"Filled Down",{"Index"}),
#"Reordered Columns" = Table.ReorderColumns(#"Removed Columns",{"Employee_In_Out_Index", "EmployeeID", "Action", "LOCAL_TIME"}),
#"Pivoted Column" = Table.Pivot(#"Reordered Columns", List.Distinct(#"Reordered Columns"[Action]), "Action", "LOCAL_TIME"),
#"Added Custom" = Table.AddColumn(#"Pivoted Column", "Time Difference", each [OUT] - [IN])
in
#"Added Custom"
In the new table, each record is for one in-out. There is a Time Difference column here which you can use in your calculations.

Get the item with the highest count

Can you please help me to get the item with the highest count using DAX?
Measure = FIRSTNONBLANK('Table1'[ItemName],CALCULATE(COUNT('Table2'[Instance])))
This shows the First ItemName in the table but doesnt get the ItemName of the Highest Value.
Thanks
Well, it's more complicated than I would have wanted, but here's what I came up with.
There things that you are hoping to do that are not so straightforward in DAX. First, you want an aggregated aggregation ;) -- in this case, the Max of a Count. The second thing is that you want to use a value from one column that you identify by what's in another column. That's row-based thinking and DAX prefers column-based thinking.
So, to do the aggregate of aggregates, we just have to slog through it. SUMMARIZE gives us counts of items. Max and Rank functions could help us find the biggest count, but wouldn't be so useful for getting Item Name. TOP N gives us the whole row where our count is the biggest.
But now we need to get our ItemName out of the row, so SELECTCOLUMNS lets us pick the field to work with. Finally, we really want a value not a 1-column, 1-row table. So FirstNonBlank finishes the job.
Hope it helps.
Here's my DAX
MostFrequentItem =
VAR SummaryTable = SUMMARIZE ( 'Table', 'Table'[ItemName], "CountsByItem", COUNT ( 'Table'[ItemName] ) )
VAR TopSummaryItemRow = TOPN(1, SummaryTable, [CountsByItem], DESC)
VAR TopItem = SELECTCOLUMNS (TopSummaryItemRow, "TopItemName", [ItemName])
RETURN FIRSTNONBLANK (TopItem, [TopItemName])
Here's the DAX without using variables (not tested, sorry. Should be close):
MostFrequentItem_2 =
FIRSTNONBLANK (
SELECTCOLUMNS (
TOPN (
1,
SUMMARIZE ( 'Table', 'Table'[ItemName], "Count", COUNT ( 'Table'[ItemName] ) ),
[Count], DESC
),
"ItemName", [ItemName]
),
[ItemName]
)
Here's the mock data:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WcipNSspJTS/NVYrVIZ/nnFmUnJOKznRJzSlJxMlyzi9PSs3JAbODElMyizNQmLEA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [Stuff = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Stuff", type text}}),
#"Renamed Columns" = Table.RenameColumns(#"Changed Type",{{"Stuff", "ItemName"}})
in
#"Renamed Columns"

How to sum consecutive rows in Power Query

I have in Power Query a Column "% sum of all". I need to create a custom column "Sum Consecutive" that each row has as value the "% sum of all" of the current row + the value of "Sum Consecutive" of the previous row.
Current row situation
New Custom Column Expectation
You can see two images that show the current situation and the next situation I need in the Power Query.
Can you please help me find a code/command to create this new column like that?
Although there are similar solved questions in DAX, I still need to keep editing the file after that, so it should be performed in M language in power query.
Thank you!
Not sure how performant my approaches are. I would think both should be reasonably efficient as they only loop over each row in the table once (and "remember" the work done in the previous rows). However, maybe the conversion to records/list and then back to table is slow for large tables (I don't know).
Approach 1: Isolate the input column as a list, transform the list by cumulatively adding, put the transformed list back in the table as a new column.
let
someTable = Table.FromColumns({List.Repeat({0.0093}, 7) & List.Repeat({0.0086}, 7) & {0.0068, 0.0068}}, {"% of sum of all"}),
listToLoopOver = someTable[#"% of sum of all"],
cumulativeSum = List.Accumulate(List.Positions(listToLoopOver), {}, (listState, currentIndex) =>
let
numberToAdd = listToLoopOver{currentIndex},
sum = try listState{currentIndex - 1} + numberToAdd otherwise numberToAdd,
append = listState & {sum}
in
append
),
backToTable = Table.FromColumns(Table.ToColumns(someTable) & {cumulativeSum}, Table.ColumnNames(someTable) & {"Cumulative sum"})
in
backToTable
Approach 2: Convert the table to a list of records, loop over each record and add a new field (representing the new column) to each record, then convert the transformed list of records back into a table.
let
someTable = Table.FromColumns({List.Repeat({0.0093}, 7) & List.Repeat({0.0086}, 7) & {0.0068, 0.0068}}, {"% of sum of all"}),
listToLoopOver = Table.ToRecords(someTable),
cumulativeSum = List.Accumulate(List.Positions(listToLoopOver), {}, (listState, currentIndex) =>
let
numberToAdd = Record.Field(listToLoopOver{currentIndex}, "% of sum of all"),
sum = try listState{currentIndex - 1}[Cumulative sum] + numberToAdd otherwise numberToAdd, // 'try' should only be necessary for first item
recordToAdd = listToLoopOver{currentIndex} & [Cumulative sum = sum],
append = listState & {recordToAdd}
in
append
),
backToTable = Table.FromRecords(cumulativeSum)
in
backToTable
I couldn't find a function in the reference for M/Power Query that sums a list cumulatively.

Google Data Studio incorrect calculated metrics

I am creating calculated metrics in Data Studio and I am having trouble with the results.
Metric 1 uses this formula:
COUNT_DISTINCT(CASE WHEN ( Event Category = "ABC" AND Event Action = "XXX" AND Event Label = "123" ) THEN ga clientId (user) ELSE " " END )
[[To count the events with distinct clientIds]]
Metric 2 uses this formula:
COUNT_DISTINCT(CASE WHEN ( Event Category = "ABC" AND Event Action = "YYY" AND Event Label = "456" ) THEN ga clientId (user) ELSE " " END )
[[To count the events with distinct clientIds]]
Metric 3 uses this formula:
COUNT_DISTINCT(CASE WHEN ( Event Category = "ABC" AND Event Action = "ZZZ" AND Event Label = "789" ) THEN userId(user) ELSE " " END )
[[To count the events with distinct userIds]]
The formulas work fine and when I do Metric 2/ Metric 1 the number is correct for a one day time span. When I do Metric 3/Metric 2 the number is wrong. Why is this? It doesn't make sense to me since they are both numerical values.
Also, when I increase the date range the Metric 2 / Metric 1 is incorrect too! Any ideas why these are not working?
If you are aggregating over a certain amount of data, then these calculations will not be exact; they will be approximations.
I have noticed that Google Data Studio is more accurate with data properly loaded into BigQuery rather than data loaded through something else like a PostgreSQL connector. Otherwise, APPROX_COUNT_DISTINCT may be used.

BigQuery Deadline Exceeded and RuntimeException

When I am running a query which I am confident the result set isn't that large. I keep getting this error, can someone explain what causes this error and how can I change my query to avoid this (other than selecting less data because that is not something I am able to change for this query)
Error preparing subsidiary query: com.google.storage.megastore.exception.DeadlineExceededRuntimeException: Deadline exceeded: Deadline
my query is basically selecting data for a month and then applying a case clause and a group by. There are no joins
Here is a cleaned up version of my query. Most of the columns are just strings.
select
counted,
CONCAT(user_id,"_",string(index)) as user_id,
name,
-- We want to give each event an alias here, so the first event in the funnel would be called step1
case when name="16" and param7 = "b" then 'step1'
when name="71" then 'step2'
when name="73" then 'step3'
when name="10" and param7= "b" and param1="a" then 'step4'
when name="18" then 'step5'
when name="31" then 'step6'
else 'na'
end as step
from (TABLE_DATE_RANGE([tablename_],TIMESTAMP('2016-04-01'),TIMESTAMP('2016-05-01')))
-- selects all of the 6 steps in the funnel.
WHERE (name = "16" AND param7 = "b") OR (name = "71") OR (name = "73") OR (name="10" AND param7 = "b" AND param1 = "a") OR (name = "18") OR (name = "31")
Based on your comment - you should try
SELECT * FROM (
SELECT
counted,
CONCAT(user_id,"_",STRING(index)) AS user_id,
name,
-- We want to give each event an alias here, so the first event in the funnel would be called step1
CASE
WHEN name="16" AND param7 = "b" THEN 'step1'
WHEN name="71" THEN 'step2'
WHEN name="73" THEN 'step3'
WHEN name="10" AND param7= "b" AND param1="a" THEN 'step4'
WHEN name="18" THEN 'step5'
WHEN name="31" THEN 'step6'
ELSE 'na'
END AS step
FROM (TABLE_DATE_RANGE([tablename_],TIMESTAMP('2016-04-01'),TIMESTAMP('2016-05-01')))
)
-- leave only those 6 steps in the funnel.
WHERE step != 'na'
Another possibility raised by a coworker was the fact that I was using the function
TABLE DATE RANGE which messes up something about the query causing it to fail, he suggested typing up each individual table, so in this case 30 tables and it did remove the error as well and I noticed that the queries were faster too. Not sure why and it is a pain to have a giant query because you have to have 1 line for each date you want to query in the FROM clause
FROM (TABLE_DATE_RANGE([tablename_],TIMESTAMP('2016-04-01'),TIMESTAMP('2016-05-01')))