SQL new variable using multiple conditions (count of occurrences in 6 month look-back period using timestamp for each unique ID) - sql

I am trying to achieve the following:
Attached is what my data looks like.
I want to create 2 new variables which counts the number of times 'Target' (variable 1) and 'Competitor' appears (variable 2), within the last 6 months of a given date_of_prescription. This would be done for every unique D_PRESCRIBER_ID.
So for example:
For ID: 1003000902 prescribing on 2020-03-18 date, the COMPETITOR drug. When you look at the rows before that, you can see that within 6 months prior to the 2020-03-18 date, there are 2 Target drugs prescribed and 0 competitor drugs prescribed. So my variable values will be: 2 (variable 1) and 0 (variable 2)
My data is much larger than what the screenshot looks like. It has more variables and 1000's of unique D_PRESCRIBER_IDs. Each row is not a unique ID, there are duplicates in the data for various date_of_prescription timestamps. These variables need to be created in my select statements in order to keep the rest of the data the same.
Any help here would be awesome. Thanks!

Related

How to combine a row of cells in VBA if certain column values are the same

I have a database where all of the input from the user (through a userform) gets stored. In the database, each column is a different category for the type of data (ex. date, shift, quantity, etc) and the data from the userform input gets put into its corresponding category. For some of the data, all the data is the same except for the quantity. I was wondering how I could combine these rows into one and add the quantities to each other for the whole database (ex. combining the first and third data entries). I have tried playing around with a couple different loops but can't seem to figure anything out.
Period Date Line Shift Type Quantity
4 x 2 4/3/18 A 3 14 18
4 x 2 4/3/18 A 3 13 12
4 x 2 4/3/18 A 3 14 15
Thank you!
If you're looking to modify the underlying database, you might be able to query the data into the format you want by including all the other columns in a GROUP BY statement, save the result to another table, then replace the original table with the properly formatted one.
If you have the data in Excel and you just want to view it with the duplicate rows summed, a Pivot Table would be a good choice. You can select all the other columns as rows for the Pivot Table and sum of Quantity as the values.

Counting latest instance of multiple only based on filter context

I've got a large table of events that have occurred in an inventory of vehicles, which affect whether they are in service or out of service. I would like to create a measure that would be able to count the number of vehicles in the various inventories at any point in time, based on the events in this table.
This table is pulled from a SQL database into an Excel 2016 sheet, and I'm using PowerPivot to try to come up with the DAX measure.
Here is some example data event_list:
vehicle_id event_date event event_sequence inventory
100 2018-01-01 purchase 1 in-service
101 2018-01-01 purchase 1 in-service
102 2018-02-04 purchase 1 in-service
100 2018-02-07 maintenance 2 out-of-service
101 2018-02-14 damage 2 out-of-service
101 2018-02-18 repaired 3 in-service
100 2018-03-15 repaired 3 in-service
102 2018-05-01 damage 2 out-of-service
103 2018-06-03 purchase 1 in-service
I'd like to be able to create a pivot table in Excel (or use CUBE functions, etc) to get an output table like this:
date in-service out-of-service
2018-02-04 3 0
2018-02-14 1 2
2018-03-15 3 0
2018-06-03 3 1
Essentially, I want to be able to calculate the inventory based on any date in time. The example only has a few dates, but hopefully provides enough of a picture.
I've basically come up with this so far, but it counts more vehicles than desired - I can't figure out how to only take the latest event_sequence or event_date and use that to count the inventory.
cumulative_vehicles_at_date:=CALCULATE(
COUNTA([vehicle_id]),
IF(IF(HASONEVALUE (event_list[event_date]), VALUES (event_list[event_date]))>=event_list[event_date],event_list[event_date])
)
I tried using MAX() and EARLIER() functions, but they don't seem to work.
Edit: Added the PowerBI tag as I'm now using that software to attempt to solve this as well. See comments on Alexis Olson's answer.
I think I've found a much cleaner method than I gave previously.
Let's add two columns onto the event_list table. One which counts vehicles "in-service" on that date and one which counts vehicles "out-of-service" on that date.
InService =
VAR Summary = SUMMARIZE(
FILTER(event_list,
event_list[event_date] <= EARLIER(event_list[event_date])),
event_list[vehicle_id],
"MaxSeq", MAX(event_list[event_sequence]))
VAR Filtered = FILTER(event_list,
event_list[event_sequence] =
MAXX(
FILTER(Summary,
event_list[vehicle_id] = EARLIER(event_list[vehicle_id])),
[MaxSeq]))
RETURN SUMX(Filtered, 1 * (event_list[inventory] = "in-service"))
You can create an analogous calculated column for OutOfService or you can just take the total minus the InService count.
OutOfService =
CALCULATE(
DISTINCTCOUNT(event_list[vehicle_id]),
FILTER(event_list,
event_list[event_date] <= EARLIER(event_list[event_date])))
- event_list[InService]
Now all you have to do is put event_date on the matrix visual rows section and add the InService and OutOfService columns to the values section (use Maximum or Minimum for the aggregation option rather than Sum).
Here's the logic behind the calculated column InService:
We first create a Summary table which calculates the maximal event_sequence value for each vehicle. (We filter the event_date to only consider dates up to the current one we are working with.)
Now that we know what the last event_sequence value is for each vehicle, we use that to filter the entire table down to just the rows that correspond to those vehicles and sequence values. The filter goes through the table row by row and checks to see if the sequence value matches the one we calculated in the Summary table. Note that when we filter the Summary table to just the vehicle we are currently working with, we only get a single row. I'm just using MAXX to extract the [MaxSeq] value. (It's kind of like using LOOKUPVALUE, but you can't use that on a variable.)
Now that we've filtered the table just to the most recent events for each vehicle, all we need to do is count how many of them are "in-service". I used a SUMX here where the 1*(True/False) coerces the boolean value to return 1 or 0.
This is pretty difficult. I don't have a great answer, but here's something that kind of works.
You'll create a new calculated table where you'll calculate the status for each vehicle on each date. Start with the base cross join for each vehicle and each date:
= CROSSJOIN(VALUES(event_list[vehicle_id]), VALUES(event_list[event_date]))
Then add a calculated column to find the max sequence number for each vehicle on that date.
Sequence = MAXX(
FILTER(event_list,
event_list[event_date] <= Cross[event_date] &&
event_list[vehicle_id] = Cross[vehicle_id]),
event_list[event_sequence])
Now you can lookup the inventory value for each vehicle/sequence pair with another calculated column:
Inventory = LOOKUPVALUE(
event_list[inventory],
event_list[vehicle_id], Cross[vehicle_id],
event_list[event_sequence], Cross[Sequence])
The result should look something like this:
Once you have this, you can create a matrix using this calculated table. Put the event_date on the rows and Inventory on the columns. Filter out blank inventory values in the visual level filter and put the vehicle_id in the values field, using a count or distinct count as the aggregation method (instead of the default sum).
It should look like this:

Row blocks in Hive (how to group rows by certain criteria and count these groups)

Here is a sample of the data I have:
Date_key UserID
20140401 a
20140402 a
20140406 a
20140407 a
20140408 a
20140409 a
20140404 b
20140408 b
20140409 b
20140414 b
20140415 b
... ...
Each row has a Date, User ID couple which indicates that that user was active on that day. A user can appear on multiple dates and a date will have multiple users -- just like in the example.
I want to get the number of consecutive day groups (i.e. blocks of activity). For example, this value for 'User a' would be 2 because they were active on 20140401 and 20140402 which is the first group of consecutive days. After 20140402, they waited for a while before becoming active again (i.e. they were not active the following day). On 20140406, their second block of activity started and continued without any break up until 20140409. For 'User b', this value would be 3 because they have been active during three consecutive day periods: 1)20140404 2) 20140408, 20140409 3) 20140414, 20140415
I use Hive. I am not sure if this is possible in Hive, but if the data needs to be carried over to a RDBMS to perform this task, I can do that too. Your recommendations are greatly appreciated. Thank you!
Cheers
When you use the distribute by clause, ie: .......distribute by user_id sort by user_id,date_key desc...... all the records for a particular user would go to a particular reducer, where the records are then sorted by date_key descending. Here why don't we write a UDF to iterate through the records, and when ever there is break in the continuity it would increment the counter for continuity by 1 and return the result along with the user_id.

SSRS - Is there a way to have a table split by a page break to appear on the same page?

I am creating a report using MS Report Builder 3.0. For this report, I have a stored procedure built that filters down to the specific rows needed, and then I use a row group to group on a particular field (pass_no). The table that is displayed is 2 columns and 3 rows within the row group. The basic description of what I want to accomplish is instead of the rows running onto the next page, I want the rows to continue on the same page in a new set of 2 columns. Think of it like a newspaper where the text continues in a new column rather than running down onto the next page.
For the example I'm going to use here, there are 12 rows of data returned by the SP, and 8 unique values in the pass_no column which is what my row group is grouped on. So in the report I end up with 8 groups of 3 rows. I'm aiming to have the table display 6 pass_no values (so 6 groups of 3 rows) before, for lack of better terminology, starting a new table.
My first approach at this has been to create a column group and set the grouping expression to the following:
=Floor((RowNumber(Nothing) - 1) / 6)
While this works in creating a new set of 2 columns, the split for the new columns is based on the row number from the raw data returned by the SP rather than the number of rows sets created by the row group. So because there are 12 rows returned, and the 6th and 7th rows have the same pass_no value, the second set of columns duplicates that 1 set of data. Also, the top 6 rows of the second column set are blank with the second set of values appearing below the first set.
If I add an additional column group where it is also grouping by pass_no, then I don't get the duplicate values, but I do get a pair of columns for each pass_no as well (as would be expected). I've tried modifying the expression above a bit and changed Nothing to the row group name and have tried the table name, but neither of them have yielded the desired result.
I can't alter the SP to do the grouping there because there are other column values that are not identical and I pull that data into a cell value expression within the table using Join(LookupSet()).
I have also considered creating 2 tables and applying a filter to the table so the first table only displays the first 6 results and the second table displays the remaining results, but that also looks at the raw data rather than the groupings and TOP N can't be used on pass_no as it's a text value, not an integer. This would also cause problems if I need to go to 3 tables.
So long story short, is there a way to do a table break rather than a page break or to overflow columns onto the same page rather than onto a new page?
Here's the pertinent portions of the Dataset:
http://sqlfiddle.com/#!2/5082b/1
PASS_NO MASTERTRAN TRANS_NO DESCRIPTION IS_MOD
7913019000 4931019000 4931019000 General Admission Adult 0
7914019000 4932019000 4932019000 Sea Turtle Hosp Adult 0
7914019000 4932019000 4933019000 2:00 PM SEA TURTLE HOSP 1
7916019000 4934019000 4934019000 Sea Turtle Hosp Child 0
7916019000 4934019000 4935019000 2:00 PM SEA TURTLE HOSP 1
7917019000 4934019000 4934019000 Sea Turtle Hosp Child 0
7917019000 4934019000 4935019000 2:00 PM SEA TURTLE HOSP 1
7918019000 4934019000 4934019000 Sea Turtle Hosp Child 0
7918019000 4934019000 4935019000 2:00 PM SEA TURTLE HOSP 1
7922019000 4936019000 4936019000 General Admission Child 0
7923019000 4936019000 4936019000 General Admission Child 0
7924019000 4936019000 4936019000 General Admission Child 0
I think your data presents a bit of a problem.
As you've already figured out, typically for this sort of setup you'd set up a row group with an expression like:
=(RowNumber(Nothing) - 1) Mod 6
And a column group expression like:
=Ceiling(RowNumber(Nothing) / 6)
This would create a six row tablix that would grow horizontally as required.
See this SO question for a similar example.
However, you currently have the requirement of also grouping by another column - pass_no in your case. Normally you can approximate a group-level row number with an expression like:
=RunningValue(Fields!pass_no.Value, CountDistinct, "DataSet1")
Unfortunately, when you try to add this into one of the grouping expressions like:
=Ceiling(RunningValue(Fields!pass_no.Value, CountDistinct, "DataSet1") / 6)
You get the following error:
A group expression for the tablix 'Tablix1' includes the aggregate
function RunningValue. RunningValue cannot be used in group
expressions.
Based on all this, my recommendation is to try and get a Dataset that has one row per pass_no value and base the tablix on this, with the above row/column grouping expressions, i.e. no need to group on multiple pass_no rows. So in your example it would have eight rows. You could then have a separate Dataset with all the individual rows and use a lookupset function to concatenate the description, etc.
Your other option is to try and get everything on one Dataset only, including the aggregates as required. This might not be possible, but for description at least you can leverage any of the various techniques here to get a delimited list. Once you have this list you can replace the delimiter with vbCrLf to split it back over multiple rows.
All this is a very long-winded way of saying that I don't know if your requirement is possible with your data, but if you look at having at least one Dataset with one row per pass_no you should be able to make it work.

How to add two columns with the same names from different Tables

How to add two cloumns with the same name from different tables. It needs to be dymanic, because i have columns based on months. Below is the sample data. which i would use to calculate the moving 12 month average for attrition.
Transferout Oct'11 Nov'11
3310ED
3310FL
3310HD 1
3310PZ
3310RC
3310SH
3310SM 1
Terinations Oct'11 Nov'11
3310ED
3310FL
3310HD 1
3310PZ 1
3310RC
So according to the column name and dept id in the row filed the above two tables needs to be added and later divided by the head count of the respective dept ids using the same column headings in a diff table
Looks like you can use the LOOKUPVALUE function or the RELATED function (if the tables are related).