VBA: Count unique values that meet two criteria - vba

I have three columns: ID, events, and month. I need to get the count of events by month unique by ID.
So far I have the count of events by month (e.g. 1806 unique logins in May) using CountIfs(Range("B2:B276609"), EventName, Range(C2:C76602"), m)).
How do I filter this above count so only the unique IDs within that count is being used? Note that I have to loop this through a bunch of event types and months.
To make this clearer, let me provide some sample data:
ID Event Month
1 Login May 16
2 click July 16
1 Save June 16
1 Login May 16
3 Save June 16
From this I need to get the following info:
1 unique login in May 16
2 unique saves in June 16
1 unique click in July 16

You can use Excel's inbuilt remove duplicates function.
ActiveSheet.Range("A2:C76602").RemoveDuplicates Columns:=Array(1,2), Header:=xlYes

Based on the link in Ralph's comments which can be found here here you get:
To know how many unique items you have you can use this regular formula:
=SUM(IF(FREQUENCY(COUNTIF(Colors,"<"&Colors),COUNTIF(Colors,"<"&Colors)),1))
I then extended this to multiple columns just change the countif formulas to countifs. (different ranges used obviously).
=SUM(IF(FREQUENCY(COUNTIFS($A$1:$A$10,"<"&$A$1:$A$10,$B$1:$B$10,"<"&$B$1:$B$10),COUNTIFS($A$1:$A$10,"<"&$A$1:$A$10,$B$1:$B$10,"<"&$B$1:$B$10)),1))

Related

SQL new variable using multiple conditions (count of occurrences in 6 month look-back period using timestamp for each unique ID)

I am trying to achieve the following:
Attached is what my data looks like.
I want to create 2 new variables which counts the number of times 'Target' (variable 1) and 'Competitor' appears (variable 2), within the last 6 months of a given date_of_prescription. This would be done for every unique D_PRESCRIBER_ID.
So for example:
For ID: 1003000902 prescribing on 2020-03-18 date, the COMPETITOR drug. When you look at the rows before that, you can see that within 6 months prior to the 2020-03-18 date, there are 2 Target drugs prescribed and 0 competitor drugs prescribed. So my variable values will be: 2 (variable 1) and 0 (variable 2)
My data is much larger than what the screenshot looks like. It has more variables and 1000's of unique D_PRESCRIBER_IDs. Each row is not a unique ID, there are duplicates in the data for various date_of_prescription timestamps. These variables need to be created in my select statements in order to keep the rest of the data the same.
Any help here would be awesome. Thanks!

How to combine a row of cells in VBA if certain column values are the same

I have a database where all of the input from the user (through a userform) gets stored. In the database, each column is a different category for the type of data (ex. date, shift, quantity, etc) and the data from the userform input gets put into its corresponding category. For some of the data, all the data is the same except for the quantity. I was wondering how I could combine these rows into one and add the quantities to each other for the whole database (ex. combining the first and third data entries). I have tried playing around with a couple different loops but can't seem to figure anything out.
Period Date Line Shift Type Quantity
4 x 2 4/3/18 A 3 14 18
4 x 2 4/3/18 A 3 13 12
4 x 2 4/3/18 A 3 14 15
Thank you!
If you're looking to modify the underlying database, you might be able to query the data into the format you want by including all the other columns in a GROUP BY statement, save the result to another table, then replace the original table with the properly formatted one.
If you have the data in Excel and you just want to view it with the duplicate rows summed, a Pivot Table would be a good choice. You can select all the other columns as rows for the Pivot Table and sum of Quantity as the values.

Counting latest instance of multiple only based on filter context

I've got a large table of events that have occurred in an inventory of vehicles, which affect whether they are in service or out of service. I would like to create a measure that would be able to count the number of vehicles in the various inventories at any point in time, based on the events in this table.
This table is pulled from a SQL database into an Excel 2016 sheet, and I'm using PowerPivot to try to come up with the DAX measure.
Here is some example data event_list:
vehicle_id event_date event event_sequence inventory
100 2018-01-01 purchase 1 in-service
101 2018-01-01 purchase 1 in-service
102 2018-02-04 purchase 1 in-service
100 2018-02-07 maintenance 2 out-of-service
101 2018-02-14 damage 2 out-of-service
101 2018-02-18 repaired 3 in-service
100 2018-03-15 repaired 3 in-service
102 2018-05-01 damage 2 out-of-service
103 2018-06-03 purchase 1 in-service
I'd like to be able to create a pivot table in Excel (or use CUBE functions, etc) to get an output table like this:
date in-service out-of-service
2018-02-04 3 0
2018-02-14 1 2
2018-03-15 3 0
2018-06-03 3 1
Essentially, I want to be able to calculate the inventory based on any date in time. The example only has a few dates, but hopefully provides enough of a picture.
I've basically come up with this so far, but it counts more vehicles than desired - I can't figure out how to only take the latest event_sequence or event_date and use that to count the inventory.
cumulative_vehicles_at_date:=CALCULATE(
COUNTA([vehicle_id]),
IF(IF(HASONEVALUE (event_list[event_date]), VALUES (event_list[event_date]))>=event_list[event_date],event_list[event_date])
)
I tried using MAX() and EARLIER() functions, but they don't seem to work.
Edit: Added the PowerBI tag as I'm now using that software to attempt to solve this as well. See comments on Alexis Olson's answer.
I think I've found a much cleaner method than I gave previously.
Let's add two columns onto the event_list table. One which counts vehicles "in-service" on that date and one which counts vehicles "out-of-service" on that date.
InService =
VAR Summary = SUMMARIZE(
FILTER(event_list,
event_list[event_date] <= EARLIER(event_list[event_date])),
event_list[vehicle_id],
"MaxSeq", MAX(event_list[event_sequence]))
VAR Filtered = FILTER(event_list,
event_list[event_sequence] =
MAXX(
FILTER(Summary,
event_list[vehicle_id] = EARLIER(event_list[vehicle_id])),
[MaxSeq]))
RETURN SUMX(Filtered, 1 * (event_list[inventory] = "in-service"))
You can create an analogous calculated column for OutOfService or you can just take the total minus the InService count.
OutOfService =
CALCULATE(
DISTINCTCOUNT(event_list[vehicle_id]),
FILTER(event_list,
event_list[event_date] <= EARLIER(event_list[event_date])))
- event_list[InService]
Now all you have to do is put event_date on the matrix visual rows section and add the InService and OutOfService columns to the values section (use Maximum or Minimum for the aggregation option rather than Sum).
Here's the logic behind the calculated column InService:
We first create a Summary table which calculates the maximal event_sequence value for each vehicle. (We filter the event_date to only consider dates up to the current one we are working with.)
Now that we know what the last event_sequence value is for each vehicle, we use that to filter the entire table down to just the rows that correspond to those vehicles and sequence values. The filter goes through the table row by row and checks to see if the sequence value matches the one we calculated in the Summary table. Note that when we filter the Summary table to just the vehicle we are currently working with, we only get a single row. I'm just using MAXX to extract the [MaxSeq] value. (It's kind of like using LOOKUPVALUE, but you can't use that on a variable.)
Now that we've filtered the table just to the most recent events for each vehicle, all we need to do is count how many of them are "in-service". I used a SUMX here where the 1*(True/False) coerces the boolean value to return 1 or 0.
This is pretty difficult. I don't have a great answer, but here's something that kind of works.
You'll create a new calculated table where you'll calculate the status for each vehicle on each date. Start with the base cross join for each vehicle and each date:
= CROSSJOIN(VALUES(event_list[vehicle_id]), VALUES(event_list[event_date]))
Then add a calculated column to find the max sequence number for each vehicle on that date.
Sequence = MAXX(
FILTER(event_list,
event_list[event_date] <= Cross[event_date] &&
event_list[vehicle_id] = Cross[vehicle_id]),
event_list[event_sequence])
Now you can lookup the inventory value for each vehicle/sequence pair with another calculated column:
Inventory = LOOKUPVALUE(
event_list[inventory],
event_list[vehicle_id], Cross[vehicle_id],
event_list[event_sequence], Cross[Sequence])
The result should look something like this:
Once you have this, you can create a matrix using this calculated table. Put the event_date on the rows and Inventory on the columns. Filter out blank inventory values in the visual level filter and put the vehicle_id in the values field, using a count or distinct count as the aggregation method (instead of the default sum).
It should look like this:

Row blocks in Hive (how to group rows by certain criteria and count these groups)

Here is a sample of the data I have:
Date_key UserID
20140401 a
20140402 a
20140406 a
20140407 a
20140408 a
20140409 a
20140404 b
20140408 b
20140409 b
20140414 b
20140415 b
... ...
Each row has a Date, User ID couple which indicates that that user was active on that day. A user can appear on multiple dates and a date will have multiple users -- just like in the example.
I want to get the number of consecutive day groups (i.e. blocks of activity). For example, this value for 'User a' would be 2 because they were active on 20140401 and 20140402 which is the first group of consecutive days. After 20140402, they waited for a while before becoming active again (i.e. they were not active the following day). On 20140406, their second block of activity started and continued without any break up until 20140409. For 'User b', this value would be 3 because they have been active during three consecutive day periods: 1)20140404 2) 20140408, 20140409 3) 20140414, 20140415
I use Hive. I am not sure if this is possible in Hive, but if the data needs to be carried over to a RDBMS to perform this task, I can do that too. Your recommendations are greatly appreciated. Thank you!
Cheers
When you use the distribute by clause, ie: .......distribute by user_id sort by user_id,date_key desc...... all the records for a particular user would go to a particular reducer, where the records are then sorted by date_key descending. Here why don't we write a UDF to iterate through the records, and when ever there is break in the continuity it would increment the counter for continuity by 1 and return the result along with the user_id.

Find value -> sum for 90 cells -> drop -> repeat

I have this spreadsheet i am working on for awhile. Its basically attendance piece. End user keeps track of employees, if they showed up or not etc...
I have tired looking up loops but i couldn't figure out how to do what i am trying to do.
What i have in this excel.
A-D : Emp info
E-∞ : 1-3 Days/Dates; 4-∞ emp data (if they missed a day, values for that)
To get better understanding, see this
The data entered from E5 to xx thats where i am trying to get this vba working.
Anything the script detects first value either '1' or '2', start 90 days (cells) from there. And after 90, reset to 0. starting from 91 start search for '1' or '2' and do similar.
See the excel file for better understanding. If it doesn't make sense, ill be happy to simplify.
Thank You
The most efficient and clean way to handle this is to use a form of a relational data model because it can be done easily without using VBA code. You will have two simple tables in your spreadsheet, EmployeeInfo and AttendanceRecords. Your Employee info will look something like this
Emp# Name Craft # In 90 Days NumOf2s NumOf1s
1 EMP 1 SM Site Manager 0 0 0
2 EMP 2 SM Site Manager 1 0 1
3 EMP 3 SM Site Manager 0 0 0
4 EMP 4 SM Site Manager 0 0 0
5 EMP 5 SM Site Manager 1 0 1
The last three columns are calculated from the AttendanceRecords table. This table is going to be variable size but this way you only need to store the important data (When employees actually got marks). It will look like this.
Emp# Date Days Count
1 12/1/2013 122 1
3 1/1/2014 91 2
2 2/1/2014 60 1
5 2/15/2014 46 1
You can have multiple entries for the same day and the same employee. The important thing is that we only need one entry per infraction (NOTE: In order to do this in a proper database type model, each attendance record should also have some kind of incrementing totally unique ID (like employees), but we can forgo that for this application).
You enter in the employee number, the date, and the count. The "Days" column then auto calculates the age of the record with the following formula:
=TODAY()-[#Date]
NOTE: If the [#Date] notation does not look familiar, this is because it deals with Excel Tables. I recommend you read up on those if not already familiar.
So now we have the age of each record. So back on the EmployeeInfo table, we use the following formula to get all AttendanceRecords that apply to Employee x for the last 90 days
=SUMIFS(AttendanceRecords[Count],AttendanceRecords[Emp'#],[#[Emp'#]],AttendanceRecords[Days],"<=90")
You can now also use some simple formulas to get the other columns I pointed out, including the number of 2 count in fractions or the number of 1 count infractions:
=COUNTIFS(AttendanceRecords[Emp'#],[#[Emp'#]],AttendanceRecords[Days],"<=90",AttendanceRecords[Count],2)
=COUNTIFS(AttendanceRecords[Emp'#],[#[Emp'#]],AttendanceRecords[Days],"<=90",AttendanceRecords[Count],1)
There is a lot more data that could be gathered, including the date of the last infraction, total number of infractions for all time, etc. If any of the formulas or terms I used don't make sense or need more explaining, feel free to ask.
EDIT: If you want them automatically removed after 90 days, it would be relatively easy to write a VBA script to do this. It would also be easy to just sort the AttendanceRecords table on Days and delete all records that are older than a certain number of days. However, unless you see yourself adding hundreds of records a week, this really shouldn't be necessary. Also, If you want to write a Visual Basic form to enter in new infractions, that is definitely very possible, but another discussion.
EDIT: To respond to concerns about viewing when these issues happened, I will give you an example of a way to view the data in your tables. One of the advantages of excel tables is that the order of the records isn't as absolute as in a normal range, so we can sort, rearrange, and filter them to see what we need. So if you need to see all of the issues with employee 3, you just go to the Emp# column in the AttendanceRecords table, select the little arrow down button next to Emp#, uncheck 'Select All', and then check the '3', and then the only values I will see in the table are the ones from employee 3. I can then sort the 'Date' column by clicking its little arrow and selecting 'Sort Newest to Oldest'.
What it comes down to is that you can view ANY data you need to, and if you think through what you really need to see, you can set up your summary table (EmployeeInfo) to display enough data that you hardly ever have to look at the AttendanceRecords table. But if you need to, you can go into that table and do a manual sort (as I described above) very easily.
EDIT: To help add some of the functionality I've shown above to the askers current spreadsheet, I will show the current formula.
=SUMIFS($E5:$BR5,$E$3:$BR$3,">"&(TODAY()-90))
For EMP 1, this formula uses the employees row as the sum range. It then looks at the field of dates in the corresponding columns in row 3. If the date in row 3 is > TODAY()-90, then we will add it to the count. This will at least just look at the infractions for the previous 90 days.