Row blocks in Hive (how to group rows by certain criteria and count these groups) - sql

Here is a sample of the data I have:
Date_key UserID
20140401 a
20140402 a
20140406 a
20140407 a
20140408 a
20140409 a
20140404 b
20140408 b
20140409 b
20140414 b
20140415 b
... ...
Each row has a Date, User ID couple which indicates that that user was active on that day. A user can appear on multiple dates and a date will have multiple users -- just like in the example.
I want to get the number of consecutive day groups (i.e. blocks of activity). For example, this value for 'User a' would be 2 because they were active on 20140401 and 20140402 which is the first group of consecutive days. After 20140402, they waited for a while before becoming active again (i.e. they were not active the following day). On 20140406, their second block of activity started and continued without any break up until 20140409. For 'User b', this value would be 3 because they have been active during three consecutive day periods: 1)20140404 2) 20140408, 20140409 3) 20140414, 20140415
I use Hive. I am not sure if this is possible in Hive, but if the data needs to be carried over to a RDBMS to perform this task, I can do that too. Your recommendations are greatly appreciated. Thank you!
Cheers

When you use the distribute by clause, ie: .......distribute by user_id sort by user_id,date_key desc...... all the records for a particular user would go to a particular reducer, where the records are then sorted by date_key descending. Here why don't we write a UDF to iterate through the records, and when ever there is break in the continuity it would increment the counter for continuity by 1 and return the result along with the user_id.

Related

SQL new variable using multiple conditions (count of occurrences in 6 month look-back period using timestamp for each unique ID)

I am trying to achieve the following:
Attached is what my data looks like.
I want to create 2 new variables which counts the number of times 'Target' (variable 1) and 'Competitor' appears (variable 2), within the last 6 months of a given date_of_prescription. This would be done for every unique D_PRESCRIBER_ID.
So for example:
For ID: 1003000902 prescribing on 2020-03-18 date, the COMPETITOR drug. When you look at the rows before that, you can see that within 6 months prior to the 2020-03-18 date, there are 2 Target drugs prescribed and 0 competitor drugs prescribed. So my variable values will be: 2 (variable 1) and 0 (variable 2)
My data is much larger than what the screenshot looks like. It has more variables and 1000's of unique D_PRESCRIBER_IDs. Each row is not a unique ID, there are duplicates in the data for various date_of_prescription timestamps. These variables need to be created in my select statements in order to keep the rest of the data the same.
Any help here would be awesome. Thanks!

code to Look at rows and return the value for the highest number for a user

I have a table(Below is a sample of what it contains) that shows the userId plus various milestones and admission stages. What I need to do this to look at the highest number in milestone_stage_number for that user and returns the value of milestone, admission stage and milestone_latest_stage. so in the example below the query should only return one line for userid 1 with milestone_stoge_number =4 (which is the max number for that person) and return accepted for the admission stage, milestione_lates_stage = emailed and milestone= emailed. In my actual table I have over 12000 users but I need the query to just return one row per user with the information for the maximum stage Number of that user. I hope this is clear what I need to achieve so if I have use 2 five times only returns the row for the highest numve=ber in Milestone_stage_number and hence after running the query I get one row for user 1 and one row for user 2.
my table is called applicants
Person_id Milestone admission_stage milestone_latest_stage milestone_stage_number
1 Under Review Accepted Accepted 2
1 emailed accepted emailed 4
1 offered accepted accepted 3
1 submitted reviewed offered 1
Could use a qualify and a window function
SELECT * FROM applicants
QUALIFY MAX(milestone_stage_number) OVER (PARTITION BY Person_id) = milestone_stage_number

Should I combine the columns of a fact table to make it more narrow, or should I keep it more user friendly with a lot of columns?

I have a Fact table that shows the results of KPIs. There are several KPIs, and some of these have a similar output.
My current columns are something like this:
KPI_ID, DOCUMENT_ID, TRUE_FALSE_FLAG1, TRUE_FALSE_FLAG2, DURATION_3, DURATION_4
So, for KPI number 1 (true false output), the last three columns will be NULL- values. Should I combine TRUE_FALSE_FLAG1 and TRUE_FALSE_FLAG2? What is BEST PRACTICE?
In total, there are 18 columns, where 12 of them are either true/false- flags or durations in the shape of "number of days" (integer).
picture of the two alternatives
EDIT:
KPI 3 could be "duration of problem", and you'd have a bunch of problems, each with a documentID, represented as a row. Dur_3 would be like 5 days, 3 days, 10 days, etc. KPI 4 would be "Delay of fix after repair was ordered", and the answer would still be an integer in days. But completely non- related to KPI 3.
Reporting could be "average delay of fix". So roughly a select AVG() from table where KPI_ID = 3 group by KPI_ID.
Based on your latest comment, you are best with Alternative 2. Specifically, as long as every KPI is only True/False, and has only one duration to store, you are better with Alternative 2.
EDIT: with Alternative 2, each KPI can store one True/False value AND one duration value

Access SQL - Add Row Number to Query Result for a Multi-table Join

What I am trying to do is fairly simple. I just want to add a row number to a query. Since this is in Access is a bit more difficult than other SQL, but under normal circumstances is still doable using solutions such as DCount or Select Count(*), example here: How to show row number in Access query like ROW_NUMBER in SQL or Access SQL how to make an increment in SELECT query
My Issue
My issue is I'm trying to add this counter to a multi-join query that orders by fields from numerous tables.
Troubleshooting
My code is a bit ridiculous (19 fields, seven of which are long expressions, from 9 different joined tables, and ordered by fields from 5 of those tables). To make things simple, I have an simplified example query below:
Example Query
SELECT DCount("*","Requests_T","[Requests_T].[RequestID]<=" & [Requests_T].[RequestID]) AS counter, Requests_T.RequestHardDeadline AS Deadline, Requests_T.RequestOverridePriority AS Priority, Requests_T.RequestUserGroup AS [User Group], Requests_T.RequestNbrUsers AS [Nbr of Users], Requests_T.RequestSubmissionDate AS [Submitted on], Requests_T.RequestID
FROM (((((((Requests_T
INNER JOIN ENUM_UserGroups_T ON ENUM_UserGroups_T.UserGroups = Requests_T.RequestUserGroup)
INNER JOIN ENUM_RequestNbrUsers_T ON ENUM_RequestNbrUsers_T.NbrUsers = Requests_T.RequestNbrUsers)
INNER JOIN ENUM_RequestPriority_T ON ENUM_RequestPriority_T.Priority = Requests_T.RequestOverridePriority)
ORDER BY Requests_T.RequestHardDeadline, ENUM_RequestPriority_T.DisplayOrder DESC , ENUM_UserGroups_T.DisplayOrder, ENUM_RequestNbrUsers_T.DisplayOrder DESC , Requests_T.RequestSubmissionDate;
If the code above is trying to select a field from a table not included, I apologize - just trust the field comes from somewhere (lol i.e. one of the other joins I excluded to simply the query). A great example of this is the .DisplayOrder fields used in the ORDER BY expression. These are fields from a table that simply determines the "priority" of an enum. Example: Requests_T.RequestOverridePriority displays to the user as an combobox option of "Low", "Med", "High". So in a table, I assign a numerical priority to these of "1", "2", and "3" to these options, respectively. Thus when ENUM_RequestPriority_T.DisplayOrder DESC is called in order by, all "High" priority requests will display above "Medium" and "Low". Same holds true for ENUM_UserGroups_T.DisplayOrder and ENUM_RequestNbrUsers_T.DisplayOrder.
I'd also prefer to NOT use DCOUNT due to efficiency, and rather do something like:
select count(*) from Requests_T where Requests_T.RequestID>=RequestID) as counter
Due to the "Order By" expression however, my 'counter' doesn't actually count my resulting rows sequentially since both of my examples are tied to the RequestID.
Example Results
Based on my actual query results, I've made an example result of the query above.
Counter Deadline Priority User_Group Nbr_of_Users Submitted_on RequestID
5 12/01/2016 High IT 2-4 01/01/2016 5
7 01/01/2017 Low IT 2-4 05/06/2016 8
10 Med IT 2-4 07/13/2016 11
15 Low IT 10+ 01/01/2016 16
8 Low IT 2-4 01/01/2016 9
2 Low IT 2-4 05/05/2016 2
The query is displaying my results in the proper order (those with the nearest deadline at the top, then those with the highest priority, then user group, then # of users, and finally, if all else is equal, it is sorted by submission date). However, my "Counter" values are completely wrong! The counter field should simply intriment +1 for each new row. Thus if displaying a single request on a form for a user, I could say
"You are number: Counter [associated to RequestID] in the
development queue."
Meanwhile my results:
Aren't sequential (notice the first four display sequentially, but then the final two rows don't)! Even though the final two rows are lower in priority than the records above them, they ended up with a lower Counter value simply because they had the lower RequestID.
They don't start at "1" and increment +1 for each new record.
Ideal Results
Thus my ideal result from above would be:
Counter Deadline Priority User_Group Nbr_of_Users Submitted_on RequestID
1 12/01/2016 High IT 2-4 01/01/2016 5
2 01/01/2017 Low IT 2-4 05/06/2016 8
3 Med IT 2-4 07/13/2016 11
4 Low IT 10+ 01/01/2016 16
5 Low IT 2-4 01/01/2016 9
6 Low IT 2-4 05/05/2016 2
I'm spoiled by PLSQL and other software where this would be automatic lol. This is driving me crazy! Any help would be greatly appreciated.
FYI - I'd prefer an SQL option over VBA if possible. VBA is very much welcomed and will definitely get an up vote and my huge thanks if it works, but I'd like to mark an SQL option as the answer.
Unfortuantely, MS Access doesn't have the very useful ROW_NUMBER() function like other clients do. So we are left to improvise.
Because your query is so complicated and MS Access does not support common table expressions, I recommend you follow a two step process. First, name that query you already wrote IntermediateQuery. Then, write a second query called FinalQuery that does the following:
SELECT i1.field_primarykey, i1.field2, ... , i1.field_x,
(SELECT field_primarykey FROM IntermediateQuery i2
WHERE t2.field_primarykey <= t1.field_primarykey) AS Counter
FROM IntermediateQuery i1
ORDER BY Counter
The unfortunate side effect of this is the more data your table returns, the longer it will take for the inline subquery to calculate. However, this is the only way you'll get your row numbers. It does depend on having a primary key in the table. In this particular case, it doesn't have to be an explicitly defined primary key, it just needs to be a field or combination of fields that is completely unique for each record.

Find value -> sum for 90 cells -> drop -> repeat

I have this spreadsheet i am working on for awhile. Its basically attendance piece. End user keeps track of employees, if they showed up or not etc...
I have tired looking up loops but i couldn't figure out how to do what i am trying to do.
What i have in this excel.
A-D : Emp info
E-∞ : 1-3 Days/Dates; 4-∞ emp data (if they missed a day, values for that)
To get better understanding, see this
The data entered from E5 to xx thats where i am trying to get this vba working.
Anything the script detects first value either '1' or '2', start 90 days (cells) from there. And after 90, reset to 0. starting from 91 start search for '1' or '2' and do similar.
See the excel file for better understanding. If it doesn't make sense, ill be happy to simplify.
Thank You
The most efficient and clean way to handle this is to use a form of a relational data model because it can be done easily without using VBA code. You will have two simple tables in your spreadsheet, EmployeeInfo and AttendanceRecords. Your Employee info will look something like this
Emp# Name Craft # In 90 Days NumOf2s NumOf1s
1 EMP 1 SM Site Manager 0 0 0
2 EMP 2 SM Site Manager 1 0 1
3 EMP 3 SM Site Manager 0 0 0
4 EMP 4 SM Site Manager 0 0 0
5 EMP 5 SM Site Manager 1 0 1
The last three columns are calculated from the AttendanceRecords table. This table is going to be variable size but this way you only need to store the important data (When employees actually got marks). It will look like this.
Emp# Date Days Count
1 12/1/2013 122 1
3 1/1/2014 91 2
2 2/1/2014 60 1
5 2/15/2014 46 1
You can have multiple entries for the same day and the same employee. The important thing is that we only need one entry per infraction (NOTE: In order to do this in a proper database type model, each attendance record should also have some kind of incrementing totally unique ID (like employees), but we can forgo that for this application).
You enter in the employee number, the date, and the count. The "Days" column then auto calculates the age of the record with the following formula:
=TODAY()-[#Date]
NOTE: If the [#Date] notation does not look familiar, this is because it deals with Excel Tables. I recommend you read up on those if not already familiar.
So now we have the age of each record. So back on the EmployeeInfo table, we use the following formula to get all AttendanceRecords that apply to Employee x for the last 90 days
=SUMIFS(AttendanceRecords[Count],AttendanceRecords[Emp'#],[#[Emp'#]],AttendanceRecords[Days],"<=90")
You can now also use some simple formulas to get the other columns I pointed out, including the number of 2 count in fractions or the number of 1 count infractions:
=COUNTIFS(AttendanceRecords[Emp'#],[#[Emp'#]],AttendanceRecords[Days],"<=90",AttendanceRecords[Count],2)
=COUNTIFS(AttendanceRecords[Emp'#],[#[Emp'#]],AttendanceRecords[Days],"<=90",AttendanceRecords[Count],1)
There is a lot more data that could be gathered, including the date of the last infraction, total number of infractions for all time, etc. If any of the formulas or terms I used don't make sense or need more explaining, feel free to ask.
EDIT: If you want them automatically removed after 90 days, it would be relatively easy to write a VBA script to do this. It would also be easy to just sort the AttendanceRecords table on Days and delete all records that are older than a certain number of days. However, unless you see yourself adding hundreds of records a week, this really shouldn't be necessary. Also, If you want to write a Visual Basic form to enter in new infractions, that is definitely very possible, but another discussion.
EDIT: To respond to concerns about viewing when these issues happened, I will give you an example of a way to view the data in your tables. One of the advantages of excel tables is that the order of the records isn't as absolute as in a normal range, so we can sort, rearrange, and filter them to see what we need. So if you need to see all of the issues with employee 3, you just go to the Emp# column in the AttendanceRecords table, select the little arrow down button next to Emp#, uncheck 'Select All', and then check the '3', and then the only values I will see in the table are the ones from employee 3. I can then sort the 'Date' column by clicking its little arrow and selecting 'Sort Newest to Oldest'.
What it comes down to is that you can view ANY data you need to, and if you think through what you really need to see, you can set up your summary table (EmployeeInfo) to display enough data that you hardly ever have to look at the AttendanceRecords table. But if you need to, you can go into that table and do a manual sort (as I described above) very easily.
EDIT: To help add some of the functionality I've shown above to the askers current spreadsheet, I will show the current formula.
=SUMIFS($E5:$BR5,$E$3:$BR$3,">"&(TODAY()-90))
For EMP 1, this formula uses the employees row as the sum range. It then looks at the field of dates in the corresponding columns in row 3. If the date in row 3 is > TODAY()-90, then we will add it to the count. This will at least just look at the infractions for the previous 90 days.