How to insert uneven data rows into matrix in SAS? - sql

I have an originations data set with loan ids. I then have a corresponding dataset with performance data for each of these loans ids, which can be anywhere from 10-40 rows in the performance data set.
The start date of each of the performance loans is not the same either, although some do overlap. What I want to do is take every loan id group in the performance data set, and then create a row of a certain column value across all occurrences in the data set. It doesn't matter if they start on different dates, I just want to align the values as this is the first value for loan id x and y.
For example:
ID Date Val
3 201601 100
3 201602 102
3 201603 103
--> Result:
ID Val1 Val2 Val3
3 100 102 103
I'm having two issues. One is the differing size of performance data for each id. I can't construct a matrix with differing lengths of rows. I'm assuming I'll need to append 0's to the end of each row to meet a predefined width.
My second issue is that I'm not sure how to read through a the performance data set to group loans, extract the value column, construct the column into a row for that id, and then insert into a matrix. I know how I would do this in Python but I need to use SAS. I can construct tables in SAS, but I'm not sure how to append rows, only columns.
If someone could provide some guidance on this it'd be a great help.

Anyone who runs into a similar issue it ended up being only a few lines of code.
proc transpose data = new_data
out = new_data1;
var trans_state;
by id;
run;
The output will be

Related

Populate NULL Values based on Array Formula

New user, so apologies in advance for bad formatting.
Essentially what I'm trying to do is be able to populate the staff_hours column where it equals NULL with the one value that IS NOT NULL. As you can see from the screenshot, there will only be one person who staffs an open cl_hole_staffing_no and as a result will have a start_dt (with time) and end_dt (with time) along with staff_hours. 16 people were offered a shift, and the person in row 15 accepted it is what is going on here.
The ideal output would be the staff_hours column is populated with the amount of time of the one person who ended up taking the open job, so 24.00 in this example. How can I write a formula to do this? I was thinking something like an array function in Excel, but am not sure how to do that in SQL.
Your explanation is a bit confusing about what you are really trying to achieve. However I think that what you really want is just to populate the staff_hours column, which can be achieved with the following:
UPDATE
your_table_name
SET
staff_hours = 24
WHERE
staff_hours is NULL;
EDIT
I get it now. You want to operate with the two dates and extract the amount of hours between them. Since you are in sql-server you can actually define a Computed Column in which you can use the values from other columns to compute the value you want.
You will need to create your table again. (The example below contains only the necessary attributes for it to work)
CREATE TABLE your_table_name
( id INT IDENTITY (1,1) NOT NULL
, staff_start_dt DATETIME
, staff_end_dt DATETIME
, staff_hours AS DATEDIFF(hh, staff_start_dt , staff_end_dt)
);
Now every time you insert a record on the table with both staff_start_dt and staff_end_dt, the column staff_hours will automatically compute the number of hours between the two dates.
[pre]
Code (vb):
A B C
1 10 X X
2 11 A Y
3 12 Y Z
4 13 B
5 14 B
6 15 Z
[/pre]
Assuming that the rows in Col A is Named "datarange"
And your criteria is in C1:C3
The following formula will return an array {10,12,15}
=SMALL(COUNTIF(C1:C3,B1:B6)*datarange, ROW(INDEX(A:A,SUMPRODUCT(--(COUNTIF(C1:C3,B1:B6)=0))+1):INDEX(A:A,ROWS(datarange))))
COUNTIF(C1:C3,B1:B6)*datarange returns {10;0;12;0;0;15}
The segment ROW(INDEX(....):INDEX(...)) returns {4;5;6}, indicating the number of non-zero values.
The SMALL() function then returns the 4th smallest, 5th smallest and 6th smallest values.
One disadvantage with this approach is that you get a sorted sub-list. Perhaps that would work for you.

How to combine a row of cells in VBA if certain column values are the same

I have a database where all of the input from the user (through a userform) gets stored. In the database, each column is a different category for the type of data (ex. date, shift, quantity, etc) and the data from the userform input gets put into its corresponding category. For some of the data, all the data is the same except for the quantity. I was wondering how I could combine these rows into one and add the quantities to each other for the whole database (ex. combining the first and third data entries). I have tried playing around with a couple different loops but can't seem to figure anything out.
Period Date Line Shift Type Quantity
4 x 2 4/3/18 A 3 14 18
4 x 2 4/3/18 A 3 13 12
4 x 2 4/3/18 A 3 14 15
Thank you!
If you're looking to modify the underlying database, you might be able to query the data into the format you want by including all the other columns in a GROUP BY statement, save the result to another table, then replace the original table with the properly formatted one.
If you have the data in Excel and you just want to view it with the duplicate rows summed, a Pivot Table would be a good choice. You can select all the other columns as rows for the Pivot Table and sum of Quantity as the values.

Counting latest instance of multiple only based on filter context

I've got a large table of events that have occurred in an inventory of vehicles, which affect whether they are in service or out of service. I would like to create a measure that would be able to count the number of vehicles in the various inventories at any point in time, based on the events in this table.
This table is pulled from a SQL database into an Excel 2016 sheet, and I'm using PowerPivot to try to come up with the DAX measure.
Here is some example data event_list:
vehicle_id event_date event event_sequence inventory
100 2018-01-01 purchase 1 in-service
101 2018-01-01 purchase 1 in-service
102 2018-02-04 purchase 1 in-service
100 2018-02-07 maintenance 2 out-of-service
101 2018-02-14 damage 2 out-of-service
101 2018-02-18 repaired 3 in-service
100 2018-03-15 repaired 3 in-service
102 2018-05-01 damage 2 out-of-service
103 2018-06-03 purchase 1 in-service
I'd like to be able to create a pivot table in Excel (or use CUBE functions, etc) to get an output table like this:
date in-service out-of-service
2018-02-04 3 0
2018-02-14 1 2
2018-03-15 3 0
2018-06-03 3 1
Essentially, I want to be able to calculate the inventory based on any date in time. The example only has a few dates, but hopefully provides enough of a picture.
I've basically come up with this so far, but it counts more vehicles than desired - I can't figure out how to only take the latest event_sequence or event_date and use that to count the inventory.
cumulative_vehicles_at_date:=CALCULATE(
COUNTA([vehicle_id]),
IF(IF(HASONEVALUE (event_list[event_date]), VALUES (event_list[event_date]))>=event_list[event_date],event_list[event_date])
)
I tried using MAX() and EARLIER() functions, but they don't seem to work.
Edit: Added the PowerBI tag as I'm now using that software to attempt to solve this as well. See comments on Alexis Olson's answer.
I think I've found a much cleaner method than I gave previously.
Let's add two columns onto the event_list table. One which counts vehicles "in-service" on that date and one which counts vehicles "out-of-service" on that date.
InService =
VAR Summary = SUMMARIZE(
FILTER(event_list,
event_list[event_date] <= EARLIER(event_list[event_date])),
event_list[vehicle_id],
"MaxSeq", MAX(event_list[event_sequence]))
VAR Filtered = FILTER(event_list,
event_list[event_sequence] =
MAXX(
FILTER(Summary,
event_list[vehicle_id] = EARLIER(event_list[vehicle_id])),
[MaxSeq]))
RETURN SUMX(Filtered, 1 * (event_list[inventory] = "in-service"))
You can create an analogous calculated column for OutOfService or you can just take the total minus the InService count.
OutOfService =
CALCULATE(
DISTINCTCOUNT(event_list[vehicle_id]),
FILTER(event_list,
event_list[event_date] <= EARLIER(event_list[event_date])))
- event_list[InService]
Now all you have to do is put event_date on the matrix visual rows section and add the InService and OutOfService columns to the values section (use Maximum or Minimum for the aggregation option rather than Sum).
Here's the logic behind the calculated column InService:
We first create a Summary table which calculates the maximal event_sequence value for each vehicle. (We filter the event_date to only consider dates up to the current one we are working with.)
Now that we know what the last event_sequence value is for each vehicle, we use that to filter the entire table down to just the rows that correspond to those vehicles and sequence values. The filter goes through the table row by row and checks to see if the sequence value matches the one we calculated in the Summary table. Note that when we filter the Summary table to just the vehicle we are currently working with, we only get a single row. I'm just using MAXX to extract the [MaxSeq] value. (It's kind of like using LOOKUPVALUE, but you can't use that on a variable.)
Now that we've filtered the table just to the most recent events for each vehicle, all we need to do is count how many of them are "in-service". I used a SUMX here where the 1*(True/False) coerces the boolean value to return 1 or 0.
This is pretty difficult. I don't have a great answer, but here's something that kind of works.
You'll create a new calculated table where you'll calculate the status for each vehicle on each date. Start with the base cross join for each vehicle and each date:
= CROSSJOIN(VALUES(event_list[vehicle_id]), VALUES(event_list[event_date]))
Then add a calculated column to find the max sequence number for each vehicle on that date.
Sequence = MAXX(
FILTER(event_list,
event_list[event_date] <= Cross[event_date] &&
event_list[vehicle_id] = Cross[vehicle_id]),
event_list[event_sequence])
Now you can lookup the inventory value for each vehicle/sequence pair with another calculated column:
Inventory = LOOKUPVALUE(
event_list[inventory],
event_list[vehicle_id], Cross[vehicle_id],
event_list[event_sequence], Cross[Sequence])
The result should look something like this:
Once you have this, you can create a matrix using this calculated table. Put the event_date on the rows and Inventory on the columns. Filter out blank inventory values in the visual level filter and put the vehicle_id in the values field, using a count or distinct count as the aggregation method (instead of the default sum).
It should look like this:

How to Pivot a single column source data in SQL?

Below are the input and output details.Any database Oracle, SQL Server and MySQL should do for the answers.I am not able to derive the logic to rank data which will help me to pivot.
My source is a flat file which contains data like below.I have loaded that file into one of the tables in Oracle.
Source Input:
**Flatfile1**
**Coulmn1**
Kamesh
65
5000
123456789
Nanu
45
3000
321654789
Expected Output:
Name Age Salary Mobilenumber
Kamesh 65 5000 123456789
Nanu 45 3000 321654789
After loading into one of the tables I am applying the logic to number this data which will eventually look like below:
Column1 Datavalue
Kamesh 1
65 1
5000 1
123456789 1
Nanu 2
45 2
3000 2
321654789 2
However, I am not able to derive logic (I tried with Rank) which will give me sequence number like this without having any key field.Hope this explains situation.
Thanks!!
Oracle doesn't store the rows in order, if you do select * from table1 multiple times you could get rows in different orders according to db operations and caching
Therefore if you have a table like that with no other column it's impossible to "pivot" the data.
I strongly suggest to save data in a normalized form, if you can't consider adding a column with a row ID populated automatically (identity column in oracle 12, trigger+ sequence in previous version)
Once you have your rows in order it will be easy to organize your data

Average Distinct Values in a single column in Power Pivot

I have a column in PowerPivot that basically goes:
1
1
2
3
4
3
5
4
If I =AVERAGE([Column]), it's going to average all 8 values in the sample column. I just need the average of the distinct values (i.e., in the example above I want the average of (1,2,3,4,5).
Any thoughts on how to go about doing this? I tried a combination of =(DISTINCT(AVERAGE)) but it gives a formula error.
Thanks!!
Kevin
There must be a cleaner way of doing this but here is one method which uses a measure to get the sum of the values divided by the number of times it appears (to basically give the original value) then uses an iterative function to do it for each unique value.
Apologies for the uninspired measure names:
[m1] = SUM(table1[theValue]) / COUNTROWS(Table1)
[m2] = AVERAGEX(VALUES(Tables1[theValue]), [m1])
Assuming your table is caled table1 and the column is called theValue