I have a table with a customer column followed by multiple columns (relating to shops) and a flag to indicate if they have visited this shop, if they haven’t, the cell is null. The shops are listed in order of importance, with Shop1 being highest, the shop2, shop3 and so on…. The flag to say if a customer has visited that particular shop is a number relating to the shop number. So for example, if a customer hasn't visited shop1 this will be blank, but if they have visited shop2, this cell will be '2'.
I need to merge the columns together, to create a table which for each customer has the top 4 shops they have visited, so for example the entry for a customer could read first column '2', second column '5',third column '7', fourth column '8' as they haven't visited shops numbers 1,3,4 or 6. Could someone please help? Thank you.
I think this is what you are looking for. Take the input data (what I assume you mean), perform a transpose, filter the missing (null) values, and transpose again.
data input;
input customer $ shop01 shop02 shop03 shop04;
datalines;
Bill 1 2 . .
Ted . 2 3 .
;
proc sort data=input;
by customer;
run;
proc transpose data=input out=temp(where=(col1 ^= .) drop=_name_);
by customer;
run;
proc transpose data=temp out=output(drop=_name_);
by customer;
run;
This gives me:
Bill 1 2
Ted 2 3
Given how your data is laid out, you could use the smallest function with a variable list to return the first smallest non-missing value in your shop variables, the second smallest, and so on.
data test(drop=shop01-shop08);
set input;
first = smallest(1,of shop01-shop08);
second = smallest(2,of shop01-shop08);
third = smallest(3,of shop01-shop08);
fourth = smallest(4,of shop01-shop08);
run;
Related
I am trying to stack 3 columns into one, but however, I would like to keep a filter column to be able to distinct the variables, I have tried with Coalesce and Union all, but I don't get to understand how to do it, given that I do not have an ID column.
Here the tables:
You can use a data step approach.
I'm not typing out your data so here's a fully worked example that's similar to yours but not exactly the same.
Use VNAME() to get the variable name.
Use an array to get the values.
DATA wide;
input famid faminc96 faminc97 faminc98 ;
CARDS;
1 40000 40500 41000
2 45000 45400 45800
3 75000 76000 77000
;
RUN;
DATA long1a;
SET wide;
*declare an array with the list of variables to transpose;
ARRAY afaminc(96:98) faminc96 - faminc98 ;
DO year = 96 to 98 ;
faminc = afaminc(year);
variable_name = vname(afaminc(year));
OUTPUT;
END;
DROP faminc96 - faminc98 ;
RUN;
Wide to long using data step
https://stats.idre.ucla.edu/sas/modules/reshaping-data-wide-to-long-using-a-data-step/
Arrays:
https://stats.idre.ucla.edu/sas/seminars/sas-arrays/
I am looking at transactional data such as my credit card statement. I want to ensure that I am not getting my card swiped twice. The fields that I have are card number (I have multiple), amount of transaction, transaction date, merchant code, merchant name, and transaction code.
To know if it is a true duplicate transaction, I want to know if the merchant code, merchant name, and transaction amount appear more the once. I also want to make sure that the transaction was within 5 days of each other if all else matches.
I am doing the work in SAS code, but I can also do in PROC SQL. So far in SAS I’ve sorted the data and then pulled a table that only holds duplicates, but since I’ve sorted the data, It will only call it a duplicate if the dates are the exact same date instead of the 5 days rule mentioned.
I did a simple PROC SORT.
PROC SORT DATA=WORK.TRANSACTIONS
OUT=WORK.TRANSACTIONS1
DUPOUT=WORK.SORTSORTEDDUPS
NODUPKEY;
BY CARD NUMBER TRANSACTION_AMOUNT TRANSACTION_DATE MERCHANT_CODE MERCHANT_NAME TRANSACTION_CODE
What do I need to incorporate to add my rule of transaction within 5 days?
You can do it with an additional pass, retaining (and comparing to) the last transaction date as per the below. Note the change in the sort BY statement (you'll need to update the proc sort also).
data duplicates;
set work.transactions1;
by BY CARD NUMBER TRANSACTION_AMOUNT MERCHANT_CODE MERCHANT_NAME TRANSACTION_CODE TRANSACTION_DATE;
retain datecheck 0;
if first.TRANSACTION_CODE then datecheck=0;
else if TRANSACTION_DATE-datecheck le 5 then output;
datecheck=TRANSACTION_DATE;
run;
Let's create our practice data source:
DATA MY_CREDIT_CARDS;
INPUT
C_NUMBER
TRANC_AMOUNT
TRANSC_DATE :DATE10.
TRANSC_CODE
MERCH_CODE
MERCH_NAME $10.;
FORMAT TRANSC_DATE DDMMYY10.;
CARDS;
1 100 17JAN1990 1 1 AMAZON
2 200 01JAN1990 2 8 WALLMART
4 100 04JAN1990 3 5 CRUSTYKRAB
2 200 07JAN1990 4 7 NETFLIX
1 300 01JAN1990 5 2 GOOGLEPLAY
3 200 17JAN1990 6 8 WALLMART
5 100 18JAN1990 7 2 GOOG.PLAY
5 300 19JAN1990 8 2 GOOGLEPLAY
2 200 22JAN1990 9 8 WALLMART
4 200 20JAN1990 10 2 GOOGLEPLAY
1 100 03JAN1990 11 2 GOOG.PLAY
1 100 17JAN1990 12 1 AMZN
;
RUN;
Result:
Now, first of all, I recommend not to use descriptive fields such as a names (merchant name in this case) as keys, because descriptive fields can be very variable, i.e. someone can register AMAZON as AMZN or AMAZN, or any combination you could imagine as the merchant name. Use ID fields instead. So, assuming merchant code is an unique ID, I think that is enough to identify the merchant.
Considering the above, using PROC SQL you could do something like this to find duplicates based on the rule you provide (and without the need of using any other extra-step):
PROC SQL;
/*The following assuming each record are unique
(identified by 'transaction code' in this case),
otherwise you must handle duplicate records properly.*/
SELECT
DISTINCT A.*,
CASE WHEN
B.TRANSC_CODE IS NOT NULL
THEN 1 ELSE 0 END AS DUPLICATED
FROM MY_CREDIT_CARDS AS A
LEFT JOIN MY_CREDIT_CARDS AS B
ON
A.MERCH_CODE = B.MERCH_CODE AND
A.TRANC_AMOUNT = B.TRANC_AMOUNT AND
A.TRANSC_CODE ^= B.TRANSC_CODE AND
A.TRANSC_DATE >= INTNX('day',B.TRANSC_DATE,-5) AND
A.TRANSC_DATE <= INTNX('day',B.TRANSC_DATE,5)
;
/*You could use an ORDER BY clause to sort the
results as you want.*/
RUN;
The result would be:
Now you have a new column named "DUPLICATED" showing 1 if found the value as duplicated and 0 if not.
Hope it helps.
I currently have data like so:
Product_ID IND 1_Revenue 2_Revenue Revenue_Code Channel
1 S $50. $75. 1 E
1. S $50. $75. 2 SE
2. P $100. $0. 1 E
3. S $400. $60. 1 SE
3. S $400. $60. 2 S
I am trying to pick when IND=S, give me the row with the highest revenue if the channel= SE. the revenue code refers to the fields 1_Revenue and 2_Revenue.
So in this case I’d expect the output to have 2nd row and the 4th row.
I’ve tried multiple things and nothing has worked. What is the best solution?
As per our understanding a simple where clause is sufficient to get your result like:
select Product_ID, IND, 1_Revenue, 2_Revenue, Revenue_Code, Channel
from yourtable
where IND = 'S' and Channel = 'SE'
If anything else is required then kindly mention it.
I don't quite understand what is meant by the highest revenue. Based on your description, if you just apply a filter to pick rows where IND = S and channel = SE then won't you get rows 2 and 4 out? (as follows)
data want;
set have;
if IND = 'S' and channel = 'SE';
run;
or if you want to use SQL
PROC SQL;
create table want as
select * from have where IND = 'S' and channel = 'SE';
quit;
I have an originations data set with loan ids. I then have a corresponding dataset with performance data for each of these loans ids, which can be anywhere from 10-40 rows in the performance data set.
The start date of each of the performance loans is not the same either, although some do overlap. What I want to do is take every loan id group in the performance data set, and then create a row of a certain column value across all occurrences in the data set. It doesn't matter if they start on different dates, I just want to align the values as this is the first value for loan id x and y.
For example:
ID Date Val
3 201601 100
3 201602 102
3 201603 103
--> Result:
ID Val1 Val2 Val3
3 100 102 103
I'm having two issues. One is the differing size of performance data for each id. I can't construct a matrix with differing lengths of rows. I'm assuming I'll need to append 0's to the end of each row to meet a predefined width.
My second issue is that I'm not sure how to read through a the performance data set to group loans, extract the value column, construct the column into a row for that id, and then insert into a matrix. I know how I would do this in Python but I need to use SAS. I can construct tables in SAS, but I'm not sure how to append rows, only columns.
If someone could provide some guidance on this it'd be a great help.
Anyone who runs into a similar issue it ended up being only a few lines of code.
proc transpose data = new_data
out = new_data1;
var trans_state;
by id;
run;
The output will be
I am trying to create a table in SAS, which is a subset of a larger table. I am using the following chart as an example. As you can see, columnA has 501 and 502 repeated twice. What I want is to select the row with the max number in ColumnB. The second chart is the result that I would like to have.
Chart 1
A B C
501 1 O
502 1 K
503 1 V
501 2 Y
502 2 U
504 1 I
Chart 2
A B C
503 1 V
501 2 Y
502 2 U
504 1 I
What I am thinking right now is:
PROC SQL;
CREATE TABLE CHART2 AS
SELECT
C.COLUMNA,
C.COLUMNC
FROM CHART1 C;
QUIT;
I am not sure how to say that when there is a duplicate rows in columnA, only select the rows where columnB has the max number. The formatting of the table is a little bit weirdo. I hope you get my point.
One option is to use the having clause in proc sql. Think of it as a filter that gets applied after any groupings have been done.
proc sql noprint;
create table want as
select *
from sashelp.class
group by sex
having age = max(age)
;
quit;
In the above code, we are keeping the rows where the age value on the row is equal to the maximum age (max(age)) for that sex (as we are grouping by sex).
You will notice in the results that for Females we get two rows returned because there were two records that had an age equal to the max female age, but only one row for Males.
Without more details about your data I can't be certain that this will exactly fit your needs but it may.
You can try this:
PROC SORT data = Chart1;
by A descending B;
RUN;
DATA Chart2;
set Chart1;
by A;
if first.A then output;
RUN;
The first step sorts your data by ascending order of A and then by descending order of B. The second step keeps only the first row for each value of A.