SAS - Update table variable with multiple where criteria - sql

Please excuse my lack of knowledge i'm very new to SAS.
I have two tables as exampled below:
T1
ID
Ill_No
1
1
1
1
1
2
1
2
1
3
1
3
2
1
2
1
2
2
2
2
2
3
2
3
T2
ID
Ill_No
1
1
2
3
I want to update the original table with a new variable (MATCH) where both ID and Ill_No match with the second table. Example below:
T1
ID
Ill_No
MATCH
1
1
Y
1
1
Y
1
2
1
2
1
3
1
3
2
1
2
1
2
2
2
2
2
3
Y
2
3
Y
What is the most efficient way to do this?

Perhaps use a simple merge statement
data want;
merge t1(in=one) t2(in=two);
by id III_No;
if one and two then match = 'Y';
run;
ID III_No match
1 1 Y
1 1 Y
1 2
1 2
1 3
1 3
2 1
2 1
2 2
2 2
2 3 Y
2 3 Y

Related

Pandas concat function with count assigned for each iteration

At the replication of a dataframe using concat with index (see example here), is there a way I can assign a count variable for each iteration in column c (where column c is the count variable)?
Orig df:
a
b
0
1
2
1
2
3
df replicated with pd.concat[df]*5 and with an additional Column c:
a
b
c
0
1
2
1
1
2
3
1
0
1
2
2
1
2
3
2
0
1
2
3
1
2
3
3
0
1
2
4
1
2
3
4
0
1
2
5
1
2
3
5
This is a multi-row dataframe where the count variable would have to be applied to multiple rows.
Thanks for your thoughts!
You could use np.arange and np.repeat:
N = 5
new_df = pd.concat([df] * N)
new_df['c'] = np.repeat(np.arange(N), df.shape[0]) + 1
Output:
>>> new_df
a b c
0 1 2 1
1 2 3 1
0 1 2 2
1 2 3 2
0 1 2 3
1 2 3 3
0 1 2 4
1 2 3 4
0 1 2 5
1 2 3 5

Sum distinct group values only

I would like to sum values distinct per group. Pardon the wordy post...
Context. Suppose I have a table of the form:
ID Foo Value
A 1 2
B 0 2
C 0 3
A 1 2
A 1 2
C 0 3
B 0 2
Each ID/Foo combo has a distinct value. I'd like to join this table onto another cte that has a cumulative field, e.g. suppose after joining using rows unbounded preceeding I have a new field called cumulative. Same data, just duplicated 3 times with value cumulative:
ID Foo Value Cumulative
A 1 2 1
B 0 2 1
C 0 3 1
A 1 2 1
A 1 2 1
C 0 3 1
B 0 2 1
A 1 2 2
B 0 2 2
C 0 3 2
A 1 2 2
A 1 2 2
C 0 3 2
B 0 2 2
A 1 2 3
B 0 2 3
C 0 3 3
A 1 2 3
A 1 2 3
C 0 3 3
B 0 2 3
I want to add a new field 'segment_value' that, for each foo gets the sum of distinct ID values. E.g. The distinct ID/Foo combinations are:
ID Foo Value
A 1 2
B 0 2
C 0 3
I would therefore like a new field, 'segment_value', That returns 2 for Foo=1 and 5 for Foo=0. Desired result:
ID Foo Value Cumulative segment_value
A 1 2 1 2
B 0 2 1 5
C 0 3 1 5
A 1 2 1 2
A 1 2 1 2
C 0 3 1 5
B 0 2 1 5
A 1 2 2 2
B 0 2 2 5
C 0 3 2 5
A 1 2 2 2
A 1 2 2 2
C 0 3 2 5
B 0 2 2 5
A 1 2 3 2
B 0 2 3 5
C 0 3 3 5
A 1 2 3 2
A 1 2 3 2
C 0 3 3 5
B 0 2 3 5
How can I achieve this?
I don't think you explained your problem very well and I might have misunderstood something, but can't you extract the segment_value using a query such as this one:
select
foo,
sum(val) as segment_value
from (
select distinct foo, val from table
) tab
group by foo
this would return the following result:
foo segment_value
1 2
0 5
then you could join this to the rest of you query and use it as per your needs.

How to remove one specific duplicate named column in columns of a dataframe?

I have a sample dataframe df with columns as:
a b c a a b b c c
0 2 2 1 2 2 1 1 2 2
1 2 2 2 2 2 1 2 1 2
. . .
. . .
I want to remove the duplicate columns named with only 'a' and keep other as same
The expected o/p is:
a b c b b c c
0 2 2 1 1 1 2 2
1 2 2 2 1 2 1 2
Here is a general solution to drop any duplicates of a column, no matter where these columns are in the dataframe and what the content of these columns is.
First we get all column indexes for the given column name and drop the first occurrence. Then we "substract" these indexes from all indexes and return the remaining columns:
to_drop = 'a'
dup = [i for i,v in enumerate(df.columns) if v==to_drop][1:]
df = df.iloc[:, list(set(range(len(df.columns))) - set(dup))]
Result:
a b c b b c c
0 2 2 1 1 1 2 2
1 2 2 2 1 2 1 2
df = df.T.reset_index().drop_duplicates().set_index('index').T
del df.columns.name
Exp
since the column a has only dupe values, so we can simply transpose with reset index
df.T.reset_index()
index 0 1
0 a 2 2
1 b 2 2
2 c 1 2
3 b 1 1
4 b 1 2
5 c 2 1
6 c 2 2
Apply drop_duplicate on above df and only the dupes will get removed. It serves the purpose in those instances too where there are more than one column which has dupe value
Output
a b c b b c c
0 2 2 1 1 1 2 2
1 2 2 2 1 2 1 2

which rows are duplicates to each other

I have got a database with a lot of columns. Some of the rows are duplicates (on a certain subset).
Now I want to find out which row duplicates which row and put them together.
For instance, let's suppose that the data frame is
id A B C
0 0 1 2 0
1 1 2 3 4
2 2 1 4 8
3 3 1 2 3
4 4 2 3 5
5 5 5 6 2
and subset is
['A','B']
I expect something like this:
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2
Is there any function that can help me do this?
Thanks :)
Use DataFrame.duplicated with keep=False for mask with all dupes, then flter by boolean indexing, sorting by DataFrame.sort_values and join together by concat:
L = ['A','B']
m = df.duplicated(L, keep=False)
df = pd.concat([df[m].sort_values(L), df[~m]], ignore_index=True)
print (df)
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2

Multiple row in self join

i have table in this format
Id QId ResourceId ModuleId SubProjId Comments
1 1 1 1 2 ffdg 1 1
2 2 1 1 2 dfgfdg 1 1
3 3 1 1 2 hgjhg 1 1
4 1 2 1 2 tryty 1 0
5 5 1 1 2 sdf 1 1
6 5 2 1 2 ghgfh 1 0
7 7 2 1 2 tytry 1 0
8 3 2 1 2 rytr 1 0
and i wan result in this way
qid ResourceId Comments ResourceId Comments
1 1 ffdg 2 tryty
3 1 hgjhg 2 rytr
i tried
select distinct A.qid,A.ResourceId,A.Comments,B.ResourceId,b.Comments
from dbo.#temp A inner join #temp B on A.QId=B.QId and A.[ModuleId]=B.[ModuleId] and a.[SubProjId]=b.[SubProjId]
but did not find any luck please help
You want to convert vertical data to horizontal. So you need to create a pivot table. You can find more details here
How to transform vertical data into horizontal data with SQL?