Sum distinct group values only - sql

I would like to sum values distinct per group. Pardon the wordy post...
Context. Suppose I have a table of the form:
ID Foo Value
A 1 2
B 0 2
C 0 3
A 1 2
A 1 2
C 0 3
B 0 2
Each ID/Foo combo has a distinct value. I'd like to join this table onto another cte that has a cumulative field, e.g. suppose after joining using rows unbounded preceeding I have a new field called cumulative. Same data, just duplicated 3 times with value cumulative:
ID Foo Value Cumulative
A 1 2 1
B 0 2 1
C 0 3 1
A 1 2 1
A 1 2 1
C 0 3 1
B 0 2 1
A 1 2 2
B 0 2 2
C 0 3 2
A 1 2 2
A 1 2 2
C 0 3 2
B 0 2 2
A 1 2 3
B 0 2 3
C 0 3 3
A 1 2 3
A 1 2 3
C 0 3 3
B 0 2 3
I want to add a new field 'segment_value' that, for each foo gets the sum of distinct ID values. E.g. The distinct ID/Foo combinations are:
ID Foo Value
A 1 2
B 0 2
C 0 3
I would therefore like a new field, 'segment_value', That returns 2 for Foo=1 and 5 for Foo=0. Desired result:
ID Foo Value Cumulative segment_value
A 1 2 1 2
B 0 2 1 5
C 0 3 1 5
A 1 2 1 2
A 1 2 1 2
C 0 3 1 5
B 0 2 1 5
A 1 2 2 2
B 0 2 2 5
C 0 3 2 5
A 1 2 2 2
A 1 2 2 2
C 0 3 2 5
B 0 2 2 5
A 1 2 3 2
B 0 2 3 5
C 0 3 3 5
A 1 2 3 2
A 1 2 3 2
C 0 3 3 5
B 0 2 3 5
How can I achieve this?

I don't think you explained your problem very well and I might have misunderstood something, but can't you extract the segment_value using a query such as this one:
select
foo,
sum(val) as segment_value
from (
select distinct foo, val from table
) tab
group by foo
this would return the following result:
foo segment_value
1 2
0 5
then you could join this to the rest of you query and use it as per your needs.

Related

SAS - Update table variable with multiple where criteria

Please excuse my lack of knowledge i'm very new to SAS.
I have two tables as exampled below:
T1
ID
Ill_No
1
1
1
1
1
2
1
2
1
3
1
3
2
1
2
1
2
2
2
2
2
3
2
3
T2
ID
Ill_No
1
1
2
3
I want to update the original table with a new variable (MATCH) where both ID and Ill_No match with the second table. Example below:
T1
ID
Ill_No
MATCH
1
1
Y
1
1
Y
1
2
1
2
1
3
1
3
2
1
2
1
2
2
2
2
2
3
Y
2
3
Y
What is the most efficient way to do this?
Perhaps use a simple merge statement
data want;
merge t1(in=one) t2(in=two);
by id III_No;
if one and two then match = 'Y';
run;
ID III_No match
1 1 Y
1 1 Y
1 2
1 2
1 3
1 3
2 1
2 1
2 2
2 2
2 3 Y
2 3 Y

Pandas concat function with count assigned for each iteration

At the replication of a dataframe using concat with index (see example here), is there a way I can assign a count variable for each iteration in column c (where column c is the count variable)?
Orig df:
a
b
0
1
2
1
2
3
df replicated with pd.concat[df]*5 and with an additional Column c:
a
b
c
0
1
2
1
1
2
3
1
0
1
2
2
1
2
3
2
0
1
2
3
1
2
3
3
0
1
2
4
1
2
3
4
0
1
2
5
1
2
3
5
This is a multi-row dataframe where the count variable would have to be applied to multiple rows.
Thanks for your thoughts!
You could use np.arange and np.repeat:
N = 5
new_df = pd.concat([df] * N)
new_df['c'] = np.repeat(np.arange(N), df.shape[0]) + 1
Output:
>>> new_df
a b c
0 1 2 1
1 2 3 1
0 1 2 2
1 2 3 2
0 1 2 3
1 2 3 3
0 1 2 4
1 2 3 4
0 1 2 5
1 2 3 5

which rows are duplicates to each other

I have got a database with a lot of columns. Some of the rows are duplicates (on a certain subset).
Now I want to find out which row duplicates which row and put them together.
For instance, let's suppose that the data frame is
id A B C
0 0 1 2 0
1 1 2 3 4
2 2 1 4 8
3 3 1 2 3
4 4 2 3 5
5 5 5 6 2
and subset is
['A','B']
I expect something like this:
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2
Is there any function that can help me do this?
Thanks :)
Use DataFrame.duplicated with keep=False for mask with all dupes, then flter by boolean indexing, sorting by DataFrame.sort_values and join together by concat:
L = ['A','B']
m = df.duplicated(L, keep=False)
df = pd.concat([df[m].sort_values(L), df[~m]], ignore_index=True)
print (df)
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2

Replace consecutive identical row occurences to single row based on same Id - SQL Server

I am trying to reduce the consecutive identical rows within the same Id to one single row. I tried duplication but then it replaces all non-consecutive identical occurrences within the same Id to one single row. Also, the order of the message is important. The input and the desired output is shown below. Is there any way to achieve this desired result?
Thanks
Input data
Id Result Message
----------------------
1 0 a
1 0 p
1 0 p
1 0 p
1 0 d
1 0 p
1 0 p
1 0 f
1 0 p
2 1 a
2 1 a
2 1 a
2 1 f
2 1 h
2 1 b
2 1 b
3 0 d
3 0 d
3 0 d
3 0 c
3 0 c
Desired output
Id Result Message
----------------------
1 0 a
1 0 p
1 0 d
1 0 p
1 0 f
1 0 p
2 1 a
2 1 f
2 1 h
2 1 b
3 0 d
3 0 c
Taking #GordonLinoff's comment into consideration, if you were to include a column which specified the order in which you wanted the rows looked at, for example,
Id Result Message Order
1 0 a 1
1 0 p 2
1 0 p 2
1 0 p 2
1 0 d 3
1 0 p 4
1 0 p 4
1 0 f 5
1 0 p 6
2 1 a 7
2 1 a 7
2 1 a 7
2 1 f 8
2 1 h 9
2 1 b 10
2 1 b 10
3 0 d 11
3 0 d 11
3 0 d 11
3 0 c 12
3 0 c 12
Then you could easily obtain the desired result with the following query:
SELECT distinct Id, Result, Message, Order
FROM Table_A
OUPUT:
Id Result Message Order
1 0 a 1
1 0 p 2
1 0 d 3
1 0 p 4
1 0 f 5
1 0 p 6
2 1 a 7
2 1 f 8
2 1 h 9
2 1 b 10
3 0 d 11
3 0 c 12
I guess you're looking for Group by?
SELECT col1, col2, col3 FROM Table GROUP BY col1, col2, col3;
The order of the result will be the order of the columns you pass.

compare two column of two dataframe pandas

I have 2 data frames like :
df_out:
a b c d
1 1 2 1
2 1 2 3
3 1 3 5
df_fin:
a e f g
1 0 2 1
2 5 2 3
3 1 3 5
5 2 4 6
7 3 2 5
I want to get result as :
a b c d a e f g
1 1 2 1 1 0 2 1
2 1 2 3 2 5 2 3
3 1 3 5 3 1 3 5
in the other word I have two diffrent data frames that are common in one column(a), I want two compare this two columns(df_fin.a and df_out.a) and select the rows from df_fin that have the same value in column a and create new dataframe that has selected rows from df_fin and added columns from df_out ?
I think you need merge with left join:
df = pd.merge(df_out, df_fin, on='a', how='left')
print (df)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5
EDIT:
df1 = df_fin[df_fin['a'].isin(df_out['a'])]
df2 = df_out.join(df1.set_index('a'), on='a')
print (df2)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5