Related ids per row in SAS - sql

Given the following table with two columns:
ID ACC
A1 ACC1
A2 ACC1
A3 ACC1
B1 ACC2
B2 ACC2
All rows are related based on the ACC column. So my goal is to have the following table:
ID ID2 ACC
A1 A2 ACC1
A1 A3 ACC1
A2 A1 ACC1
A2 A3 ACC1
A3 A1 ACC1
A3 A2 ACC1
B1 B2 ACC2
B2 B1 ACC2

proc sql;
create table want as
select left.ID, rigth.ID, left.ACC
from have as left, have as right
where left.ACC eq right.ACC
and left.ID ne right.ID;
quit;

Related

Splunk: Use output of search A row by row as input for search B, then produce common result table

In Splunk, I have a search producing a result table like this:
_time
A
B
C
2022-10-19 09:00:00
A1
B1
C1
2022-10-19 09:00:00
A2
B2
C2
2022-10-19 09:10:20
A3
B3
C3
Now, for each row, I want to run a second search, using the _time value as input parameter.
For above row 1 and 2 (same _time value), the result of the second search would be:
_time
D
E
2022-10-19 09:00:00
D1
E1
For above row 3, the result of the second search would be:
_time
D
E
2022-10-19 09:10:20
D3
E3
And now I want to output the results in a common table, like this:
_time
A
B
C
D
E
2022-10-19 09:00:00
A1
B1
C1
D1
E1
2022-10-19 09:00:00
A2
B2
C2
D1
E1
2022-10-19 09:10:20
A3
B3
C3
D3
E3
I experimented with join, append, map, appendcols and subsearch, but I am struggling both with the row-by-row character of the second search and with pulling to data together into one common table.
For example, appendcols simply tacks one result table onto another, even if they are completely unrelated and differently shaped. Like so:
_time
A
B
C
D
E
2022-10-19 09:00:00
A1
B1
C1
D1
E1
2022-10-19 09:00:00
A2
B2
C2
-
-
2022-10-19 09:10:20
A3
B3
C3
-
-
Can anybody please point me into the right direction?

Find Active customers in past X days

I am facing some hard times, Need quick help. It would be great if someone could assist me.
Thanks a lot in advance:)
I have 2 tables.
1st table: daily_customer_snapshot: the daily snapshot of the customer which looks something as shown below.
c_id
date
state
location
b1
2020-12-01
Active
OOW
b1
2020-12-02
Active
OOW
b1
2020-12-03
Active
OOW
b1
2020-12-04
Active
OOW
b1
2020-12-05
Active
OOW
b3
2020-12-06
Active
OOW
b3
2020-12-07
Active
OOW
b3
2020-12-08
Active
OOW
b1
2020-12-09
Decay
IW
b2
2020-12-15
Active
OOW
2nd table: customer_date_series: contains the date series from the day user became our customer.
Ex: refer image 2: user b1 became our customer on '2020-12-01' and user b3 became our customer on '2020-12-06'
and b2 became our customer on '2020-12-15'. I have generated the date series with customer_id to count at any given date how many customers we had.
c_id
date
b1
2020-12-01
b1
2020-12-02
b1
2020-12-03
b1
2020-12-04
b1
2020-12-05
b1
2020-12-06
b1
2020-12-07
b1
2020-12-08
b1
2020-12-09
b1
2020-12-10
b1
2020-12-11
b1
2020-12-12
b1
2020-12-13
b1
2020-12-14
b1
2020-12-15
b1
2020-12-16
b3
2020-12-06
b3
2020-12-07
b3
2020-12-08
b3
2020-12-09
b3
2020-12-10
b3
2020-12-11
b3
2020-12-12
b3
2020-12-13
b3
2020-12-14
b3
2020-12-15
b3
2020-12-16
b2
2020-12-15
b2
2020-12-16
I left joined table1 (customer_date_series) with table2 (daily_customer_snapshot) to get the overview of the customer behavior at any given date.
I got the results as displayed in image 3.
Query to Join:
select
bds.date,
bds.c_id,
b.state,
b.location
FROM
customer_date_series bds LEFT JOIN daily_customer_snapshot b ON bds.c_id = b.c_id and bds.date = b.date
ORDER BY
1,2;
date
c_id
state
location
2020-12-01
b1
Active
OOW
2020-12-02
b1
Active
OOW
2020-12-03
b1
Active
OOW
2020-12-04
b1
Active
OOW
2020-12-05
b1
Active
OOW
2020-12-06
b1
2020-12-06
b3
Active
OOW
2020-12-07
b1
2020-12-07
b3
Active
OOW
2020-12-08
b1
2020-12-08
b3
Active
OOW
2020-12-09
b1
Decay
IW
2020-12-09
b3
2020-12-10
b1
2020-12-10
b3
2020-12-11
b1
2020-12-11
b3
2020-12-12
b1
2020-12-12
b3
2020-12-13
b1
2020-12-13
b3
2020-12-14
b1
2020-12-14
b3
2020-12-15
b1
2020-12-15
b2
Active
OOW
2020-12-15
b3
2020-12-16
b1
2020-12-16
b2
2020-12-16
b3
This is where I am struggling.
I am facing a challenge here. I want to create new column called 'status' and if the customer data in the daily_customer_snapshot is updated in the past 5 days from the current_date
I want to set the status to be 'Active' Else 'Inactive'.
Ex:
If I follow you correctly, you can use boolean window aggregation:
select
bds.date,
bds.c_id,
b.state,
b.location,
bool_or(b.state = 'Active') over(
partition by bds.c_id
order by bds.date
range between interval '5 days' preceding and current row
) as is_active
from customer_date_series bds
left join daily_customer_snapshot b on bds.c_id = b.c_id and bds.date = b.date
order by 1,2;
This sets a boolean flag on rows where the same customer was active at least once within the last 5 days (or in the current day).
If you do want to see 'Active'/ 'InActive' instead (which I find less useful than a boolean) you can do:
min(b.state) over(
partition by bds.c_id
order by bds.date
range between interval '5 days' preceding and current row
) as status
... Which works because, string-wise, 'Active' < 'InActive'.
If you want to use both tables, then a lateral join does what you want:
select bds.date, bds.c_id, b.state, b.location
--CASE WHEN b.state = '%ActiveDecay%' between current_date- 10 and current_date THEN 'ActIve' ELSE 'DECAY' END as STATUS
FROM battery_date_series bds LEFT JOIN LATERAL
(SELECT b.*
FROM battery b
WHERE bds.c_id = b.c_id and b.date <= bds.date
ORDER BY b.date DESC
LIMIT 1
) b
ON 1=1
ORDER BY 1,2;

create a new table from 2 other tables

If I want to merge the table with 2 other tables b,c
where table a contains columns:( Parent, Style, Ending_Date, WeekNum, Net_Requirment)
tables and calculate how much is required to make product A in a certain date.
The table should like the BOM (Bill of Material)
Can it be applied by pandas?
table b represent the demand for product A per date:
Style Date WeekNum Quantity
A 24/11/2019 0 600
A 01/12/2019 1 500
table c represent Details and quantity used to make product A:
Parent Child Q
A A1 2
A1 A11 3
A1 A12 2
so table a should be filled like this:
Parent Child Date WeekNum Net_Quantity
A A1 24/11/2019 0 1200
A1 A11 24/11/2019 0 3600
A1 A12 24/11/2019 0 2400
A A1 01/12/2019 1 1000
A1 A11 01/12/2019 1 3000
A1 A12 01/12/2019 1 2000
Welcome, in order to properly merge these tables and the rest you would have to have a common key to merge on. What you could do is add said key to each table like this:
data2 = {'Parent':['A','A1','A1'], 'Child':['A1','A11','A12'],
'Q':[2,3,2], 'Style':['A','A','A']}
df2 = pd.DataFrame(data2)
After this you can do a left join on the first table and then you can have multiple rows for the same date. So essentially this:
(notice if you do a left join, your left table will create as many duplicate rows as needed tu suffice the matching key on the right table)
data = {'Style':['A','A'], 'Date':['24/11/2019', '01/12/2019'],
'WeekNum':[0,1], 'Quantity':[600,500]}
df = pd.DataFrame(data)
mergeDf = df.merge(df2,how='left', left_on='Style', right_on='Style')
mergeDf
Then to calculate:
test['Net_Quantity'] = test.Quantity*test.Q
test.drop(['Q'], axis = 1,inplace=True)
result:
Style Date WeekNum Quantity Parent Child Net_Quantity
0 A 24/11/2019 0 600 A A1 1200
1 A 24/11/2019 0 600 A1 A11 1800
2 A 24/11/2019 0 600 A1 A12 1200
3 A 01/12/2019 1 500 A A1 1000
4 A 01/12/2019 1 500 A1 A11 1500
5 A 01/12/2019 1 500 A1 A12 1000

DISTINCT columns from 2 different tables

I have 2 tables with similar information. Let's call them DAILYROWDATA and SUMMARYDATA.
Table DAILYROWDATA
NIP NAME DEPARTMENT
A1 ARIA BB
A2 CHLOE BB
A3 RYAN BB
A4 STEVE BB
Table SUMMARYDATA
NIP NAME DEPARTMENT STATUSIN STATUSOUT
A1 ARIA BB 1/21/2020 8:06:23 AM 1/21/2020 8:07:53 AM
A2 CHLOE BB 1/21/2020 8:16:07 AM 1/21/2020 9:51:21 AM
A1 ARIA BB 1/22/2020 9:06:23 AM 1/22/2020 10:07:53 AM
A2 CHLOE BB 1/22/2020 9:16:07 AM 1/22/2020 10:51:21 AM
A3 RYAN BB 1/22/2020 8:15:03 AM 1/22/2020 9:12:03 AM
And I need to combine these two tables and show all data in table DAILYROWDATA and set the value if STATUSIN = NULL and STATUSOUT= Null then write 'NA'. This is the output that I meant:
NIP NAME DEPARTMENT STATUSIN STATUSOUT
A1 ARIA BB 1/21/2020 8:06:23 AM 1/21/2020 8:07:53 AM
A2 CHLOE BB 1/21/2020 8:16:07 AM 1/21/2020 9:51:21 AM
A3 RYAN BB NA NA
A4 STEVE BB NA NA
A1 ARIA BB 1/22/2020 9:06:23 AM 1/22/2020 10:07:53 AM
A2 CHLOE BB 1/22/2020 9:16:07 AM 1/22/2020 10:51:21 AM
A3 RYAN BB 1/22/2020 8:15:03 AM 1/22/2020 9:12:03 AM
A4 STEVE BB NA NA
I need to add some condition, so, i wanna set the value STATUSIN = NULL just when there is no NIP,NAME,DEPARTMENT,STATUSIN,STATUSOUT in one date.. so, that's can be multiple
You want a left join to bring the two tables together. The trickier part is that you need strings in order to represent the 'NA':
select drd.*,
coalesce(cast(statusin as varchar(255)), 'NA') as statusin,
coalesce(cast(statusout as varchar(255)), 'NA') as statusout
from DAILYROWDATA drd left join
SUMMARYDATA sd
on drd.nip = sd.nip;

Pandas pivot table selecting rows with maximum values

I have pandas dataframe as:
df
Id Name CaseId Value
82 A1 case1.01 37.71
1558 A3 case1.01 27.71
82 A1 case1.06 29.54
1558 A3 case1.06 29.54
82 A1 case1.11 12.09
1558 A3 case1.11 32.09
82 A1 case1.16 33.35
1558 A3 case1.16 33.35
For each Id, Name pair I need to select the CaseId with maximum value.
i.e. I am seeking the following output:
Id Name CaseId Value
82 A1 case1.01 37.71
1558 A3 case1.16 33.35
I tried the following:
import pandas as pd
pd.pivot_table(df, index=['Id', 'Name'], columns=['CaseId'], values=['Value'], aggfunc=[np.max])['amax']
But all it does is for each CaseId as column it gives maximum value and not the results that I am seeking above.
sort_values + drop_duplicates
df.sort_values('Value').drop_duplicates(['Id'],keep='last')
Out[93]:
Id Name CaseId Value
7 1558 A3 case1.16 33.35
0 82 A1 case1.01 37.71
Since we post same time , adding more method
df.sort_values('Value').groupby('Id').tail(1)
Out[98]:
Id Name CaseId Value
7 1558 A3 case1.16 33.35
0 82 A1 case1.01 37.71
This should work:
df = df.sort_values('Value', ascending=False).drop_duplicates('Id').sort_index()
Output:
Id Name CaseId Value
0 82 A1 case1.01 37.71
7 1558 A3 case1.16 33.35
With nlargest and groupby
pd.concat(d.nlargest(1, ['Value']) for _, d in df.groupby('Name'))
Id Name CaseId Value
0 82 A1 case1.01 37.71
7 1558 A3 case1.16 33.35
Another idea is to create a joint column, take its max, then split it back to two columns:
df['ValueCase'] = list(zip(df['Value'], df['CaseId']))
p = pd.pivot_table(df, index=['Id', 'Name'], values=['ValueCase'], aggfunc='max')
p['Value'], p['CaseId'] = list(zip(*p['ValueCase']))
del p['ValueCase']
Results in:
CaseId Value
Id Name
82 A1 case1.01 37.71
1558 A3 case1.16 33.35