DISTINCT columns from 2 different tables

DISTINCT columns from 2 different tables - sql

I have 2 tables with similar information. Let's call them DAILYROWDATA and SUMMARYDATA.
Table DAILYROWDATA
NIP NAME DEPARTMENT
A1 ARIA BB
A2 CHLOE BB
A3 RYAN BB
A4 STEVE BB
Table SUMMARYDATA
NIP NAME DEPARTMENT STATUSIN STATUSOUT
A1 ARIA BB 1/21/2020 8:06:23 AM 1/21/2020 8:07:53 AM
A2 CHLOE BB 1/21/2020 8:16:07 AM 1/21/2020 9:51:21 AM
A1 ARIA BB 1/22/2020 9:06:23 AM 1/22/2020 10:07:53 AM
A2 CHLOE BB 1/22/2020 9:16:07 AM 1/22/2020 10:51:21 AM
A3 RYAN BB 1/22/2020 8:15:03 AM 1/22/2020 9:12:03 AM
And I need to combine these two tables and show all data in table DAILYROWDATA and set the value if STATUSIN = NULL and STATUSOUT= Null then write 'NA'. This is the output that I meant:
NIP NAME DEPARTMENT STATUSIN STATUSOUT
A1 ARIA BB 1/21/2020 8:06:23 AM 1/21/2020 8:07:53 AM
A2 CHLOE BB 1/21/2020 8:16:07 AM 1/21/2020 9:51:21 AM
A3 RYAN BB NA NA
A4 STEVE BB NA NA
A1 ARIA BB 1/22/2020 9:06:23 AM 1/22/2020 10:07:53 AM
A2 CHLOE BB 1/22/2020 9:16:07 AM 1/22/2020 10:51:21 AM
A3 RYAN BB 1/22/2020 8:15:03 AM 1/22/2020 9:12:03 AM
A4 STEVE BB NA NA
I need to add some condition, so, i wanna set the value STATUSIN = NULL just when there is no NIP,NAME,DEPARTMENT,STATUSIN,STATUSOUT in one date.. so, that's can be multiple

You want a left join to bring the two tables together. The trickier part is that you need strings in order to represent the 'NA':
select drd.*,
coalesce(cast(statusin as varchar(255)), 'NA') as statusin,
coalesce(cast(statusout as varchar(255)), 'NA') as statusout
from DAILYROWDATA drd left join
SUMMARYDATA sd
on drd.nip = sd.nip;

Related

Splunk: Use output of search A row by row as input for search B, then produce common result table

In Splunk, I have a search producing a result table like this:
_time
A
B
C
2022-10-19 09:00:00
A1
B1
C1
2022-10-19 09:00:00
A2
B2
C2
2022-10-19 09:10:20
A3
B3
C3
Now, for each row, I want to run a second search, using the _time value as input parameter.
For above row 1 and 2 (same _time value), the result of the second search would be:
_time
D
E
2022-10-19 09:00:00
D1
E1
For above row 3, the result of the second search would be:
_time
D
E
2022-10-19 09:10:20
D3
E3
And now I want to output the results in a common table, like this:
_time
A
B
C
D
E
2022-10-19 09:00:00
A1
B1
C1
D1
E1
2022-10-19 09:00:00
A2
B2
C2
D1
E1
2022-10-19 09:10:20
A3
B3
C3
D3
E3
I experimented with join, append, map, appendcols and subsearch, but I am struggling both with the row-by-row character of the second search and with pulling to data together into one common table.
For example, appendcols simply tacks one result table onto another, even if they are completely unrelated and differently shaped. Like so:
_time
A
B
C
D
E
2022-10-19 09:00:00
A1
B1
C1
D1
E1
2022-10-19 09:00:00
A2
B2
C2
-
-
2022-10-19 09:10:20
A3
B3
C3
-
-
Can anybody please point me into the right direction?

Related ids per row in SAS

Given the following table with two columns:
ID ACC
A1 ACC1
A2 ACC1
A3 ACC1
B1 ACC2
B2 ACC2
All rows are related based on the ACC column. So my goal is to have the following table:
ID ID2 ACC
A1 A2 ACC1
A1 A3 ACC1
A2 A1 ACC1
A2 A3 ACC1
A3 A1 ACC1
A3 A2 ACC1
B1 B2 ACC2
B2 B1 ACC2

proc sql;
create table want as
select left.ID, rigth.ID, left.ACC
from have as left, have as right
where left.ACC eq right.ACC
and left.ID ne right.ID;
quit;

Find Active customers in past X days

I am facing some hard times, Need quick help. It would be great if someone could assist me.
Thanks a lot in advance:)
I have 2 tables.
1st table: daily_customer_snapshot: the daily snapshot of the customer which looks something as shown below.
c_id
date
state
location
b1
2020-12-01
Active
OOW
b1
2020-12-02
Active
OOW
b1
2020-12-03
Active
OOW
b1
2020-12-04
Active
OOW
b1
2020-12-05
Active
OOW
b3
2020-12-06
Active
OOW
b3
2020-12-07
Active
OOW
b3
2020-12-08
Active
OOW
b1
2020-12-09
Decay
IW
b2
2020-12-15
Active
OOW
2nd table: customer_date_series: contains the date series from the day user became our customer.
Ex: refer image 2: user b1 became our customer on '2020-12-01' and user b3 became our customer on '2020-12-06'
and b2 became our customer on '2020-12-15'. I have generated the date series with customer_id to count at any given date how many customers we had.
c_id
date
b1
2020-12-01
b1
2020-12-02
b1
2020-12-03
b1
2020-12-04
b1
2020-12-05
b1
2020-12-06
b1
2020-12-07
b1
2020-12-08
b1
2020-12-09
b1
2020-12-10
b1
2020-12-11
b1
2020-12-12
b1
2020-12-13
b1
2020-12-14
b1
2020-12-15
b1
2020-12-16
b3
2020-12-06
b3
2020-12-07
b3
2020-12-08
b3
2020-12-09
b3
2020-12-10
b3
2020-12-11
b3
2020-12-12
b3
2020-12-13
b3
2020-12-14
b3
2020-12-15
b3
2020-12-16
b2
2020-12-15
b2
2020-12-16
I left joined table1 (customer_date_series) with table2 (daily_customer_snapshot) to get the overview of the customer behavior at any given date.
I got the results as displayed in image 3.
Query to Join:
select
bds.date,
bds.c_id,
b.state,
b.location
FROM
customer_date_series bds LEFT JOIN daily_customer_snapshot b ON bds.c_id = b.c_id and bds.date = b.date
ORDER BY
1,2;
date
c_id
state
location
2020-12-01
b1
Active
OOW
2020-12-02
b1
Active
OOW
2020-12-03
b1
Active
OOW
2020-12-04
b1
Active
OOW
2020-12-05
b1
Active
OOW
2020-12-06
b1
2020-12-06
b3
Active
OOW
2020-12-07
b1
2020-12-07
b3
Active
OOW
2020-12-08
b1
2020-12-08
b3
Active
OOW
2020-12-09
b1
Decay
IW
2020-12-09
b3
2020-12-10
b1
2020-12-10
b3
2020-12-11
b1
2020-12-11
b3
2020-12-12
b1
2020-12-12
b3
2020-12-13
b1
2020-12-13
b3
2020-12-14
b1
2020-12-14
b3
2020-12-15
b1
2020-12-15
b2
Active
OOW
2020-12-15
b3
2020-12-16
b1
2020-12-16
b2
2020-12-16
b3
This is where I am struggling.
I am facing a challenge here. I want to create new column called 'status' and if the customer data in the daily_customer_snapshot is updated in the past 5 days from the current_date
I want to set the status to be 'Active' Else 'Inactive'.
Ex:

If I follow you correctly, you can use boolean window aggregation:
select
bds.date,
bds.c_id,
b.state,
b.location,
bool_or(b.state = 'Active') over(
partition by bds.c_id
order by bds.date
range between interval '5 days' preceding and current row
) as is_active
from customer_date_series bds
left join daily_customer_snapshot b on bds.c_id = b.c_id and bds.date = b.date
order by 1,2;
This sets a boolean flag on rows where the same customer was active at least once within the last 5 days (or in the current day).
If you do want to see 'Active'/ 'InActive' instead (which I find less useful than a boolean) you can do:
min(b.state) over(
partition by bds.c_id
order by bds.date
range between interval '5 days' preceding and current row
) as status
... Which works because, string-wise, 'Active' < 'InActive'.

If you want to use both tables, then a lateral join does what you want:
select bds.date, bds.c_id, b.state, b.location
--CASE WHEN b.state = '%ActiveDecay%' between current_date- 10 and current_date THEN 'ActIve' ELSE 'DECAY' END as STATUS
FROM battery_date_series bds LEFT JOIN LATERAL
(SELECT b.*
FROM battery b
WHERE bds.c_id = b.c_id and b.date <= bds.date
ORDER BY b.date DESC
LIMIT 1
) b
ON 1=1
ORDER BY 1,2;

Compare two tables and display selected value from 2nd table

I'm trying to match the 3rd column and 2nd column on two table. In below example, I need to get the PROGRAM from the second table and output it using `AWK. Common between the two table is the TESTER.
below is my code, not working . pls help fix
awk -F, 'NR==FNR{a[$1]=$8;next;}{print $0,a[$3]?a[$2]:"N/A"}' OFS=, table1 table2
Table1:
Date Time TESTER Niche SMS_NO TEST_AREA SCREEN_TYPE PROGRAM
4/23/2019 8:40:42 A1 Nxx S11 TA1 ST1 PGM1
4/23/2019 7:34:08 B1 Nx1 S21 TA2 ST2 PGM2
4/23/2019 3:16:24 C1 Nx2 S31 TA3 ST3 PGM3
4/23/2019 6:22:04 D1 Nx3 S41 TA4 ST4 PGM4
4/23/2019 8:55:19 E1 Nx4 S51 TA5 ST5 PGM5
7/22/2018 17:30:37 F1 Nx5 S61 TA6 ST6 PGM6
Table2:
FEATURE TESTER LICENSE_USED
FEA1 A1 4
FEA2 B1 16
FEA3 C1 16
FEA4 D1 16
FEA5 E1 16
FEA6 F1 16
FEA7 G1 16
FEA8 G2 16
Expected output:
FEATURE TESTER LICENSE_USED PROGRAM
FEA1 A1 4 PGM1
FEA2 B1 16 PGM2
FEA3 C1 16 PGM3
FEA4 D1 16 PGM4
FEA5 E1 16 PGM5
FEA6 F1 16 PGM6
FEA7 G1 16 N/A
FEA8 G2 16 N/A

Please check this:
awk 'NR==FNR {a[$3]=$8; next} {print $0 FS (a[$2]?a[$2]:"N/A")}' file1.txt file2.txt
File1.txt
Date Time TESTER Niche SMS_NO TEST_AREA SCREEN_TYPE PROGRAM
4/23/2019 8:40:42 A1 Nxx S11 TA1 ST1 PGM1
4/23/2019 7:34:08 B1 Nx1 S21 TA2 ST2 PGM2
4/23/2019 3:16:24 C1 Nx2 S31 TA3 ST3 PGM3
4/23/2019 6:22:04 D1 Nx3 S41 TA4 ST4 PGM4
4/23/2019 8:55:19 E1 Nx4 S51 TA5 ST5 PGM5
7/22/2018 17:30:37 F1 Nx5 S61 TA6 ST6 PGM6
File2.txt
FEATURE TESTER LICENSE_USED
FEA1 A1 4
FEA2 B1 16
FEA3 C1 16
FEA4 D1 16
FEA5 E1 16
FEA6 F1 16
FEA7 G1 16
FEA8 G2 16
Output:
FEATURE TESTER LICENSE_USED PROGRAM
FEA1 A1 4 PGM1
FEA2 B1 16 PGM2
FEA3 C1 16 PGM3
FEA4 D1 16 PGM4
FEA5 E1 16 PGM5
FEA6 F1 16 PGM6
FEA7 G1 16 N/A
FEA8 G2 16 N/A

tried on gnu awk
awk 'NR==FNR{a[$3]=$8;next} {$4=a[$2];if($4=="") $4="N/A";print}' Table1 Table2

Pandas pivot table selecting rows with maximum values

I have pandas dataframe as:
df
Id Name CaseId Value
82 A1 case1.01 37.71
1558 A3 case1.01 27.71
82 A1 case1.06 29.54
1558 A3 case1.06 29.54
82 A1 case1.11 12.09
1558 A3 case1.11 32.09
82 A1 case1.16 33.35
1558 A3 case1.16 33.35
For each Id, Name pair I need to select the CaseId with maximum value.
i.e. I am seeking the following output:
Id Name CaseId Value
82 A1 case1.01 37.71
1558 A3 case1.16 33.35
I tried the following:
import pandas as pd
pd.pivot_table(df, index=['Id', 'Name'], columns=['CaseId'], values=['Value'], aggfunc=[np.max])['amax']
But all it does is for each CaseId as column it gives maximum value and not the results that I am seeking above.

sort_values + drop_duplicates
df.sort_values('Value').drop_duplicates(['Id'],keep='last')
Out[93]:
Id Name CaseId Value
7 1558 A3 case1.16 33.35
0 82 A1 case1.01 37.71
Since we post same time , adding more method
df.sort_values('Value').groupby('Id').tail(1)
Out[98]:
Id Name CaseId Value
7 1558 A3 case1.16 33.35
0 82 A1 case1.01 37.71

This should work:
df = df.sort_values('Value', ascending=False).drop_duplicates('Id').sort_index()
Output:
Id Name CaseId Value
0 82 A1 case1.01 37.71
7 1558 A3 case1.16 33.35

With nlargest and groupby
pd.concat(d.nlargest(1, ['Value']) for _, d in df.groupby('Name'))
Id Name CaseId Value
0 82 A1 case1.01 37.71
7 1558 A3 case1.16 33.35

Another idea is to create a joint column, take its max, then split it back to two columns:
df['ValueCase'] = list(zip(df['Value'], df['CaseId']))
p = pd.pivot_table(df, index=['Id', 'Name'], values=['ValueCase'], aggfunc='max')
p['Value'], p['CaseId'] = list(zip(*p['ValueCase']))
del p['ValueCase']
Results in:
CaseId Value
Id Name
82 A1 case1.01 37.71
1558 A3 case1.16 33.35

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

DISTINCT columns from 2 different tables - sql

Related

Splunk: Use output of search A row by row as input for search B, then produce common result table

Related ids per row in SAS

Find Active customers in past X days

Compare two tables and display selected value from 2nd table

Pandas pivot table selecting rows with maximum values

Categories

Resources