Compare two tables and display selected value from 2nd table - awk

I'm trying to match the 3rd column and 2nd column on two table. In below example, I need to get the PROGRAM from the second table and output it using `AWK. Common between the two table is the TESTER.
below is my code, not working . pls help fix
awk -F, 'NR==FNR{a[$1]=$8;next;}{print $0,a[$3]?a[$2]:"N/A"}' OFS=, table1 table2
Table1:
Date Time TESTER Niche SMS_NO TEST_AREA SCREEN_TYPE PROGRAM
4/23/2019 8:40:42 A1 Nxx S11 TA1 ST1 PGM1
4/23/2019 7:34:08 B1 Nx1 S21 TA2 ST2 PGM2
4/23/2019 3:16:24 C1 Nx2 S31 TA3 ST3 PGM3
4/23/2019 6:22:04 D1 Nx3 S41 TA4 ST4 PGM4
4/23/2019 8:55:19 E1 Nx4 S51 TA5 ST5 PGM5
7/22/2018 17:30:37 F1 Nx5 S61 TA6 ST6 PGM6
Table2:
FEATURE TESTER LICENSE_USED
FEA1 A1 4
FEA2 B1 16
FEA3 C1 16
FEA4 D1 16
FEA5 E1 16
FEA6 F1 16
FEA7 G1 16
FEA8 G2 16
Expected output:
FEATURE TESTER LICENSE_USED PROGRAM
FEA1 A1 4 PGM1
FEA2 B1 16 PGM2
FEA3 C1 16 PGM3
FEA4 D1 16 PGM4
FEA5 E1 16 PGM5
FEA6 F1 16 PGM6
FEA7 G1 16 N/A
FEA8 G2 16 N/A

Please check this:
awk 'NR==FNR {a[$3]=$8; next} {print $0 FS (a[$2]?a[$2]:"N/A")}' file1.txt file2.txt
File1.txt
Date Time TESTER Niche SMS_NO TEST_AREA SCREEN_TYPE PROGRAM
4/23/2019 8:40:42 A1 Nxx S11 TA1 ST1 PGM1
4/23/2019 7:34:08 B1 Nx1 S21 TA2 ST2 PGM2
4/23/2019 3:16:24 C1 Nx2 S31 TA3 ST3 PGM3
4/23/2019 6:22:04 D1 Nx3 S41 TA4 ST4 PGM4
4/23/2019 8:55:19 E1 Nx4 S51 TA5 ST5 PGM5
7/22/2018 17:30:37 F1 Nx5 S61 TA6 ST6 PGM6
File2.txt
FEATURE TESTER LICENSE_USED
FEA1 A1 4
FEA2 B1 16
FEA3 C1 16
FEA4 D1 16
FEA5 E1 16
FEA6 F1 16
FEA7 G1 16
FEA8 G2 16
Output:
FEATURE TESTER LICENSE_USED PROGRAM
FEA1 A1 4 PGM1
FEA2 B1 16 PGM2
FEA3 C1 16 PGM3
FEA4 D1 16 PGM4
FEA5 E1 16 PGM5
FEA6 F1 16 PGM6
FEA7 G1 16 N/A
FEA8 G2 16 N/A

tried on gnu awk
awk 'NR==FNR{a[$3]=$8;next} {$4=a[$2];if($4=="") $4="N/A";print}' Table1 Table2

Related

Splunk: Use output of search A row by row as input for search B, then produce common result table

In Splunk, I have a search producing a result table like this:
_time
A
B
C
2022-10-19 09:00:00
A1
B1
C1
2022-10-19 09:00:00
A2
B2
C2
2022-10-19 09:10:20
A3
B3
C3
Now, for each row, I want to run a second search, using the _time value as input parameter.
For above row 1 and 2 (same _time value), the result of the second search would be:
_time
D
E
2022-10-19 09:00:00
D1
E1
For above row 3, the result of the second search would be:
_time
D
E
2022-10-19 09:10:20
D3
E3
And now I want to output the results in a common table, like this:
_time
A
B
C
D
E
2022-10-19 09:00:00
A1
B1
C1
D1
E1
2022-10-19 09:00:00
A2
B2
C2
D1
E1
2022-10-19 09:10:20
A3
B3
C3
D3
E3
I experimented with join, append, map, appendcols and subsearch, but I am struggling both with the row-by-row character of the second search and with pulling to data together into one common table.
For example, appendcols simply tacks one result table onto another, even if they are completely unrelated and differently shaped. Like so:
_time
A
B
C
D
E
2022-10-19 09:00:00
A1
B1
C1
D1
E1
2022-10-19 09:00:00
A2
B2
C2
-
-
2022-10-19 09:10:20
A3
B3
C3
-
-
Can anybody please point me into the right direction?

Compare values from one column in table A and another column in table B

I need to create a NeedDate column in the expected output. I will compare the QtyShort from Table B with QtyReceive from table A.
In the expected output, if QtyShort = 0, NeedDate = MaltDueDate.
For the first row of table A, if 0 < QtyShort (in Table B) <= QtyReceive (=6), NeedDate = 10/08/2021 (DueDate from Table A).
If 6 < QtyShort <= 10 (QtyReceive), move to the second row, NeedDate = 10/22/2021 (DueDate from Table A).
If 10 < QtyShort <= 20 (QtyReceive), move to the third row, NeedDate = 02/01/2022 (DueDate from Table A).
If QtyShort > QtyReceive (=20), NeedDate = 09/09/9999.
This should continue in a loop until the last row on table B has been compared
How could we do this? Any help will be appreciated. Thank you in advance!
Table A
Item DueDate QtyReceive
A1 10/08/2021 6
A1 10/22/2021 10
A1 02/01/2022 20
Table B
Item MatlDueDate QtyShort
A1 06/01/2022 0
A1 06/02/2022 0
A1 06/03/2022 1
A1 06/04/2022 2
A1 06/05/2022 5
A1 06/06/2022 7
A1 06/07/2022 10
A1 06/08/2022 15
A1 06/09/2022 25
Expected Output:
Item MatlDueDate QtyShort NeedDate
A1 06/01/2022 0 06/01/2022
A1 06/02/2022 0 06/02/2022
A1 06/03/2022 1 10/08/2021
A1 06/04/2022 2 10/08/2021
A1 06/05/2022 5 10/08/2021
A1 06/06/2022 7 10/22/2021
A1 06/07/2022 10 10/22/2021
A1 06/08/2022 15 02/01/2022
A1 06/09/2022 25 09/09/9999
Use OUTER APPLY() operator to find the minimum DueDate from TableA that is able to fulfill the QtyShort
select b.Item, b.MatlDueDate, b.QtyShort,
NeedDate = case when b.QtyShort = 0
then b.MatlDueDate
else isnull(a.DueDate, '9999-09-09')
end
from TableB b
outer apply
(
select DueDate = min(a.DueDate)
from TableA a
where a.Item = b.Item
and a.QtyReceive >= b.QtyShort
) a
Result:
Item
MatlDueDate
QtyShort
NeedDate
A1
2022-06-01
0
2022-06-01
A1
2022-06-02
0
2022-06-02
A1
2022-06-03
1
2021-10-08
A1
2022-06-04
2
2021-10-08
A1
2022-06-05
5
2021-10-08
A1
2022-06-06
7
2021-10-22
A1
2022-06-07
10
2021-10-22
A1
2022-06-08
15
2022-02-01
A1
2022-06-09
25
9999-09-09
db<>fiddle demo

Display values that are out of scope in addition to in scope

So I have a complex situation.
I have a 3 tables:
Product
Resource Name
Resource Type
C1
Bold
E2
Crema
C2
Bold
C3
Bold
Purchase_History
Resource Name
Qty
Cust ID
Date
Batch
C1
7
123
Jun 1
324
C1
7
222
Jun 10
324
C1
7
333
Jun 11
4BZ
C1
7
124
Jun 11
4BZ
C1
7
125
Jun 11
324
C1
7
111
Jun 21
324
C2
7
55
Jun 22
A22
C2
7
1
Jun 24
A22
Inventory
Resource Name
Available
Qty
Batch
C1
1
40
324
C2
1
50
3GC
C1
2
0
4BZ
C2
1
99
A22
E2
1
99
B22
E2
2
0
C22
So I've created a query as below:
Select
p.resourcename
, ph.cust_id
, ph.batch
, case when i.available=1 then 'Yes' when i.available=2 then 'no' else ''end 'In Stock'
from product p
join purchase_history ph on ph.resource_name=p.resource_name
join inventory i on i.batch=ph.batch
where
ph.date >='Jun 1'
ph.date <='Jun 20'
I am getting the following:
Resource Name
Cust Id
Batch
In Stock
C1
123
324
Yes
C1
222
324
Yes
C1
333
4BZ
No
C1
124
4BZ
No
C1
123
324
Yes
What I would like to achieve is the below, where even though the last 2 batch and products are out of the range of transactions, we can still see them as below. I know this is a weird as but essentially the team wants to see what has been sold so far - within the date range - and all product availability status. Is this something achievable?
Resource Name
Cust Id
Batch
In Stock
C1
123
324
Yes
C1
222
324
Yes
C1
333
4BZ
No
C1
124
4BZ
No
C1
123
324
Yes
C2
n/a
3GC
Yes
C2
n/a
A22
Yes
E2
n/a
B22
Yes
E2
n/a
C22
No
you need to use left join with purchase_history table:
Select
p.resourcename
, ph.cust_id
, i.batch
, case when i.available=1 then 'Yes' when i.available=2 then 'no' else ''end 'In Stock'
from product p
join inventory i on i.resource_name=p.resource_name
left join purchase_history ph
on ph.resource_name=p.resource_name
and i.batch=ph.batch
where
ph.date >='Jun 1' and ph.date <='Jun 20'
notice I changes the order of tables for better readability.

Selecting rows where values in one column are different, other column(s) values are similar and values of Date columns are correct

Suppose I have the following columns:
ID,Code,DST,Short_text,Long_text,Date_from,Date_until
Here is the dataset:
ID Code DST Short_text Long_text Date_From Date_Until
1 B 01 B 1 Bez1 Bezirk1 29.10.1999 13.01.2020
1 B 01 B 1 Bez1 Bezirk1 14.01.2020 31.12.9999
2 B 02 B 2 Bez2 Bezirk2 29.10.1999 13.01.2020
3 B 03 B 3 Bez3 Bezirk3 14.01.2020 31.12.9999
4 B 04 B 4 Bez4 Bezirk4 29.10.1999 13.01.2020
4 B 04 B 4 Bez4 Bezirk4 14.01.2020 31.12.9999
97 M 51 M 52 MA 51 Sport 29.10.1999 13.01.2020
96 M 51 M 51 MA 51 Sport 14.01.2020 31.12.9999
98 M 55 M 53 MA 53 Dance 29.10.1999 13.01.2020
99 M 55 M 54 MA 54 Skating 14.01.2020 31.12.9999
100 M 56 M 59 MA 57 Football 29.10.1999 13.01.2020
101 M 56 M 56 MA 56 Tennis 29.10.1999 31.12.9999
I want to select rows, such that they have different ID AND (they have similar Code OR SImilar Short_text OR simmlar long_text) AND Correct Date_from - Date_Until.
Definition of correct Date_from - Date_Until:
1.Date_ from < Date_Until
2.Both fields are not Null
3. WHEN PREV_DATE_UNTIL = DATE_FROM - 1 OR PREV_DATE_UNTIL is null THEN 'OK'(PREV_DATE_UNTIL using lag operator)
4. WHEN NEXT_DATE_FROM = DATE_UNTIL + 1 OR NEXT_DATE_FROM is null THEN 'OK'(NEXT_DATE_FROM using lead operator)
Not correct:
WHEN WHEN NEXT_DATE_FROM > DATE_UNTIL + 1 THEN 'Gaps in Dates'
WHEN WHEN NEXT_DATE_FROM < DATE_UNTIL + 1 THEN 'Overlapping dates'
Basically what I mean, that historization of the data must be correct(no overlapping)
At the end I want to select the following rows:
97 M 51 M 52 MA 51 Sport 29.10.1999 13.01.2020
96 M 51 M 51 MA 51 Sport 14.01.2020 31.12.9999
Because they have different ID and similar Code or short_text or long_text and dates are correct according to the definition
And
98 M 55 M 53 MA 53 Dance 29.10.1999 13.01.2020
99 M 55 M 54 MA 54 Skating 14.01.2020 31.12.9999
Because they have different ID and similar Code and dates are correct according to the definition
Rows:
100 M 56 M 59 MA 57 Football 29.10.1999 13.01.2020
101 M 56 M 56 MA 56 Tennis 29.10.1999 31.12.9999
Should NOT be selected, because they have different ID and similar Code BUT they have incorrect Dates(they are overlapping).
This will be something like that:
with t as (
select row_number() over (order by date_from, date_until) rn,
id, code, dst, short_text, long_text, date_from,
nullif(date_until, date '9999-12-31') date_until
from data)
select rna, rnb, description, t.*
from t
join (
select a.rn rna, b.rn rnb,
case when b.date_from = a.date_until + 1 then 'OK'
when b.date_from > a.date_until + 1 then 'gaps'
when b.date_from < a.date_until + 1 then 'overlapping'
end description
from t a
join t b on a.id <> b.id and a.rn < b.rn
and (a.code = b.code or a.short_text = b.short_text
or a.long_text = b.long_text)) pairs
on rn in (rna, rnb)
result is:
RNA RNB DESCRIPTION RN ID CODE DST SHORT_TEXT LONG_TEXT DATE_FROM DATE_UNTIL
------ ------ ----------- ----- ---------- ---- ---- ---------- --------- ----------- -----------
1 7 overlapping 1 100 M 56 M 59 MA 57 Football 1999-10-29 2020-01-13
1 7 overlapping 7 101 M 56 M 56 MA 56 Tennis 1999-10-29
3 8 OK 3 98 M 55 M 53 MA 53 Dance 1999-10-29 2020-01-13
3 8 OK 8 99 M 55 M 54 MA 54 Skating 2020-01-14
6 9 OK 6 97 M 51 M 52 MA 51 Sport 1999-10-29 2020-01-13
6 9 OK 9 96 M 51 M 51 MA 51 Sport 2020-01-14
dbfiddle
I numbered rows, self joined such data and dressed your logic in case when syntax. I tested on your examples, in case of any mistakes please provide dbfiddle if possible.

SQL: Counting and Numbering Duplicates - Optimising Correlated Subquery

In an SQLite database I have one table where I need to count the duplicates across certain columns (i.e. rows where 3 particular columns are the same) and then also number each of these cases (i.e. if there are 2 occurrences of a particular duplicate, they need to be numbered as 1 and 2). I'm finding it a bit difficult to explain in words so I'll use a simplified example below.
The data I have is similar to the following (first line is header row, table is referenced in following as "idcountdata"):
id match1 match2 match3 data
1 AbCde BC 0 data01
2 AbCde BC 0 data02
3 AbCde BC 1 data03
4 AbCde AB 0 data04
5 FGhiJ BC 0 data05
6 FGhiJ AB 0 data06
7 FGhiJ BC 1 data07
8 FGhiJ BC 1 data08
9 FGhiJ BC 2 data09
10 HkLMop BC 1 data10
11 HkLMop BC 1 data11
12 HkLMop BC 1 data12
13 HkLMop DE 1 data13
14 HkLMop DE 2 data14
15 HkLMop DE 2 data15
16 HkLMop DE 2 data16
17 HkLMop DE 2 data17
And the output I need to generate for the above would be:
id match1 match2 match3 data matchid matchcount
1 AbCde BC 0 data01 1 2
2 AbCde BC 0 data02 2 2
3 AbCde BC 1 data03 1 1
4 AbCde AB 0 data04 1 1
5 FGhiJ BC 0 data05 1 1
6 FGhiJ AB 0 data06 1 1
7 FGhiJ BC 1 data07 1 2
8 FGhiJ BC 1 data08 2 2
9 FGhiJ BC 2 data09 1 1
10 HkLMop BC 1 data10 1 3
11 HkLMop BC 1 data11 2 3
12 HkLMop BC 1 data12 3 3
13 HkLMop DE 1 data13 1 1
14 HkLMop DE 2 data14 1 4
15 HkLMop DE 2 data15 2 4
16 HkLMop DE 2 data16 3 4
17 HkLMop DE 2 data17 4 4
Previously I was using a couple of correlated subqueries to achieve this as follows:
SELECT id, match1, match2, match3, data,
(SELECT count(*) FROM idcountdata d2
WHERE d1.match1=d2.match1 AND d1.match2=d2.match2 AND d1.match3=d2.match3
AND d2.id<=d1.id)
AS matchid,
(SELECT count(*) FROM idcountdata d2
WHERE d1.match1=d2.match1 AND d1.match2=d2.match2 AND d1.match3=d2.match3)
AS matchcount
FROM idcountdata d1;
But the table has over 200,000 rows (and the data can be variable in length/content) and hence this takes hours to run. (Strangely, when I first used the same query on the same data back in mid-to-late 2013 it took minutes rather than hours, but that is beside the point - even back then I thought it was inelegant and inefficient.)
I've already converted the correlated subquery for "matchcount" in the above to an uncorrelated subquery with a JOIN as follows:
SELECT d1.id, d1.match1, d1.match2, d1.match3, d1.data,
matchcount
FROM idcountdata d1
JOIN
(SELECT id,match1,match2,match3,count(*) matchcount
FROM idcountdata
GROUP BY match1,match2,match3) d2
ON (d1.match1=d2.match1 and d1.match2=d2.match2 and d1.match3=d2.match3);
So it's just the subquery for "matchid" that I would like some help to optimise.
In short, the following query runs too slowly for larger datasets:
SELECT id, match1, match2, match3, data,
(SELECT count(*) FROM idcountdata d2
WHERE d1.match1=d2.match1 AND d1.match2=d2.match2 AND d1.match3=d2.match3
AND d2.id<=d1.id)
matchid
FROM idcountdata d1;
How can I improve the performance of the above query?
It doesn't have to run in seconds, but it needs to be minutes rather than hours (for around 200,000 rows).
A self join may be faster than a correlated subquery
SELECT d1.id, d1.match1, d1.match2, d1.match3, d1.data, count(*) matchid
FROM idcountdata d1
JOIN idcountdata d2 on d1.match1 = d2.match1
and d1.match2 = d2.match2
and d1.match3 = d2.match3
and d1.id >= d2.id
GROUP BY d1.id, d1.match1, d1.match2, d1.match3, d1.data
This query can take advantage of a composite index on (match1,match2,match3,id)