How to get unique rows in one file comparing with multiple files in awk

How to get unique rows in one file comparing with multiple files in awk - awk

I have tab-delimited files as shown below and would like to get the output as described below. I tried to some extent with the below commands but could not reach the final task. The description is slightly lengthy to make the question clear.
file1.txt
col1 col2 col3 col4 col5
ID1 str1 234 cond1 0
ID1 str2 567 cond1 0
ID1 str3 789 cond1 1
ID1 str4 123 cond1 1
file2.txt
col1 col2 col3 col4 col5
ID2 str1 235 cond1 0
ID2 str2 567 cond2 1
ID2 str3 789 cond1 1
ID2 str4 123 cond2 0
file3.txt
col1 col2 col3 col4 col5
ID3 str1 235 cond1 0
ID3 str2 567 cond2 1
ID3 str3 789 cond1 1
I would like to find the unique rows in file1.txt when compared with rest of the files, file2.txt and file3.txt. Columns col2 and col3 are used as keys to search. I have an additional condition to delete only if col4="cond1" though the keys col2 and col3 are found in file2.txt and file3.txt. Below is the code and output:
awk -F "\t" 'NR == 1 { OFS="\t"; print $0; next }
NR == FNR { a[$2,$3] = $0; next }
{ if ($4=="cond1") delete a[$2, $3] }
END { for (i in a) print a[i] }' file1.txt file2.txt file3.txt
Output:
col1 col2 col3 col4 col5
ID1 str1 234 cond1 0
ID1 str2 567 cond1 0
ID1 str4 123 cond1 0
Now, I would like to add additional columns with a list of col1 values and a count of col1 values from the files which do not meet the condition $4=="cond1" in file2.txt and file3.txt.
DESIRED OUTPUT
col1 col2 col3 col4 col5 col6 col7
ID1 str1 234 cond1 0 NA NA
ID1 str2 567 cond1 0 ID2,ID3 2
ID1 str4 123 cond1 0 ID2 1
Though str2 and 567 are present in file2.txt and file3.txt, the row from file1.txt is retained since col=="cond2" in file2.txt and file3.txt. Now the issue is to get those additional columns col6 and col7. Any idea?
NOTE: This is test case where file1 is compared with file2 and file3. In the real scenario, there will be more files to compare with file.

awk -vOFS="\t" '!c{c=$0"\tcol6\tcol7";next}NR==FNR{a[$2$3]=$0;next}{if($4=="cond1"){delete a[$2$3]}else{b[$2$3]=b[$2$3]?b[$2$3]","$1:$1}}END{print c;for(i in a){s=split(b[i],t,",");if(!s){b[i]=s="NA"}print a[i],b[i],s}}' a b c
col1 col2 col3 col4 col5 col6 col7
ID1 str2 567 cond1 0 ID2,ID3 2
ID1 str1 234 cond1 0 NA NA
ID1 str4 123 cond1 1 ID2 1

Related

using OFS with awk on file with multiple spaces as delimiter

I have file with multiple spaces as the delimiter.
> cat file1.csv
col1 col2 col3 col4
col1 col2 col3
col1 col2
col1
col1 col2 col3 col4 col5
This is the output I want but with no empty new lines (which makes me wonder if my -F' {2,}' is working)
> awk -F' {2,}' 'NR==1{print $0}' file1.csv | tr " " "\n"
col1
col2
col3
col4
>
But I was hoping to do it with AWK and using the OFS, but not sure if doing it right
> awk -F' {2,}' 'BEGIN{OFS=","} NR==1{print $0}' file1.csv
col1 col2 col3 col4
Other work in this space for my ref
I want to do something like this but with
>cat file.csv
col1,col2,col3,col4
col1,col2,col3
col1,col2
col1
col1,col2,col3,col4,col5
> awk -F, 'NR==1{print $0}' file.csv
col1,col2,col3,col4
> awk -F, 'NR==1{print $0}' file.csv | tr "," "\n"
col1
col2
col3
col4
> awk -F, 'NR==1{print $0}' file.csv | tr "," "\n" | cat -n
1 col1
2 col2
3 col3
4 col4
I can just use sed to remove blank lines but I want to use awk OFS as above:
> awk -F' {2,}' 'NR==1{print $0}' file1.csv | tr " " "\n" | sed '/^[[:space:]]*$/d'
col1
col2
col3
col4
>

Assumptions:
we're only interested in the 1st line of the file
each field/column of the (1st) line is to be printed on a separate line
Sample inputs:
$ head file?.csv
==> file1.csv <==
col1 col2 col3 col4
col1 col2 col3
col1 col2
col1
col1 col2 col3 col4 col5
==> file2.csv <==
col1,col2,col3,col4
col1,col2,col3
col1,col2
col1
col1,col2,col3,col4,col5
For the 1st file (file1.csv) we can use the default input field delimiter (ie, white space):
$ awk '{for (i=1;i<=NF;i++) print $i; exit}' file1.csv
col1
col2
col3
col4
For the 2nd file (file2.csv) we use an input field delimiter of a comma (,):
$ awk -F',' '{for (i=1;i<=NF;i++) print $i; exit}' file2.csv
col1
col2
col3
col4
NOTE: in neither of these cases do we need to worry setting the output field separator (OFS)
If we absolutely, positively need to set, and use, a non-default OFS:
$ awk 'BEGIN{OFS="\n"} {$1=$1; print; exit}' file1.csv
col1
col2
col3
col4
$ awk -F',' 'BEGIN{OFS="\n"} {$1=$1; print; exit}' file2.csv
col1
col2
col3
col4

Given:
$ head file?.csv
==> file1.csv <==
col1 col2 col3 col4
col1 col2 col3
col1 col2
col1
col1 col2 col3 col4 col5
==> file2.csv <==
col1,col2,col3,col4
col1,col2,col3
col1,col2
col1
col1,col2,col3,col4,col5
You can use head and sed.
For a space / tab separated file:
$ head -1 file1.csv | sed -E 's/[[:blank:]]{1,}/\n/g'
col1
col2
col3
col4
Comma separated:
$ head -1 file2.csv | sed -E 's/,/\n/g'
col1
col2
col3
col4
You can also skip head altogether and use sed to 1) find line n, 2) do the replacement; 3) quit. Here for the fifth line:
sed -nE '5{s/([[:blank:]]{1,})/\n/g; p; q; }' file1.csv
col1
col2
col3
col4
col5
Just use 1 for the first line.
Or similarly with awk:
$ awk -v ln=5 -F"[[:blank:]]{1,}" 'FNR==ln{for(i=1;i<=NF;i++) print $i; exit}' file1.csv

if you just wanna de-dupe it all down to col1 to col5, try :
1 col1 col2 col3 col4
2 col1 col2 col3
3 col1 col2
4 col1
5 col1 col2 col3 col4 col5
6 col1,col2,col3,col4
7 col1,col2,col3
8 col1,col2
9 col1
10 col1,col2,col3,col4,col5
{m,g,n}awk '!__[$0]++' RS='[,[:space:]]+'
col1
col2
col3
col4
col5

awk + multiple spaces as the delimiter. 2 [duplicate]

I have file with multiple spaces as the delimiter.
> cat file1.csv
col1 col2 col3 col4
col1 col2 col3
col1 col2
col1
col1 col2 col3 col4 col5
This is the output I want but with no empty new lines (which makes me wonder if my -F' {2,}' is working)
> awk -F' {2,}' 'NR==1{print $0}' file1.csv | tr " " "\n"
col1
col2
col3
col4
>
But I was hoping to do it with AWK and using the OFS, but not sure if doing it right
> awk -F' {2,}' 'BEGIN{OFS=","} NR==1{print $0}' file1.csv
col1 col2 col3 col4
Other work in this space for my ref
I want to do something like this but with
>cat file.csv
col1,col2,col3,col4
col1,col2,col3
col1,col2
col1
col1,col2,col3,col4,col5
> awk -F, 'NR==1{print $0}' file.csv
col1,col2,col3,col4
> awk -F, 'NR==1{print $0}' file.csv | tr "," "\n"
col1
col2
col3
col4
> awk -F, 'NR==1{print $0}' file.csv | tr "," "\n" | cat -n
1 col1
2 col2
3 col3
4 col4
I can just use sed to remove blank lines but I want to use awk OFS as above:
> awk -F' {2,}' 'NR==1{print $0}' file1.csv | tr " " "\n" | sed '/^[[:space:]]*$/d'
col1
col2
col3
col4
>

Assumptions:
we're only interested in the 1st line of the file
each field/column of the (1st) line is to be printed on a separate line
Sample inputs:
$ head file?.csv
==> file1.csv <==
col1 col2 col3 col4
col1 col2 col3
col1 col2
col1
col1 col2 col3 col4 col5
==> file2.csv <==
col1,col2,col3,col4
col1,col2,col3
col1,col2
col1
col1,col2,col3,col4,col5
For the 1st file (file1.csv) we can use the default input field delimiter (ie, white space):
$ awk '{for (i=1;i<=NF;i++) print $i; exit}' file1.csv
col1
col2
col3
col4
For the 2nd file (file2.csv) we use an input field delimiter of a comma (,):
$ awk -F',' '{for (i=1;i<=NF;i++) print $i; exit}' file2.csv
col1
col2
col3
col4
NOTE: in neither of these cases do we need to worry setting the output field separator (OFS)
If we absolutely, positively need to set, and use, a non-default OFS:
$ awk 'BEGIN{OFS="\n"} {$1=$1; print; exit}' file1.csv
col1
col2
col3
col4
$ awk -F',' 'BEGIN{OFS="\n"} {$1=$1; print; exit}' file2.csv
col1
col2
col3
col4

Given:
$ head file?.csv
==> file1.csv <==
col1 col2 col3 col4
col1 col2 col3
col1 col2
col1
col1 col2 col3 col4 col5
==> file2.csv <==
col1,col2,col3,col4
col1,col2,col3
col1,col2
col1
col1,col2,col3,col4,col5
You can use head and sed.
For a space / tab separated file:
$ head -1 file1.csv | sed -E 's/[[:blank:]]{1,}/\n/g'
col1
col2
col3
col4
Comma separated:
$ head -1 file2.csv | sed -E 's/,/\n/g'
col1
col2
col3
col4
You can also skip head altogether and use sed to 1) find line n, 2) do the replacement; 3) quit. Here for the fifth line:
sed -nE '5{s/([[:blank:]]{1,})/\n/g; p; q; }' file1.csv
col1
col2
col3
col4
col5
Just use 1 for the first line.
Or similarly with awk:
$ awk -v ln=5 -F"[[:blank:]]{1,}" 'FNR==ln{for(i=1;i<=NF;i++) print $i; exit}' file1.csv

if you just wanna de-dupe it all down to col1 to col5, try :
1 col1 col2 col3 col4
2 col1 col2 col3
3 col1 col2
4 col1
5 col1 col2 col3 col4 col5
6 col1,col2,col3,col4
7 col1,col2,col3
8 col1,col2
9 col1
10 col1,col2,col3,col4,col5
{m,g,n}awk '!__[$0]++' RS='[,[:space:]]+'
col1
col2
col3
col4
col5

Scatter multiple rows having duplicate columns to single unique row in postgresql

how to scatter multiple duplicate rows into one row in sql/postgresql.
For example --->
lets i am getting 3 rows of
col1 col2 col3
-------------------
11 test rat
11 test cat
11 test test
I want something like this
col1 col2 col3 col4
------------------------
11 test rat cat
Its the same thing like groupby in lodash. But how do I achieve the same in postgresql query?

You're looking for crosstab
postgres=# create table ab (col1 text, col2 text, col3 text);
CREATE TABLE
postgres=# insert into ab values ('t1','test','cat'),('t1','test','rat'),('t1','test','test');
INSERT 0 3
postgres=# select * from crosstab('select col1,col2,col3 from ab') as (col1 text, col2 text, col3 text, col4 text);
col1 | col2 | col3 | col4
------+------+------+------
t1 | cat | rat | test
(1 row)
Disclosure: I work for EnterpriseDB (EDB)

SQL MINUS/EXCEPT command analog for columns only while INSERTion

Does MINUS/EXCEPT command or code workaround analog exist for columns only? Since MINUS/EXCEPT command fine for rows, how about for columns?
Mask-table (physically exist):
id col1 col2 col3 col4 ... colN comment
doesn't A B C D ... Z --alphabet correct sequence
matter
Columns Data Type of col[i] equals to each other.
Incoming select-stream (not a physical table, but represented as table as a result of other complex joined-selection from other tables):
col1 col2 col3 col4 ... colN comment
A B C D ... Z --alphabet correct seq
A C B D ... Z --incorrect
E B C M ... Z --incorrect
...
Z Y X W ... A --full inverse icorrect
Expected output to physical table after processing mask-table to select-stream as insert result:
id col1 col2 col3 col4 ... colN
(auto-
gnrtd)
(null)(null)(null)(null)...(null)
(null) C B (null)...(null)
E (null)(null) M ...(null)
...
Z Y X W ... A
Please note: alphabet is given just as an example. Not the issue-case here. SQL-Logic/command required. Analog of MINUS/EXCEPT, but for columns (DISTINCT? How, if incoming select-stream is a result of other complex joined-select)
What will be the SQL-code for this task? Please, advise.

Another way to do it without CASE statements:
Setup
CREATE TABLE mask (
col1 TEXT,
col2 TEXT,
col3 TEXT,
col4 TEXT,
col5 TEXT
);
INSERT INTO mask SELECT 'A', 'B', 'C', 'D', 'E';
CREATE TABLE your_stream (
col1 TEXT,
col2 TEXT,
col3 TEXT,
col4 TEXT,
col5 TEXT
);
INSERT INTO your_stream
VALUES
('A', 'B', 'C', 'D', 'E'),
('A', 'C', 'B', 'D', 'E'),
('E', 'B', 'C', 'M', 'E');
Query
SELECT
NULLIF(s.col1, m.col1) AS col1,
NULLIF(s.col2, m.col2) AS col2,
NULLIF(s.col3, m.col3) AS col3,
NULLIF(s.col4, m.col4) AS col4,
NULLIF(s.col5, m.col5) AS col5
FROM your_stream s
CROSS JOIN mask m;
Result
| col1 | col2 | col3 | col4 | col5 |
| ---- | ---- | ---- | ---- | ---- |
| null | null | null | null | null |
| null | C | B | null | null |
| E | null | null | M | null |
View on DB Fiddle

I don't see what the connection to a set operation like EXCEPT would be.
Anyway, this is how you could proceed:
INSERT INTO destination (col1, col2, col3, ...)
SELECT CASE WHEN incoming_col1 <> mask.col1
THEN incoming_col1
END,
CASE WHEN incoming_col2 <> mask.col2
THEN incoming_col2
END,
...
FROM mask;

Combining data from different rows into one

I have a database that looks something like this:
col1 | col2 | col3 | col4
User1 | Coms | Start | 19 June 2019
User1 | Coms | Ended | 20 June 2019
I would like to transpose the data into a single row, like this:
col1 | col2 | col3 | col4 | col5 (end status)
User1 | Coms | Start | 19 June 2019 | Ended
You see, this user Ended the session, meaning the second row of that transaction is pulled into the first row. If they did not end, the new End Status column will simply be a Null.
I know there is a function Stuff ... For XML Path query that can put some rows together into one comma delimited field, but this is not what I am looking for.
Any good ideas?

use conditional aggregation
select col1,col2,
max(case when col3='start' then col3 end),
max(case when col3='end' then col3 end),min(col4)
from table group by col1,col2

use conditional aggregation
select col1, col2,
max(case when col3='Start' then col3 end) as col3,
max(case when col3='Start' then col4 end) as col4,
max(case when col3='Ended' then 'Ended' end) as col5
from tablename
group by col1,col2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to get unique rows in one file comparing with multiple files in awk - awk

Related

using OFS with awk on file with multiple spaces as delimiter

awk + multiple spaces as the delimiter. 2 [duplicate]

Scatter multiple rows having duplicate columns to single unique row in postgresql

SQL MINUS/EXCEPT command analog for columns only while INSERTion

Combining data from different rows into one

Categories

Resources