script to remove redundant lines from two different files - awk

I will explain my problem with an example
I have the following files in Solaris
file1:
1 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U386.A0 I have some text here
1 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U386.A1 I have some text here
2 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U246.A0 I have some text here
2 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U246.A1 I have some text here
3 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A0 I have some text here
3 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A1 I have some text here
3 INST C 1 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U64.A1 I have some text here
4 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A0 I have some text here
4 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A1 I have some text here
5 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A0 I have some text here
5 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A1 I have some text here
6 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U375.A0 I have some text here
6 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U375.A1 I have some text here
7 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U98.A I have some text here
8 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U392.A0 I have some text here
8 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U392.A1 I have some text here
9 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U372.A0 I have some text here
10 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U372.A1 I have some text here
11 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U184.A I have some text here
12 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U97.B I have some text here
file2:
INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A0
INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A1
INST C 1 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U64.A1
INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A0
INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A1
INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A0
INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A1
Now i have file2 as reference and print all the lines that match in file1
Expected output is:
3 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A0 I have some text here
3 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A1 I have some text here
3 INST C 1 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U64.A1 I have some text here
4 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A0 I have some text here
4 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A1 I have some text here
5 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A0 I have some text here
5 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A1 I have some text here
I have tried grep:
grep -F -x -f file1 -v file2 > file3
and fgrep:
fgrep -x -f file1 -v file2 > file3
based on several posts from stackoverflow. But didnt find what i need. Since i am a starter, I am really confused to find a way out of this. Your help is most appreciated

this works for you:
grep -Ff file2 file1 >file3
test with your files:
kent$ grep -Ff f2 f1
3 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A0 I have some text here
3 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A1 I have some text here
3 INST C 1 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U64.A1 I have some text here
4 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A0 I have some text here
4 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A1 I have some text here
5 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A0 I have some text here
5 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A1 I have some text here

Related

Print lines in both file when 2 different columns match

I have 2 tab delim files
file 1
B T 4 tab -
1 C 5 - cab
5 A 2 - ttt
D T 18 1111 -
file 2
K A 3 0.1
T B 4 0.3
P 1 5 0.5
P 5 2 0.11
I need to merge the two based on file 1 col1 and 3 and file2 col2 and 3, and print lines in both files. I'm expecting the following output:
B T 4 tab - T B 4 0.3
1 C 5 - cab P 1 5 0.5
5 A 2 - ttt P 5 2 0.11
I tried adapting from a similar question I had in the past:
awk 'NR==FNR {a[$1,$3] = $2"\t"$4"\t"$5; next} $2,$3 in a {print a[$1,$3],$0}' file1 file2
but no success, the output I get looks like this, which is similar to file2:
K A 3 0.1
T B 4 0.3
P 1 5 0.5
P 5 2 0.11
There are two small problems in your code:
awk 'NR==FNR{a[$1,$3]=$0; next} ($2,$3) in a {print a[$2,$3], $0}' file1 file2
# parentheses -^ ----^
# $2,$3 ----^

Merge files print 0 in empty field

I have 5 tab delim files
file 0 is basically a key
A
C
F
AA
BC
CC
D
KKK
S
file1
A 2
C 3
F 5
AA 5
BC 4
D 7
file2
A 2
C 3
F 7
D 10
file3
A 2
C 2
F 5
CC 4
D 7
file4
A 1
C 3
F 5
CC 4
D 7
KKK 10
I would like to merge all files based on the 1st column and print 0 in missing fields.
A 2 2 2 1
C 3 3 2 3
F 5 7 5 5
AA 5 0 0 0
BC 4 0 0 0
CC 0 0 4 4
D 7 10 7 7
KKK 0 0 0 10
S 0 0 0 0
Columns must keep the order of input file0, file1, file2, file3, file4
I was going to wait til you included your own attempt in your question but since you have 2 answers already anyway....
$ cat tst.awk
NR==FNR {
key2rowNr[$1] = ++numRows
rowNr2key[numRows] = $1
next
}
FNR==1 { ++numCols }
{
rowNr = key2rowNr[$1]
vals[rowNr,numCols] = $2
}
END {
for (rowNr=1; rowNr<=numRows; rowNr++) {
printf "%s", rowNr2key[rowNr]
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%d", OFS, vals[rowNr,colNr]
}
print ""
}
}
$ awk -f tst.awk file0 file1 file2 file3 file4
A 2 2 2 1
C 3 3 2 3
F 5 7 5 5
AA 5 0 0 0
BC 4 0 0 0
CC 0 0 4 4
D 7 10 7 7
KKK 0 0 0 10
S 0 0 0 0
awk solution
awk '
FNR==1{f++}
{
a[f""$1]=$2
b[$1]++
}
END{
for(i in b){
printf i" "
for(j=1;j<=f;j++){
tmp=j""i
if(tmp in a){
printf a[tmp]" "
}else{
printf 0" "
}
}
print ""
}
}
' file*
oupput :
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0
first i store every value for each file number and key value in variable a
then store all uniqe key in variable b
and in END block, checked if key is exists or not, if exists print it OR not exist print 0
we can delete file0, if delete it, awk show only exists key in file1,2,3,4,..
Not awk, but this sort of joining of files on a common field is exactly what join is meant for. Complicated a bit by it only working with two files at a time; you have to pipe the results of each one into the next as the first file.
$ join -o 0,2.2 -e0 -a1 <(sort file0) <(sort file1) \
| join -o 0,1.2,2.2 -e0 -a1 - <(sort file2) \
| join -o 0,1.2,1.3,2.2 -e0 -a1 - <(sort file3) \
| join -o 0,1.2,1.3,1.4,2.2 -e0 -a1 - <(sort file4) \
| tr ' ' '\t'
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0
Caveats: This requires a shell like bash or zsh that understands <(command) redirection. Sorting all the files in advance is an alternative. Or as pointed out, even though join normally requires its input files to be sorted on the column that's being joined on, it works anyways without the sorts for this particular input.
With GNU awk you can use the ENDFILE clause to make sure you have enough elements in all rows, e.g.:
parse.awk
BEGIN { OFS = "\t" }
# Collect all information into the `h` hash
{ h[$1] = (ARGIND == 1 ? $1 : h[$1] OFS $2) }
# At the end of each file do the necessary padding
ENDFILE {
for(k in h) {
elems = split(h[k], a, OFS)
if (elems != ARGIND)
h[k] = h[k] OFS 0
}
}
# Print the content of `h`
END {
for(k in h)
print h[k]
}
Run it like this:
awk -f parse.awk file[0-4]
Output:
AA 5 0 0 0
A 2 2 2 1
C 3 3 2 3
D 7 10 7 7
BC 4 0 0 0
CC 0 0 4 4
S 0 0 0 0
KKK 0 0 0 10
F 5 7 5 5
NB: This solution assumes you only have two columns per file (except the first one).
You could use coreutils join to determine missing fields and add them to each file:
sort file0 > file0.sorted
for file in file[1-4]; do
{
cat $file
join -j 1 -v 1 file0.sorted <(sort $file) | sed 's/$/ 0/'
} | sort > $file.sorted
done
Now you just need to paste them together:
paste file0.sorted \
<(cut -d' ' -f2 file1.sorted) \
<(cut -d' ' -f2 file2.sorted) \
<(cut -d' ' -f2 file3.sorted) \
<(cut -d' ' -f2 file4.sorted)
Output:
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0

Count number of occurrences of a number larger than x from every raw

I have a file with multiple rows and 26 columns. I want to count the number of occurrences of values that are higher than 0 (I guess is also valid different from 0) in each row (excluding the first two columns). The file looks like this:
X Y Sample1 Sample2 Sample3 .... Sample24
a a1 0 7 0 0
b a2 2 8 0 0
c a3 0 3 15 3
d d3 0 0 0 0
I would like to have an output file like this:
X Y Result
a a1 1
b b1 2
c c1 3
d d1 0
awk or sed would be good.
I saw a similar question but in that case the columns were summed and the desired output was different.
awk 'NR==1{printf "X\tY\tResult%s",ORS} # Printing the header
NR>1{
count=0; # Initializing count for each row to zero
for(i=3;i<=NF;i++){ #iterating from field 3 to end, NF is #fields
if($i>0){ #$i expands to $3,$4 and so which are the fields
count++; # Incrementing if the condition is true.
}
};
printf "%s\t%s\t%s%s",$1,$2,count,ORS # For each row print o/p
}' file
should do that
another awk
$ awk '{if(NR==1) c="Result";
else for(i=3;i<=NF;i++) c+=($i>0);
print $1,$2,c; c=0}' file | column -t
X Y Result
a a1 1
b a2 2
c a3 3
d d3 0
$ awk '{print $1, $2, (NR>1 ? gsub(/ [1-9]/,"") : "Result")}' file
X Y Result
a a1 1
b a2 2
c a3 3
d d3 0

How to replace other column-entries when searching for a specific column in a file?

Suppose my file looks like this:
A 1 0
B 1 0
C 1 0
How can I search for the line that has B in the first column, and if so, switch the entries in the second and third column? So my final result would look like:
A 1 0
B 0 1
C 1 0
try this -
vipin#kali:~$ awk '{if($1 == "B") {print $1,$3,$2} else print $1,$2,$3}' kk
A 1 0
B 0 1
C 1 0

pasting files/multiple columns with different number of rows

Hi I was trying to paste multiple files (each with a single column but different number of rows) together. But it did't provide what I was expecting. How to solve that?
paste file1.txt file2.txt paste3.txt ... paste100 > out.txt
input file 1:
A
B
C
input file 2:
D
E
input file 3:
F
G
H
I
J
.......
......
Desired output:
A D F
B E G
C H
I
J
Would this be same if the files have multiple columns with different number of rows?
for example:
file1
A 1
B 2
C 3
file2
D 4
E 5
file3
F 6 %
G 7 &
H 8 #
I 9 #
J 10 ?
output:
A 1 D 4 F 6 %
B 2 E 5 G 7 &
C 3 H 8 #
I 9 #
J 10 ?
Isn't the default behaviour of paste exactly what you ask?
% paste <(echo "a
b
c
d") <(echo "1
2
3") <(echo "10
> 20
> 30
> 40
> 50
> 60")
a 1 10
b 2 20
c 3 30
d 40
50
60
%