Map tab separated file with a list using awk/sed/join - awk

I have multiple large files with two columns (tab-delimited).
The content of these files are like this:
working_file-1
K00001 0.188
K00005 15.97
K00008 188.09
mapping file
K00001
K00002
K00003
K00004
K00005
K00006
K00007
K00008
mapping file range is K00001 - K25804
I want to map my working_file-1 to mapping file so that the output looks like this:
K00001 0.188
K00002
K00003
k00004
K00005 15.97
K00006
K00007
k00008 188.09
The blanks (K0's not present in working_file) can be filled with zero (if possible) or can be left as blank.
So far, I tried these codes by following other similar posts (but did not success):
awk 'NR==FNR {a[$1]++; next} $1 in a' mapping file working_file-1 > output.file
grep -Fw -f mapping file working_file-1 > output.file
edit: output of od -c work1; od -c map
0000000 K 0 0 0 0 1 \r \n K 0 0 0 0 2 \r \n
0000020 K 0 0 0 0 3 \r \n K 0 0 0 0 4 \r \n
0000040 K 0 0 0 0 5 \r \n K 0 0 0 0 6 \r \n
0000060 K 0 0 0 0 7 \r \n K 0 0 0 0 8 \r \n

Given a "map" file like:
a
b
c
d
e
f
g
h
and a "work1" file like:
a A
c C
g G
and desired "merged1" output like:
a A
b
c C
d
e
f
g G
h
then join can do the combining:
join -1 1 -2 1 -a 1 -o 0,2.2 map work1 > merged1
-1 1 -2 1 joins on first (space-delimited) field of each file
-a 1 prints lines from first file even if unpaired
-o 0,2.2 formats output to be join field (first column), separator (space), then second field of second file
To produce instead "merged2" like:
a A
b 0
c C
d 0
e 0
f 0
g G
h 0
add the -e option:
join -1 1 -2 1 -a 1 -e 0 -o 0,2.2 map work1 > merged2
If "work2" has fields separated by a character other than single space, use the -t option.
For example, with "work2" using single tab delimiter like:
a A
c C
g G
(note: stackoverflow displays this with spaces instead of single tab) then with POSIX shell, use -t ' ' (ie. QUOTE TAB QUOTE - may need to be typed as: 'CTRL-VTAB' if shell performs history completion):
join -t ' ' -1 1 -2 1 -a 1 -e 0 -o 0,2.2 map work2 >merged3
or with bash, using -t $'\t' is possible instead:
join -t $'\t' -1 1 -2 1 -a 1 -e 0 -o 0,2.2 map work2 >merged3
to produce "merged3" like:
a A
b 0
c C
d 0
e 0
f 0
g G
h 0
(note: again stackoverflow displays the tab as spaces)

Using GNU sort for -s (stable sort) this might be what you want:
$ sort -k1,1 -u -s working_file-1 mapping_file
K00001 0.188
K00002
K00003
K00004
K00005 15.97
K00006
K00007
K00008 188.09
or if you want to add 0s:
$ sort -k1,1 -u -s working_file-1 mapping_file |
awk -v OFS='\t' '{print $1, $2+=0}'
K00001 0.188
K00002 0
K00003 0
K00004 0
K00005 15.97
K00006 0
K00007 0
K00008 188.09
If you don't have GNU sort then you could do:
$ sort -k1,1 -k2,2rn working_file-1 mapping_file |
awk -v OFS='\t' '$1 != p{print $1, $2+0; p=$1}'
K00001 0.188
K00002 0
K00003 0
K00004 0
K00005 15.97
K00006 0
K00007 0
K00008 188.09

Related

Extract substring from grep/awk results?

I have a grep command that finds rows in a file, passes those to awk and prints out the 1st and 15th columns.
grep String1 /path/to/file.txt | grep string2 | awk -F ' ' '{print $1, $15}'
So far, so good. This results in a list like this:
2023-01-20 [text1]>
2023-01-22 [text2]>
2023-01-23 [text3]>
2023-01-25 [text4]>
Ideally, I'd like to add some regex to the awk command so that I get this:
2023-01-20 text1
2023-01-22 text2
2023-01-23 text3
2023-01-25 text4
My searches have only returned how to use regex with awk to identify fields but not to extract a substring from the results. Is this possible with awk or some other command?
One awk idea that combines the current code with the new requirement:
awk -v s1="String1" -v s2="string2" ' # feed both search strings in as awk variables "s1" and "s2"
$0~s1 && $0~s2 { print $1,substr($15,2,index($15,"]")-2) } # if s1 and s2 are both present in the current line then print 1st field and 15th field (sans the "[" "]" wrappers)
' /path/to/file.txt
A non-sensical demo file:
$ cat file.txt
a b c d e f g h i j k l m n o p q r s t u v w x y z
a string2 c d e f g h i j k l m n [old]> p q r s t u v String1 x y z
a b c d e f g h i j k l m n o p q r s t u v w x y z
a String1 c d e f g h i j k l m n [older]> p q r s t u v string2 x y z
Running the awk script against this file generates:
a old
a older
If you're just basically want to delete the characters [, ] and >, you can simply use tr -d for this, something like:
... | tr -d "[]>"
Linux prompt>echo "2023-01-20 [text1]>" | tr -d "[]>"
2023-01-20 text1
Another option removing the leading [ and trailing ]> with gsub and an alternation:
awk '/String1/ && /string2/ {
gsub(/^\[|\]>$/, "", $15)
{print $1, $15}
}' file.txt
In gnu-awk you could use gensub:
awk '/String1/ && /string2/ {
{print $1, gensub(/^\[|\]>$/, "", "g", $15)}
}' file
Or find the occurrence of the string using index:
awk 'index($0, "String1") && index($0, "string2"){
gsub(/^\[|\]>$/, "", $15)
{print $1, $15}
}' file

Merge files print 0 in empty field

I have 5 tab delim files
file 0 is basically a key
A
C
F
AA
BC
CC
D
KKK
S
file1
A 2
C 3
F 5
AA 5
BC 4
D 7
file2
A 2
C 3
F 7
D 10
file3
A 2
C 2
F 5
CC 4
D 7
file4
A 1
C 3
F 5
CC 4
D 7
KKK 10
I would like to merge all files based on the 1st column and print 0 in missing fields.
A 2 2 2 1
C 3 3 2 3
F 5 7 5 5
AA 5 0 0 0
BC 4 0 0 0
CC 0 0 4 4
D 7 10 7 7
KKK 0 0 0 10
S 0 0 0 0
Columns must keep the order of input file0, file1, file2, file3, file4
I was going to wait til you included your own attempt in your question but since you have 2 answers already anyway....
$ cat tst.awk
NR==FNR {
key2rowNr[$1] = ++numRows
rowNr2key[numRows] = $1
next
}
FNR==1 { ++numCols }
{
rowNr = key2rowNr[$1]
vals[rowNr,numCols] = $2
}
END {
for (rowNr=1; rowNr<=numRows; rowNr++) {
printf "%s", rowNr2key[rowNr]
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%d", OFS, vals[rowNr,colNr]
}
print ""
}
}
$ awk -f tst.awk file0 file1 file2 file3 file4
A 2 2 2 1
C 3 3 2 3
F 5 7 5 5
AA 5 0 0 0
BC 4 0 0 0
CC 0 0 4 4
D 7 10 7 7
KKK 0 0 0 10
S 0 0 0 0
awk solution
awk '
FNR==1{f++}
{
a[f""$1]=$2
b[$1]++
}
END{
for(i in b){
printf i" "
for(j=1;j<=f;j++){
tmp=j""i
if(tmp in a){
printf a[tmp]" "
}else{
printf 0" "
}
}
print ""
}
}
' file*
oupput :
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0
first i store every value for each file number and key value in variable a
then store all uniqe key in variable b
and in END block, checked if key is exists or not, if exists print it OR not exist print 0
we can delete file0, if delete it, awk show only exists key in file1,2,3,4,..
Not awk, but this sort of joining of files on a common field is exactly what join is meant for. Complicated a bit by it only working with two files at a time; you have to pipe the results of each one into the next as the first file.
$ join -o 0,2.2 -e0 -a1 <(sort file0) <(sort file1) \
| join -o 0,1.2,2.2 -e0 -a1 - <(sort file2) \
| join -o 0,1.2,1.3,2.2 -e0 -a1 - <(sort file3) \
| join -o 0,1.2,1.3,1.4,2.2 -e0 -a1 - <(sort file4) \
| tr ' ' '\t'
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0
Caveats: This requires a shell like bash or zsh that understands <(command) redirection. Sorting all the files in advance is an alternative. Or as pointed out, even though join normally requires its input files to be sorted on the column that's being joined on, it works anyways without the sorts for this particular input.
With GNU awk you can use the ENDFILE clause to make sure you have enough elements in all rows, e.g.:
parse.awk
BEGIN { OFS = "\t" }
# Collect all information into the `h` hash
{ h[$1] = (ARGIND == 1 ? $1 : h[$1] OFS $2) }
# At the end of each file do the necessary padding
ENDFILE {
for(k in h) {
elems = split(h[k], a, OFS)
if (elems != ARGIND)
h[k] = h[k] OFS 0
}
}
# Print the content of `h`
END {
for(k in h)
print h[k]
}
Run it like this:
awk -f parse.awk file[0-4]
Output:
AA 5 0 0 0
A 2 2 2 1
C 3 3 2 3
D 7 10 7 7
BC 4 0 0 0
CC 0 0 4 4
S 0 0 0 0
KKK 0 0 0 10
F 5 7 5 5
NB: This solution assumes you only have two columns per file (except the first one).
You could use coreutils join to determine missing fields and add them to each file:
sort file0 > file0.sorted
for file in file[1-4]; do
{
cat $file
join -j 1 -v 1 file0.sorted <(sort $file) | sed 's/$/ 0/'
} | sort > $file.sorted
done
Now you just need to paste them together:
paste file0.sorted \
<(cut -d' ' -f2 file1.sorted) \
<(cut -d' ' -f2 file2.sorted) \
<(cut -d' ' -f2 file3.sorted) \
<(cut -d' ' -f2 file4.sorted)
Output:
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0

Move last character of each line to new column

Which sed/awk command can I use to cut the last number (and remove delimiter _) of each string in column 1 and move it to a new column (column 3)?
For example,
$ head -3 test1.bed
HLA:HLA00001_A*01:01:01:01_3503 1
HLA:HLA02169_A*01:01:01:02N_3291 1
HLA:HLA14798_A*01:01:01:03_2903 1
Should become:
$ head -3 test1.bed
HLA:HLA00001_A*01:01:01:01 1 3503
HLA:HLA02169_A*01:01:01:02N 1 3291
HLA:HLA14798_A*01:01:01:03 1 2903
$ sed -E 's/(.*)_([0-9]+)(.*)/\1\3 \2/' file
HLA:HLA00001_A*01:01:01:01 1 3503
HLA:HLA02169_A*01:01:01:02N 1 3291
HLA:HLA14798_A*01:01:01:03 1 2903
The above will work with OSX sed and newer GNU seds where -E = Extended Regexps. With any sed:
$ sed 's/\(.*\)_\([0-9]*\)\(.*\)/\1\3 \2/' file
HLA:HLA00001_A*01:01:01:01 1 3503
HLA:HLA02169_A*01:01:01:02N 1 3291
HLA:HLA14798_A*01:01:01:03 1 2903
awk -F'[_ ]' '{print $1"_"$2,$4,$5,$3}' file
HLA:HLA00001_A*01:01:01:01 1 3503
HLA:HLA02169_A*01:01:01:02N 1 3291
HLA:HLA14798_A*01:01:01:03 1 2903

script to remove redundant lines from two different files

I will explain my problem with an example
I have the following files in Solaris
file1:
1 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U386.A0 I have some text here
1 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U386.A1 I have some text here
2 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U246.A0 I have some text here
2 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U246.A1 I have some text here
3 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A0 I have some text here
3 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A1 I have some text here
3 INST C 1 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U64.A1 I have some text here
4 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A0 I have some text here
4 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A1 I have some text here
5 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A0 I have some text here
5 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A1 I have some text here
6 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U375.A0 I have some text here
6 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U375.A1 I have some text here
7 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U98.A I have some text here
8 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U392.A0 I have some text here
8 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U392.A1 I have some text here
9 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U372.A0 I have some text here
10 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U372.A1 I have some text here
11 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U184.A I have some text here
12 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U97.B I have some text here
file2:
INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A0
INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A1
INST C 1 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U64.A1
INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A0
INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A1
INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A0
INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A1
Now i have file2 as reference and print all the lines that match in file1
Expected output is:
3 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A0 I have some text here
3 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A1 I have some text here
3 INST C 1 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U64.A1 I have some text here
4 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A0 I have some text here
4 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A1 I have some text here
5 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A0 I have some text here
5 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A1 I have some text here
I have tried grep:
grep -F -x -f file1 -v file2 > file3
and fgrep:
fgrep -x -f file1 -v file2 > file3
based on several posts from stackoverflow. But didnt find what i need. Since i am a starter, I am really confused to find a way out of this. Your help is most appreciated
this works for you:
grep -Ff file2 file1 >file3
test with your files:
kent$ grep -Ff f2 f1
3 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A0 I have some text here
3 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U385.A1 I have some text here
3 INST C 1 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U64.A1 I have some text here
4 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A0 I have some text here
4 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U384.A1 I have some text here
5 INST N 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A0 I have some text here
5 INST C 0 top.gbp.stg1.stg2.stg3.stg4.stg5.stg6.stg6.U390.A1 I have some text here

awk performance and CPU run time

I have an input file with millions of rows and thousands of columns/fields. Can anybody explain to me, why the two awk methods below, which yield the same output, differ so much in terms of CPU run time?
175.0 seconds:
awk 'BEGIN{FS=":| "}NR>1{field1=$1;field2=$2;$1="";$2="";print field1":"field2,field1":"field2,field2,$0}' file_in > file_out
19.7 seconds:
cat file_in | awk 'BEGIN{FS=":"}NR>1{print $1,$2}' | awk '{print $1":"$2,$1":"$2,$0}' | cut -d " " -f 3 --complement > file_out
Here is the 2nd and 3rd line of one file_in with just hundreds of columns/fields (there is no line break between the lines):
1:1000071 C T 1 0 0 1 0 0
1:1000759 C T 1 0 0 0 1 0
Here are the corresponding lines of the file_out:
1:1000071 1:1000071 1000071 C T 1 0 0 1 0 0
1:1000759 1:1000759 1000759 C T 1 0 0 0 1 0
These 2 statements:
$1="";$2=""
are causing awk to recompile each record twice. Given millions of rows and thousands of fields on each, I expect that would have an impact.
If you show us a couple of lines of representative sample input and expected output, we can show you how to do it concisely and efficiently.
It LOOKS like all you're doing is converting lines like this:
1:1000071 C T 1 0 ...
1:1000759 C T 1 0 ...
to lines like this:
1:1000071 1:1000071 1000071 C T 1 0 ...
1:1000759 1:1000759 1000759 C T 1 0 ...
if so, all you need to do is:
awk '{x=$1; sub(/[^:]+:/,x" "x" ")}1' file
or since this is a simple substitution on a single line, even sed could handle it:
sed 's/\([^:]*:\)\([^ ]*\)/\1\2 \1\2 \2/' file
Look:
$ cat file
1:1000071 C T 1 0 ...
1:1000759 C T 1 0 ...
$ awk '{x=$1; sub(/[^:]+:/,x" "x" ")}1' file
1:1000071 1:1000071 1000071 C T 1 0 ...
1:1000759 1:1000759 1000759 C T 1 0 ...
$ sed 's/\([^:]*:\)\([^ ]*\)/\1\2 \1\2 \2/' file
1:1000071 1:1000071 1000071 C T 1 0 ...
1:1000759 1:1000759 1000759 C T 1 0 ...
Ah, but I see you mentioned your sample input was from line 2 on so I guess you have a header line or something to skip over. That'd be:
awk 'NR>1{x=$1; sub(/[^:]+:/,x" "x" ");print}' file
sed -n '2,$s/\([^:]*:\)\([^ ]*\)/\1\2 \1\2 \2/p' file
Finally - here's an alternative awk solution that may be more efficient if your lines all start with "1:" as shown in your sample input:
awk 'NR>1{print $1, $1, substr($0,3)}' file
This remains the fastest solution:
cat file_in | awk 'BEGIN{FS=":"}NR>1{print $1,$2}' | awk '{print $1":"$2,$1":"$2,$0}' | cut -d " " -f 3 --complement > file_out