Extracting information from lines having columns occuring more than once - awk

I have a file :
chr1 1234 2345 EG1234:E1
chr1 2350 2673 EG1234:E2
chr1 2673 2700 EG1234:E2
chr1 2700 2780 EG1234:E2
chr2 5672 5700 EG2345:E1
chr2 5705 5890 EG2345:E2
chr2 6000 6010 EG2345:E3
chr2 6010 6020 EG2345:E3
As you can see there is a specific ID before ':' and there is an id that is repeated after ':' which might be common to more than one row , I want an output that looks something like this:
chr1 1234 2345 EG1234:E1 (output as it is since it doesn't have duplicate id in the next row)
chr1 2350 2780 EG1234:E2 (since duplicate the 1st and 2nd column of 1st occurrence &
3rd and 4 th column of the last occurence)
similarly
chr2 5672 5700 EG2345:E1
chr2 5705 5890 EG2345:E2
chr2 6000 6020 EG2345:E3
I was trying to use a key to move to next column but I am not quiet sure as to how would I extract the column wise values
awk '{key=$4; if (!(key in data)) c[++n]=key; data[key]=$0} END{for (i=1; i<=n; i++) print data[c[i]]}' file1
In short I want to extract the first two columns of first occurrence and last two columns from the last occurrence of any rows with duplicate 4 th column

This one only messes up the record order:
($1 FS $4 in a) { # combination of $1 and $4 is the key
split(a[$1 FS $4],b) # split to get the old $2
a[$1 FS $4]=b[1] FS b[2] FS $3 FS b[4] # update $3
next
}
{
a[$1 FS $4]=$0 # new key found
}
END {
for(i in a) # print them all
print a[i]
}
Test it:
$ awk -f foo.awk foo.txt
chr1 EG1234:E2 2350 2780
chr2 EG2345:E1 5672 5700
chr2 EG2345:E2 5705 5890
chr2 EG2345:E3 6000 6020
chr1 EG1234:E1 1234 2345
One-liner:
$ awk '($1 FS $4 in a) {split(a[$1 FS $4],b); a[$1 FS $4]=b[1] FS b[2] FS $3 FS b[4]; next} {a[$1 FS $4]=$0} END {for(i in a) print a[i]}' foo.txt

Using awk, considering the key1:key2 as a unique combination and if applying it to filter duplicates. Here $4 represents the key1:key2 from your file.
awk '!seen[$4]++' file
chr1 1234 2345 EG1234:E1
chr1 2350 2673 EG1234:E2
chr2 5672 5700 EG2345:E1
chr2 5705 5890 EG2345:E2
chr2 6000 6010 EG2345:E3
The logic is straight forward, the line identified by key1:key2 is printed only if it is not seen already.

Related

Subtracting from values ending with specific digits?

I have a .bed (.tsv) file that looks like this:
chr1 0 100000
chr1 100000 200000
chr1 200000 300000
chr1 300000 425234
I want to perform the operation -1 from only values in column 3 that end in "000", using sed or awk so that the output looks like:
chr1 0 99999
chr1 100000 199999
chr1 200000 299999
chr1 300000 425234
Embarassingly enough, the best I've come up with is:
awk {sub(/000$/,"999",$3); print $1,$2,$3}' oldfile > newfile
which simply substituites the last 3 digits for 999, rather than actually subtracting.
Any help is appreciated is always!
Awk can easily perform arithmetic, too.
awk 'BEGIN{FS=OFS="\t"} $3 ~ /000$/ {$3 -= 1}1' oldfile > newfile
This is assuming all the lines in your file always have three fields and that you want to print all the lines.
sed has no idea about even the simplest arithmetic so it's not particularly suitable for this.
I would use GNU AWK for this as follows, let file.txt content be
chr1 0 100000
chr1 100000 200000
chr1 200000 300000
chr1 300000 425234
then
awk 'BEGIN{OFS="\t"}($3%1000==0){$3-=1}{print}' file.txt
output
chr1 0 99999
chr1 100000 199999
chr1 200000 299999
chr1 300000 425234
Explanation: Use tab (\t) as output field separator (OFS). If remainder from diving $3 by 1000 is zero (i.e. $3 is multiply of 1000) then subtract 1 from $3, for each line print.
(tested in gawk 4.2.1)

Cut Columns and Append to Same File

I'm working with a tab separated file on MacOS. The file contains 15 columns and thousands of rows. I want to cut columns 1, 2, and 3 and then append them with columns 11, 12, and 13. I was hoping to do this in a pipe so that no extra files need to be created. The only post I found used a command sponge but I evidently don't have that on MacOS, or it isn't in my BASH.
The input tsv file is actually being generated within the same line of code,
arbitrary command to generate input.tsv | cut -f1-3,11-13 | <Step to cut -f4-6 and append -f1-3> | sort > out.file
Input tsv
chr1 21018 21101 A B C D E F G chr1 20752 21209
chr10 74645 74836 A B C D E F G chr10 74638 74898
chr10 75267 75545 A B C D E F G chr10 75280 75917
chr4 212478 212556 A B C D E F G chr4 212491 213285
Desired Output tsv
chr1 21018 21101
chr1 20752 21209
chr10 74638 74898
chr10 74645 74836
chr10 75280 75917
chr4 212478 212556
chr4 212491 213285
Using perl and awk :
code
perl -pe 's/chr[0-9]+/\n$&/g' file | awk '/./{print $1, $2, $3}'
 Output
chr1 21018 21101
chr1 20752 21209
chr10 74645 74836
chr10 74638 74898
chr10 75267 75545
chr10 75280 75917
chr4 212478 212556
chr4 212491 213285
here is short awk solution:
awk '{print $1, $2, $3, "\n" $1, $12, $13;}' input.tsv
output:
chr1 21018 21101
chr1 20752 21209
chr10 74645 74836
chr10 74638 74898
chr10 75267 75545
chr10 75280 75917
chr4 212478 212556
chr4 212491 213285
Explanation
{ # for each input line
print $1, $2, $3; # print 1st field, append 2nd and 3rd fields. Terminate with new line
print $1, $12, $13; #print 1st field, append 12th and 13th field. Terminate with new line
}

manipulating columns in a text file in awk

I have a tab separated text file and want to do some math operation on one column and make a new tab separated text file.
this is an example of my file:
chr1 144520803 144520804 12 chr1 144520813 58
chr1 144520840 144520841 12 chr1 144520845 36
chr1 144520840 144520841 12 chr1 144520845 36
chr1 144520848 144520849 14 chr1 144520851 32
chr1 144520848 144520849 14 chr1 144520851 32
i want to change the 4th column. in fact I want to divide every single element in the 4th column by sum of all elements in the 4th column and then multiply by 1000000 . like the expected output.
expected output:
chr1 144520803 144520804 187500 chr1 144520813 58
chr1 144520840 144520841 187500 chr1 144520845 36
chr1 144520840 144520841 187500 chr1 144520845 36
chr1 144520848 144520849 218750 chr1 144520851 32
chr1 144520848 144520849 218750 chr1 144520851 32
I am trying to do that in awk using the following command but it does not return what I want. do you know how to fix it:
awk '{print $1 "\t" $2 "\t" $3 "\t" $4/{sum+=$4}*1000000 "\t" $5 "\t" $6 "\t" $7}' myfile.txt > new_file.txt
you need two passes, one to compute the sum and then to scale the field
something like this
$ awk -v OFS='\t' 'NR==FNR {sum+=$4; next}
{$4*=(1000000/sum)}1' file{,} > newfile

Awk OR conditional not working

Input: A tab-separated input file with 15 columns where column 15 is an integer.
Output: The number of lines that satisfy the conditional.
My code:
$ closest-features --closest --no-overlaps --delim '\t' --dist --ec megatrans_enhancers.sorted.bed ../../data/alu_repeats.sorted.bed | awk -v OFS='\t' '{if ($15 <= 1000 || $15 >= -1000) print $0}' | wc -l
1188
The || conditional in this case is failing to work (the total number of lines in the file are 1188 and I know for certain at least some lines do not satisfy the condition), because if I remove the OR conditional then suddenly it works:
$ closest-features --closest --no-overlaps --delim '\t' --dist --ec megatrans_enhancers.sorted.bed ../../data/alu_repeats.sorted.bed | awk -v OFS='\t' '{if ($15 <= 1000) print $0}' | wc -l
926
Not sure what i'm doing wrong. Any advice?
Example Input to Awk command:
chr1 378268 378486 chr1-798_Enhancer 17.2 + chr1 375923 376219 AluY|SINE|Alu-HOMER529 0 + E:375923 0.044 -2050
chr1 1079471 1079689 chr1-929_Enhancer 14.6 - chr1 1071271 1071563 AluSx1|SINE|Alu-HOMER1669 0 - E:1071271 0.13 -7909
chr1 1080259 1080477 chr1-830_Enhancer 16.7 - chr1 1071271 1071563 AluSx1|SINE|Alu-HOMER1669 0 - E:1071271 0.13 -8697
chr1 6611744 6611962 chr1-241_Enhancer 46.6 + chr1 6611431 6611723 AluSc|SINE|Alu-HOMER10257 0 + E:6611431 0.089 -22
chr1 6959639 6959857 chr1-58_Enhancer 100.1 - chr1 6966612 6966911 AluSx|SINE|Alu-HOMER11041 0 - E:6966612 0.137 6756
chr1 6960593 6960811 chr1-202_Enhancer 51.6 - chr1 6966612 6966911 AluSx|SINE|Alu-HOMER11041 0 - E:6966612 0.137 5802
chr1 7447888 7448106 chr1-2_Enhancer 181.9 - chr1 7449489 7449799 AluSz|SINE|Alu-HOMER11879 0 + E:7449489 0.119 1384
chr1 10752461 10752679 chr1-131_Enhancer 65.4 - chr1 10752754 10753065 AluSq2|SINE|Alu-HOMER19455 0 + E:10752754 0.106 76
chr1 12485694 12485912 chr1-353_Enhancer 36.7 + chr1 12487328 12487634 AluSx3|SINE|Alu-HOMER23581 0 + E:12487328 0.085 1417
chr1 12486469 12486687 chr1-141_Enhancer 63.6 + chr1 12487328 12487634 AluSx3|SINE|Alu-HOMER23581 0 + E:12487328 0.085 642
Try to put && condition because a digit should be greater than -1000 and lesser than 1000.
Your_command | awk '$15<=1000 && $15>=-1000{count++} END{print count}'
Add -F"\t" in above awk in case your Input to it is coming TAB delimited too. Also there is no need to use wc -l after awk. I have written logic for that so give the count of lines which are satisfying the condition by creating a variable named count and printing it at very last of Input_file.
Also for your provided samples output is coming as 3 which I believe is correct one.

LINUX AWK command for a big file

I have encountered a problem that exceeds my basic unix knowledge and would really appreciate some help. I have a large file in the following format:
chr1 10495 10499 211
chr1 10496 10500 1
chr1 10587 10591 93
chr1 10588 10592 1
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10936 6
chr3 10933 10937 2
chr5 11056 11060 4
chr6 11155 11159 9
If the values in column one match and one number difference in column two, I want to sum the values in column 4 of both lines and replace the value of column 3 in line 1 with the value of column 3 in line 2 , else just the the values in the unique line without modifying any column.
So the output I am hoping for would look like this:
chr1 10495 10500 212
chr1 10587 10592 94
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10937 8
chr5 11056 11060 4
chr6 11155 11159 9
$ cat tst.awk
BEGIN { OFS="\t" }
NR>1 {
if ( ($1==p[1]) && ($2==(p[2]+1)) ) {
print p[1], p[2], $3, p[4]+$4
delete p[0]
next
}
else if (0 in p) {
print p[0]
}
}
{ split($0,p); p[0]=$0 }
END { if (0 in p) print p[0] }
$
$ awk -f tst.awk file
chr1 10495 10500 212
chr1 10587 10592 94
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10937 8
chr5 11056 11060 4
chr6 11155 11159 9
Haven't checked closely, but I think you want:
awk '{split(p,a)}
$1==a[1] && a[2]==$2-1{print a[1], a[2], $3, $4 + a[4]; p=""; next}
p {print p} {p=$0}
END {print}' OFS=\\t input
At any given step (except the first), p holds the value from the previous line. The 2nd line of the script checks if the first field in the current line matches the first field of the last line and that the 2nd field is one greater than the 2nd field of the last line. In that condition, it prints the first two fields from the previous line, the third from the current line, and the sum of the 4th fields and moves on to the next line. If they don't match, it prints the previous line. At the end, it just prints the line.
This script, I'm using to merge intervals in transcriptome data
awk '
NR==1{
n= split($0, first);
c=1;
for(i=1; i<=n; i++) d[c, i] = first[i];
}
NR>1{
n= split($0, actual);
#if(actual[1] != d[c, 1] || actual[2]>d[c, 3]){ #for interval fusion
if(actual[1] != d[c, 1] || actual[2]>d[c,2]+1){ #OP requirement
c++;
for(i=1; i<=n; i++) d[c, i] = actual[i];
}else{
if(actual[3] > d[c,3]) d[c,3] = actual[3];
d[c,4] = d[c,4] + actual[4];
}
}
END{
for(i=1; i<=c; i++){
print d[i, 1], d[i, 2], d[i, 3], d[i, 4]
}
}' file
you get:
chr1 10495 10500 212
chr1 10587 10592 94
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10937 8
chr5 11056 11060 4
chr6 11155 11159 9