awk to find and output differences in files - awk

I am trying to find the differences between file1.txt and file2.txt and output the differences. I tried diff and sed and the output does not return any differences. I also tried awk and matching on $2, but I think the syntax is wrong as a file gets created but it is 0kb. The actual data I am using is quite large but I know there should be 18 differences. Thank you :).
awk 'NR==FNR{a[$2]++;next} !($2 in a){print $2}' file1.txt file2.txt > diff.txt
file1.txt
chr1 955542 955763
chr1 957570 957852
chr1 976034 976270
file2.txt
chr1 955542 955763 + AGRN:exon.1
chr1 957570 957852 + AGRN:exon.2
chr1 976034 976270 + AGRN:exon.2;AGRN:exon.3;AGRN:exon.4
chr1 976542 976787 + AGRN:exon.3;AGRN:exon.5
chr1 976847 977092 + AGRN:exon.6
Desired output
chr1 976542 976787 + AGRN:exon.3;AGRN:exon.5
chr1 976847 977092 + AGRN:exon.6
Diff result (since these are the two records that are not in both files)
1,52058c1,52040
< chr1 955542 955763
< chr1 957570 957852
< chr1 976034 976270

I'm curious while diff isn't working like you want, but your awk logic isn't correct:
You're checking the second field's (delimited by spaces) value only. In your example the second field is all identical so nothing is being printed out. Using the whole line instead works as expected:
Using your example text where all is different:
$ cat file1.txt
chr1 955542 955763
chr1 957570 957852
chr1 976034 976270
$ cat file2.txt
chr1 955542 955763 + AGRN:exon.1
chr1 957570 957852 + AGRN:exon.2
chr1 976034 976270 + AGRN:exon.2;AGRN:exon.3;AGRN:exon.4
$ awk 'NR==FNR{a[$0]++;next} !($0 in a){print $0}' file1.txt file2.txt > diff.txt
$ cat diff.txt
chr1 955542 955763 + AGRN:exon.1
chr1 957570 957852 + AGRN:exon.2
chr1 976034 976270 + AGRN:exon.2;AGRN:exon.3;AGRN:exon.4
Here's with the second line identical just to show it working in a more obvious way.
$ cat file1.txt
chr1 955542 955763
chr1 957570 957852
chr1 976034 976270
$ cat file2.txt
chr1 955542 955763 + AGRN:exon.1
chr1 957570 957852
chr1 976034 976270 + AGRN:exon.2;AGRN:exon.3;AGRN:exon.4
$ awk 'NR==FNR{a[$0]++;next} !($0 in a){print $0}' file1.txt file2.txt > diff.txt
$ cat diff.txt
chr1 955542 955763 + AGRN:exon.1
chr1 976034 976270 + AGRN:exon.2;AGRN:exon.3;AGRN:exon.4
EDIT
Based on a comment stating:
"There should be 18 differences out of 52,000 lines. File1.txt is 52,058 entries and file2.txt has 52,040 entries in it. I am trying to find out what the 18 are"
Given you said file1 has more lines, you need to process file2 first. The first file read is populating the array and then the second is checking for lines existing in that array. You need to process the smaller file first so that the additional lines you're interested in aren't in the array. It'd be the same logic above, just with the file order switched, e.g.:
$ cat file1.txt
chr1 955542 955763
chr1 957570 957852
chr1 976034 976270
New Line!
Not in file2!
$ cat file2.txt
chr1 955542 955763 + AGRN:exon.1
chr1 957570 957852
chr1 976034 976270 + AGRN:exon.2;AGRN:exon.3;AGRN:exon.4
$ awk 'NR==FNR{a[$0]++;next} !($0 in a){print $0}' file2.txt file1.txt > diff.txt
$ cat diff.txt
chr1 955542 955763
chr1 976034 976270
New Line!
Not in file2!
$ awk 'NR==FNR{a[$0]++;next} !($0 in a){print $0}' file1.txt file2.txt > diff.txt
$ cat diff.txt
chr1 955542 955763 + AGRN:exon.1
chr1 976034 976270 + AGRN:exon.2;AGRN:exon.3;AGRN:exon.4
Note that reading file1 first doesn't emit the additional lines.
If you don't care about the additional text on the lines, just the text in the second field, then you could use $2 as you originally did.

$ awk 'NR==FNR{a[$2];next} !($2 in a)' file1 file2
chr1 976542 976787 + AGRN:exon.3;AGRN:exon.5
chr1 976847 977092 + AGRN:exon.6

Related

awk to calculate difference between two files and output specific text based on value

I am trying to use awk to check if each $2 in file1 falls between $2 and $3 of the matching $4 line of file2. If it does then in $5 of file2, exon if it does not intron. I think the awk below will do that, but I am struggling trying to is add a calculation that if the difference is less than or equal to 10, then $5 is splicing. I have added an example of line 1 as well.
The 6th line is an example of the splicing, because the $2 value in file1 is 2 away from the $2 value in file2. My actual data is very large with file2 always being several hundreds of thousand lines. File 1 will be variable but usually ~100 lines. The files are hardcoded in this example but will be gotten from a bash for loop. That will provide the input. Thank you :).
file1 tab-delimited with whitespace after $3 and $4
chr1 17345304 17345315 SDHB
chr1 17345516 17345524 SDHB
chr1 93306242 93306261 RPL5
chr1 93307262 93307291 RPL5
chrX 153295819 153296875 MECP2
chrX 153295810 153296830 MECP2
file2 tab-delimited
chr1 17345375 17345453 SDHB_cds_0_0_chr1_17345376_r 0 -
chr1 17349102 17349225 SDHB_cds_1_0_chr1_17349103_r 0 -
chr1 17350467 17350569 SDHB_cds_2_0_chr1_17350468_r 0 -
chr1 17354243 17354360 SDHB_cds_3_0_chr1_17354244_r 0 -
chr1 17355094 17355231 SDHB_cds_4_0_chr1_17355095_r 0 -
chr1 17359554 17359640 SDHB_cds_5_0_chr1_17359555_r 0 -
chr1 17371255 17371383 SDHB_cds_6_0_chr1_17371256_r 0 -
chr1 17380442 17380514 SDHB_cds_7_0_chr1_17380443_r 0 -
chr1 93297671 93297674 RPL5_cds_0_0_chr1_93297672_f 0 +
chr1 93298945 93299015 RPL5_cds_1_0_chr1_93298946_f 0 +
chr1 93299101 93299217 RPL5_cds_2_0_chr1_93299102_f 0 +
chr1 93300335 93300470 RPL5_cds_3_0_chr1_93300336_f 0 +
chr1 93301746 93301949 RPL5_cds_4_0_chr1_93301747_f 0 +
chr1 93303012 93303190 RPL5_cds_5_0_chr1_93303013_f 0 +
chr1 93306107 93306196 RPL5_cds_6_0_chr1_93306108_f 0 +
chr1 93307322 93307422 RPL5_cds_7_0_chr1_93307323_f 0 +
chrX 153295817 153296901 MECP2_cds_0_0_chrX_153295818_r 0 -
chrX 153297657 153298008 MECP2_cds_1_0_chrX_153297658_r 0 -
chrX 153357641 153357667 MECP2_cds_2_0_chrX_153357642_r 0 -
desired output tab-delimited
chr1 17345304 17345315 SDHB intron
chr1 17345516 17345524 SDHB intron
chr1 93306242 93306261 RPL5 intron
chr1 93307262 93307291 RPL5 intron
chrX 153295819 153296875 MECP2 exon
chrX 153295810 153296800 MECP2 splicing
awk
awk '
FNR==NR{
a[$4];
min[$4]=$2;
max[$4]=$3;
next
}
{
split($4,array,"_");
print $0,(array[1] in a) && ($2>=min[array[1]] &&
$2<=max[array[1]])?"exon":"intron"
}' file1 OFS="\t" file2 > output
example of line 1
a[$4] = SDHB
min[$4] = 17345304
max[$4] = 17345315
array[1] = SDHB, 17345304 >= 17345375 && array[1] = SDHB, 17345315 <= 17345453 ---- intron

manipulating columns in a text file in awk

I have a tab separated text file and want to do some math operation on one column and make a new tab separated text file.
this is an example of my file:
chr1 144520803 144520804 12 chr1 144520813 58
chr1 144520840 144520841 12 chr1 144520845 36
chr1 144520840 144520841 12 chr1 144520845 36
chr1 144520848 144520849 14 chr1 144520851 32
chr1 144520848 144520849 14 chr1 144520851 32
i want to change the 4th column. in fact I want to divide every single element in the 4th column by sum of all elements in the 4th column and then multiply by 1000000 . like the expected output.
expected output:
chr1 144520803 144520804 187500 chr1 144520813 58
chr1 144520840 144520841 187500 chr1 144520845 36
chr1 144520840 144520841 187500 chr1 144520845 36
chr1 144520848 144520849 218750 chr1 144520851 32
chr1 144520848 144520849 218750 chr1 144520851 32
I am trying to do that in awk using the following command but it does not return what I want. do you know how to fix it:
awk '{print $1 "\t" $2 "\t" $3 "\t" $4/{sum+=$4}*1000000 "\t" $5 "\t" $6 "\t" $7}' myfile.txt > new_file.txt
you need two passes, one to compute the sum and then to scale the field
something like this
$ awk -v OFS='\t' 'NR==FNR {sum+=$4; next}
{$4*=(1000000/sum)}1' file{,} > newfile

awk split carrying-over whitespace

The below awk split appears to be leaving the whitespace in after `$4~ in the output and I can not seem to prevent it. What is the correct syntax? Thank you :).
input
chr1 955543 955763 + AGRN-6|pr=2|gc=75
chr1 957571 957852 + AGRN-7|pr=3|gc=61.2
chr1 970621 970740 + AGRN-8|pr=1|gc=57.1
Current output
chr1 955543 955763 + AGRN-6|gc=75
chr1 957571 957852 + AGRN-7|gc=61.2
chr1 970621 970740 + AGRN-8|gc=57.1
gawk '{print gensub(/(^[^|]+)\|[^|]+([|][^+]+).*/,"\\1\\2","g",$0)}' input
edit
chr1^I955543^I955763^I+ AGRN-6|gc=75$
chr1^I957571^I957852^I+ AGRN-7|gc=61.2$
chr1^I970621^I970740^I+ AGRN-8|gc=57.1$
desired
chr1^I955542^I955662^I+^IAGRN_70$
chr1^I955643^I955763^I+^IAGRN_71$
chr1^I957570^I957690^I+^IAGRN_72$
Another curious awk alternative:
awk '{print $1""$2}' FS='pr=[0-9]\\|' file
Results
chr1 955543 955763 + AGRN-6|gc=75
chr1 957571 957852 + AGRN-7|gc=61.2
chr1 970621 970740 + AGRN-8|gc=57.1
Explanation
The value of FS could be any regex, so we can use pr=[0-9]| as separator and print the fields before and after it.
awk will rewrite the line with the specified OFS. If you want to preserve the input spacing you can choose a simpler solution with sed
sed -r 's/\|.*\|/\|/' file
chr1 955543 955763 + AGRN-6|gc=75
chr1 957571 957852 + AGRN-7|gc=61.2
chr1 970621 970740 + AGRN-8|gc=57.1
awk '{n=split($5, a, "|"); print $1,$2,$3,$4" "a[1]"|"a[3]}' OFS="\t" input

How to pull out all lines of a file matching each line from another file and output into separate rows?

This is a similar question to what has been previously asked (see below for link) but this time I would like to output the common strings into rows instead of columns as shown below:
I have two files, each with one column that look like this:
File 1
chr1 106623434
chr1 106623436
chr1 106623442
chr1 106623468
chr1 10699400
chr1 10699405
chr1 10699408
chr1 10699415
chr1 10699426
chr1 10699448
chr1 110611528
chr1 110611550
chr1 110611552
chr1 110611554
chr1 110611560
File 2
chr1 1066234
chr1 106994
chr1 1106115
I want to search file 1 and pull out all lines that are an exact match with line 1 of file 2 and output all matches on it's own line. Then I want to do the same for line 2 of file 2 and so on until all matches of file 2 have been found in file 1 and output to it's own row. Also I am working with very large files so something that won't require file 2 to be completely stored in memory, otherwise it will not run to completion. Hopefully the output will look something like this:
chr1 106623434 chr1 106623436 chr1 106623442 chr1 106623468
chr1 10699400 chr1 10699405 chr1 10699408 chr1 10699415 chr1 10699426 chr1 10699448
chr1 110611528 chr1 110611550 chr1 110611552 chr1 110611554 chr1 110611560
Similar question at:
How to move all strings in one file that match the lines of another to columns in an output file?
as long as your patterns don't overlap completely this should work
$ while read p; do grep "$p" file1 | tr '\n' '\t'; echo ""; done < file2
chr1 106623434 chr1 106623436 chr1 106623442 chr1 106623468
chr1 10699400 chr1 10699405 chr1 10699408 chr1 10699415 chr1 10699426 chr1 10699448
chr1 110611528 chr1 110611550 chr1 110611552 chr1 110611554 chr1 110611560
You could do this as it uses close to zero memory but it'll be very slow since it reads the whole of "file1" once for every line of "file2":
$ cat tst.awk
{
ofs = ors = ""
while ( (getline line < "file1") > 0) {
if (line ~ "^"$0) {
printf "%s%s", ofs, line
ofs = "\t"
ors = "\n"
}
}
printf ors
close("file1")
}
$ awk -f tst.awk file2
chr1 106623434 chr1 106623436 chr1 106623442 chr1 106623468
chr1 10699400 chr1 10699405 chr1 10699408 chr1 10699415 chr1 10699426 chr1 10699448
chr1 110611528 chr1 110611550 chr1 110611552 chr1 110611554 chr1 110611560
you can try
awk -vOFS="\t" '
NR==FNR{ #only file2
keys[++i]=$0; #'keys' store pattern to search ('i' contains number of keys)
next; #stop processing the current record and
#go on to the next record
}
{
for(j=1; j<=i; ++j)
#if line start with key then add
if($0 ~ "^"keys[j])
a[keys[j]] = a[keys[j]] (a[keys[j]]!=""?OFS:"") $0;
}
END{
for(j=1; j<=i; ++j) print a[keys[j]]; #print formating lines
}' file2 file1
you get,
chr1 106623434 chr1 106623436 chr1 106623442 chr1 106623468
chr1 10699400 chr1 10699405 chr1 10699408 chr1 10699415 chr1 10699426 chr1 10699448
chr1 110611528 chr1 110611550 chr1 110611552 chr1 110611554 chr1 110611560

awk to count lines in column of file

I have a large file that I want to use awk to count the lines in a specific column $5, before the: and only count -uniq entries, but seem to be having trouble getting the syntax correct. Thank you :).
Sample Input
chr1 955542 955763 + AGRN:exon.1 1 0
chr1 955542 955763 + AGRN:exon.1 2 0
chr1 955542 955763 + AGRN:exon.1 3 0
chr1 955542 955763 + AGRN:exon.1 4 1
chr1 955542 955763 + AGRN:exon.1 5 1
awk -F: ' NR > 1 { count += $5 } -uniq' Input
Desired output
1
$ awk -F'[ \t:]+' '{a[$5]=1;} END{for (k in a)n++; print n;}' Input
1
-F'[ \t:]+'
This tells awk to use spaces, tabs, or colons as the field separator.
a[$5]=1
As we loop through each line, this adds an entry into associative array a for each value of $5 encountered.
END{for (k in a)n++; print n;}
After we have finished reading the file, this counts the number of keys in associative array a and prints the total.
The idiomatic, portable awk approach:
$ awk '{sub(/:.*/,"",$5)} !seen[$5]++{unq++} END{print unq}' file
1
The briefer but gawk-only (courtesy of length(array)) approach:
$ awk '{seen[$5]} END{print length(seen)}' file
1