How to use conditional expression to select data? - awk

I have a table like this:
symbol refseq seqname start stop strand
Susd4 NM_144796 chr1 184695027 184826500 +
Ptpn14 NM_008976 chr1 191552147 191700574 +
Cd34 NM_001111059 chr1 196765080 196787475 +
Gm5698 NM_001166637 chr1 31034088 31055753 -
Epha4 NM_007936 chr1 77363760 77511663 -
Sp110 NM_175397 chr1 87473474 87495392 -
Gbx2 chr1 91824537 91827751 -
Kif1a chr1 94914855 94998430 -
Bcl2 NM_009741 chr1 108434770 108610879 -
And I want to extract data with the following conditions:
1) lines that the values in "refseq" column are not missing
2) for the values in the columns "start" and "stop", only keep one value for each line: if the value in the column "strand" is "+", take the value in "start"; if the value in the column "strand" is "-", take the value in "stop".
And this is what expected:
Susd4 NM_144796 chr1 184695027 +
Ptpn14 NM_008976 chr1 191552147 +
Cd34 NM_001111059 chr1 196765080 +
Gm5698 NM_001166637 chr1 31055753 -
Epha4 NM_007936 chr1 77511663 -
Sp110 NM_175397 chr1 87495392 -
Bcl2 NM_009741 chr1 108610879 -

I would be very tempted to leave the input delimiter unmodified so blanks and tabs separate fields, rather than insisting on tabs only. That means you want records after the first (to skip the headings line) that have six fields:
awk 'NR > 1 && NF == 6 { if ($6 == "+") x = $4; else x = $5; print $1, $2, $3, x; }'
If you want to control the output format more, you can dink with OFS, or use printf:
awk 'BEGIN { OFS = "\t" }
NR > 1 && NF == 6 { if ($6 == "+") x = $4; else x = $5; print $1, $2, $3, x; }'
awk 'NR > 1 && NF == 6 { if ($6 == "+") x = $4; else x = $5;
printf "%-8s %-12s %s %9s\n", $1, $2, $3, x; }'
There are other ways to handle it, I'm sure...
The first script produces:
Susd4 NM_144796 chr1 184695027
Ptpn14 NM_008976 chr1 191552147
Cd34 NM_001111059 chr1 196765080
Gm5698 NM_001166637 chr1 31055753
Epha4 NM_007936 chr1 77511663
Sp110 NM_175397 chr1 87495392
Bcl2 NM_009741 chr1 108610879
The content is correct, I believe; the formatting can be improved in many ways. The last script produces:
Susd4 NM_144796 chr1 184695027
Ptpn14 NM_008976 chr1 191552147
Cd34 NM_001111059 chr1 196765080
Gm5698 NM_001166637 chr1 31055753
Epha4 NM_007936 chr1 77511663
Sp110 NM_175397 chr1 87495392
Bcl2 NM_009741 chr1 108610879
You can tweak field widths as necessary.

This might work for you (GNU sed):
sed -r '1d;/(\S+\s+){5}\S+/!d;/\+$/s/\S+\s+//5;/-$/s/\S+\s+//4' file
EDIT:
1d delete the header line
/(\S+\s+){5}\S+/!d; if the line does not have 6 fields delete it
/\+$/s/\S+\s+//5 if the line ends in + delete the 5th field
/-$/s/\S+\s+//4 if the line ends in - delete the 4th field

quick and dirty, pls check if it works:
awk -F'\t' 'NR>1&&$2{print $NF=="+"?$4:$5}' file
output:
184695027
191552147
196765080
31055753
77511663
87495392
108610879
if you want other values in output too:
awk 'BEGIN{FS=OFS="\t"}NR>1&&NF==6{print $1,$2,$3,$NF=="+"?$4:$5}' file
ouput:
Susd4 NM_144796 chr1 184695027
Ptpn14 NM_008976 chr1 191552147
Cd34 NM_001111059 chr1 196765080
Gm5698 NM_001166637 chr1 31055753
Epha4 NM_007936 chr1 77511663
Sp110 NM_175397 chr1 87495392
Bcl2 NM_009741 chr1 108610879
EDIT, adjust format to OP's output example:
awk 'BEGIN{FS=OFS="\t"}NR>1&&NF==6{$4=$NF=="+"?$4:" ";$5=$NF=="+"?" ":$5;print}' file
output:
Susd4 NM_144796 chr1 184695027 +
Ptpn14 NM_008976 chr1 191552147 +
Cd34 NM_001111059 chr1 196765080 +
Gm5698 NM_001166637 chr1 31055753 -
Epha4 NM_007936 chr1 77511663 -
Sp110 NM_175397 chr1 87495392 -
Bcl2 NM_009741 chr1 108610879 -

When you deal with a text file with fields, awk is usually better than sed because awk was designed to help parse text files with fields.
How are the columns in your table setup? Are they tab delimited, or do you use spaces to help line up the columns?
If this is a tab delimited table, you could use awk to check if the second field is null:
awk '
{
if ($2 == "") {
print "Missing 'refseqence' in symbol " $1
}
' $myfile
If your file uses spaces to align the various fields, you can still use awk by using its built in substr` function.
awk '
{
if (substr($0, 9, 12) ~ /^ *$/)
print "Missing 'refsequence' in symbol " substr ($0, 1, 7)
}
}
' $myfile
By the way, I'm being rather wordy here to show you the syntax to make it understandable. I could have used a few shortcuts to put these on one line:
awk '$2 == "" {print "Missing refseqence in symbol " $1}' $myfile
awk 'substr($0, 9, 12) ~ /^ */ {print "Missing refsequnece in symbol " substr($0, 1, 7) }' $myfile

Related

awk - Duplicate, Rearrange and Add columns to file

I would like to :
duplicate 2 columns
switch position between columns
add a variable (as: 0,128,128) in a column at the end of the file
My file :
chr1 3006607 3006623 Class 0 +
chr1 3006607 3006623 Class 0 +
chr1 3006607 3006623 Class 0 +
chr1 3006607 3006623 Class 0 +
chr1 3006607 3006623 Class 0 +
....continue
My code :
cat FILE.txt | awk 'BEGIN { FS=" "; OFS="\t" } { print $1, $2=$2 "\t" $2, $3=$3 "\t" $3, $4, $5, $6 }' | awk 'BEGIN { FS="\t"; OFS="\t" } { print $1, $2, $4, $6, $7, $8, $3, $5 "\t" "0,128,128" }' > FILE.bed
My out-put :
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
...continue
ERROR = DUPLICATED COLUMNS AND THE ADDED ONE ARE IN A ROW BELOW
What I would like to obtain !
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
.....continue
What am I doing wrong?
Am I missing NR or paste0?
If the code in your question produces output that puts the duplicated values on a new row then your input has DOS line endings (or similar?) causing that to happen because your code will not do that.
See Why does my tool output overwrite itself and how do I fix it? for how to handle DOS line endings and then this is all you really need instead of the script in your question:
$ awk -v OFS='\t' '{print $0, $2, $3, "0,128,128"}' file
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
You may use:
awk -v OFS='\t' -v s='0,128,128' '{$1=$1; print $0, $2, $3, s}' file
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
$1=$1 is to force $0 to be reformatted with tab as field separator.

modifying the text file in awk

I have a text file like the following small example:
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 53 955 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
the expected output for the small example is:
chr1 HAVANA transcript 11998 12060 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 41 103 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
in the input file, there are different lines. each line starts with chr. every line has some columns and separators are either tab or ";".
I want to make a new file from this one in which there would be a change only in columns 4 and 5. in fact column 4 in the new file would be ((column 4 in original file)-12) and 5th column in the new file would be ((column 4 in original file)+50). the only difference between input file and output file in the numbers in 4th and 5th column.
I tried to do that in awk using the following command:
awk 'BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32";"$33" "$34";"$35" "$36";"$37" "$38";" }' input.txt > test2.txt
when I run the code, it would return this error:
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
do you know how to fix it? I want to get the an output file with exactly the same format as input file. meaning the same delimiters.
There is no need to output every single column individually, it's enough to modify the existing data and then print the modified line.
awk -F '\t' '{ col4 = $4; $4 = col4 - 12; $5 = col4 + 50; print }' OFS='\t' file
This modifies the fourth and fifth tab-delimited column before printing the whole line.

error when edditing big text file using awk [duplicate]

This question already has an answer here:
modifying the text file in awk
(1 answer)
Closed 4 years ago.
I have a text file like the following small example:
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 12010 12057 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr3 HAVANA exon 12179 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
in the file there are different lines. each line starts with chr. every line has some columns and separators are either tab or ";".
I want to make a new file from this one in which there would be a change only in columns 4 and 5. in fact column 4 in the new file would be ((column 4 in original file)-12) and 5th column in the new file would be ((column 4 in original file)+50). I tried to do that in awk using the following command:
awk 'BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32";"$33" "$34";"$35" "$36";"$37" "$38";" }' input.txt > test2.txt
when I run the code, it would return this error:
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
do you know how to fix it?
try this
awk ' {print $1 "\t" $2"\t"$3"\t"($4 - 12)"\t" ($5 + 50)"\t" $6 "\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15"\t"$16"\t"$17"\t"$18""$19" "$20""$21" "$22";"$23" "$24""$25" "$26""$27" "$28""$29" "$30""$31" "$32""$33" "$34""$35" "$36""$37" "$38"" }' input.txt > output.txt
What you basically do is:
awk '{print $1=$1+2 ";" $2=$1+2}' < input
What you need to do is:
awk '{print ($1=$1+2) ";" ($2=$1+2)}' < input
^^^ Note that you need to parenthesize your inline assignments.
Or just do the assignments before printing:
awk '{$1=$1+2;$2=$1+2;print ... }' < input

awk to match field between two files and use conditions on match

I am trying to look for $2 of file1 (skipping the header) in $2 of file2 and if they match and the value in $10 is > 30 and $11 is > 49, then print the line to a output file. The below awk has syntax errors in it though shellcheck didn't return any. Both the input and output are tab-delimited. I think the below is close, but not sure what is wrong. Thank you :).
awk
awk -F'\t' -v OFS='\t' 'NR==FNR{A[$2];next}$2 in A
{if($10 >.5 OFS $11 > 49)
print ; next
' file1 file2
awk: cmd. line:2: {if($10 >.5 OFS $11 > 49)
awk: cmd. line:2: ^ syntax error
awk: cmd. line:3: print ; next
awk: cmd. line:3: ^ unexpected newline or end of string
file1
Missing in IDP but found in Reference:
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94
file2
chr2 166245425 SCN2A AMPL5155065355 SNP Het C/T C T 54 100 50 23 27
chr2 166848646 SCN1A AMPL1543060606 SNP Het G/A G A 52.9411764706 100 68 32 36
desired output
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94
edit with new awk
awk -F'\t' -v OFS='\t' 'NR==FNR{A[$2];next}$2 in A {
if($10 >.5 OFS $11 > 49) >>> if($10 >.5 && $11 > 49)
print }
' file1 file2 > out
awk: cmd. line:2: if($10 >.5 OFS $11 > 49) >>> if($10 >.5 && $11 > 49)
awk: cmd. line:2: ^ syntax error
here you go...
$ awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$2]; next}
($2 in a) && $10>30 && $11>49 ' file1 file2

remove lines that do not match specific digits in list file using awk

I am trying to use awk to remove the lines in file that do not match the digits after the NM_ but before the . in $2 of list. Thank you :).
file
204 NM_003852 chr7 + 138145078 138270332 138145293
204 NM_015905 chr7 + 138145078 138270332 138145293
list
TRIM24 NM_015905.2
awk
awk -v OFS="\t" '{ sub(/\r/, "") } ; NR==FNR { N=$2 ; sub(/\..*/, "", $2); A[$2]=N; next } ; $2 in A { $2=A[$2] } 1' list file > out
current output
204 NM_003852 chr7 + 138145078 138270332 138145293
204 NM_015905.2 chr7 + 138145078 138270332 138145293
desired output (line 1 removed as that is the line that does not match)
204 NM_015905.2 chr7 + 138145078 138270332 138145293
awk 'NR==FNR{split($2,f2,".");a[f2[1]];next} $2 in a' list file
$ awk -F'[ .]' 'NR==FNR{a[$2];next}$2 in a' list file
204 NM_015905 chr7 + 138145078 138270332 138145293