I would like to :
duplicate 2 columns
switch position between columns
add a variable (as: 0,128,128) in a column at the end of the file
My file :
chr1 3006607 3006623 Class 0 +
chr1 3006607 3006623 Class 0 +
chr1 3006607 3006623 Class 0 +
chr1 3006607 3006623 Class 0 +
chr1 3006607 3006623 Class 0 +
....continue
My code :
cat FILE.txt | awk 'BEGIN { FS=" "; OFS="\t" } { print $1, $2=$2 "\t" $2, $3=$3 "\t" $3, $4, $5, $6 }' | awk 'BEGIN { FS="\t"; OFS="\t" } { print $1, $2, $4, $6, $7, $8, $3, $5 "\t" "0,128,128" }' > FILE.bed
My out-put :
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
...continue
ERROR = DUPLICATED COLUMNS AND THE ADDED ONE ARE IN A ROW BELOW
What I would like to obtain !
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
.....continue
What am I doing wrong?
Am I missing NR or paste0?
If the code in your question produces output that puts the duplicated values on a new row then your input has DOS line endings (or similar?) causing that to happen because your code will not do that.
See Why does my tool output overwrite itself and how do I fix it? for how to handle DOS line endings and then this is all you really need instead of the script in your question:
$ awk -v OFS='\t' '{print $0, $2, $3, "0,128,128"}' file
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
You may use:
awk -v OFS='\t' -v s='0,128,128' '{$1=$1; print $0, $2, $3, s}' file
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
$1=$1 is to force $0 to be reformatted with tab as field separator.
I have a text file like the following small example:
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 53 955 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
the expected output for the small example is:
chr1 HAVANA transcript 11998 12060 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 41 103 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
in the input file, there are different lines. each line starts with chr. every line has some columns and separators are either tab or ";".
I want to make a new file from this one in which there would be a change only in columns 4 and 5. in fact column 4 in the new file would be ((column 4 in original file)-12) and 5th column in the new file would be ((column 4 in original file)+50). the only difference between input file and output file in the numbers in 4th and 5th column.
I tried to do that in awk using the following command:
awk 'BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32";"$33" "$34";"$35" "$36";"$37" "$38";" }' input.txt > test2.txt
when I run the code, it would return this error:
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
do you know how to fix it? I want to get the an output file with exactly the same format as input file. meaning the same delimiters.
There is no need to output every single column individually, it's enough to modify the existing data and then print the modified line.
awk -F '\t' '{ col4 = $4; $4 = col4 - 12; $5 = col4 + 50; print }' OFS='\t' file
This modifies the fourth and fifth tab-delimited column before printing the whole line.
This question already has an answer here:
modifying the text file in awk
(1 answer)
Closed 4 years ago.
I have a text file like the following small example:
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 12010 12057 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr3 HAVANA exon 12179 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
in the file there are different lines. each line starts with chr. every line has some columns and separators are either tab or ";".
I want to make a new file from this one in which there would be a change only in columns 4 and 5. in fact column 4 in the new file would be ((column 4 in original file)-12) and 5th column in the new file would be ((column 4 in original file)+50). I tried to do that in awk using the following command:
awk 'BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32";"$33" "$34";"$35" "$36";"$37" "$38";" }' input.txt > test2.txt
when I run the code, it would return this error:
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
do you know how to fix it?
try this
awk ' {print $1 "\t" $2"\t"$3"\t"($4 - 12)"\t" ($5 + 50)"\t" $6 "\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15"\t"$16"\t"$17"\t"$18""$19" "$20""$21" "$22";"$23" "$24""$25" "$26""$27" "$28""$29" "$30""$31" "$32""$33" "$34""$35" "$36""$37" "$38"" }' input.txt > output.txt
What you basically do is:
awk '{print $1=$1+2 ";" $2=$1+2}' < input
What you need to do is:
awk '{print ($1=$1+2) ";" ($2=$1+2)}' < input
^^^ Note that you need to parenthesize your inline assignments.
Or just do the assignments before printing:
awk '{$1=$1+2;$2=$1+2;print ... }' < input
I am trying to look for $2 of file1 (skipping the header) in $2 of file2 and if they match and the value in $10 is > 30 and $11 is > 49, then print the line to a output file. The below awk has syntax errors in it though shellcheck didn't return any. Both the input and output are tab-delimited. I think the below is close, but not sure what is wrong. Thank you :).
awk
awk -F'\t' -v OFS='\t' 'NR==FNR{A[$2];next}$2 in A
{if($10 >.5 OFS $11 > 49)
print ; next
' file1 file2
awk: cmd. line:2: {if($10 >.5 OFS $11 > 49)
awk: cmd. line:2: ^ syntax error
awk: cmd. line:3: print ; next
awk: cmd. line:3: ^ unexpected newline or end of string
file1
Missing in IDP but found in Reference:
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94
file2
chr2 166245425 SCN2A AMPL5155065355 SNP Het C/T C T 54 100 50 23 27
chr2 166848646 SCN1A AMPL1543060606 SNP Het G/A G A 52.9411764706 100 68 32 36
desired output
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94
edit with new awk
awk -F'\t' -v OFS='\t' 'NR==FNR{A[$2];next}$2 in A {
if($10 >.5 OFS $11 > 49) >>> if($10 >.5 && $11 > 49)
print }
' file1 file2 > out
awk: cmd. line:2: if($10 >.5 OFS $11 > 49) >>> if($10 >.5 && $11 > 49)
awk: cmd. line:2: ^ syntax error
here you go...
$ awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$2]; next}
($2 in a) && $10>30 && $11>49 ' file1 file2
I am trying to use awk to remove the lines in file that do not match the digits after the NM_ but before the . in $2 of list. Thank you :).
file
204 NM_003852 chr7 + 138145078 138270332 138145293
204 NM_015905 chr7 + 138145078 138270332 138145293
list
TRIM24 NM_015905.2
awk
awk -v OFS="\t" '{ sub(/\r/, "") } ; NR==FNR { N=$2 ; sub(/\..*/, "", $2); A[$2]=N; next } ; $2 in A { $2=A[$2] } 1' list file > out
current output
204 NM_003852 chr7 + 138145078 138270332 138145293
204 NM_015905.2 chr7 + 138145078 138270332 138145293
desired output (line 1 removed as that is the line that does not match)
204 NM_015905.2 chr7 + 138145078 138270332 138145293
awk 'NR==FNR{split($2,f2,".");a[f2[1]];next} $2 in a' list file
$ awk -F'[ .]' 'NR==FNR{a[$2];next}$2 in a' list file
204 NM_015905 chr7 + 138145078 138270332 138145293