awk - Duplicate, Rearrange and Add columns to file - awk

I would like to :
duplicate 2 columns
switch position between columns
add a variable (as: 0,128,128) in a column at the end of the file
My file :
chr1 3006607 3006623 Class 0 +
chr1 3006607 3006623 Class 0 +
chr1 3006607 3006623 Class 0 +
chr1 3006607 3006623 Class 0 +
chr1 3006607 3006623 Class 0 +
....continue
My code :
cat FILE.txt | awk 'BEGIN { FS=" "; OFS="\t" } { print $1, $2=$2 "\t" $2, $3=$3 "\t" $3, $4, $5, $6 }' | awk 'BEGIN { FS="\t"; OFS="\t" } { print $1, $2, $4, $6, $7, $8, $3, $5 "\t" "0,128,128" }' > FILE.bed
My out-put :
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 +
3006607 3006623 0,128,128
...continue
ERROR = DUPLICATED COLUMNS AND THE ADDED ONE ARE IN A ROW BELOW
What I would like to obtain !
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
.....continue
What am I doing wrong?
Am I missing NR or paste0?

If the code in your question produces output that puts the duplicated values on a new row then your input has DOS line endings (or similar?) causing that to happen because your code will not do that.
See Why does my tool output overwrite itself and how do I fix it? for how to handle DOS line endings and then this is all you really need instead of the script in your question:
$ awk -v OFS='\t' '{print $0, $2, $3, "0,128,128"}' file
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128

You may use:
awk -v OFS='\t' -v s='0,128,128' '{$1=$1; print $0, $2, $3, s}' file
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
chr1 3006607 3006623 Class 0 + 3006607 3006623 0,128,128
$1=$1 is to force $0 to be reformatted with tab as field separator.

Related

Use awk with two different delimiters to split and select columns

How can I tell gawk to use two different delimiters so that I can separate some columns, but select others using the tab-delimited format of my file?
> cat broad_snps.tab
chrsnpID rsID freq_bin snp_maf gene_count dist_nearest_gene_snpsnap dist_nearest_gene_snpsnap_protein_coding dist_nearest_gene dist_nearest_gene_located_within loci_upstream loci_downstream ID_nearest_gene_snpsnap ID_nearest_gene_snpsnap_protein_coding ID_nearest_gene ID_nearest_gene_located_within HGNC_nearest_gene_snpsnap HGNC_nearest_gene_snpsnap_protein_coding flag_snp_within_gene flag_snp_within_gene_protein_coding ID_genes_in_matched_locus friends_ld01 friends_ld02 friends_ld03 friends_ld04 friends_ld05 friends_ld06 friends_ld07 friends_ld08 friends_ld09 -1
10:10001753 10:10001753 7 0.07455 0 98932.0 1045506.0 98932.0 inf 9986766 10039928 ENSG00000224788 ENSG00000048740 ENSG00000224788 CELF2 False False 253.0 103.0 55.0 40.0 35.0 33.031.0 20.0 0.0 -1
10:10001794 10:10001794 41 0.4105 0 98891.0 1045465.0 98891.0 inf 9964948 10071879 ENSG00000224788 ENSG00000048740 ENSG00000224788 CELF2 False False 365.0 299.0 294.0 266.0 168.0 138.58.0 45.0 0.0 -1
10:100023489 10:100023489 10 0.1054 1 4518.0 4518.0 4518.0 4518.0 100023489 100023489 ENSG00000138131 ENSG00000138131 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 True True ENSG00000138131 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1
10:100025128 10:100025128 45 0.4543 1 2879.0 2879.0 2879.0 2879.0 100025128 100025128 ENSG00000138131 ENSG00000138131 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 True True ENSG00000138131 112.0 70.0 3.0 0.0 0.0
The output I want:
chr10 10001752 10001753 CELF2
chr10 10001793 10001794 CELF2
chr10 100023488 100023489 LOXL4
chr10 100025127 100025128 LOXL4
chr10 10002974 10002975 LOXL4
The command I am currently using:
cat broad_snps.tab | tail -n+2 | gawk -vOFS="\t" -vFS=":" '{ print "chr"$1, ($2 - 1), $2}' | gawk -vOFS="\t" '{print $1, $2, $3}' > broad_SNPs.bed
Returns this:
chr10 10001752 10001753 10
chr10 10001793 10001794 10
chr10 100023488 100023489 10
chr10 100025127 100025128 10
chr10 10002974 10002975 10
chr10 10003391 10003392 10
chr10 100038815 100038816 10
chr10 10008001 10008002 10
chr10 100093012 100093013 10
I'd like to be able to use the ":" delimiter to split up the first column, but I need to use "\t" to pick out the gene ID.
Thanks!
awk -F'[\t:]' '{print $1, $2, $4, $17}'

modifying the text file in awk

I have a text file like the following small example:
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 53 955 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
the expected output for the small example is:
chr1 HAVANA transcript 11998 12060 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 41 103 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
in the input file, there are different lines. each line starts with chr. every line has some columns and separators are either tab or ";".
I want to make a new file from this one in which there would be a change only in columns 4 and 5. in fact column 4 in the new file would be ((column 4 in original file)-12) and 5th column in the new file would be ((column 4 in original file)+50). the only difference between input file and output file in the numbers in 4th and 5th column.
I tried to do that in awk using the following command:
awk 'BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32";"$33" "$34";"$35" "$36";"$37" "$38";" }' input.txt > test2.txt
when I run the code, it would return this error:
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
do you know how to fix it? I want to get the an output file with exactly the same format as input file. meaning the same delimiters.
There is no need to output every single column individually, it's enough to modify the existing data and then print the modified line.
awk -F '\t' '{ col4 = $4; $4 = col4 - 12; $5 = col4 + 50; print }' OFS='\t' file
This modifies the fourth and fifth tab-delimited column before printing the whole line.

error when edditing big text file using awk [duplicate]

This question already has an answer here:
modifying the text file in awk
(1 answer)
Closed 4 years ago.
I have a text file like the following small example:
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 12010 12057 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr3 HAVANA exon 12179 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
in the file there are different lines. each line starts with chr. every line has some columns and separators are either tab or ";".
I want to make a new file from this one in which there would be a change only in columns 4 and 5. in fact column 4 in the new file would be ((column 4 in original file)-12) and 5th column in the new file would be ((column 4 in original file)+50). I tried to do that in awk using the following command:
awk 'BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32";"$33" "$34";"$35" "$36";"$37" "$38";" }' input.txt > test2.txt
when I run the code, it would return this error:
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
do you know how to fix it?
try this
awk ' {print $1 "\t" $2"\t"$3"\t"($4 - 12)"\t" ($5 + 50)"\t" $6 "\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15"\t"$16"\t"$17"\t"$18""$19" "$20""$21" "$22";"$23" "$24""$25" "$26""$27" "$28""$29" "$30""$31" "$32""$33" "$34""$35" "$36""$37" "$38"" }' input.txt > output.txt
What you basically do is:
awk '{print $1=$1+2 ";" $2=$1+2}' < input
What you need to do is:
awk '{print ($1=$1+2) ";" ($2=$1+2)}' < input
^^^ Note that you need to parenthesize your inline assignments.
Or just do the assignments before printing:
awk '{$1=$1+2;$2=$1+2;print ... }' < input

remove lines that do not match specific digits in list file using awk

I am trying to use awk to remove the lines in file that do not match the digits after the NM_ but before the . in $2 of list. Thank you :).
file
204 NM_003852 chr7 + 138145078 138270332 138145293
204 NM_015905 chr7 + 138145078 138270332 138145293
list
TRIM24 NM_015905.2
awk
awk -v OFS="\t" '{ sub(/\r/, "") } ; NR==FNR { N=$2 ; sub(/\..*/, "", $2); A[$2]=N; next } ; $2 in A { $2=A[$2] } 1' list file > out
current output
204 NM_003852 chr7 + 138145078 138270332 138145293
204 NM_015905.2 chr7 + 138145078 138270332 138145293
desired output (line 1 removed as that is the line that does not match)
204 NM_015905.2 chr7 + 138145078 138270332 138145293
awk 'NR==FNR{split($2,f2,".");a[f2[1]];next} $2 in a' list file
$ awk -F'[ .]' 'NR==FNR{a[$2];next}$2 in a' list file
204 NM_015905 chr7 + 138145078 138270332 138145293

How to use conditional expression to select data?

I have a table like this:
symbol refseq seqname start stop strand
Susd4 NM_144796 chr1 184695027 184826500 +
Ptpn14 NM_008976 chr1 191552147 191700574 +
Cd34 NM_001111059 chr1 196765080 196787475 +
Gm5698 NM_001166637 chr1 31034088 31055753 -
Epha4 NM_007936 chr1 77363760 77511663 -
Sp110 NM_175397 chr1 87473474 87495392 -
Gbx2 chr1 91824537 91827751 -
Kif1a chr1 94914855 94998430 -
Bcl2 NM_009741 chr1 108434770 108610879 -
And I want to extract data with the following conditions:
1) lines that the values in "refseq" column are not missing
2) for the values in the columns "start" and "stop", only keep one value for each line: if the value in the column "strand" is "+", take the value in "start"; if the value in the column "strand" is "-", take the value in "stop".
And this is what expected:
Susd4 NM_144796 chr1 184695027 +
Ptpn14 NM_008976 chr1 191552147 +
Cd34 NM_001111059 chr1 196765080 +
Gm5698 NM_001166637 chr1 31055753 -
Epha4 NM_007936 chr1 77511663 -
Sp110 NM_175397 chr1 87495392 -
Bcl2 NM_009741 chr1 108610879 -
I would be very tempted to leave the input delimiter unmodified so blanks and tabs separate fields, rather than insisting on tabs only. That means you want records after the first (to skip the headings line) that have six fields:
awk 'NR > 1 && NF == 6 { if ($6 == "+") x = $4; else x = $5; print $1, $2, $3, x; }'
If you want to control the output format more, you can dink with OFS, or use printf:
awk 'BEGIN { OFS = "\t" }
NR > 1 && NF == 6 { if ($6 == "+") x = $4; else x = $5; print $1, $2, $3, x; }'
awk 'NR > 1 && NF == 6 { if ($6 == "+") x = $4; else x = $5;
printf "%-8s %-12s %s %9s\n", $1, $2, $3, x; }'
There are other ways to handle it, I'm sure...
The first script produces:
Susd4 NM_144796 chr1 184695027
Ptpn14 NM_008976 chr1 191552147
Cd34 NM_001111059 chr1 196765080
Gm5698 NM_001166637 chr1 31055753
Epha4 NM_007936 chr1 77511663
Sp110 NM_175397 chr1 87495392
Bcl2 NM_009741 chr1 108610879
The content is correct, I believe; the formatting can be improved in many ways. The last script produces:
Susd4 NM_144796 chr1 184695027
Ptpn14 NM_008976 chr1 191552147
Cd34 NM_001111059 chr1 196765080
Gm5698 NM_001166637 chr1 31055753
Epha4 NM_007936 chr1 77511663
Sp110 NM_175397 chr1 87495392
Bcl2 NM_009741 chr1 108610879
You can tweak field widths as necessary.
This might work for you (GNU sed):
sed -r '1d;/(\S+\s+){5}\S+/!d;/\+$/s/\S+\s+//5;/-$/s/\S+\s+//4' file
EDIT:
1d delete the header line
/(\S+\s+){5}\S+/!d; if the line does not have 6 fields delete it
/\+$/s/\S+\s+//5 if the line ends in + delete the 5th field
/-$/s/\S+\s+//4 if the line ends in - delete the 4th field
quick and dirty, pls check if it works:
awk -F'\t' 'NR>1&&$2{print $NF=="+"?$4:$5}' file
output:
184695027
191552147
196765080
31055753
77511663
87495392
108610879
if you want other values in output too:
awk 'BEGIN{FS=OFS="\t"}NR>1&&NF==6{print $1,$2,$3,$NF=="+"?$4:$5}' file
ouput:
Susd4 NM_144796 chr1 184695027
Ptpn14 NM_008976 chr1 191552147
Cd34 NM_001111059 chr1 196765080
Gm5698 NM_001166637 chr1 31055753
Epha4 NM_007936 chr1 77511663
Sp110 NM_175397 chr1 87495392
Bcl2 NM_009741 chr1 108610879
EDIT, adjust format to OP's output example:
awk 'BEGIN{FS=OFS="\t"}NR>1&&NF==6{$4=$NF=="+"?$4:" ";$5=$NF=="+"?" ":$5;print}' file
output:
Susd4 NM_144796 chr1 184695027 +
Ptpn14 NM_008976 chr1 191552147 +
Cd34 NM_001111059 chr1 196765080 +
Gm5698 NM_001166637 chr1 31055753 -
Epha4 NM_007936 chr1 77511663 -
Sp110 NM_175397 chr1 87495392 -
Bcl2 NM_009741 chr1 108610879 -
When you deal with a text file with fields, awk is usually better than sed because awk was designed to help parse text files with fields.
How are the columns in your table setup? Are they tab delimited, or do you use spaces to help line up the columns?
If this is a tab delimited table, you could use awk to check if the second field is null:
awk '
{
if ($2 == "") {
print "Missing 'refseqence' in symbol " $1
}
' $myfile
If your file uses spaces to align the various fields, you can still use awk by using its built in substr` function.
awk '
{
if (substr($0, 9, 12) ~ /^ *$/)
print "Missing 'refsequence' in symbol " substr ($0, 1, 7)
}
}
' $myfile
By the way, I'm being rather wordy here to show you the syntax to make it understandable. I could have used a few shortcuts to put these on one line:
awk '$2 == "" {print "Missing refseqence in symbol " $1}' $myfile
awk 'substr($0, 9, 12) ~ /^ */ {print "Missing refsequnece in symbol " substr($0, 1, 7) }' $myfile