How to convert row to column on specific condition using awk? - awk

I have a file while I want to convert from row to column on specific condition.
Input file:
cat f
"0/35","0eij8401c
"0/35","59ij41015
"0/35","21ij3e01c
"0/35","dbije401b
"1/35","dbij8a015
"1/35","67ijb9011
"1/35","b5ije001b
"1/35","bdij3701d
"2/35","abij3b011
"2/35","7fij70018
"2/35","77ijf9010
"2/35","e5ij64015
"3/35","59ij41015
"3/35","f6ijae01e
"3/35","c4ij5801c
"3/35","dbij98012
"4/35","edij6801e
"4/35","pdij6801e
"4/35","kdij6801e
"4/35","8cij57018
NOTE: here I am fetching 1st, 5th, 9th, 13th and 17th row's second column in first column in output below. like wise 2nd, 6th, 10th, 14th and 18th row's second column to print second column in output and same for rest of the rows.
There are two expected output:
Expected output 1: (To see it in a report format)
"0eij8401c "dbij8a015 "abij3b011 "59ij41015 "edij6801e
"59ij41015 "67ijb9011 "7fij70018 "f6ijae01e "pdij6801e
"21ij3e01c "b5ije001b "77ijf9010 "c4ij5801c "kdij6801e
"dbije401b "bdij3701d "e5ij64015 "dbij98012 "8cij57018
Expected output2:
And then convert the expected output1 into a single column to perform some operation:
0eij8401c
dbij8a015
abij3b011
59ij41015
edij6801e
59ij41015
67ijb9011
7fij70018
f6ijae01e
pdij6801e
21ij3e01c
b5ije001b
77ijf9010
c4ij5801c
kdij6801e
dbije401b
bdij3701d
e5ij64015
dbij98012
8cij57018
I tried combination of awk and paste, trying to achieve both with awk command.
This is what I tried -
cat f | awk -v batchNo=1 -v Num=4 '{print $1 > "batch_" batchNo ".txt";if(NR%Num==0) {batchNo++}}'
to generate 5 files like below -
ls batch_*
batch_1.txt batch_2.txt batch_3.txt batch_4.txt batch_5.txt
and then combined with paste like below -
paste batch_1.txt batch_2.txt batch_3.txt batch_4.txt batch_5.txt
"0eij8401c "dbij8a015 "abij3b011 "59ij41015 "edij6801e
"59ij41015 "67ijb9011 "7fij70018 "f6ijae01e "pdij6801e
"21ij3e01c "b5ije001b "77ijf9010 "c4ij5801c "kdij6801e
"dbije401b "bdij3701d "e5ij64015 "dbij98012 "8cij57018
I also tried something like this to get the desired result but didn't get it.
awk '{a[$1]++; b[$2]++;c[$3]++;d[$4]++;e[$5]++} END {for (k in a) print k > "out.txt"; for (j in b) print j > "out.txt";for (k in c) print j > "out.txt";for(l in d) print l> "out.txt"; for (m in e) print m> "out.txt";}' batch_*
Any suggestion please.

In addition to the other two good answers, there is yet another simplified way to approach each of your separate output problems. In the first case, you can simply save the values from the second column in an indexed array and then output in rows by groups of 5, e.g.
awk -F, '
{ a[++n] = $2 }
END {
for (i=1; i<=(n/5); i++)
printf "%s %s %s %s %s\n", a[i], a[i+4], a[i+8], a[i+12], a[i+16]
}
' f
Output
"0eij8401c "dbij8a015 "abij3b011 "59ij41015 "edij6801e
"59ij41015 "67ijb9011 "7fij70018 "f6ijae01e "pdij6801e
"21ij3e01c "b5ije001b "77ijf9010 "c4ij5801c "kdij6801e
"dbije401b "bdij3701d "e5ij64015 "dbij98012 "8cij57018
If you need the column output in the specific order shown, you can use the approach to save to an indexed array and then output with '\n' separators instead along with trimming the first char with substr(), e.g.
awk -F, '
{ a[++n]=$2 }
END {
for (i=1; i<=(n/5); i++)
printf "%s\n%s\n%s\n%s\n%s\n", substr(a[i],2), substr(a[i+4],2),
substr(a[i+8],2), substr(a[i+12],2), substr(a[i+16],2)
}
' f
Output
0eij8401c
dbij8a015
abij3b011
59ij41015
edij6801e
59ij41015
67ijb9011
7fij70018
f6ijae01e
pdij6801e
21ij3e01c
b5ije001b
77ijf9010
c4ij5801c
kdij6801e
dbije401b
bdij3701d
e5ij64015
dbij98012
8cij57018
If you just need a column of output of the 2nd field, regardless of order, you can simply use substring to output all but the first character, e.g.
awk -F, '{ print substr($2,2) }' f
Output
0eij8401c
59ij41015
21ij3e01c
dbije401b
dbij8a015
67ijb9011
b5ije001b
bdij3701d
abij3b011
7fij70018
77ijf9010
e5ij64015
59ij41015
f6ijae01e
c4ij5801c
dbij98012
edij6801e
pdij6801e
kdij6801e
8cij57018

About solutions: 3 of these solutions will print continuous view(details one by one for file's continuity) AND report view as well where you need them horizontally. 1st solution: considers that your Input_file is sorted by digits/digits format. 2nd solution sorts Input_file then does the job. 3rd: solution print both styles and create output file batches also.
1st solution: (Considers that your Input_file is sorted by "digit/digits" format)With your shown samples please try following awk code. This will print the output directly sorting order of 1st field eg: "0/35", "1/35" and so on.
awk -v count=0 -v s1="\"" -F'^"|","' '
prev!=$2{
countFile++
max=(max>count?max:count)
count=1
}
{
arr[countFile,count++]=$3
prev=$2
}
END{
print "Printing continous view from here..."
for(i=1;i<=max;i++){
for(j=1;j<countFile;j++){
print(arr[i,j])
}
}
print "Printing REPORT view from here......"
for(i=1;i<countFile;i++){
for(j=1;j<=max;j++){
printf("%s%s",s1 arr[j,i],j==max?ORS:OFS)
}
}
}
' Input_file
2nd solution: In case your Input_file is NOT sorted with("digit/digits" format) then try this code.
awk -F'^"|","' '{print $2,$0}' Input_file | sort -t/ -nk1 -k2 | cut -d' ' -f2 |
awk -v count=0 -v s1="\"" -F'^"|","' '
prev!=$2{
countFile++
max=(max>count?max:count)
count=1
}
{
arr[countFile,count++]=$3
prev=$2
}
END{
print "Printing continous view from here..."
for(i=1;i<=max;i++){
for(j=1;j<countFile;j++){
print(arr[i,j])
}
}
print "Printing REPORT view from here......"
for(i=1;i<countFile;i++){
for(j=1;j<=max;j++){
printf("%s%s",s1 arr[j,i],j==max?ORS:OFS)
}
}
}
'
OR 3rd solution: In case you want to print data on screen as well as you want to create output files also within same awk program then try following:
awk -v count=0 -v s1="\"" -F'^"|","' '
prev!=$2{
close(outputFile)
countFile++
outputFile="batch_"countFile".txt"
max=(max>count?max:count)
count=1
}
{
arr[countFile,count++]=$3
prev=$2
print (s1 $3) > (outputFile)
}
END{
print "Printing continous view from here..."
for(i=1;i<=max;i++){
for(j=1;j<countFile;j++){
print(arr[i,j])
}
}
print "Printing REPORT view from here......"
for(i=1;i<countFile;i++){
for(j=1;j<=max;j++){
printf("%s%s",s1 arr[j,i],j==max?ORS:OFS)
}
}
}
' Input_file
Output will be as follows which it will print:
"0eij8401c "dbij8a015 "abij3b011 "59ij41015 "edij6801e
"59ij41015 "67ijb9011 "7fij70018 "f6ijae01e "pdij6801e
"21ij3e01c "b5ije001b "77ijf9010 "c4ij5801c "kdij6801e
"dbije401b "bdij3701d "e5ij64015 "dbij98012 "8cij57018

As your shown input is already sorted on first field, you may use this solution:
awk -F, '{gsub(/^"|\/[0-9]+"/, ""); print $2 > "batch_" ($1+1) ".txt"}' f
paste batch_1.txt batch_2.txt batch_3.txt batch_4.txt batch_5.txt
"0eij8401c "dbij8a015 "abij3b011 "59ij41015 "edij6801e
"59ij41015 "67ijb9011 "7fij70018 "f6ijae01e "pdij6801e
"21ij3e01c "b5ije001b "77ijf9010 "c4ij5801c "kdij6801e
"dbije401b "bdij3701d "e5ij64015 "dbij98012 "8cij57018
For output2 as per edited question use:
awk '{
a[FNR] = a[FNR] substr($0,2) "\n"
}
END {
for (i=1; i<=FNR; ++i) printf "%s", a[i]
}' batch_1.txt batch_2.txt batch_3.txt batch_4.txt batch_5.txt
0eij8401c
dbij8a015
abij3b011
59ij41015
edij6801e
59ij41015
67ijb9011
7fij70018
f6ijae01e
pdij6801e
21ij3e01c
b5ije001b
77ijf9010
c4ij5801c
kdij6801e
dbije401b
bdij3701d
e5ij64015
dbij98012
8cij57018

Using any awk:
$ cat tst.awk
BEGIN { FS="\"" }
{ vals[++numVals] = $NF }
END {
numValsPerBatch = int(numVals / numBatches) + ( numVals % numBatches ? 1 : 0 )
for ( batchNr=1; batchNr<=numBatches; batchNr++ ) {
for ( valNr=1; valNr<=numValsPerBatch; valNr++ ) {
valIdx = batchNr + (valNr - 1) * numBatches
printf "%s%s", vals[valIdx], (valNr<numValsPerBatch ? OFS : ORS) > "out1.txt"
print vals[valIdx] > "out2.txt"
}
}
}
$ awk -v numBatches=4 -f tst.awk f
$ head -100 out?.txt
==> out1.txt <==
0eij8401c dbij8a015 abij3b011 59ij41015 edij6801e
59ij41015 67ijb9011 7fij70018 f6ijae01e pdij6801e
21ij3e01c b5ije001b 77ijf9010 c4ij5801c kdij6801e
dbije401b bdij3701d e5ij64015 dbij98012 8cij57018
==> out2.txt <==
0eij8401c
dbij8a015
abij3b011
59ij41015
edij6801e
59ij41015
67ijb9011
7fij70018
f6ijae01e
pdij6801e
21ij3e01c
b5ije001b
77ijf9010
c4ij5801c
kdij6801e
dbije401b
bdij3701d
e5ij64015
dbij98012
8cij57018
or if you want the number of batches to be calculated from the key values (YMMV if there's different numbers of values per key in your input):
$ cat tst.awk
BEGIN { FS="\"" }
!seen[$2]++ { numKeys++ }
{ vals[++numVals] = $NF }
END {
numBatches = int(numVals / numKeys) + (numVals % numKeys ? 1 : 0)
numValsPerBatch = int(numVals / numBatches) + (numVals % numBatches ? 1 : 0)
for ( batchNr=1; batchNr<=numBatches; batchNr++ ) {
for ( valNr=1; valNr<=numValsPerBatch; valNr++ ) {
valIdx = batchNr + (valNr - 1) * numBatches
printf "%s%s", vals[valIdx], (valNr<numValsPerBatch ? OFS : ORS) > "out1.txt"
print vals[valIdx] > "out2.txt"
}
}
}
$ awk -f tst.awk f
$ head -100 out?.txt
==> out1.txt <==
0eij8401c dbij8a015 abij3b011 59ij41015 edij6801e
59ij41015 67ijb9011 7fij70018 f6ijae01e pdij6801e
21ij3e01c b5ije001b 77ijf9010 c4ij5801c kdij6801e
dbije401b bdij3701d e5ij64015 dbij98012 8cij57018
==> out2.txt <==
0eij8401c
dbij8a015
abij3b011
59ij41015
edij6801e
59ij41015
67ijb9011
7fij70018
f6ijae01e
pdij6801e
21ij3e01c
b5ije001b
77ijf9010
c4ij5801c
kdij6801e
dbije401b
bdij3701d
e5ij64015
dbij98012
8cij57018

TXR solution:
#(collect)
# (all)
"#id/#nil
# (and)
# (collect :gap 0)
"#id/#nil","#data
# (bind qdata `"#data`)
# (end)
# (end)
#(end)
#(bind tdata #(transpose qdata))
#(bind fdata #(flatten (transpose data)))
#(output)
# (repeat)
#{tdata " "}
# (end)
# (repeat)
#fdata
# (end)
#(end)
$ txr soln.txr data
"0eij8401c "dbij8a015 "abij3b011 "59ij41015 "edij6801e
"59ij41015 "67ijb9011 "7fij70018 "f6ijae01e "pdij6801e
"21ij3e01c "b5ije001b "77ijf9010 "c4ij5801c "kdij6801e
"dbije401b "bdij3701d "e5ij64015 "dbij98012 "8cij57018
0eij8401c
dbij8a015
abij3b011
59ij41015
edij6801e
59ij41015
67ijb9011
7fij70018
f6ijae01e
pdij6801e
21ij3e01c
b5ije001b
77ijf9010
c4ij5801c
kdij6801e
dbije401b
bdij3701d
e5ij64015
dbij98012
8cij57018

Related

For each different occurrence in field, print lines with max value associated

I have
ID=exon-XM_030285750.2 LOC100221041 7895
ID=exon-XM_030285760.2 LOC100221041 8757
ID=exon-XM_030285720.2 LOC100221041 8656
ID=exon-XM_030285738.2 LOC100221041 8183
ID=exon-XM_030285728.2 LOC100221041 8402
ID=exon-XM_030285733.2 LOC100221041 7398
ID=exon-XM_030285715.2 LOC100221041 8780
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_030285774.2 CMSS1 1440
ID=exon-XM_012570107.3 CMSS1 1502
ID=exon-XM_012570104.3 FILIP1L 6371
ID=exon-XM_030285654.2 FILIP1L 6456
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XM_032751000.1 FILIP1L 5886
ID=exon-XM_030285671.2 FILIP1L 5622
ID=exon-XM_030285682.2 FILIP1L 5395
ID=exon-XR_004369230.1 LOC116808959 2289
I want to print the line for which each element in $2 is associates with highest value in $3
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_012570107.3 CMSS1 1502
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XR_004369230.1 LOC116808959 2289
I tried this
awk -f avg.sh test | awk 'BEGIN {OFS = "\t"} arr[$2]==0 {arr[$2]=$3} ($3 > arr[$2]) {arr[$2]=$3} END{for (i in arr) {print i, arr[i]}}'
from here
how to conditionally filter rows in awk
but I would like to also keep $1 in the output and keep the same ordering as in the input.
The answer to this
Computing averages of chunks of a column
shows how to build an array that keeps the original ordering, but I'm falling putting the two together
Could you please try following, written and tested with shown samples in GNU awk.
awk '
!arr1[$2]++{
found[++count]=$2
}
{
arr[$2]=(arr[$2]>$3?arr[$2]:$3)
val[$2 OFS $3]=$1
}
END{
for(i=1;i<=count;i++){
print val[found[i] OFS arr[found[i]]],found[i],arr[found[i]]
}
}' Input_file
Output will be as follows.
ID=exon-XM_030285707.2 1 8963
ID=exon-XM_030285694.2 2 5838
ID=exon-XM_012570107.3 3 1502
ID=exon-XM_030285647.2 4 6488
ID=exon-XR_004369230.1 5 2289
To get in TAB separated form try following.
awk -v OFS="\t" '
!arr1[$2]++{
found[++count]=$2
}
{
arr[$2]=(arr[$2]>$3?arr[$2]:$3)
val[$2 OFS $3]=$1
}
END{
for(i=1;i<=count;i++){
print val[found[i] OFS arr[found[i]]],found[i],arr[found[i]]
}
}' Input_file |
column -t -s $'\t'
You may use this awk:
awk '!($2 in max) || $3 > max[$2] {
if(!($2 in max))
ord[++n] = $2
max[$2] = $3
rec[$2] = $0
}
END {
for (i=1; i<=n; ++i)
print rec[ord[i]]
}' file | column -t
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_012570107.3 CMSS1 1502
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XR_004369230.1 LOC116808959 2289
You can do with sort and awk.
If ordering is optional.
$ sort -k2,2 -k3,3nr madza.txt | awk ' $2!=p2 { if(NR>1) print p; p=$0;p2=$2 } END { print p }'
ID=exon-XR_004369230.1 LOC116808959 2289
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_012570107.3 CMSS1 1502
$
To keep the ordering, you can introduce seq numbers and remove them at the last.
$ awk ' { $(NF+1)=NR}1 ' madza.txt | sort -k2,2 -k3,3nr | awk ' $2!=p2 { if(NR>1) print p; p=$0;p2=$2 } END { print p }' | sort -k4 -n | awk ' {NF=NF-1}1 '
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_012570107.3 CMSS1 1502
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XR_004369230.1 LOC116808959 2289
$

modifying the text file in awk

I have a text file like the following small example:
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 53 955 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
the expected output for the small example is:
chr1 HAVANA transcript 11998 12060 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 41 103 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
in the input file, there are different lines. each line starts with chr. every line has some columns and separators are either tab or ";".
I want to make a new file from this one in which there would be a change only in columns 4 and 5. in fact column 4 in the new file would be ((column 4 in original file)-12) and 5th column in the new file would be ((column 4 in original file)+50). the only difference between input file and output file in the numbers in 4th and 5th column.
I tried to do that in awk using the following command:
awk 'BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32";"$33" "$34";"$35" "$36";"$37" "$38";" }' input.txt > test2.txt
when I run the code, it would return this error:
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
do you know how to fix it? I want to get the an output file with exactly the same format as input file. meaning the same delimiters.
There is no need to output every single column individually, it's enough to modify the existing data and then print the modified line.
awk -F '\t' '{ col4 = $4; $4 = col4 - 12; $5 = col4 + 50; print }' OFS='\t' file
This modifies the fourth and fifth tab-delimited column before printing the whole line.

error when edditing big text file using awk [duplicate]

This question already has an answer here:
modifying the text file in awk
(1 answer)
Closed 4 years ago.
I have a text file like the following small example:
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 12010 12057 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr3 HAVANA exon 12179 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
in the file there are different lines. each line starts with chr. every line has some columns and separators are either tab or ";".
I want to make a new file from this one in which there would be a change only in columns 4 and 5. in fact column 4 in the new file would be ((column 4 in original file)-12) and 5th column in the new file would be ((column 4 in original file)+50). I tried to do that in awk using the following command:
awk 'BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32";"$33" "$34";"$35" "$36";"$37" "$38";" }' input.txt > test2.txt
when I run the code, it would return this error:
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
do you know how to fix it?
try this
awk ' {print $1 "\t" $2"\t"$3"\t"($4 - 12)"\t" ($5 + 50)"\t" $6 "\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15"\t"$16"\t"$17"\t"$18""$19" "$20""$21" "$22";"$23" "$24""$25" "$26""$27" "$28""$29" "$30""$31" "$32""$33" "$34""$35" "$36""$37" "$38"" }' input.txt > output.txt
What you basically do is:
awk '{print $1=$1+2 ";" $2=$1+2}' < input
What you need to do is:
awk '{print ($1=$1+2) ";" ($2=$1+2)}' < input
^^^ Note that you need to parenthesize your inline assignments.
Or just do the assignments before printing:
awk '{$1=$1+2;$2=$1+2;print ... }' < input

Filter columns in Pig script

I am loading data in Pig from a CSV.
After having loaded data, I need to filter out columns .
exportAllProductsCleaned = FOREACH exportAllProducts
generate $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $26, $27, $28, $29, $30, $31, $32, $33
Is there a way wherein I can specify only
The columns I need to remove
OR
The range of columns I need for ex. $1-15 and then $18 - $30
Is it possible?
Yes, you can do so using '..' convention.Refer
Support project range expression
exportAllProductsCleaned = FOREACH exportAllProducts GENERATE $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $26, $27, $28, $29, $30, $31, $32, $33;
exportAllProductsFiltered = FOREACH exportAllProductsCleaned GENERATE $1 .. $15,$18 .. $30;

How to use conditional expression to select data?

I have a table like this:
symbol refseq seqname start stop strand
Susd4 NM_144796 chr1 184695027 184826500 +
Ptpn14 NM_008976 chr1 191552147 191700574 +
Cd34 NM_001111059 chr1 196765080 196787475 +
Gm5698 NM_001166637 chr1 31034088 31055753 -
Epha4 NM_007936 chr1 77363760 77511663 -
Sp110 NM_175397 chr1 87473474 87495392 -
Gbx2 chr1 91824537 91827751 -
Kif1a chr1 94914855 94998430 -
Bcl2 NM_009741 chr1 108434770 108610879 -
And I want to extract data with the following conditions:
1) lines that the values in "refseq" column are not missing
2) for the values in the columns "start" and "stop", only keep one value for each line: if the value in the column "strand" is "+", take the value in "start"; if the value in the column "strand" is "-", take the value in "stop".
And this is what expected:
Susd4 NM_144796 chr1 184695027 +
Ptpn14 NM_008976 chr1 191552147 +
Cd34 NM_001111059 chr1 196765080 +
Gm5698 NM_001166637 chr1 31055753 -
Epha4 NM_007936 chr1 77511663 -
Sp110 NM_175397 chr1 87495392 -
Bcl2 NM_009741 chr1 108610879 -
I would be very tempted to leave the input delimiter unmodified so blanks and tabs separate fields, rather than insisting on tabs only. That means you want records after the first (to skip the headings line) that have six fields:
awk 'NR > 1 && NF == 6 { if ($6 == "+") x = $4; else x = $5; print $1, $2, $3, x; }'
If you want to control the output format more, you can dink with OFS, or use printf:
awk 'BEGIN { OFS = "\t" }
NR > 1 && NF == 6 { if ($6 == "+") x = $4; else x = $5; print $1, $2, $3, x; }'
awk 'NR > 1 && NF == 6 { if ($6 == "+") x = $4; else x = $5;
printf "%-8s %-12s %s %9s\n", $1, $2, $3, x; }'
There are other ways to handle it, I'm sure...
The first script produces:
Susd4 NM_144796 chr1 184695027
Ptpn14 NM_008976 chr1 191552147
Cd34 NM_001111059 chr1 196765080
Gm5698 NM_001166637 chr1 31055753
Epha4 NM_007936 chr1 77511663
Sp110 NM_175397 chr1 87495392
Bcl2 NM_009741 chr1 108610879
The content is correct, I believe; the formatting can be improved in many ways. The last script produces:
Susd4 NM_144796 chr1 184695027
Ptpn14 NM_008976 chr1 191552147
Cd34 NM_001111059 chr1 196765080
Gm5698 NM_001166637 chr1 31055753
Epha4 NM_007936 chr1 77511663
Sp110 NM_175397 chr1 87495392
Bcl2 NM_009741 chr1 108610879
You can tweak field widths as necessary.
This might work for you (GNU sed):
sed -r '1d;/(\S+\s+){5}\S+/!d;/\+$/s/\S+\s+//5;/-$/s/\S+\s+//4' file
EDIT:
1d delete the header line
/(\S+\s+){5}\S+/!d; if the line does not have 6 fields delete it
/\+$/s/\S+\s+//5 if the line ends in + delete the 5th field
/-$/s/\S+\s+//4 if the line ends in - delete the 4th field
quick and dirty, pls check if it works:
awk -F'\t' 'NR>1&&$2{print $NF=="+"?$4:$5}' file
output:
184695027
191552147
196765080
31055753
77511663
87495392
108610879
if you want other values in output too:
awk 'BEGIN{FS=OFS="\t"}NR>1&&NF==6{print $1,$2,$3,$NF=="+"?$4:$5}' file
ouput:
Susd4 NM_144796 chr1 184695027
Ptpn14 NM_008976 chr1 191552147
Cd34 NM_001111059 chr1 196765080
Gm5698 NM_001166637 chr1 31055753
Epha4 NM_007936 chr1 77511663
Sp110 NM_175397 chr1 87495392
Bcl2 NM_009741 chr1 108610879
EDIT, adjust format to OP's output example:
awk 'BEGIN{FS=OFS="\t"}NR>1&&NF==6{$4=$NF=="+"?$4:" ";$5=$NF=="+"?" ":$5;print}' file
output:
Susd4 NM_144796 chr1 184695027 +
Ptpn14 NM_008976 chr1 191552147 +
Cd34 NM_001111059 chr1 196765080 +
Gm5698 NM_001166637 chr1 31055753 -
Epha4 NM_007936 chr1 77511663 -
Sp110 NM_175397 chr1 87495392 -
Bcl2 NM_009741 chr1 108610879 -
When you deal with a text file with fields, awk is usually better than sed because awk was designed to help parse text files with fields.
How are the columns in your table setup? Are they tab delimited, or do you use spaces to help line up the columns?
If this is a tab delimited table, you could use awk to check if the second field is null:
awk '
{
if ($2 == "") {
print "Missing 'refseqence' in symbol " $1
}
' $myfile
If your file uses spaces to align the various fields, you can still use awk by using its built in substr` function.
awk '
{
if (substr($0, 9, 12) ~ /^ *$/)
print "Missing 'refsequence' in symbol " substr ($0, 1, 7)
}
}
' $myfile
By the way, I'm being rather wordy here to show you the syntax to make it understandable. I could have used a few shortcuts to put these on one line:
awk '$2 == "" {print "Missing refseqence in symbol " $1}' $myfile
awk 'substr($0, 9, 12) ~ /^ */ {print "Missing refsequnece in symbol " substr($0, 1, 7) }' $myfile