Filter columns in Pig script - apache-pig

I am loading data in Pig from a CSV.
After having loaded data, I need to filter out columns .
exportAllProductsCleaned = FOREACH exportAllProducts
generate $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $26, $27, $28, $29, $30, $31, $32, $33
Is there a way wherein I can specify only
The columns I need to remove
OR
The range of columns I need for ex. $1-15 and then $18 - $30
Is it possible?

Yes, you can do so using '..' convention.Refer
Support project range expression
exportAllProductsCleaned = FOREACH exportAllProducts GENERATE $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $26, $27, $28, $29, $30, $31, $32, $33;
exportAllProductsFiltered = FOREACH exportAllProductsCleaned GENERATE $1 .. $15,$18 .. $30;

Related

For each different occurrence in field, print lines with max value associated

I have
ID=exon-XM_030285750.2 LOC100221041 7895
ID=exon-XM_030285760.2 LOC100221041 8757
ID=exon-XM_030285720.2 LOC100221041 8656
ID=exon-XM_030285738.2 LOC100221041 8183
ID=exon-XM_030285728.2 LOC100221041 8402
ID=exon-XM_030285733.2 LOC100221041 7398
ID=exon-XM_030285715.2 LOC100221041 8780
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_030285774.2 CMSS1 1440
ID=exon-XM_012570107.3 CMSS1 1502
ID=exon-XM_012570104.3 FILIP1L 6371
ID=exon-XM_030285654.2 FILIP1L 6456
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XM_032751000.1 FILIP1L 5886
ID=exon-XM_030285671.2 FILIP1L 5622
ID=exon-XM_030285682.2 FILIP1L 5395
ID=exon-XR_004369230.1 LOC116808959 2289
I want to print the line for which each element in $2 is associates with highest value in $3
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_012570107.3 CMSS1 1502
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XR_004369230.1 LOC116808959 2289
I tried this
awk -f avg.sh test | awk 'BEGIN {OFS = "\t"} arr[$2]==0 {arr[$2]=$3} ($3 > arr[$2]) {arr[$2]=$3} END{for (i in arr) {print i, arr[i]}}'
from here
how to conditionally filter rows in awk
but I would like to also keep $1 in the output and keep the same ordering as in the input.
The answer to this
Computing averages of chunks of a column
shows how to build an array that keeps the original ordering, but I'm falling putting the two together
Could you please try following, written and tested with shown samples in GNU awk.
awk '
!arr1[$2]++{
found[++count]=$2
}
{
arr[$2]=(arr[$2]>$3?arr[$2]:$3)
val[$2 OFS $3]=$1
}
END{
for(i=1;i<=count;i++){
print val[found[i] OFS arr[found[i]]],found[i],arr[found[i]]
}
}' Input_file
Output will be as follows.
ID=exon-XM_030285707.2 1 8963
ID=exon-XM_030285694.2 2 5838
ID=exon-XM_012570107.3 3 1502
ID=exon-XM_030285647.2 4 6488
ID=exon-XR_004369230.1 5 2289
To get in TAB separated form try following.
awk -v OFS="\t" '
!arr1[$2]++{
found[++count]=$2
}
{
arr[$2]=(arr[$2]>$3?arr[$2]:$3)
val[$2 OFS $3]=$1
}
END{
for(i=1;i<=count;i++){
print val[found[i] OFS arr[found[i]]],found[i],arr[found[i]]
}
}' Input_file |
column -t -s $'\t'
You may use this awk:
awk '!($2 in max) || $3 > max[$2] {
if(!($2 in max))
ord[++n] = $2
max[$2] = $3
rec[$2] = $0
}
END {
for (i=1; i<=n; ++i)
print rec[ord[i]]
}' file | column -t
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_012570107.3 CMSS1 1502
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XR_004369230.1 LOC116808959 2289
You can do with sort and awk.
If ordering is optional.
$ sort -k2,2 -k3,3nr madza.txt | awk ' $2!=p2 { if(NR>1) print p; p=$0;p2=$2 } END { print p }'
ID=exon-XR_004369230.1 LOC116808959 2289
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_012570107.3 CMSS1 1502
$
To keep the ordering, you can introduce seq numbers and remove them at the last.
$ awk ' { $(NF+1)=NR}1 ' madza.txt | sort -k2,2 -k3,3nr | awk ' $2!=p2 { if(NR>1) print p; p=$0;p2=$2 } END { print p }' | sort -k4 -n | awk ' {NF=NF-1}1 '
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_012570107.3 CMSS1 1502
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XR_004369230.1 LOC116808959 2289
$

awk if else statement syntax error. What is causing it?

There's a syntax error in awk if else statement which I got the code from another question and unable to fix it.
Bash one-liner code to output unique values.
Can someone correct the statement.
awk 'BEGIN {output=0} /Slave_IO_Running.*No/ {output+=1} /Slave_SQL_Running.*No/ {output +=2} END {if(output==3} {print 0} else {if(output==0} {print 3} else {print output}}'
debug output
awk 'BEGIN {output=0} /Slave_IO_Running.*No/ {output+=1} /Slave_SQL_Running.*No/ {output +=2} END {if(output==3}{print 0} else {if(output==0} {print 3} else {print output}}'
awk: cmd. line:1: BEGIN {output=0} /Slave_IO_Running.*No/ {output+=1} /Slave_SQL_Running.*No/ {output +=2} END {if(output==3}{print 0} else {if(output==0} {print 3} else {print output}}
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN {output=0} /Slave_IO_Running.*No/ {output+=1} /Slave_SQL_Running.*No/ {output +=2} END {if(output==3}{print 0} else {if(output==0} {print 3} else {print output}}
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN {output=0} /Slave_IO_Running.*No/ {output+=1} /Slave_SQL_Running.*No/ {output +=2} END {if(output==3}{print 0} else {if(output==0} {print 3} else {print output}}
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN {output=0} /Slave_IO_Running.*No/ {output+=1} /Slave_SQL_Running.*No/ {output +=2} END {if(output==3}{print 0} else {if(output==0} {print 3} else {print output}}
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN {output=0} /Slave_IO_Running.*No/ {output+=1} /Slave_SQL_Running.*No/ {output +=2} END {if(output==3}{print 0} else {if(output==0} {print 3} else {print output}}
awk: cmd. line:1: ^ syntax error
if(output==0} and if(output==3} should end with close paren ), not close brace }.
You should use else if for nested if statements, and those braces are only necessary for multiple operations.
END {if(output==3) print 0; else if(output==0) print 3; else print output}
Just for fun:
END {print output==3? 0: output==0? 3: output}
There are clues in the error messages, e.g.:
awk: cmd. line:1: ... END {if(output==3}{print 0} else {if(output==0} {print 3} else {print output}}
awk: cmd. line:1: ^ syntax error
It really couldn't be much clearer.

error when edditing big text file using awk [duplicate]

This question already has an answer here:
modifying the text file in awk
(1 answer)
Closed 4 years ago.
I have a text file like the following small example:
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; tr
anscript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene "OTTHUMG00000000961.2"; havana_tran
script "OTTHUMT00000002844.2";
chr2 HAVANA exon 12010 12057 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr3 HAVANA exon 12179 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000450305.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript
_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; havana_gene
"OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
in the file there are different lines. each line starts with chr. every line has some columns and separators are either tab or ";".
I want to make a new file from this one in which there would be a change only in columns 4 and 5. in fact column 4 in the new file would be ((column 4 in original file)-12) and 5th column in the new file would be ((column 4 in original file)+50). I tried to do that in awk using the following command:
awk 'BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32";"$33" "$34";"$35" "$36";"$37" "$38";" }' input.txt > test2.txt
when I run the code, it would return this error:
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: BEGIN { FS="\t;" } {print $1"\t"$2"\t"$3"\t"$4=$4-12"\t"$5=$4+50"\t"$6"\t"$7"\t"$8"\t"$9" "$10";"$11" "$12";"$13" "$14";"$15" "$16";"$17" "$18";"$19" "$20";"$21" "$22";"$23" "$24";"$25" "$26";"$27" "$28";"$29" "$30";"$31" "$32 ";" $33" "$34";"$35" "$36";"$37" "$38";" }
awk: cmd. line:1: ^ syntax error
do you know how to fix it?
try this
awk ' {print $1 "\t" $2"\t"$3"\t"($4 - 12)"\t" ($5 + 50)"\t" $6 "\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15"\t"$16"\t"$17"\t"$18""$19" "$20""$21" "$22";"$23" "$24""$25" "$26""$27" "$28""$29" "$30""$31" "$32""$33" "$34""$35" "$36""$37" "$38"" }' input.txt > output.txt
What you basically do is:
awk '{print $1=$1+2 ";" $2=$1+2}' < input
What you need to do is:
awk '{print ($1=$1+2) ";" ($2=$1+2)}' < input
^^^ Note that you need to parenthesize your inline assignments.
Or just do the assignments before printing:
awk '{$1=$1+2;$2=$1+2;print ... }' < input

awk to match field between two files and use conditions on match

I am trying to look for $2 of file1 (skipping the header) in $2 of file2 and if they match and the value in $10 is > 30 and $11 is > 49, then print the line to a output file. The below awk has syntax errors in it though shellcheck didn't return any. Both the input and output are tab-delimited. I think the below is close, but not sure what is wrong. Thank you :).
awk
awk -F'\t' -v OFS='\t' 'NR==FNR{A[$2];next}$2 in A
{if($10 >.5 OFS $11 > 49)
print ; next
' file1 file2
awk: cmd. line:2: {if($10 >.5 OFS $11 > 49)
awk: cmd. line:2: ^ syntax error
awk: cmd. line:3: print ; next
awk: cmd. line:3: ^ unexpected newline or end of string
file1
Missing in IDP but found in Reference:
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94
file2
chr2 166245425 SCN2A AMPL5155065355 SNP Het C/T C T 54 100 50 23 27
chr2 166848646 SCN1A AMPL1543060606 SNP Het G/A G A 52.9411764706 100 68 32 36
desired output
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94
edit with new awk
awk -F'\t' -v OFS='\t' 'NR==FNR{A[$2];next}$2 in A {
if($10 >.5 OFS $11 > 49) >>> if($10 >.5 && $11 > 49)
print }
' file1 file2 > out
awk: cmd. line:2: if($10 >.5 OFS $11 > 49) >>> if($10 >.5 && $11 > 49)
awk: cmd. line:2: ^ syntax error
here you go...
$ awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$2]; next}
($2 in a) && $10>30 && $11>49 ' file1 file2

remove lines that do not match specific digits in list file using awk

I am trying to use awk to remove the lines in file that do not match the digits after the NM_ but before the . in $2 of list. Thank you :).
file
204 NM_003852 chr7 + 138145078 138270332 138145293
204 NM_015905 chr7 + 138145078 138270332 138145293
list
TRIM24 NM_015905.2
awk
awk -v OFS="\t" '{ sub(/\r/, "") } ; NR==FNR { N=$2 ; sub(/\..*/, "", $2); A[$2]=N; next } ; $2 in A { $2=A[$2] } 1' list file > out
current output
204 NM_003852 chr7 + 138145078 138270332 138145293
204 NM_015905.2 chr7 + 138145078 138270332 138145293
desired output (line 1 removed as that is the line that does not match)
204 NM_015905.2 chr7 + 138145078 138270332 138145293
awk 'NR==FNR{split($2,f2,".");a[f2[1]];next} $2 in a' list file
$ awk -F'[ .]' 'NR==FNR{a[$2];next}$2 in a' list file
204 NM_015905 chr7 + 138145078 138270332 138145293