How to remove partial redundent patterns in each raws? - awk

I have a file like this:
reference 25038 A G 39134 1 TPPH54 TPPH49 TPPH50 TPPHL51 TPPH52 TPPH53 TPPH55 p.Thr10198Thr
reference 77940 T C 5131 1 TPPH54 TPPH49 p.Asn898Asp
reference 77940 T C 5131 1 TPPH29 TPPH30 TPPH32 p.Gly48Gly
and I would like to get:
reference 25038 A G 39134 1 TPPH54 p.Thr10198Thr
reference 77940 T C 5131 1 TPPH54 p.Asn898Asp
reference 77940 T C 5131 1 TPPH29 p.Gly48Gly
How to remove in awk/sed/grep patterns after the first one (always $7) all those having the same beggining??
I was thinking something like:
only print the 7 first columns and the last one
paste <(awk '{print $1, $2, $3, $4, $5, $6, $7}' file) <(awk '{print ????}' file-tmp) > file-final
but I don't know how to get the last one because the number can be different at each raw
or 'scan' the file until having 'TPPH' beginning expression, keep the first one and remove the other ones for each raw. I'm not sure how to do it
Thank you very much in advance for your help!

You can just do:
awk '{print $1, $2, $3, $4, $5, $6, $7, $NF}' file | column -t
reference 25038 A G 39134 1 TPPH54 p.Thr10198Thr
reference 77940 T C 5131 1 TPPH54 p.Asn898Asp
reference 77940 T C 5131 1 TPPH29 p.Gly48Gly
Here column -t has been used for tabular display only.

Using sed
$ sed -E ':a;s/(([^ \t]*[ \t]+){6}TPPH[0-9]+)[ \t]+TPPH[^ \t]*[ \t]+/\1\t/;ta' input_file
reference 25038 A G 39134 1 TPPH54 p.Thr10198Thr
reference 77940 T C 5131 1 TPPH54 p.Asn898Asp
reference 77940 T C 5131 1 TPPH29 p.Gly48Gly

With your shown samples in GNU awk please try following awk code. Using match function of awk here in which using regex to capture the required part which further creates 2 capturing groups and saving them into array named arr with index of 1,2 and so on. Then printing their value as per required output.
awk '
match($0,/^(\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+).*\s+(\S+)/,arr){
print arr[1],arr[2]
}
' Input_file

This might work for you (GNU sed):
sed -E 's/\S+/\n&/8g;s/\n.*\n//;s/\n//' file
Insert newlines before the 8th and subsequent fields.
Remove everything between the first and last newlines.
If there were no extra fields, remove the unwanted newline.
Alternative:
sed -E 's/^((\S+\s+){7})((\S+)\s*)*/\1\4/' file
Keep the first seven fields and their space delimiters and retain the last field.
N.B. The last opportunity for a back reference is kept when the * quantifier is used in conjunction with grouping.

Using awk to print the last field, you can use $NF
As your example data does not contain TPPH before the 7th field, you can split and concat the fields:
awk '$7~/^TPPH/{split($0,a,$7);print a[1], $7, $NF}' file
Output
reference 25038 A G 39134 1 TPPH54 p.Thr10198Thr
reference 77940 T C 5131 1 TPPH54 p.Asn898Asp
reference 77940 T C 5131 1 TPPH29 p.Gly48Gly
Note that if you have exactly 7 columns and you print $7 AND $NF you will print the same value twice.
In that case you can only print the last field it there are more than 7 fields:
awk '$7~/^TPPH/{split($0,a,$7);print a[1], $7 (NF==7?"" : OFS $NF)}' file

Related

awk/sed - replace column with pattern using variables from other columns

I have a tab delimited text file:
#CHROM
POS
ID
REF
ALT
1
188277
rs434
C
T
20
54183975
rs5321
CTAAA
C
and I try to replace the "ID" column with specific patern $CHROM_$POS_$REF_$ALT with sed or awk
#CHROM
POS
ID
REF
ALT
1
188277
1_188277_C_T
C
T
20
54183975
20_54183975_CTAAA_C
CTAAA
C
unfortunately, I managed only to delete this ID column with:
sed -i -r 's/\S+//3'
and all patterns I try do not work in all cases. To be honest I am lost in the documentation and I am looking for examples which could help me solve this problem.
Using awk, you can set the value of the 3rd field concatenating field 1,2,4 and 5 with an underscore except for the first line. Using column -t to present the output as a table:
awk '
BEGIN{FS=OFS="\t"}
NR>1 {
$3 = $1"_"$2"_"$4"_"$5
}1' file | column -t
Output
#CHROM POS ID REF ALT
1 188277 1_188277_C_T C T
20 54183975 20_54183975_CTAAA_C CTAAA C
Or writing all fields, with a custom value for the 3rd field:
awk '
BEGIN{FS=OFS="\t"}
NR==1{print;next}
{print $1, $2, $1"_"$2"_"$4"_"$5, $4, $5}
' file | column -t
GNU sed solution
sed '2,$s/\(\S*\)\t\(\S*\)\t\(\S*\)\t\(\S*\)\t\(\S*\)/\1\t\2\t\1_\2_\3_\4_\5\t\4\t\5/' file.txt
Explanation: from line 2 to last line, do following replace: put 5 \t-sheared columns (holding zero or more non-whitespace) into groups. Then replace it with these column joined using \t excluding third one, which is replace by _-join of 1st, 2nd, 3rd, 4th, 5th column.
(tested in sed (GNU sed) 4.2.2)
awk -v OFS='\t' 'NR==1 {print $0}; NR>1 {print $1, $2, $1"_"$2"_"$4"_"$5, $4, $5}' inputfile.txt

print specific value from 7th column using pattern matching along with first 6 columns

file1
1 123 ab456 A G PASS AC=0.15;FB=1.5;BV=45; 0|0 0|0 0|1 0|0
4 789 ab123 C T PASS FB=90;AC=2.15;BV=12; 0|1 0|1 0|0 0|0
desired output
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
I used
awk '{print $1,$2,$3,$4,$5,$6,$7}' file1 > out1.txt
sed -i 's/;/\t/g' out1.txt
awk '{print $1,$2,$3,$4,$5,$6,$7,$8}' out1.txt
output generated
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS FB=90
I want to print first 6 columns along with value of AC=(*) from 7th column.
With your shown samples, please try following awk code.
awk '
{
val=""
while(match($7,/AC=[^;]*/)){
val=(val?val:"")substr($7,RSTART,RLENGTH)
$7=substr($7,RSTART+RLENGTH)
}
print $1,$2,$3,$4,$5,$6,val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
val="" ##Nullifying val here.
while(match($7,/AC=[^;]*/)){ ##Running while loop to use match function to match AC= till semi colon all occurrences here.
val=(val?val:"")substr($7,RSTART,RLENGTH) ##Creating val and keep adding matched regex value to it, from 7th column.
$7=substr($7,RSTART+RLENGTH) ##Assigning rest pending values to 7th column itself.
}
print $1,$2,$3,$4,$5,$6,val ##Printing appropriate columns required by OP along with val here.
}
' Input_file ##Mentioning Input_file name here.
$ awk '{
n=split($7,a,/;/) # split $7 on ;s
for(i=1;i<=n&&a[i]!~/^AC=/;i++); # just loop looking for AC
print $1,$2,$3,$4,$5,$6,a[i] # output
}' file
Output:
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
If AC= was not found, and empty field is outputed instead.
Any time you have tag=value pairs in your data I find it best to first populate an array (f[] below) to hold those tag-value mappings so you can print/test/rearrange those values by their tags (names).
Using any awk in any shell on every Unix box:
$ cat tst.awk
{
split($7,tmp,/[=;]/)
for (i=1; i<NF; i+=2) {
f[tmp[i]] = tmp[i] "=" tmp[i+1]
}
sub(/[[:space:]]*[^[:space:]]+;.*/,"")
print $0, f["AC"]
}
$ awk -f tst.awk file
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
This might work for you (GNU sed):
sed -nE 's/^((\S+\s){6})\S*;?(AC=[^;]*);.*/\1\3/p' file
Turn off implicit printing -n and add easier regexp -E.
Match the first six fields and their delimiters and append the AC tag and its value from the next.
With only GNU sed:
$ sed -r 's/(\S+;)?(AC=[^;]*).*/\2/' file1
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
Lines without a AC=... part in the 7th field will be printed without modification. If you prefer removing the 7th field and the end of the line, use:
$ sed -r 's/(\S+;)?(AC=[^;]*).*/\2/;t;s/\S+;.*//' file1

Bash/AWK conditionals using two files

First of all, thank you for your help. I have a problem trying to use bash conditionals using two files. I have the file letters.txt
A
B
C
And I have the file number.txt
B 10
D 20
A 15
C 18
E 23
A 12
B 14
And I want to use conditionals so if one letter in file letter.txt is also in number.txt it generates the file a.txt b.txt c.txt so the will look as this:
a.txt
A 12
A 15
b.txt
B 10
B 14
c.txt
C 18
I know I can do it using this code:
cat number.txt | awk '{if($1=="A")print $0}' > a.txt
But I want to do it using two files.
The efficient way to approach this type of problem is to sort the input on the key field(s) first so you don't need to have multiple output files open simultaneously (which has limits and/or can slow processing down managing them) or be opening/closing output files with every line read (which is always slow).
Using GNU sort for -s (stable sort) to retain input order of the non-key fields and only having 1 output file open at a time and keeping it open for the whole time it's being populated:
$ sort -k1,1 -s number.txt |
awk '
NR==FNR { lets[$1]; next }
!($1 in lets) { next }
$1 != prev { close(out); out=tolower($1) ".txt"; prev=$1 }
{ print > out }
' letters.txt -
$ head ?.txt
==> a.txt <==
A 15
A 12
==> b.txt <==
B 10
B 14
==> c.txt <==
C 18
If you don't have GNU sort for -s to retain input order of the lines for each key field, you can replace it with awk | sort | cut, e.g.:
$ sort -k1,1 -s number.txt
A 15
A 12
B 10
B 14
C 18
D 20
E 23
$ awk '{print NR, $0}' number.txt | sort -k2,2 -k1,1n | cut -d' ' -f2-
A 15
A 12
B 10
B 14
C 18
D 20
E 23
Note the change in the order of the 2nd fields for A compared to the input order without doing the above because by default sort doesn't guarantee to retain the relative line order for each key it sorts on:
$ sort -k1,1 number.txt
A 12
A 15
B 10
B 14
C 18
D 20
E 23
With your shown samples, please try following.
awk '
FNR==NR{
arr[$0]
next
}
($1 in arr){
outputFile=(tolower($1)".txt")
print >> (outputFile)
close(outputFile)
}
' letters.txt numbers.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when letters.txt is being read.
arr[$0] ##Creating array arr with index of current line.
next ##next will skip all further statements from here.
}
($1 in arr){ ##checking condition if 1st field is present in arr.
outputFile=(tolower($1)".txt") ##Creating outputFile to print output.
print >> (outputFile) ##Printing current line into output file.
close(outputFile) ##Closing output file in backend.
}
' letters.txt numbers.txt ##Mentioning Input_file names here.

Counting the number of unique values based on two columns in bash

I have a tab-separated file looking like this:
A 1234
A 123245
A 4546
A 1234
B 24234
B 4545
C 1234
C 1234
Output:
A 3
B 2
C 1
Basically I need counts of unique values that belong to the first column, all in one commando with pipelines. As you may see, there can be some duplicates like "A 1234". I had some ideas with awk or cut, but neither of the seem to work. They just print out all unique pairs, while I need count of unique values from the second column considering the value in the first one.
awk -F " "'{print $1}' file.tsv | uniq -c
cut -d' ' -f1,2 file.tsv | sort | uniq -ci
I'd really appreciate your help! Thank you in advance.
With complete awk solution could you please try following.
awk 'BEGIN{FS=OFS="\t"} !found[$0]++{val[$1]++} END{for(i in val){print i,val[i]}}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Using GNU awk:
$ gawk -F\\t '{a[$1][$2]}END{for(i in a)print i,length(a[i])}' file
Output:
A 3
B 2
C 1
Explained:
$ gawk -F\\t '{ # using GNU awk and tab as delimiter
a[$1][$2] # hash to 2D array
}
END {
for(i in a) # for all values in first field
print i,length(a[i]) # output value and the size of related array
}' file
$ sort -u file | cut -f1 | uniq -c
3 A
2 B
1 C
Another way, using the handy GNU datamash utility:
$ datamash -g1 countunique 2 < input.txt
A 3
B 2
C 1
Requires the input file to be sorted on the first column, like your sample. If real file isn't, add -s to the options.
You could try this:
cat file.tsv | sort | uniq | awk '{print $1}' | uniq -c | awk '{print $2 " " $1}'
It works for your example. (But I'm not sure if it works for other cases. Let me know if it doesn't work!)

AWK Retrieve text after a certain pattern where the 1st and 2nd columns match the values in the 1st and 2nd columns in an input file

My input file (file1) looks like this:
part position col3 col4 info
part1 34 1 1 NAME=Mark;AGE=23;HEIGHT=189
part2 55 1 1 NAME=Alice;AGE=43;HEIGHT=167
part2 19 1 1 NAME=Emily;AGE=16;HEIGHT=164
part3 23 1 1 NAME=Owen;AGE=55;HEIGHT=181
part3 99 1 1 NAME=Rachel;AGE=76;HEIGHT=162
I need to retrieve the text after "NAME=" in the info column, but only if the values in the first two columns match another file (file2).
part position
part2 55
part3 23
Then only the 2nd and 4th rows will be considered and text after "NAME=" in those rows are put into the output file:
Alice
Owen
I don't need to preserve the order of the original rows, so the following output is equally valid:
Owen
Alice
My (not very good) attempt:
awk -F, 'FNR==NR {a[$1]=$5; next}; $1 in a {print a[$1]}' file1 file2
Something like,
awk -F"[ =;]" 'FNR==NR{found[$1" "$2]=$6; next} $1" "$2 in found{print found[$1" "$2]}'
Example
$ awk -F"[ =;]" 'FNR==NR{found[$1" "$2]=$6; next} $1" "$2 in found{print found[$1" "$2]}' file1 file2
Alice
Owen
What it does?
-F"[ =;]" -F sets the field separators. Here we set it to space or = or ;. This makes it easier to get the name from the first file without using a split function.
found[$1" "$2]=$6 This block is run only for file1, here we save the names $6 in the associative array found indexed by part position
$1" "$2 in found{print found[$1" "$2]} This is executed for the second file. Checks if the part position is found in the array, if yes print the name from the array
Using gnu awk below would do the same
awk 'NR>1 && NR==FNR{found[$1","$2];next}\
$1","$2 in found{print gensub(/^NAME=([^;]*).*/,"\\1","1",$NF);}' file2 file1
Output
Alice
Owen