awk to remove lines that finish with a character - awk

I have a string:
COL1
COL2
PRE
test1/
PRE
test1/
PRE
test1/
2023-01-27
12:37:16
2023-01-27
12:37:16
2023-01-27
12:37:16
2023-01-27
12:37:16
2023-01-27
12:37:16
Want left a black space with awk the complete lines that finish with the character "/" but i cannot guess it.
I test it for example but doesn't work:
awk '{gsub("*/",""); print $1 $2}'
Thanks!

Use the pattern expression to run different code depending on whether the line ends with /$ or not.
awk '!/\/$/ {print $1, $2}
/\/$/ {print ""}' filename

Related

Using awk gsub with /1 to replace chars with a section of the original characters

This is what I'm doing (I just want to get rid of the leading numbers in the fourth column)
cat text.txt | awk 'BEGIN {OFS="\t"} {gsub(/[0-9XY][0-9]?([pq])/,"\1",$4); print}'
This is my input
AADDC 4902 3 21q11.3-p11.1 4784 4793
DEEDA 4023 6 9p21.31|22.3-p22.1 2829 2832
ZWTEF 3920 10 8q21-q22 5811 5812
This is my Output
AADDC 4902 3 11.3-p11.1 4784 4793
DEEDA 4023 6 21.31|22.3-p22.1 2829 2832
ZWTEF 3920 10 21-q22 5811 5812
But I want this to be my output
AADDC 4902 3 q11.3-p11.1 4784 4793
DEEDA 4023 6 p21.31|22.3-p22.1 2829 2832
ZWTEF 3920 10 q21-q22 5811 5812
If you use GNU awk, you can use gensub which, unlike gsub, supports backreferences:
awk 'BEGIN {OFS="\t"} {$4=gensub(/[0-9XY][0-9]?([pq])/,"\\1",1,$4); print}' text.txt
Some explanations:
What is the extra "\" for by the 1:
Because otherwise, that would be the character of ascii code 1.
Why does 1 need to be placed between the \1" and the $4:
To tell gensub to replace only the first occurence of the pattern.
Is there a reason why you must put $4= as well as $4
Yes, unlike gsub, gensub doesn't modify the field but returns the updated one.

awk to use header field to count fields

I am trying to use awk to count the headers and use those as field numbers. My problem is two fold:
The awk is close, but I need some expert help to make it better. Thank you :).
the awk as is ignores the field headers and defines the fields using the text (sometimes field 5 starts with NM_, other times it is LRG_) as the RefSeqGene.txt illustrates. I think that is because not all the fields have text, but what is consistent are the headers.
I only want to pull the row where $10 = "reference standard"
awk
awk 'FNR==NR {E[$1]; next }$3 in E {print $3, $5}' panel_genes.txt RefSeqGene.txt > update.txt
example of panel genes.txt (used to search RefSeqGene.txt)
ACTA2
BRAF
BHLHB9
example of RefSeqGene.txt
#tax_id GeneID Symbol RSG LRG RNA t Protein p Category
9606 59 ACTA2 NG_011541.1 NM_001613.2 NP_001604.1 reference standard
9606 59 ACTA2 NG_011541.1 NM_001141945.1 NP_001135417.1 reference standard
9606 673 BRAF NG_007873.3 LRG_299 NM_004333.4 t1 NP_004324.2 p1 reference standard
9606 80823 BHLHB9 NG_021340.1 NM_001142524.1 NP_001135996.1 aligned
9606 80823 BHLHB9 NG_021340.1 NM_001142525.1 NP_001135997.1 aligned
9606 80823 BHLHB9 NG_021340.1 NM_001142526.1 NP_001135998.1 aligned
desired output
ACTA2 NM_001613.2
ACTA2 NM_001141945.1
BRAF NM_004333.4
this one-liner gives your the desired output:
awk 'FNR==NR{a[$0];next}
$(NF-1)$NF=="referencestandard" && $3 in a{print $3, ($5~/^NM_/?$5:$6)}' file1 file2
$(NF-1)$NF=="referencestandard" checks your $10
if $5 begins with NM_ we take it, otherwise, we take the $6

take out specific columns from mulitple files

I have multiple files that look like the one below. They are tab-separated. For all the files I would like to take out column 1 and the column that start with XF:Z:. This will give me output 1
The files names are htseqoutput*.sam.sam where * varies. I am not sure about the awk function use, and if the for-loop is correct.
for f in htseqoutput*.sam.sam
do
awk ????? "$f" > “out${f#htseqoutput}”
done
input example
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 16 chr22 39715068 24 51M * 0 0 GACAATCAGCACACAGTTCCTGTCCGCCCGTCAATAAGTTCATCATCTGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T31G0 YT:Z:UU XF:Z:SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 16 chr19 4724687 40 33M * 0 0 AGGCGAATGTGATAACCGCTACACTAAGGAAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:26C6 YT:Z:UU XF:Z:tRNA
TCGACTCCCGGTGTGGGAACC_0 16 chr13 45492060 23 21M * 0 0 GGTTCCCACACCGGGAGTCGA IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0C20 YT:Z:UU XF:Z:tRNA
output 1:
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
Seems like you could just use sed for this:
sed -r 's/^([ACGT0-9_]+).*XF:Z:([[:alnum:]]+).*/\1\t\2/' file
This captures the part at the start of the line and the alphanumeric part following XF:Z: and outputs them, separated by a tab character. One potential advantage of this approach is that it will work independently of the position of the XF:Z: string.
Your loop looks OK (you can use this sed command in place of the awk part) but be careful with your quotes. " should be used, not “/”.
Alternatively, if you prefer awk (and assuming that the bit you're interested in is always part of the last field), you can use a custom field separator:
awk -F'[[:space:]](XF:Z:)?' -v OFS='\t' '{print $1, $NF}' file
This optionally adds the XF:Z: part to the field separator, so that it is removed from the start of the last field.
You can try, if column with "XF:Z:" is always at the end
awk 'BEGIN{OFS="\t"}{n=split($NF,a,":"); print $1, a[n]}' file.sam
you get,
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
or, if this column is a variable position for each file
awk 'BEGIN{OFS="\t"}
FNR==1{
for(i=1;i<=NF;i++){
if($i ~ /^XF:Z:/) break
}
}
{n=split($i,a,":"); print $1, a[n]}' file.sam

I have a script code which works, but how do I make this script code more "elegant"?

Some background. I have two files (A and B) which contain data I need to extract.
For file A, I only need the last two lines which look like this:
RMM: 17 -0.221674395053E+01 0.59892E-04 0.00000E+00 31 0.259E-03
1 F= -.22167440E+01 E0= -.22167440E+01 d E =-.398708E-10 mag= 2.0000
I need to extract the following numbers:
-1st Line, 2nd field (17)
-1st Line 4th field (0.59892E-04)
-2nd Line, 1st field (1)
-2nd Line, 3rd field (-.22167440E+01)
-2nd Line, 5th field (-.22167440E+01)
-2nd Line, 8th field (-.398708E-10)
-2nd Line, 10th field (2.0000)
For file B, I only need the last 11 lines which look like this:
Total CPU time used (sec): 0.364
User time (sec): 0.355
System time (sec): 0.009
Elapsed time (sec): 1.423
Maximum memory used (kb): 9896.
Average memory used (kb): 0.
Minor page faults: 2761
Major page faults: 4
Voluntary context switches: 24
I need to extract the following numbers:
-1st line, 6th field (0.364)
-2nd line, 4th field (0.355)
-3rd line, 4th field (0.009)
-4th line, 4th field (1.423)
-6th line, 5th field (9896.)
-7th line, 5th field (0.)
My output should be like this:
mainfolder1[tab/space]subfolder1[tab/space][all the extracted info separated by tab]
mainfolder2[tab/space]subfolder2[tab/space][all the extracted info separated by tab]
mainfolder3[tab/space]subfolder3[tab/space][all the extracted info separated by tab]
...
mainfoldern[tab/space]subfoldern[tab/space][all the extracted info separated by tab]
Now this is my script code:
for m in ./*/; do
main=$(basename "$m")
for s in "$m"*/; do
sub=$(basename "$s")
vdata=$(tail -n2 ./$main/$sub/A | awk -F'[ =]+' NR==1'{a=$2;b=$4;next}{print s,a,$2,$4,$6,$9, $11}')
ctime=$(tail -n11 ./$main/$sub/B |head -n1|awk '{print $6}')
utime=$(tail -n10 ./$main/$sub/B |head -n1|awk '{print $4}')
stime=$(tail -n9 ./$main/$sub/B |head -n1|awk '{print $4}')
etime=$(tail -n8 ./$main/$sub/B |head -n1|awk '{print $4}')
maxmem=$(tail -n6 ./$main/$sub/B |head -n1|awk '{print $5}')
avemem=$(tail -n5 ./$main/$sub/B |head -n1|awk '{print $5}')
c=$(echo $sub| cut -c 2-)
echo "$m $c $vdata $ctime $utime $stime $etime $maxmem $avemem"
done
done > output
Now, the fourth line, the vdata part, was actually a "recycled" line from a previous forum question. I do not fully understand it. I wanted my file B codes to be as a elegant as that awk code for file A. How do I do it? Thank you! :)
awk 'NR==1{print $6} NR==2{print $4} NR==3{print $4} ...'
You could simplify a bit with:
NR==2 || NR==3 || NR==4
but that seems hard to maintain. Or you could use an array:
awk 'BEGIN{a[1]=6;a[2]=4...} NR in a{ print $a[NR]}'
But I think you really just want:
awk '{print $NF}' ORS=\\t
(You don't want the 6th field from row 1. You want the last field.)
Rather than trying to collect the output into variables just to be echoed, add ORS=\\t to get tab separated output, and just let it print to stdout of the script.
For file B try something like:
tail -n11 B | awk -F':' '{ print $2 }'
if you need to retain the values and then echo, you could do something like:
array=($(tail -n11 B | awk -F':' '{ print $2 }'))
for value in "${array[#]}"
do
echo $value
done
You should look into find and xargs since every time you write a loop in shell just to manipulate text you have the wrong approach BUT to keep it simple and retain your original structure, it sounds like you could use something like:
for m in ./*/; do
main=$(basename "$m")
for s in "$m"*/; do
sub=$(basename "$s")
fileA="${main}/${sub}/A"
fileB="${main}/${sub}/B"
awk -v sizeA=$(wc -l < "$fileA") -v sizeB=$(wc -l < "$fileB") '
NR==FNR {
if ( FNR == (sizeA-1) ) { split($0,p) }
if ( FNR == sizeA ) { split($0,a) }
next
}
{ b[sizeB + 1 - FNR] = $NF }
END {
split(FILENAME,f,"/")
print f[1], f[2], p[2], p[4], a[1], a[3], a[5], a[8], a[10], b[11], b[10], b[9], b[8], b[6], b[5]
}
' "$fileA" "$fileB"
done
done > output
Note that the above only opens each "B" file 1 time instead of 6.

awk and log2 divisions

I have a tab delimited file that looks something like this:
foo 0 4
boo 3 2
blah 4 0
flah 1 1
I am trying to calculate log2 for between the two columns for each row. my problem is with the division by zero
What I have tried is this:
cat file.txt | awk -v OFS='\t' '{print $1, log($3/$2)log(2)}'
when there is a zero as the denominator, the awk will crash. What I would want to do is some sort of conditional statement that would print an "inf" as the result when the denominator is equal to 0.
I am really not sure how to go about this?
Any help would be appreciated
Thanks
You can implement that as follows (with a few additional tweaks):
awk 'BEGIN{OFS="\t"} {if ($2==0) {print $1, "inf"} else {print $1, log($3/$2)log(2)}} file.txt
Explanation:
if ($2==0) {print $1, "inf"} else {...} - First check to see if the 2nd field ($2) is zero. If so, print $1 and inf and move on to the next line; otherwise proceed as usual.
BEGIN{OFS="\t"} - Set OFS inside the awk script; mostly a preference thing.
... file.txt - awk can read from files when you specify it as an argument; this saves the use of a cat process. (See UUCA)
awk -F'\t' '{print $1,($2 ? log($3/$2)log(2) : "inf")}' file.txt