Awk pattern matching on rows that have a value at specific column. No delimiter - awk

I would like to search a file, using awk, to output rows that have a value commencing at a specific column number. e.g.
I looking for 979719 starting at column number 10:
moobaaraa**979719**
moobaaraa123456
moo**979719**123456
moobaaraa**979719**
moobaaraa123456
As you can see, there are no delimiters. It is a raw data text file. I would like to output rows 1 and 4. Not row 3 which does contain the pattern but not at the desired column number.

awk '/979719$/' file
moobaaraa979719
moobaaraa979719

An simple sed approach.
$ cat file
moobaaraa979719
moobaaraa123456
moo979719123456
moobaaraa979719
moobaaraa123456
Just search for a pattern, that end's up with 979719 and print the line:
$ sed -n '/^.*979719$/p' file
moobaaraa979719
moobaaraa979719

This code works:
awk 'length($1) == 9' FS="979719" raw-text-file
This code sets 979719 as the field separator, and checks whether the first field has a length of 9 characters. Then prints the line (as default action).

awk 'substr($0,10,6) == 979719' file
You can drop the ,6 if you want to search from the 10th char to the end of each line.

Related

Adding a decimal point to an integer with awk or sed

So, I have csv files to use with hledger, and last field of every row is the amount for that line transaction.
Lines are in the following format:
date1, date2, description, amount
With the amount format any length between 4 and 6 digits; now for some reason all amounts are missing the period before the last two digits.
Now: 1000
Should be: 10.00
Now: 25452
Should be: 254.52
How to add a '.' before the last two digits of all lines, preferably with sed/awk?
So the input file is:
16.12.2005,18.12.2005,ATM,2000
17.12.2005,18.12.2005,utility,12523
18.12.2005,20.12.2005,salary,459023
desired output
16.12.2005,18.12.2005,ATM,20.00
17.12.2005,18.12.2005,utility,125.23
18.12.2005,20.12.2005,salary,4590.23
Thanks
You could try:
awk -F , '{printf "%s,%s,%s,%-6.2f\n", $1, $2, $3, $4/100.0}'
You should always add a sample of your input file and of the output you want in your question.
In this input you provide, you will have to define what has to happen when the description field contains a ,, or if it is possible to have amount of less than 100 as input.
In function of your answer, I will need to adapt the code or not.
sed 's/..$/.&/'
......................
You can also use cut utility to get the desired output. In your case, you always want to add '.' before the last two digits. So essentially it can be thought as something like this:
Step 1: Get all the characters from the beginning till the last 2 characters.
Step 2: Get the last 2 characters from the end.
Step 3: Concatenate them with the character that you want ('.' in this case).
The corresponding command for each of the step is the following:
$ a='17.12.2005,18.12.2005,utility,12523'
$ b=`echo $a | rev | cut -c3- | rev`
$ c=`echo $a | rev | cut -c1-2 | rev`
$ echo $b"."$c
This would produce the output
17.12.2005,18.12.2005,utility,125.23
16.12.2005,18.12.2005,ATM,20.00
17.12.2005,18.12.2005,utility,125.23
18.12.2005,20.12.2005,salary,4590.23
awk -F, '{sub(/..$/,".& ")}1' file

awk: identify column by condition, change value, and finally print all columns

I want to extract the value in each row of a file that comes after AA. I can do this like so:
awk -F'[;=|]' '{for(i=1;i<=NF;i++)if($i=="AA"){print toupper($(i+1));next}}'
This gives me the exact information I need and converts to uppercase, which is exactly what I want to do. How can I do this and then print the entire row with this altered value in its previous position? I am essentially trying to do a find and replace where the value is changed to uppercase.
EDIT:
Here is a sample input line:
11 128196 rs576393503 A G 100 PASS AC=453;AF=0.0904553;AN=5008;NS=2504;DP=5057;EAS_AF=0.0159;AMR_AF=0.0259;AFR_AF=0.3071;EUR_AF=0.006;SAS_AF=0.0072;AA=g|||;VT=SNP
and here is a how I would like the output to look:
11 128196 rs576393503 A G 100 PASS AC=453;AF=0.0904553;AN=5008;NS=2504;DP=5057;EAS_AF=0.0159;AMR_AF=0.0259;AFR_AF=0.3071;EUR_AF=0.006;SAS_AF=0.0072;AA=G|||;VT=SNP
All that has changed is the g after AA= is changed to uppercase.
Following awk may help you on same.
awk '
{
match($0,/AA=[^|]*/);
print substr($0,1,RSTART+2) toupper(substr($0,RSTART+3,RLENGTH-3)) substr($0,RSTART+RLENGTH)
}
' Input_file
With GNU sed and perl, using word boundaries
$ echo 'SAS_AF=0.0072;AA=g|||;VT=SNP' | sed 's/\bAA=[^;=|]*\b/\U&/'
SAS_AF=0.0072;AA=G|||;VT=SNP
$ echo 'SAS_AF=0.0072;AA=g|||;VT=SNP' | perl -pe 's/\bAA=[^;=|]*\b/\U$&/'
SAS_AF=0.0072;AA=G|||;VT=SNP
\U will uppercase string following it until end or \E or another case-modifier
use g modifier if there can be more than one match per line

Comparing corresponding values of two lines in a file using awk [duplicate]

This question already has answers here:
Finding max value of a specific date awk
(3 answers)
Closed 6 years ago.
name1 20160801|76 20160802|67 20160803|49 20160804|35 20160805|55 20160806|76 20160807|77 20160808|70 2016089|50 20160810|75 20160811|97 20160812|90 20160813|87 20160814|99 20160815|113 20160816|83 20160817|57 20160818|158 20160819|61 20160820|46 20160821|1769608 20160822|2580938 20160823|436093 20160824|75 20160825|57 20160826|70 20160827|97 20160828|101 20160829|96 20160830|95 20160831|89
name2 20160801|32413 20160802|37707 20160803|32230 20160804|31711 20160805|32366 20160806|35532 20160807|36961 20160808|45423 2016089|65230 20160810|111078 20160811|74357 20160812|71196 20160813|71748 20160814|77001 20160815|91687 20160816|92076 20160817|89706 20160818|126690 20160819|168587 20160820|207128 20160821|221440 20160822|234594 20160823|200963 20160824|165231 20160825|139600 20160826|145483 20160827|209013 20160828|228550 20160829|223712 20160830|217959 20160831|169106
I have the line position of two lines in a file say line1 and line2. These lines may be anywhere in the file but I can access the line position using a search keyword based on name(the first word) in each line
20160801 means yyyymmdd and has an associated value separated by |
I need to compare the values associated with each of the date for the given two lines.
I am a newbie in awk. I am not understanding how to compare these two lines at the same time.
Your question is not at all clear. Perhaps the first step is to clearly articulate 1) What is the problem I am trying to solve; 2) what tools or data do I have to solve it?
The only hints specific to your question I can offer (since your problem statement is not clearly articulated) are these:
In awk, you can compare two different files by using the test FNR==NR which is only true on the first file.
You can find the key words by using a regular expression of the form /^name1/ which means lines that start with that pattern
You can split on a delimiter in awk by setting the field separator to that delimiter -- in this case (I think) it sounds like that is | but you are also comparing white space delimited fields inside of those fields?
You can compare by saving the data from the first line and comparing with the data from the second line in the other file once you can articulate what 'compare' means to you.
Wrapping that up, given:
$ cat /tmp/f1.txt
name1 20160801|76 20160802|67 20160803|49 20160804|35 20160805|55 20160806|76 20160807|77 20160808|70 2016089|50 20160810|75 20160811|97 20160812|90 20160813|87 20160814|99 20160815|113 20160816|83 20160817|57 20160818|158 20160819|61 20160820|46 20160821|1769608 20160822|2580938 20160823|436093 20160824|75 20160825|57 20160826|70 20160827|97 20160828|101 20160829|96 20160830|95 20160831|89
$ cat /tmp/f2.txt
name2 20160801|32413 20160802|37707 20160803|32230 20160804|31711 20160805|32366 20160806|35532 20160807|36961 20160808|45423 2016089|65230 20160810|111078 20160811|74357 20160812|71196 20160813|71748 20160814|77001 20160815|91687 20160816|92076 20160817|89706 20160818|126690 20160819|168587 20160820|207128 20160821|221440 20160822|234594 20160823|200963 20160824|165231 20160825|139600 20160826|145483 20160827|209013 20160828|228550 20160829|223712 20160830|217959 20160831|169106
You can find the lines in question like so:
$ awk -F"|" '/^name/ && FNR==NR {print $1}' f1.txt f2.txt
name1 20160801
$ awk -F"|" '/^name/ && FNR<NR {print $1}' f1.txt f2.txt
name2 20160801
(I have only printed the first field for clarity)
Then use that to compare. Save the first in an associative array and then compare the second when found.

gawk to create first column based on part of second column

I have a 2 column tsv that I need to insert a new first column using part of the value in column 2.
What I have:
fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
What I want:
D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
I want to pull everything between "fastq/" and the first period and print that as the new first column.
$ awk -F'[/.]' '{printf "%s\t%s\n",$2,$0}' file
D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
How it works
awk implicitly loops over all input lines.
-F'[/.]'
This tells awk to use any occurrence of / or . as a field separator. This means that, for your input, the string you are looking for will be the second field.
printf "%s\t%s\n",$2,$0
This tells awk to print the second field ($2), followed by a tab (\t), followed by the input line ($0), followed by a newline character (\n)

Add a string to the end of column 1 using awk

I have a file whose head looks like this:
>PZ7180000000004_TX nReads=26 cov=9.436
>PZ7180000031590 nReads=3 cov=2.59465
>PZ7180000027934 nReads=5 cov=2.32231
>PZ456916 nReads=1 cov=1
>PZ7180000037718 nReads=9 cov=6.26448
>PZ7180000000004_TY nReads=86 cov=36.4238
>PZ7180000000067_AF nReads=16 cov=12.0608
>PZ7180000031591 nReads=4 cov=3.26022
>PZ7180000024036 nReads=14 cov=5.86079
>PZ15501_A nReads=1 cov=1
I want to add the string _nogroup onto the first column of each line that does not have _XX already designated (i.e. the 1st column on the 1st line is fine but the 1st column on the 2nd line should read >PZ7180000031590_nogroup).
Can I do this using awk like to use the command line.
You can use this awk command:
awk '!($1 ~ /_[a-zA-Z]{2}$/) {$1=$1 "_nogroup"} 1' file
>PZ7180000000004_TX nReads=26 cov=9.436
>PZ7180000031590_nogroup nReads=3 cov=2.59465
>PZ7180000027934_nogroup nReads=5 cov=2.32231
>PZ456916_nogroup nReads=1 cov=1
>PZ7180000037718_nogroup nReads=9 cov=6.26448
>PZ7180000000004_TY nReads=86 cov=36.4238
>PZ7180000000067_AF nReads=16 cov=12.0608
>PZ7180000031591_nogroup nReads=4 cov=3.26022
>PZ7180000024036_nogroup nReads=14 cov=5.86079
>PZ15501_A_nogroup nReads=1 cov=1