Get the data after selected two strings in one line (filename.txt) - awk

output requirements:-
I want to extract the data after 33A::ISN and need to extract the price after 16E:AVVV//FMMM/.
16E:AVVV//FMMM/ is always succeeding after 33A::ISN.
The code has to fetch the data after 33A::ISN and extract the price after 16E:AVVV//FMMM/ and print the data in one line.
Next succeeding value after 33A::ISN and price after 16E:AVVV//FMMM/ should print in line 2 and similarly for next lines
Input data:-
11A::Aqty//PRCE/666,5000
11B::SUB//VEND
11c::ASD//FSE/890,
33A::ISN USDFG238
//instrument charaterted
34A::PRIC//VEND,
16E:AVVV//FMMM/7890,
19A:HGLP//USD
33A::ISN SECFG238
//inst EWQW
16E:AVVV//FMMM/9890,
19A:HGLP//HUT
33A::ISN ERWQWW8
//iCHAR HAT
16E:AVVV//FMMM/134000,
19A:HGLP//POT
*output:
*
output the data with delimiter separator.
USDFG238,7890
SECFG238,9890
ERWQWW8,134000
what i have tried
i have tried awk and sed commands. i need to get the output in linux

I would harness GNU AWK for this task following way, let file.txt content be
11A::Aqty//PRCE/666,5000
11B::SUB//VEND
11c::ASD//FSE/890,
33A::ISN USDFG238
//instrument charaterted
34A::PRIC//VEND,
16E:AVVV//FMMM/7890,
19A:HGLP//USD
33A::ISN SECFG238
//inst EWQW
16E:AVVV//FMMM/9890,
19A:HGLP//HUT
33A::ISN ERWQWW8
//iCHAR HAT
16E:AVVV//FMMM/134000,
19A:HGLP//POT
then
awk 'index($0,"33A::ISN "){printf "%s,",substr($0,10)}index($0,"16E:AVVV//FMMM/"){print substr($0,16,length-16)}' file.txt
gives output
USDFG238,7890
SECFG238,9890
ERWQWW8,134000
Explanation: I use String Functions, index function to detect lines containing 33A::ISN and lines containing 16E:AVVV//FMMM/ I do that rather than using regular expression to avoid problems arising due to / which has special meaning. For lines which have 33A::ISN I use substr function to get everything beyond it and then use printf with value being followed by comma. printf does not append ORS (newline by default) unlike print, for lines containing 16E:AVVV//FMMM/ I print (so there will be newline) everything beyond it and before last characters, length is length of whole line (number of characters) which I use to compute length of substring I need.
(tested in GNU Awk 5.0.1)

Related

awk: calculating sum from values in single field with multiple delimiters

Related to another post I had...
parsing a sql string for integer values with multiple delimiters,
In which I say I can easily accomplish the same with UNIX tools (ahem). I found it a bit more messy than expected. I'm looking for an awk solution. Any suggestions on the following?
Here is my original post, paraphrased:
#
I want to use awk to parse data sourced from a flat file that is pipe delimited. One of the fields is sub-formatted as follows. My end state is to sum the integers within the field, but my question here is to see of ways to use awk to sum the numeric values in the field. The pattern of the sub-formatting will always be where the desired integers will be preceded by a tilde (~) and followed by an asterisk (*), except for the last one in field. The number of sub fields may vary too (my example has 5, but there could more or less). The 4 char TAG name is of no importance.
So here is a sample:
|GADS~55.0*BILK~0.0*BOBB~81.0*HETT~32.0*IGGR~51.0|
From this example, all I would want for processing is the final number of 219. Again, I can work on the sum part as a further step; just interested in getting the numbers.
#
My solution currently entails two awk statements. First using gsub to replace the '~' with a '*' delimiter in my target field, 77:
awk -F'|' 'BEGIN {OFS="|"} { gsub("~", "*", $77) ; print }' file_1 > file_2
My second awk statement is to calculate the numeric sums on the target field, 77, which is the last field, and replace it with the calculated value. It is built on the assumption that there will be no other asterisks (*) anywhere else in the file. I'm okay with that. It is working for most examples, but not others, and my gut tells me this isn't that robust of an answer. Any ideas? The suggestions on my other post for SQL were great, but I couldn't implement them for unrelated silly reasons.
awk -F'*' '{if (NF>=2) {s=0; for (i=1; i<=NF; i++) s=s+$i; print substr($1, 1, length($1)-4) s;} else print}' file_2 > file_3
To get the sum (219) from your example, you can use this:
awk -F'[^0-9.]+' '{for(i=1;i<=NF;i++)s+=$i;print s}' file
or the following for 219.00 :
awk -F'[^0-9.]+' '{for(i=1;i<=NF;i++)s+=$i;printf "%.2f\n", s}' file

Delete lines which contain a number smaller/larger than a user specified value

I need to delete lines in a large file which contain a value larger than a user specified number(see picture). For example I'd like to get rid of lines with values larger than 5e-48 (x>5e-48), i. e. lines with 7e-46, 7e-40, 1e-36,.... should be deleted.
Can sed, grep, awk or any other command do that?
Thank you
Markus
With awk:
awk '$3 <= 5e-48' filename
This selects only those lines whose third field is smaller than 5e-48.
If fields can contain spaces (since the data appears to be tab-separated) use
awk -F '\t' '$3 <= 5e-48' filename
This sets the field separator to \t, so lines are split at tabs rather than any whitespace. It does not appear to be necessary with the shown input data, but it is good practice to be defensive about these things (thanks to #tripleee for pointing this out).
In Perl, for example, the solution can be
perl -ane'print unless$F[2]>5e-48'

Getting numerical sub-string of fields using awk

I was wondering how I can get the numerical sub-string of fields using awk in a text file like what is shown below. I am already familiar with substr() function. However, since the length of fields are not fixed, I have no idea how to separate text from numerical part.
A.txt
"Asd.1"
"bcdujcd.2"
"mshde.3333"
"deuhdue.777"
P.S. All the numbers are separated from text part with a single dot (.).
You may try like this:
rt$ echo "bcdujcd.2"|awk -F'[^0-9]*' '$0=$2'
If you don't care about any non-digit parts of the line and only want to see the digit parts as output you could use:
awk '{gsub(/[^[:digit:]]+/, " ")}7' A.txt
which will generate:
1
2
3333
777
as output (there's a leading space on each line for the record).
If there can only be one number field per line than the replacement above can be "" instead of " " in the gsub and the leading space will do away. The replacement with the space will keep multiple numerical fields separated by a space if they occur on a single line. (i.e. "foo.88.bar.11" becomes 88 11 instead of 8811).
If you just need the second (period delimited) field of each line of that sort then awk -F. '{print $2}' will do that.
$ awk -F'[".]' '{print $3}' file
1
2
3333
777

why the last new-line-character not replaced

the file to be processed by awk.
hello world
hello Jack
hello Jim
Hello Marry
Hello Bob
Hello Everyone
And my command is awk 'BEGIN{RS=""; FS="\n";} {gsub("\n","#"); print}'. The awk manual said that when the RS is set to the null (empty?) string, then records are separated by blank lines. So the result is expected to be
hello world#hello Jack#hello Jim#
hello Marry#hello Bob#hello Everyone#
But actually, the result is
hello world#hello Jack#hello Jim
hello Marry#hello Bob#hello Everyone
The last new-line-character is not replaced by #. Is it because the last new-line-character of a record is ommited by awk when awk read and cut content to fields? Are there some manuals about the details of how awk read and cut and process fields with patterns and actions? Thanks.
The reason you don't have trailing # in output is:
if you set RS="", it is similar with RS="\n\n+" (*but with difference, I explain it later). So the longest (>=2) continuous line-breaks would be used by awk as RS.
looking at your data, after the Jim there are two \ns, until the next text block. So awk will take the two \n as RS, so there is no ending \n in your record (Jim record). of course, your gsub won't replace it. You see the line break in your output, it was brought by print
the 2nd line in your output has no ending # either, because we used RS="" instead of RS="\n\n+". The important difference is, for RS="", leading newlines in the input data file are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. That's why there is no trailing # in output line#2.
If you changed it into RS="\n\n+", you should see the ending # on the 2nd line in your output.
I guess you want to find out why the output you got was not something you expected. but not try to achieve your expected output, right? if your question is how to get that output, I would edit my answer.
You can have a look at this page: http://www.gnu.org/software/gawk/manual/gawk.html#Multiple-Line
It says:
"When RS is set to the empty string, and FS is set to a single character, the newline character always acts as a field separator."
So you do not have to specify FS=\n, it happens automatically if you say RS=""..
In order to produce your expected output you can do the following:
BEGIN{
RS=""
}
{
$0=$0 ORS
gsub("\n","#")
print
}

awk or sed to delete a pattern with string and number

I have a file with the following content:
string1_204
string2_408
string35_592
I need to get rid of string1_,string2_,string35_ and so on and add 204,408,592 to get a value.
so the output should be 1204.
I can take out string1_ and string 2_ but for string35_592 I have 5_592.
I cant seem to get the command right to do what I want to do. Please any help is appreciated:)
With awk:
awk -F_ '{s+=$2}END{print s}' your.txt
Output:
1204
Explanation:
-F_ sets the field separator to _ what makes it easy to access
the numbers later on
{
# runs on every line of the input file
# adds the value of the second field - the number - to s.
# awk auto initializes s with 0 on it's first usage
s+=$2
}
END {
# runs after all input has been processed
# prints the sum
print s
}
In case you are interested in a coreutils/bc alternative:
<infile cut -d_ -f2 | paste -sd+ - | bc
Output:
1024
Explanation:
cut splits each line at underscore characters (-d_) and outputs only the second field (-f2). The column of numbers is passed on to paste which joins them on a line (-s) delimited by plus characters (-d+). This is passed on to bc which calculates and outputs the sum.