i want to match a number of any length in awk with pattern match - awk

i have a number which i want to do percentile match and want the correct number in front of that with awk
Table 1
mobile
9594047891
9895943283
9967545384
9594028790
Table 2
display
4047891
95943283
545384
28790
out put needed
display Output
404789 9594047891
95943283 9895943283
545384 9967545384
28790 9594028790
It will be of great help if any awk specialist can solve this
i am trying to match the number which is of 10 digit with the number which is less than 10 digit

It seems like the short number matches the end of the long one, so you can do something like this:
awk '
FNR == 1 { next } # too lazy to handle the headers
FNR == NR {
longPhoneNumber[$1]
next
}
{
for (lpn in longPhoneNumber)
if (lpn ~ $1"$") {
print lpn, $1
break
}
}
' Table1 Table2
9594047891 4047891
9895943283 95943283
9967545384 545384
9594028790 28790

Related

How to find the maximum value for the field by ignoring the lines with characters using awk?

Since am newbie to the awk , please help me with your suggestions. I tried the below command to filter the maximum value and ignore the first & last lines from the sample text file separately. They work when I try them separately.
My query:
I need to ignore the last line and first few lines and from the file and then need to take the maximum value for the field 7 using awk .
I also need to ignore the lines with the characters . Can anyone suggest me the possibilities two use both the commands together and get the required output.
Sample file:
Linux 3.10.0-957.5.1.el7.x86_64 (j051s784) 11/24/2020 _x86_64_ (8 CPU)
12:00:02 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
12:10:01 AM 4430568 61359128 93.27 1271144 27094976 66771548 33.04 39005492 16343196 1348
12:20:01 AM 4423380 61366316 93.28 1271416 27102292 66769396 33.04 39012312 16344668 1152
12:30:04 AM 4406324 61383372 93.30 1271700 27108332 66821724 33.06 39028320 16343668 2084
12:40:01 AM 4404100 61385596 93.31 1271940 27107724 66799412 33.05 39031244 16344532 1044
06:30:04 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
07:20:01 PM 3754904 62034792 94.29 1306112 27555948 66658632 32.98 39532204 16476848 2156
Average: 4013043 61776653 93.90 1293268 27368986 66755606 33.03 39329729 16427160 2005
Commands used:
cat testfile | awk '{print $7}' | head -n -1 | tail -n+7
awk 'BEGIN{a= 0}{if ($7>0+a) a=$7} END{print a}' testfile
Expected output:
Maximum value for the column 7 by excluding the lines wherever alphabet character is available
1st solution(Generic solution): Adding one Generic solution here, where sending field name to an awk variable(which we want to look for for maximum value) it will automatically find out its field number from very first line and will work accordingly. Considering that your first line has that field name which you want to look for.
awk -v var="kbcached" '
FNR==1{
for(i=1;i<=NF;i++){
if($i==var){ field=i }
}
next
}
/kbmemused/{
next
}
{
if($2!~/^[AP]M$/){
val=$(field-1)
}
else{
val=$field
}
}
{
max=(max>val?max:val)
val=""
}
END{
print "Maximum value is:" max
}
' Input_file
2nd solution(As per shown samples only): Could you please try following, based on your shown samples only. I am assuming you want the field value of column kbcached.
awk '
/kbmemfree/{
next
}
{
if($2!~/^[AP]M$/){
val=$6
}
else{
val=$7
}
}
{
max=(max>val?max:val)
val=""
}
END{
print "Maximum value is:" max
}
' Input_file
awk '$7 ~ ^[[:digit:]]+$/ && $1 != "Average:" {
max[$7]=""
}
END {
PROCINFO["sorted_in"]="#ind_num_asc";
for (i in max) {
maxtot=i
}
print maxtot
}' file
One liner:
awk '$7 ~ /^[[:digit:]]+$/ && $1 != "Average:" { max[$7]="" } END { PROCINFO["sorted_in"]="#ind_num_asc";for (i in max) { maxtot=i } print maxtot }' file
Using GNU awk, search for lines where field 7 is only numbers and field one is not "Average:" In these instances, create an array entry with field 7 as the index. At the end, sort the array in index ascending number order. Loop through the array setting a maxtot variable. The last entry in the max array will be the highest kbcached and so print maxtot

How to find records with minimum value in a specific column in multiple files?

I have 2 column 4000 dat files. from each file, I need to identify first mimum value of column 2 and print corresponding row. Then this should run on multiple files in the folder and append these values to a new file. I have tried below code.
File names include common string:
fig_3-28333.dat
^^^^^ file number
awk'BEGIN{min=0}{if(($2)>min) min=($2)}END {print line}' cat >> new.dat
output file expected to be
file number Column 1 column2
28333 x value first minimum value
28334 x value first minimum value
NOTE: This only works with gawk (which understands the ENDFILE pattern), and not the regular awk
Here is my script, min.awk:
BEGIN {
print "file number Column 1 column2"
}
FNR == 1 {
min = $2;
first = $1
second = $2
}
$2 < min {
min = $2
first = $1
second = $2
}
ENDFILE {
# Extract the file number to a[1]
match(FILENAME, /.*-([0-9]+)\.dat/, a)
print a[1], first, second
}
Notes
The BEGIN pattern prints the heading
At the first line of each file (pattern: FNR == 1), establish the minimum value
For those lines whose second value is less than the minimum (pattern: $2 < min), establish the new minimum value
At the end of each file, print out the minimum value for that file
Invoke the script
gawk -f min.awk *.dat
Update
After reviewing my script, I duplicated code which I can eliminate by combining the two blocks:
BEGIN {
print "file number Column 1 column2"
}
FNR == 1 || $2 < min{
min = $2;
first = $1
second = $2
}
ENDFILE {
# Extract the file number to a[1]
match(FILENAME, /.*-([0-9]+)\.dat/, a)
print a[1], first, second
}

AWK: How can I print averages of consecutive numbers in a file, but skip over alphabetical characters/strings?

I've figured out how to get the average of a file that contains numbers in all lines such as:
Numbers.txt
1
2
4
8
Output:
Average: 3.75
This is the code I use for that:
awk '{ sum += $1; tot++ } END { print sum / tot; }' Numbers.txt
However, the problem is that this doesn't take into account possible strings that might be in the file. For example, a file that looks like this:
NumbersAndExtras.txt
1
2
4
8
Hello
4
5
6
Cat
Dog
2
4
3
For such a file I'd want to print the multiple averages of the consecutive numbers, ignoring the strings such that the result looks something like this:
Output:
Average: 3.75
Average: 5
Average: 3
I could devise some complicated code that might accomplish that with variables and 'if' statements and loops and whatnot, but I've been told it's easier than that given some of awk features. I'd like to know how that might look like, along with an explanation of why it works.
BEGIN runs before reading the first line from file. Set sum and count to 0.
awk 'BEGIN{ sum=0; count=0} {if ( /[a-z][A-Z]/ ) { if (count > 0) {avg = sum/count; print avg;} count=0; sum=0} else { count++; sum += $1} } END{if (count > 0) {avg = sum/count; print avg}} ' NumbersAndExtras.txt
When there is an alphabet on the line, calculate and print average so far.
And do the same in the END block that runs after processing the whole file.
Keep it simple:
awk '/^$/{next}
/^[0-9]+/{a+=$1+0;c++;next}
c&&a{print "Average: "a/c;a=c=0}
END{if(c&&a){print "Average: "a/c}}' input_file
Results:
Average: 3.75
Average: 5
Average: 3
Another one:
$ awk '
function avg(s, c) { print "Average: ", s/c }
NF && !/^[[:digit:]]/ { if (count) avg(sum, count); sum = 0; count = 0; next}
NF { sum += $1; count++ }
END {if (count) avg(sum, count)}
' <file
Note: The value of this answer in explaining the solution; other answers offer more concise alternatives.
Try the following:
Note that this is an awk command with a script specified as a multi-line shell string literal - you can paste the whole thing into your terminal to try it; while it is possible to cram this into a single line, it hurts readability and the ability to comment:
awk '
# Define output function that prints an average.
function printAvg() { print "Average: ", sum/count }
# Skip blank lines
NF == 0 { next}
# Is the line non-numeric?
/[[:alpha:]]/ {
# If this line ends a numeric block, print its
# average now and reset the variables to start the next group.
if (count) {
printAvg()
wasNum = sum = count = 0
}
# Skip to next line.
next
}
# Numeric line: set flag, sum, and increment counter.
{ sum += $1; count++ }
# Finally:
END {
# If there is a group whose average has not been printed yet,
# do it now.
if (count) printAvg()
}
' NumbersAndExtras.txt
If we condense whitespace and strip the comments, we still get a reasonably readable solution, as long as we still use multiple lines:
awk '
function printAvg() { print "Average: ", sum/count }
NF == 0 { next}
/[[:alpha:]]/ { if (count) { printAvg(); sum = count = 0 } next }
{ sum += $1; count++ }
END { if (count) printAvg() }
' NumbersAndExtras.txt

How to detect the last line in awk before END?

I'm trying to concatenate String values and print them, but if the last types are Strings and there is no change of type then the concatenation won't print:
input.txt:
String 1
String 2
Number 5
Number 2
String 3
String 3
awk:
awk '
BEGIN { tot=0; ant_t=""; }
{
t = $1; val=$2;
#if string, concatenate its value
if (t == "String") {
tot+=val;
nx=1;
} else {
nx=0;
}
#if type change, add tot to res
if (t != "String" && ant_t == "String") {
res=res tot;
tot=0;
}
ant_t=t;
#if string, go next
if (nx == 1) {
next;
}
res=res"\n"val;
}
END { print res; }' input.txt
Current output:
3
5
2
Expected output:
3
5
2
6
How can I detect if awk is reading last line, so if there won't be change of type it will check if it is the last line?
awk reads line by line hence it cannot determine if it is reading the last line or not. The END block can be useful to perform actions once the end of file has reached.
To perform what you expect
awk '/String/{sum+=$2} /Number/{if(sum) print sum; sum=0; print $2} END{if(sum) print sum}'
will produce output as
3
5
2
6
what it does?
/String/ selects line that matches String so is Number
sum+=$2 performs the concatanation with String lines. When Number occurs, print the sum and reset to zero
Like this maybe:
awk -v lines="$(wc -l < /etc/hosts)" 'NR==lines{print "LAST"};1' /etc/hosts
I am pre-calculating the number of lines (using wc) and passing that into awk as a variable called lines, if that is unclear.
Just change last line to:
END { print res; print tot;}'
awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
Explanation
y is used as a boolean, and I check at the END if the last pattern was a string and print the sum
You can actually use x as the boolean like nu11p01n73R does which is smarter
Test
$ cat file
String 1
String 2
Number 5
Number 2
String 3
String 3
$ awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
3
5
2
6

awk | Add new row or update existing row in a file

I want to update file1 on the basis of file2. If any row is new in file2 then it should be added in file1. If any row from file2 is already in file1, then update that row with the row from file2 if the time is greater in file2.
file1
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051015,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
file2
DL,1111111101,201312041013,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111102,201312051016,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111104,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
newfile1
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
Notes:
2nd field should be unique in the output.
Addition of new value: the latest 2nd field for value "1111111104" in file2 is taken which is newer (201312051016) then old value (201312051014) on the basis of date column (3rd field).
Update an existing value: updated "1111111102" with newer value on the basis of date in 3rd column
file1 is very LARGE whereas file2 has 5-10 entries only.
row with 2nd field "1111111101" doesn't need to b updated because it's entry in file1 already has the latest date "201312051014" as compared to new date "201312041013" in file2.
I haven't tried much on this because it really has complex condition for me as beginner..
BEGIN { FS = OFS = "," }
FNR == NR {
m=$2;
a[m] = $0;
next
}
{
if($2 in a)
{
split(a[$2],datetime,",")
if($3>datetime[3])
print $0;
else
print a[$2]"Old time"
}
else print $0"NOMATCH";
delete a[$2];
}
Assuming that you can start your awk as follows:
awk -f script.awk input2.csv input1.csv > result.csv
you can use the following script to obtain the desired output:
BEGIN {
FS = OFS = ","
}
FILENAME == "input2.csv" {
date[$2] = $3
data[$2] = $0
used[$2] = 0
}
FILENAME == "input1.csv" {
if ($2 in date) {
used[$2] = 1
if ($3 < date[$2])
print data[$2]
else
print $0
} else {
print $0
}
}
END {
for (key in used) {
if (used[key] == 0)
print data[key]
}
}
Notes:
The script takes advantages of the assumption that file2 is smaller than file1 because it uses an array only for the few entries in file2.
The new entries are simply appended to the output. There is no sorting. If this is required there will have to be an extra effort.
EDIT
Heeding #JonathanLeffler's remark about the way I determine which file is being processed I would like to offer an alternate version that may (or may not :-) ) be a little more straight forward to understand than checking NR=FNR. However, it only works for sufficiently recent versions of awk which are capable of returning the size of an array as length(array):
BEGIN {
FS = ","
}
{
# The following effectively creates an array entry for each filename found (for "known" filenames existing entries are overwritten).
files[FILENAME] = 1
# check the number of files we have so far
if (length(files) == 1) {
# we are still in the first file
date[$2] = $3
data[$2] = $0
used[$2] = 0
} else {
# we are in the second file (or any other following file)
if ($2 in date) {
used[$2] = 1
if ($3 < date[$2])
print data[$2]
else
print $0
} else {
print $0
}
}
}
END {
for (key in used) {
if (used[key] == 0)
print data[key]
}
}
Also, if you require your output to be sorted according to the second row you can replace the call to awk by this:
awk -f script.awk input2.csv input1.csv | sort -t "," -n -k 2 > result.csv
The latter, of course, works for both versions of the script.
Since file1 is very large but file2 is very small (5-10 entries), you need to read all of file2 into memory first, dealing with the duplicate values. As a result, you'll have an array indexed by the record number with the new data; you should also have a record of the date for each record in a separate array. Then, as you read the main file, you look up the the record number and the date in the arrays, and if you need to, substitute the saved new record for the incoming old record.
Your outline script is most of the way there. It is more complex because you didn't save the dates coming in. This more or less works:
awk -F, '
FNR == NR { if (!($2 in date) || date[$2] < $3) { date[$2] = $3; line[$2] = $0; } next; }
{ if ($2 in date)
{
if (date[$2] > $3)
print line[$2]
else
print
delete line[$2]
delete date[$2]
}
else
print
}
END { for (l in line) print line[l]; }' file2 file1
Sample output for given data:
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
However, if there were 4 new records, there's no guarantee that they'd be in sorted order, though they would all be at the end of the list. It would be possible to upgrade the script to print the new records at the appropriate place in the list if the input is guaranteed to be in sorted order. You simply have to search through the list of lines to see whether there are any lines that should be printed before the current line, and if so, do so (and delete the record so that they are not printed at the end).
Note that uniqueness in the output depends on uniqueness in the input (file1). That is, if field 2 in the input is repeated, this code won't notice. There is also nothing that can be done with the current design even if a duplicate was spotted; the old row has been printed so printing the new row will simply cause the duplicate. If you were worried about this, you could design the awk script to keep the whole of file1 in memory and only print anything when the whole of the input has been processed. Needless to say, this uses a lot more memory than the current design, and will generally be less efficient because of that. Nevertheless, it could be done if needed.