How do i compare alphanumeric characters in non sequential order? - awk

Currently i am using a awk script which compares numbers in non sequential order and prints the difference . It works pretty well for numbers but if i have alphanumeric characters , it doesn't seem to work well
In its current state , apart from simply comparing the numbers it does 2 things additionally :
Currently it accounts for the zeros before a number or character and compares the absolute values only ignoring zeros before a number or character
Currently If the same number or character occurs multiple times in both files , it outputs the additional occurance
i just want the script to work well for alphanumeric characters as well as currently it only seem to work well with plain numbers. Can someone please edit the script to have the desired output while also considering the above 2 conditions
Current script
awk '{k=$0+0}
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}' file1 file2
Example below
cat file1
1
01
001
8
2B
12
13C
027B
0027B
cat file2
1
2
08
12
13C
02B
9
27B
Expected output/result
1
1
2
9
27B
Explanation of expected output
In file1 : "1" , "01" , "001" evaluates to 1 * 3 times
In file 2 : "1" is present only once
Hence "1" is present twice in result ( 3-1 times )
"2" and "9" are exclusively present in file2 , So obviously both simply form part of output
In file1 : '027B" , "0027B" evaluates to 27B * 2 times
In file 2 - "27B" is present only once
Hence '27B" is present once in result ( 2 -1 times )
Explanation of matched items ( ones not forming part of expected output )
"8" from file1 ( line 4 )is matched with "08" from file2 ( line 3)
"12" from file1 ( line 6) is matched with "12" from file2 ( line 4)
"13C" from file1 (line 7 ) is matched with "13C" from file2 ( line 5 )
"2B" from file1 ( line 5 ) is matched with "02B" from file2 ( line 6 )
Lastly the order of items in expected output should be in ascending order like shown in my above example, lets say if the eg above had 3 in expected output it should read vertically as 1 1 2 3 9 27B

It should be enough to remove leading zeros when forming the key (with a special case for zero values like 0000):
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}
$ awk -f a.awk file1 file2
2
9
27B
1
1
RE-EDIT
If you just want the values sorted numerically, pipe into sort:
$ awk -f a.awk file1 file2 | sort -n
1
1
2
3
4
5
9
27B
To output in the order as found in file2, you can remember the order in another array and then do all the printing in the END block. This version will output the values in the order of file2, with any values only in file1 printed last.
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
{ b[FNR] = k }
!(k in a && a[k]--) { a[k] = 1 }
END {
for (i=1; i<=FNR; ++i) {
k = b[i]
while(a[k]-->0) print k
}
for (k in a) {
while(a[k]-->0) print k
}
}
$ awk -f a.awk file1 file2
1
1
2
9
27B
3
4
5

Related

how to keep newline(s) when selecting a given column with awk

Suppose I have a file like this (disclaimer: this is not fixed I can have more than 7 rows, and more than 4 columns)
R H A 23
S E A 45
T E A 34
U A 35
Y T A 35
O E A 353
J G B 23
I want the output to select second column if third column is A but keeping newline or whitespace character.
output should be:
HEE TE
I tried this:
awk '{if ($3=="A") print $2}' file | awk 'BEGIN{ORS = ""}{print $1}'
But this gives:
HEETE%
Which has a weird % and is missing the space.
You may use this gnu-awk solution using FIELDWIDTHS:
awk 'BEGIN{ FIELDWIDTHS = "1 1 1 1 1 1 *" } $5 == "A" {s = s $3}
END {print s}' file
HEE TE
awk splits each record using width values provided in this variable FIELDWIDTHS.
1 1 1 1 1 1 * means each of first 6 columns will have single character length and remaining text will be filled in 7th column. Since you have a space after each value so $2,$4,$6 will be filled with a single space and $1,$3,$5 will be filled with the provided values in input.
$5 == "A" {s = s $3}: Here we are checking if $5 is A and if that condition is true then we keep appending value of $3 in a variable s. In the END block we just print variable s.
Without using fixed width parsing, awk will treat A in 4th row as $2.
Or else if we let spaces part of column value then use:
awk '
BEGIN{ FIELDWIDTHS = "2 2 2 *" }
$3 == "A " {s = s substr($2,1,1)}
END {print s}
' file

Replace column between patterns in one file from another file

I am trying to write shell script which reads file1, get a specific column from this file. And replace column in file2 with the column extracted from file1 between two patterns .
File1
Line1
Line2
.
LineN
ATOM C1 C2 C3
ATOM P23 HI IKJ
ATOM S23 JSK SN
BOND
Many lines
END
File2
Few Lines
Pattern1
1 C -9.2429 -1.3783 -9.5091 C.3 1 LIG1 0.0555
2 C -10.5865 -0.8658 -8.9679 C.3 1 LIG1 0.0529
3 N -11.3072 -0.5779 -10.1774 N.am 1 LIG1 -0.2940
Patttern2
Lines
I want to get the column $2 from file 1 and replace it the the column $2 in file2 between pattern1 and pattern 2.
Output
Few Lines
Pattern1
1 C1 -9.2429 -1.3783 -9.5091 C.3 1 LIG1 0.0555
2 P23 -10.5865 -0.8658 -8.9679 C.3 1 LIG1 0.0529
3 S23 -11.3072 -0.5779 -10.1774 N.am 1 LIG1 -0.2940
Patttern2
Lines
I have tried few things so far.
awk '($1=="ATOM") {print $2}' file1
awk '/pattern1/{flag=1; next} /pattern2/{flag=0} flag' file2
I can store the column 2 in file1. Also, lines between both the patterns from file2.
I am sure with something like FNR=NR, I should be able to handle both files at the same time. Any help would be great to proceed further.
Per Your question and as You mentioning (FNR==NR) solution You can use this AWK:
awk '
( FNR==NR && /^ATOM / ) { atoms[++atomn]=$2; }
( FNR!=NR && /^Patttern2$/ ) { doreplace=0; }
doreplace { $2=atoms[++atomn]; }
( FNR!=NR && /^Pattern1$/ ) { doreplace=1; atomn=0; }
FNR!=NR
' file1 file2
I rather prefer this way when dealing with metafile and datafiles:
awk '
BEGIN {
if ( ARGC != 3 ) exit(1);
while ( getline < ARGV[2] ) if ( $0 ~ /^ATOM / ) atoms[++atomn]=$2;
ARGC=2;
}
/^Patttern2$/ { doreplace=0; }
doreplace { $2=atoms[++atomn]; }
/^Pattern1$/ { doreplace=1; atomn=0; }
1
' file2 file1
Both will work on your provided inputs. Here Output:
Few Lines
Pattern1
1 C1 -9.2429 -1.3783 -9.5091 C.3 1 LIG1 0.0555
2 P23 -10.5865 -0.8658 -8.9679 C.3 1 LIG1 0.0529
3 S23 -11.3072 -0.5779 -10.1774 N.am 1 LIG1 -0.2940
Patttern2
Lines
Both assumes that in metafile (file1) and datafile (file2) you have exactly same number of ATOM and lines between patterns. If You are not able to validate this I would add logic to watch this not to iterate through in-allocated array elements. However in AWK nothing will happen just replacement will be done with empty string instead.

Use awk to find all columns which contain values above and below specified numbers?

I would like an Awk command where I can search a large file for columns which contain numbers both below 3 and above 5. It also needs to skip the first column.
e.g. for the following file
1 2 6 2
2 1 7 3
3 2 5 4
4 2 8 7
5 2 6 8
6 1 9 9
In this case, only column 4 is a match, as it is the only column with values above 5 and below 3 (except for column 1, which we skip).
Currently, I have this code:
awk '{for (i=2; i<=NF; i++) {if ($i < 3 && $i > 5) {print i}}}'
But this only reads one row at a time (so never makes a match). I want to search all of the rows, but I am unable to work out how this is done.
Ideally the output would simply be the column number. So for this example, simply '4'.
Many thanks.
Could you please try following and let me know if this helps you.
awk '{for(i=1;i<=NF;i++){if($i<3){col[i]++};if($i>5){col1[i]++}}} END{for(j in col){if(col[j]>=1 && col1[j]>=1){print j}}}' Input_file
If you want to start searching from second column then change i=1 to i=2 in above code.
EDIT: Adding a non-one liner form of solution too now.
awk '
{
for(i=1;i<=NF;i++){
if($i<3) { col[i]++ };
if($i>5) { col1[i]++}}
}
END{
for(j in col){
if(col[j]>=1 && col1[j]>=1){ print j }}
}' Input_file

move certain columns to end using awk

I have large tab delimited file with 1000 columns. I want to rearrange so that certain columns have to be moved to the end.
Could anyone help using awk
Example input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Move columns 5,6,7,8 to the end.
Output:
1 2 3 4 9 10 11 12 13 14 15 16 17 18 19 20 5 6 7 8
This prints columns 1 to a, then b to the last, and then columns a+1 to b-1:
$ awk -v a=4 -v b=9 '{for (i=1;i<=NF;i+=i==a?b-a:1) {printf "%s\t",$i};for (i=a+1;i<b;i++) {printf "%s\t",$i};print""}' file
1 2 3 4 9 10 11 12 13 14 15 16
17 18 19 20 5 6 7 8
The columns are moved in this way for every line in the input file, however many lines there are.
How it works
-v a=4 -v b=9
This defines the variables a and b which determine the limits on which columns will be moved.
for (i=1;i<=NF;i+=i==a?b-a:1) {printf "%s\t",$i}
This prints all columns except the ones from a+1 to b-1.
In this loop, i is incremented by one except when i==a in which case it is incremented by b-a so as to skip over the columns to be moved. This is done with awk's ternary statement:
i += i==a ? b-a : 1
+= simply means "add to." i==a ? b-a : 1 is the ternary statement. The value that it returns depends on whether i==a is true or false. If it is true, the value before the colon is returned. If it is false, the value after the colon is returned.
for (i=a+1;i<b;i++) {printf "%s\t",$i}
This prints columns a+1 to b-1.
print""
This prints a newline character to end the line.
Alternative solution that avoids printf
This approach assembles the output into the variable out and then prints with a plain print command, avoiding printf and the need for percent signs:
awk -v a=4 -v b=9 '{out="";for (i=1;i<=NF;i+=i==a?b-a:1) out=out $i"\t";for (i=a+1;i<b;i++) out=out $i "\t";print out}' file
One way to rearrange 2 columns ($5 become $20 and $20 become $5) the rest stay unchanged :
$ awk '{x=$5; $5=$20; $20=x; print}' file.txt
for 4 columns :
$ awk '{
x=$5; $5=$20; $9=x;
y=$9; $9=$10; $10=y;
print
}' file.txt
My approach:
awk 'BEGIN{ f[5];f[6];f[7];f[8] } \
{ for(i=1;i<=NF;i++) if(!(i in f)) printf "%s\t", $i; \
for(c in f) printf "%s\t", $c; printf "\n"} ' file
It's splitted in 3 parts:
The BEGIN{} part determines which field should be moved to the end. The indexes of the array f are moved. In the example it's 5, 6, 7 and 8.
Cycle trough every field (doesn't matter if there are 1000 fields or more) and check if they are in the array. If not print them.
Now we need the skipped fields. Cycle trough the f array and print those values.
Another way in awk
Switch last A-B with last N fields
awk -vA=4 -vB=8 '{x=B-A;for(i=A;i<=B;i++){y=$i;$i=$(t=(NF-x--));$t=y}}1' file
Put N rows from end into positon A
awk -vA=3 -vB=8 '{split($0,a," ");x=A++;while(x++<B)$x=a[NF-(B-x)];while(B++<NF)$B=a[A++]}1' file

How to get random lines with a given distribution with awk?

I have two tabular files that I want to merge, but prior to that I want to reduce the second one.
The first file let's say File1 is tabular and is like this
A 67 98 56
A 22 29 62
A 11 99 28
B 37 88 14
B 33 99 65
We have 3 lines with A and two with B
File2 contains 3000 lines between A and B, I want to randomly select lines from File2 but exactly the same number of A and B than File1, which means just 3 random lines with A and two with B
Any one have an idea on how to do this with awk ?
Thanks
#!/bin/bash
read -r acount bcount <<< $(csplit file2 '/^B /')
awk -v "acount=$acount" -v "bcount=$bcount" '
NR == FNR {
arr[$1]++;
next
}
! setup {
setup = 1
while (arandcount < acount) {
line = int(rand() * acount) + 1
if (! alines[line]) {
alines[line] = 1
arandcount++
}
}
while (brandcount < bcount) {
line = int(rand() * bcount) + 1
if (! blines[line]) {
blines[line] = 1
brandcount++
}
}
}
FILENAME == "xx00" && FNR in alines {
print
}
FILENAME == "xx01" && FNR in blines {
print
}' file1 xx00 xx01
Where "xx00" and "xx01" are the names of the files created by csplit.
The csplit command splits the input file on the regex and outputs the line counts of each output file. The read command puts those counts into the variables. The variables are passed into the AWK program.
The first block reads all the lines in file1 and counts each "type" (A or B).
The second block picks the line numbers to select by choosing a random number between 1 and the count of lines of the "type". This block only gets executed once because of the flag.
The last two blocks check each line's record number to see if it's in the array of picked line numbers and prints it if it is.
This might work for you:
grep '^A' file2 | sort -R | head -$(grep -c '^A' file1) >file3
grep '^B' file2 | sort -R | head -$(grep -c '^B' file1) >>file3
N.B. This assumes file1 is sorted.