Replace column between patterns in one file from another file - awk

I am trying to write shell script which reads file1, get a specific column from this file. And replace column in file2 with the column extracted from file1 between two patterns .
File1
Line1
Line2
.
LineN
ATOM C1 C2 C3
ATOM P23 HI IKJ
ATOM S23 JSK SN
BOND
Many lines
END
File2
Few Lines
Pattern1
1 C -9.2429 -1.3783 -9.5091 C.3 1 LIG1 0.0555
2 C -10.5865 -0.8658 -8.9679 C.3 1 LIG1 0.0529
3 N -11.3072 -0.5779 -10.1774 N.am 1 LIG1 -0.2940
Patttern2
Lines
I want to get the column $2 from file 1 and replace it the the column $2 in file2 between pattern1 and pattern 2.
Output
Few Lines
Pattern1
1 C1 -9.2429 -1.3783 -9.5091 C.3 1 LIG1 0.0555
2 P23 -10.5865 -0.8658 -8.9679 C.3 1 LIG1 0.0529
3 S23 -11.3072 -0.5779 -10.1774 N.am 1 LIG1 -0.2940
Patttern2
Lines
I have tried few things so far.
awk '($1=="ATOM") {print $2}' file1
awk '/pattern1/{flag=1; next} /pattern2/{flag=0} flag' file2
I can store the column 2 in file1. Also, lines between both the patterns from file2.
I am sure with something like FNR=NR, I should be able to handle both files at the same time. Any help would be great to proceed further.

Per Your question and as You mentioning (FNR==NR) solution You can use this AWK:
awk '
( FNR==NR && /^ATOM / ) { atoms[++atomn]=$2; }
( FNR!=NR && /^Patttern2$/ ) { doreplace=0; }
doreplace { $2=atoms[++atomn]; }
( FNR!=NR && /^Pattern1$/ ) { doreplace=1; atomn=0; }
FNR!=NR
' file1 file2
I rather prefer this way when dealing with metafile and datafiles:
awk '
BEGIN {
if ( ARGC != 3 ) exit(1);
while ( getline < ARGV[2] ) if ( $0 ~ /^ATOM / ) atoms[++atomn]=$2;
ARGC=2;
}
/^Patttern2$/ { doreplace=0; }
doreplace { $2=atoms[++atomn]; }
/^Pattern1$/ { doreplace=1; atomn=0; }
1
' file2 file1
Both will work on your provided inputs. Here Output:
Few Lines
Pattern1
1 C1 -9.2429 -1.3783 -9.5091 C.3 1 LIG1 0.0555
2 P23 -10.5865 -0.8658 -8.9679 C.3 1 LIG1 0.0529
3 S23 -11.3072 -0.5779 -10.1774 N.am 1 LIG1 -0.2940
Patttern2
Lines
Both assumes that in metafile (file1) and datafile (file2) you have exactly same number of ATOM and lines between patterns. If You are not able to validate this I would add logic to watch this not to iterate through in-allocated array elements. However in AWK nothing will happen just replacement will be done with empty string instead.

Related

Count rows and columns for multiple CSV files and make new file

I have multiple large comma separated CSV files in a directory. But, as a toy example:
one.csv has 3 rows, 2 columns
two.csv has 4 rows 5 columns
This is what the files look like -
# one.csv
a b
1 1 3
2 2 2
3 3 1
# two.csv
c d e f g
1 4 1 1 4 1
2 3 2 2 3 2
3 2 3 3 2 3
4 1 4 4 1 4
The goal is to make a new .txt or .csv that gives the rows and columns for each:
one 3 2
two 4 5
To get the rows and columns (and dump it into a file) for a single file
$ awk -F "," '{print NF}' *.csv | sort | uniq -c > dims.txt
But I'm not understanding the syntax to get counts for multiple files.
What I've tried
$ awk '{for (i=1; i<=2; i++) -F "," '{print NF}' *.csv$i | sort | uniq -c}'
With any awk, you could try following awk program.
awk '
FNR==1{
if(cols && rows){
print file,rows,cols
}
rows=cols=file=""
file=FILENAME
sub(/\..*/,"",file)
cols=NF
next
}
{
rows=(FNR-1)
}
END{
if(cols && rows){
print file,rows,cols
}
}
' one.csv two.csv
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line of each line then do following.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
rows=cols=file="" ##Nullifying rows, cols and file here.
file=FILENAME ##Setting FILENAME value to file here.
sub(/\..*/,"",file) ##Removing everything from dot to till end of value in file.
cols=NF ##Setting NF values to cols here.
next ##next will skip all further statements from here.
}
{
rows=(FNR-1) ##Setting FNR-1 value to rows here.
}
END{ ##Starting END block of this program from here.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
}
' one.csv two.csv ##Mentioning Input_file names here.
Using gnu awk you can do this in a single awk:
awk -F, 'ENDFILE {
print gensub(/\.[^.]+$/, "", "1", FILENAME), FNR-1, NF-1
}' one.csv two.csv > dims.txt
cat dims.txt
one 3 2
two 4 5
You will need to iterate over all CSVs print the name for each file and the dimensions
for i in *.csv; do awk -F "," 'END{print FILENAME, NR, NF}' $i; done > dims.txt
If you want to avoid awk you can also do it wc -l for lines and grep -o "CSV-separator" | wc -l for fields
I would harness GNU AWK's ENDFILE for this task as follows, let content of one.csv be
1,3
2,2
3,1
and two.csv be
4,1,1,4,1
3,2,2,3,2
2,3,3,2,3
1,4,4,1,4
then
awk 'BEGIN{FS=","}ENDFILE{print FILENAME, FNR, NF}' one.csv two.csv
output
one.csv 3 2
two.csv 4 5
Explanation: ENDFILE is executed after processing every file, I set FS to , assuming that fields are ,-separated and there is not , inside filed, FILENAME, FNR, NF are built-in GNU AWK variables: FNR is number of current row in file, i.e. in ENDFILE number of last row, NF is number of fileds (again of last row). If you have files with headers use FNR-1, if you have rows prepended with row number use NF-1.
edit: changed NR to FNR
Without GNU awk you can use the shell plus POSIX awk this way:
for fn in *.csv; do
cols=$(awk '{print NF; exit}' "$fn")
rows=$(awk 'END{print NR-1}' "$fn")
printf "%s %s %s\n" "${fn%.csv}" "$rows" "$cols"
done
Prints:
one 3 2
two 4 5

awk using filter less than 3 unique characters in column

i have a dataset in file
1 AAA
1 AAAB
1 AAABC
2 ABCD
2 ABCDE
3 AABB
3 AABCD
3 ABCDE
3 ABBCCDE
Desired output
1 AAABC
2 ABCD
2 ABCDE
3 AABCD
3 ABCDE
3 ABBCCDE
i have tried :
awk 'BEGIN{FS="\t"; OFS="\t"} {
unq=0; delete seen; for (i=1; i<=NF; i++) if (!seen[$i]++) unq++} unq > 3' file
But i cant get any results.
Your code does something quite different than what you seem to be stating in your requirements. You loop over the tokens on the line (on the first line, there are two tokens, 1 and AAA) and check how many times you have seen those. Your script then prints if you have more than three unique tokens on the line, which you never do.
The concept you seem to be looking for is the substr() function, which allows you to loop over the individual characters in a string (for example).
awk 'BEGIN{FS="\t"; OFS="\t"}
{ unq=0; delete seen; for (i=1; i<=length($2); i++) if(!seen[substr($2, i, 1)]++) unq++ }
unq >= 3' file
To start debugging your own code, try adding a print statement at various strategic places to verify that you see what you expect; when you don't, you have found a bug. For example, it should be easy to see that for (i=1; i<=$NF; ++i) print $i loops over the fields of the input line.
With an awk that splits into chars given a null field separator and supports length(array) (e.g. GNU awk but also several others):
$ cat tst.awk
{
delete chars
split($2,tmp,"")
for (i in tmp) {
chars[tmp[i]]
}
}
length(chars) >= 3
$ awk -f tst.awk file
1 AAABC
2 ABCD
2 ABCDE
3 AABCD
3 ABCDE
3 ABBCCDE
With perl:
$ perl -MList::Util=uniq -lane 'print if scalar(uniq split(//,$F[1])) > 2' ip.txt
1 AAABC
2 ABCD
2 ABCDE
3 AABCD
3 ABCDE
3 ABBCCDE
$F[1] has the second column, which is then split into characters using an empty regexp. From this array, duplicates are removed using uniq and then size of the array is obtained using scalar
Similar solution with ruby:
ruby -lane 'print if $F[1].chars.uniq.size > 2'
With your shown samples, please try following. Written and tested in GNU awk.
awk '
{
count=0
delete value
num=split($2,arr,"")
for(i=1;i<=num;i++){
if(!value[arr[i]]++){ count++ }
}
}
count>=3
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
count=0 ##Setting count to 0 here.
delete value ##Deleting value array here.
num=split($2,arr,"") ##Splitting 2nd field into arr array with delimiter of NULL.
for(i=1;i<=num;i++){ ##Traversing through all values of arr array here.
if(!value[arr[i]]++){ count++ } ##Checking if any element is NOT repeated in value then increase count by 1 here.
}
}
count>=3 ##Checking if count>=3 then print that line.
' Input_file ##Mentioning Input_file name here.
another approach
$ awk '{c=0; for(f=$2; f&&c<3; c++) gsub(substr(f,1,1),"",f)} c>2' file
1 AAABC
2 ABCD
2 ABCDE
3 AABCD
3 ABCDE
3 ABBCCDE

How do i compare alphanumeric characters in non sequential order?

Currently i am using a awk script which compares numbers in non sequential order and prints the difference . It works pretty well for numbers but if i have alphanumeric characters , it doesn't seem to work well
In its current state , apart from simply comparing the numbers it does 2 things additionally :
Currently it accounts for the zeros before a number or character and compares the absolute values only ignoring zeros before a number or character
Currently If the same number or character occurs multiple times in both files , it outputs the additional occurance
i just want the script to work well for alphanumeric characters as well as currently it only seem to work well with plain numbers. Can someone please edit the script to have the desired output while also considering the above 2 conditions
Current script
awk '{k=$0+0}
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}' file1 file2
Example below
cat file1
1
01
001
8
2B
12
13C
027B
0027B
cat file2
1
2
08
12
13C
02B
9
27B
Expected output/result
1
1
2
9
27B
Explanation of expected output
In file1 : "1" , "01" , "001" evaluates to 1 * 3 times
In file 2 : "1" is present only once
Hence "1" is present twice in result ( 3-1 times )
"2" and "9" are exclusively present in file2 , So obviously both simply form part of output
In file1 : '027B" , "0027B" evaluates to 27B * 2 times
In file 2 - "27B" is present only once
Hence '27B" is present once in result ( 2 -1 times )
Explanation of matched items ( ones not forming part of expected output )
"8" from file1 ( line 4 )is matched with "08" from file2 ( line 3)
"12" from file1 ( line 6) is matched with "12" from file2 ( line 4)
"13C" from file1 (line 7 ) is matched with "13C" from file2 ( line 5 )
"2B" from file1 ( line 5 ) is matched with "02B" from file2 ( line 6 )
Lastly the order of items in expected output should be in ascending order like shown in my above example, lets say if the eg above had 3 in expected output it should read vertically as 1 1 2 3 9 27B
It should be enough to remove leading zeros when forming the key (with a special case for zero values like 0000):
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}
$ awk -f a.awk file1 file2
2
9
27B
1
1
RE-EDIT
If you just want the values sorted numerically, pipe into sort:
$ awk -f a.awk file1 file2 | sort -n
1
1
2
3
4
5
9
27B
To output in the order as found in file2, you can remember the order in another array and then do all the printing in the END block. This version will output the values in the order of file2, with any values only in file1 printed last.
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
{ b[FNR] = k }
!(k in a && a[k]--) { a[k] = 1 }
END {
for (i=1; i<=FNR; ++i) {
k = b[i]
while(a[k]-->0) print k
}
for (k in a) {
while(a[k]-->0) print k
}
}
$ awk -f a.awk file1 file2
1
1
2
9
27B
3
4
5

awk print line of file2 based on condition of file1

I have two files:
cat file1:
0 xxx
1 yyy
1 zzz
0 aaa
cat file2:
A bbb
B ccc
C ddd
D eee
How do I get the following output using awk:
B ccc
C ddd
My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)?
Additional information:
Files file1 and file2 have an equal number of lines.
Files file1 and file2 have millions of lines and cannot be read into memory.
file1 has 4 columns.
file2 has approximately 1000 columns.
Try doing this (a bit obfuscated):
awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2
On multiples lines it can be clearer (reminder, awk works like this : condition{action} :
awk '
NR==FNR{arr[NR]=$1}
NR!=FNR && arr[FNR]
' file1 file2
If I remove the "clever" parts of the snippet :
awk '
if (NR == FNR) {arr[NR]=$1}
if (NR != FNR && arr[FNR]) {print $0}
' file1 file2
When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0)
Explanations
NR is the number of the current record from the start of input
FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file)
arr[NR]=$1 : feeding array arr with indice of the current NR with the first column
if NR!=FNR we are in the next file and if the value of the array if 1, then we print
No as clean as a awk solution
$ paste file2 file1 | sed '/0/d' | cut -f1
B
C
You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7):
with open("file1") as fd1, open("file2") as fd2:
for l1, l2 in zip(fd1, fd2):
if not l1.startswith('0'):
print l2.strip()
awk '{
getline value <"file2";
if ($1)
print value;
}' file1

How to get random lines with a given distribution with awk?

I have two tabular files that I want to merge, but prior to that I want to reduce the second one.
The first file let's say File1 is tabular and is like this
A 67 98 56
A 22 29 62
A 11 99 28
B 37 88 14
B 33 99 65
We have 3 lines with A and two with B
File2 contains 3000 lines between A and B, I want to randomly select lines from File2 but exactly the same number of A and B than File1, which means just 3 random lines with A and two with B
Any one have an idea on how to do this with awk ?
Thanks
#!/bin/bash
read -r acount bcount <<< $(csplit file2 '/^B /')
awk -v "acount=$acount" -v "bcount=$bcount" '
NR == FNR {
arr[$1]++;
next
}
! setup {
setup = 1
while (arandcount < acount) {
line = int(rand() * acount) + 1
if (! alines[line]) {
alines[line] = 1
arandcount++
}
}
while (brandcount < bcount) {
line = int(rand() * bcount) + 1
if (! blines[line]) {
blines[line] = 1
brandcount++
}
}
}
FILENAME == "xx00" && FNR in alines {
print
}
FILENAME == "xx01" && FNR in blines {
print
}' file1 xx00 xx01
Where "xx00" and "xx01" are the names of the files created by csplit.
The csplit command splits the input file on the regex and outputs the line counts of each output file. The read command puts those counts into the variables. The variables are passed into the AWK program.
The first block reads all the lines in file1 and counts each "type" (A or B).
The second block picks the line numbers to select by choosing a random number between 1 and the count of lines of the "type". This block only gets executed once because of the flag.
The last two blocks check each line's record number to see if it's in the array of picked line numbers and prints it if it is.
This might work for you:
grep '^A' file2 | sort -R | head -$(grep -c '^A' file1) >file3
grep '^B' file2 | sort -R | head -$(grep -c '^B' file1) >>file3
N.B. This assumes file1 is sorted.