Replace each successive instance of a pattern in one file with each successive item in another file - awk

Using awk or sed how can I loop through each instance of a pattern in one file and replace it with the contents of another file. For example:
File1.txt:
123 zone
123 patch
XXX family
456 zone
456 patch
XXX family
789 zone
789 patch
XXX family
File2.txt:
123
456
789
I know how to replace all instances of a pattern to the same number, i.e. -
{a = 0;}
($2 == "family") {printf("%f %s\n",$1=123,$2); a=1;}
{a = 0;}
but I do not now how to change each successive instance of a pattern to a different number - supplied in a list of numbers.
I want to replace the first instance of XXX family in File 1 with the first number in File 2, replace the second instance of XXX family in File 1 with the second number in File 2, etc... so I end up with:
New_file.txt
123 zone
123 patch
123 family
456 zone
456 patch
456 family
789 zone
789 patch
789 family

This might work for you (GNU sed):
sed -E '1{x;s/.*/cat file2.txt/e;s/ //g;x}
/XXX/{G;s/XXX([^\n]*)\n([^\n]*)/\2\1/;P;s/[^\n]*\n//;h;d}' file1.txt
Prime the hold space with the contents of file2.txt (remove any trailing spaces on each line).
Match XXX and if so append the hold space, use pattern matching to replace XXX with the contents of the first line of the hold space. Print the result and then remove the first line and restore the remainder of the hold space, ready for the next replacement.

----------- Edit -----------------
From Glenn Jackman's very helpful and correct comment
awk 'FNR==NR {chk[FNR]=$1; next}
$1 == "XXX" {
$1 = chk[++j]
}
{print}' f2.txt f1.txt
Here is a verbal description of what is happening
(awk can read any number of files from it's command line arguments (limited by the OS cmd-line size) )
We pass in the fix list file (f2.txt) first, so it can be captured into an array. Fortunately your data is in order so we use chk[1]-chk[3] as the keys to those values. next skips any further code in the file, but reading the next record from f2 until all data is stored in the chk array. No records have been printed yet.
The rest of the code processes records from the f1.txt file only (which is the 2nd file in the list). It checks that the record's first field is XXX, using $1 == "XXX". If that test is true, we cycle thru the list stored in the chk[] array, and replace the XXX field with the next value from chk[++j]. We are using the j variable as a counter to index the elsments of the chk[] array.
As all records need to be printed, including the now "fixed" record, we use the print command.
(If you need to learn about ++ for variables, you'll do best to consult a programming book, as it is often the topic for a complete chapter.)
------------- Original answer ------------------
awk 'FNR==NR{
chk[++i]=$1;next
}
{
if (FNR!=NR && $1 ~ /^XXX/){
$1=chk[++j]
print $0
}
else {
print $0
}
}' f2.txt f1.txt
output
123 zone
123 patch
123 family
456 zone
456 patch
456 family
789 zone
789 patch
789 family
awk can read any number of files from it's command line arguments (limited by the OS cmd-line size)
We pass in the fix list file (f2.txt) first, so it can be captured into an array. Fortunately your data is in order so we use chk[1]-chk[3] as the keys to those values. next skips any further processing to get the next record from f2 until all data is stored in the chk array.
The second block only processes records from the f1.txt (which is the 2nd file in the list) . This is done with the check FNR!=NR, which you can read all about in various awk books (this is a common pattern for multi-file processing where 2 "types" of data are being processed).
AHD we check that the record begins XXX, using a regexp /^XXX/. ^ means "anchored to the beginning of the line".
We cycle thru the list stored in the chk[] array, and replace the XXX field with the next value from chk[++j].
With final else, we print any records that haven't been modified.

Related

Printing out a particular row based on condition in another row

apologies if this really basic stuff but i just started with awk
so i have an input file im piping into awk like below. format never changes (like below)
name: Jim
gender: male
age: 40
name: Joe
gender: female
age: 36
name: frank
gender: Male
age: 40
I'm trying to list all names where age is 40
I can find them like so
awk '$2 == "40" {print $2 }'
but cant figure out how to print the name
Could you please try following(I am driving as of now so couldn't test it).
awk '/^age/{if($NF==40){print val};val="";next} /^name/{val=$0}' Input_file
Explanation: 1st condition checking ^name if a line starts from it then store that line value in variable val. Then in other condition checking if a line starts from age; then checking uf that line's 2nd field is greater than 40 then print value if variable val and nullify it too.
Using gnu awk and set Record Selector to nothing makes it works with blocks.
awk -v RS="" '/age: 40/ {print $2}' file
Jim
frank
Some shorter awk versions of suspectus and RavinderSingh13 post
awk '/^name/{n=$2} /^age/ && $NF==40 {print n}' file
awk '/^name/{n=$2} /^age: 40/ {print n}' file
Jim
frank
If line starts with name, store the name in n
IF line starts with age and age is 40 print n
Awk knows the concept records and fields.
Files are split in records where consecutive records are split by the record separator RS. Each record is split in fields, where consecutive fields are split by the field separator FS.
By default, the record separator RS is set to be the <newline> character (\n) and thus each record is a line. The record separator has the following definition:
RS:
The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So with the file format you give, we can define the records based on RS="".
So based on this, we can immediately list all records who have the line age: 40
$ awk 'BEGIN{RS="";ORS="\n\n"}/age: 40/
There are a couple of problems with the above line:
What if we have a person that is 400 yr old, he will be listed because the line /age: 400/ contains that the requested line.
What if we have a record with a typo stating age:40 or age : 40
What if our record has a line stating wage: 40 USD/min
To solve most of these problems, it is easier to work with well-defined fields in the record and build the key-value-pairs per record:
key value
---------------
name => Jim
gender => male
age => 40
and then, we can use this to select the requested information:
$ awk 'BEGIN{RS="";FS="\n"}
# build the record
{ delete rec;
for(i=1;i<=NF;++i) {
# find the first ":" and select key and value as substrings
j=index($i,":"); key=substr($i,1,j-1); value=substr($i,j+1)
# remove potential spaces from front and back
gsub(/(^[[:blank:]]*|[[:blank:]]$)/,key)
gsub(/(^[[:blank:]]*|[[:blank:]]$)/,value)
# store key-value pair
rec[key] = value
}
}
# select requested information and print
(rec["age"] == 40) { print rec["name"] }' file
This is not a one-liner, but it is robust. Furthermore, this method is fairly flexible and adaptable to make selections based on a more complex logic.
If you are not averse to using grep and the format is always the same:
cat filename | grep -B2 "age: 40" | grep -oP "(?<=name: ).*"
Jim
frank
awk -F':' '/^name/{name=$2} \
/^age/{if ($NF==40)print name}' input_file

AWK to look for common data in two column in same file

I am using below code/command to look for common data in two file and sending output in some file if found.
awk 'FNR==NR{l[$0]=NR; next}; $0 in l{print $0, l[$0], FNR}' MF*.txt OF*.txt > F22.txt`enter code here`
But i need your help here By using AWK, i need to look for common data in two column in same file only
For example below are two columns with data, and need to search if column A is equal to the word CAN then if column B contains some value found in column C and would then like to print these matches into separate file with line number and any error code.
A B C
CAN 9876 45678
CAN 1234 93939
CAN 45678 9090
ABC 4567 8080
BCD 97654 9876
CAN 9090 8181
Many thanks in advance.
You can use the following awk command:
awk 'NR>1{if($1=="CAN"){lines[$2]=NR" "$0;}C[$3]=$3}END{for(i in lines){if(C[i])print lines[i]}}' input_file.txt > output_file.out
to identify the lines that meet the condition described in your question:
need to search if column A is equal to the word CAN then if column B
contains some value found in column C and would then like to print
these matches into separate file with line number and any error code.
output with your input data:
7 CAN 9090 8181
2 CAN 9876 45678
4 CAN 45678 9090
Explanations:
NR>1 to start the processing from the second line
if($1=="CAN"){lines[$2]=NR" "$0;} for each line that starts with CAN you store the line with its line number in an array for which the key/index is the value of the 2nd column.
C[$3]=$3 stores all the values of the 3rd column
END{for(i in lines){if(C[i])print lines[i]} checks for all lines stored if there exists a value in the 3rd column that is equal to the value of the 2nd column if it is the case it outputs the line number and the line in question.
If you are several input files to process you can use the following awk
command:
awk 'FNR>1{if($1=="CAN"){lines[$2]=FILENAME" "FNR" "$0;}C[$3]=$3}END{for(i in lines){if(C[i])print lines[i]}}' MF*.txt OF*.txt > output_file.out
where
FNR is the relative position in each file
FILENAME is the name of the current file being processed
Hypothesis: each file first line is header A B C if it is not the case you can remove FNR>1
TEST:

Awk: Append output to new field in existing file

Is there a way to print the output of an awk script to an existing file as a new field every time?
Hi!
I'm very new at awk (so my terminology might not be correct, sorry about that!) and I'm trying to print the output of a script that will operate on several hundred files to the same file, in different fields.
For example, my data files have this structure:
#File1
1
Values, 2, Hanna
20
15
Values, 2, Josh
30
56
Values, 2, Anna
50
70
#File2
2
Values, 2, Hanna
45
60
Values, 2, Josh
98
63
Values, 2, Anna
10
56
I have several of these files, which are divided by numbered month, with the same names, but different values. I want files that are named by the name of the person, and the values in fields by month, like so:
#Hanna
20 45
15 60
#Josh
30 98
56 63
#Anna
50 10
70 56
In my script, I search for the word "values", and determine which records to print (based on the number after "value"). This works fine. Then I want to print these values. It works fine for one file, with the command:
Print $0 > name #the varible name have I saved to be = $3 of the correct row
This creates three files correctly named "Hanna", "Josh" and "Anna", with their values. However, I would like to run the script for all my datafiles, and append them to only one "Hanna"-file etc, in a new field.
So what I'm looking for is something like print $0 > $month name, reading out like "print the record to the field corresponding to the month"
I have tried to find a solution, but most solutions either just paste temporary files together or append the values after the existing ones (so that they all are in field 1). I want to avoid the temporary files and have them in different fields (so that I get a kind of matrix-structure).
Thank you in advance!
try following, though I have not checked all permutations and combinations and only considered your post. Also your output Josh column is not consistent also(Or please do let us know if more conditions are there for same too). Let me know how it goes then.
awk 'FNR==NR{if($0 ~ /^Values/){Q=$NF;B[$NF]=$NF;i="";next};A[Q,++i]=$0;next} /^Values/{V=$NF;print "#"B[V];i="";next} B[V]{print A[V,++i],$0}' file1 file2
EDIT: Adding a non-one liner form of solution too.
awk 'FNR==NR{
if($0 ~ /^Values/){
Q=$NF;
B[$NF]=$NF;
i="";
next
};
A[Q,++i]=$0;
next
}
/^Values/{
V=$NF;
print "#"B[V];
i="";
next
}
B[V]{
print A[V,++i],$0
}
' file1 file2
EDIT2: Adding explanation too now for same.
awk 'FNR==NR{ ###Checking condition FNR==NR where this condition will be TRUE only when first file named file1 is being read. FNR and NR both indicate number of lines in a Input_file, only difference between them is FNR value will be RESET whenever there is next Input_file is being read and NR value will be keep on incresing till all the Input_files are read.
if($0 ~ /^Values/){ ###Checking here if any line starts from string Values if yes then perform following operations.
Q=$NF; ###Creating a variable named Q whose value is the last field of the line.
B[$NF]=$NF;###Creating an array named B whose index is $NF(last field of the line) and value is same too.
i=""; ###Making variable i value to NULL now.
next ###using next here, it is built-in keyword for awk and it will skip all further statements now.
};
A[Q,++i]=$0; ###Creating an array named A whose index is Q and variable i with increasing value with 1 to it, each time it comes on this statement.
next ###Using next will skip all further statements now.
}
/^Values/{ ###All statements from here will be executed when second file named file2 is being read. So I am checking here if a line starts from string Values then do following.
V=$NF; ###create variable V whose value is $NF of current line.
print "#"B[V]; ###printing the string # then value of array B whose index is variable V.
i=""; ###Nullifying the variable i value here.
next ###next will sip all the further statements now.
}
B[V]{ ###Checking here if array B with index V is having a value in it, then perform following on it too.
print A[V,++i],$0 ###printing the value of array A whose index is variable V and variable i increasing value with 1 and current line.
}
' file1 file2 ###Mentioning the Input_files here named file1 and file2.

How to compare two strings of a file match the strings of another file using AWK?

I possess 2 huge files and I need to count how many entries of file 1 exist on file 2.
The file 1 contains two ids, source and destination, like below:
11111111111111|22222222222222
33333333333333|44444444444444
55555555555555|66666666666666
11111111111111|44444444444444
77777777777777|22222222222222
44444444444444|00000000000000
12121212121212|77777777777777
01010101010101|01230123012301
77777777777777|97697697697697
66666666666666|12121212121212
The file 2 contains the valid id list, which will be used to filter file 1:
11111111111111
22222222222222
44444444444444
77777777777777
00000000000000
88888888888888
66666666666666
99999999999999
12121212121212
01010101010101
What I am struggling to achieve is find a way to count how many entries in file one possess the entry in file 2. Only when both numbers in the same line
exist in file 2 will the line be counted.
On file 2:
11111111111111|22222222222222 — This will be counted because both entries exist on file 2, as well as 77777777777777|22222222222222 because both entries exist on file 2.
33333333333333|44444444444444 — This will not be counted because 33333333333333 does not exist on file 2 and the same goes to 55555555555555|66666666666666, the first does not exist on file 2.
So in the examples I mentioned in the beginning it should count 6, and printing this should be enough, better than editing one file.
awk -F'|' 'FNR == NR { seen[$0] = 1; next }
seen[$1] && seen[$2] { ++count }
END { print count }' file2 file1
Explanation:
1) FNR == NR (number of record in current file equals number of record) is only true for the first input file, which is file2 (the order is important!). Thus for every line of file2, we record the number in seen.
2) For other lines (which is file1, given second on the command line) if the |-separated fields (-F'|') number 1 and 2 were both seen (in file2), we increment count by one.
3) In the END output the count.
Caveat: Every unique number in file2 is loaded into memory. But this also makes it fast instead of having to read through file2 over and over again.
Don't know how to do it in awk but if you are open to a quick-and-dirty bash script that someone can help make efficient, you could try this:
searcher.sh
-------------
#!/bin/bash
file1="$1"
file2="$2"
-- split by pipe
while IFS='|' read -ra line; do
-- find 1st item in file2. If found, find 2nd item in file2
grep -q ${line[0]} "$file2"
if [ $? -eq 0 ]; then
grep -q ${line[1]} "$file2"
if [ $? -eq 0 ]; then
-- print line since both items were found in file2
echo "${line[0]}|${line[1]}"
fi
fi
done < "$file1"
Usage
------
bash searcher.sh file1 file2
Result using your example
--------------------------
$ time bash searcher.sh file1 file2
11111111111111 | 22222222222222
11111111111111 | 44444444444444
77777777777777 | 22222222222222
44444444444444 | 00000000000000
12121212121212 | 77777777777777
66666666666666 | 12121212121212
real 0m1.453s
user 0m0.423s
sys 0m0.627s
That's really slow on my old PC.

Awk - use particular line again to match with patterns

Suppose I have file:
1Alorem
2ipsuml
3oremip
4sumZAl
5oremip
6sumlor
7emZips
I want to split text from lines containing A to lines containing Z match with range:
/A/,/Z/ {
print > "rangeX.txt"
}
I want this particular input to give me 2 files:
1Alorem
2ipsuml
3oremip
4sumZAl
and
4sumZAl
5oremip
6sumlor
7emZips
problem is that line 4 is taken only once ad is matched as end of range, but 2nd range never starts because there is no A in other lines.
Is there a way to try to match line 4 again against all patterns or tell awk that it has to start new range?
Thanks
As Arne pointed out the second section will not be caught but the current pattern. Here is an alternative without the range.
awk 'p==0 {p= (~/A/)>0;filenr++} p==1 {print > "range"filenr".txt"; p= (~/Z/)==0; if(!p && ~/A/){filenr++;;p=1; print > "range"filenr".txt"}}' test.txt
It also handles more than two sections
All you need to do is save the last line of the first range to a variable and then reprint that variable, along with the following range, for the second file.
In other words, since you're just looping through each line, define an empty variable in your BEGIN and then update it each time through. You'll have the variable saved as the last line when your range ends. Write out that line to the next file before you begin again.
There is no way to rematch a record, but writing a variant of the pattern is an option. Here the second range pattern matches from a line containing A and Z to a line containing Z but not A:
awk "/A/,/Z/ {print 1, $0} (/A/ && /Z/),(/Z/ && !/A/) {print 2, $0}"
prints:
1 1Alorem
1 2ipsuml
1 3oremip
1 4sumZAl
2 4sumZAl
2 5oremip
2 6sumlor
2 7emZips
As your sample is a bit synthetic I don't know if that solution fits your real problem.