AWK to look for common data in two column in same file - awk

I am using below code/command to look for common data in two file and sending output in some file if found.
awk 'FNR==NR{l[$0]=NR; next}; $0 in l{print $0, l[$0], FNR}' MF*.txt OF*.txt > F22.txt`enter code here`
But i need your help here By using AWK, i need to look for common data in two column in same file only
For example below are two columns with data, and need to search if column A is equal to the word CAN then if column B contains some value found in column C and would then like to print these matches into separate file with line number and any error code.
A B C
CAN 9876 45678
CAN 1234 93939
CAN 45678 9090
ABC 4567 8080
BCD 97654 9876
CAN 9090 8181
Many thanks in advance.

You can use the following awk command:
awk 'NR>1{if($1=="CAN"){lines[$2]=NR" "$0;}C[$3]=$3}END{for(i in lines){if(C[i])print lines[i]}}' input_file.txt > output_file.out
to identify the lines that meet the condition described in your question:
need to search if column A is equal to the word CAN then if column B
contains some value found in column C and would then like to print
these matches into separate file with line number and any error code.
output with your input data:
7 CAN 9090 8181
2 CAN 9876 45678
4 CAN 45678 9090
Explanations:
NR>1 to start the processing from the second line
if($1=="CAN"){lines[$2]=NR" "$0;} for each line that starts with CAN you store the line with its line number in an array for which the key/index is the value of the 2nd column.
C[$3]=$3 stores all the values of the 3rd column
END{for(i in lines){if(C[i])print lines[i]} checks for all lines stored if there exists a value in the 3rd column that is equal to the value of the 2nd column if it is the case it outputs the line number and the line in question.
If you are several input files to process you can use the following awk
command:
awk 'FNR>1{if($1=="CAN"){lines[$2]=FILENAME" "FNR" "$0;}C[$3]=$3}END{for(i in lines){if(C[i])print lines[i]}}' MF*.txt OF*.txt > output_file.out
where
FNR is the relative position in each file
FILENAME is the name of the current file being processed
Hypothesis: each file first line is header A B C if it is not the case you can remove FNR>1
TEST:

Related

Replace each successive instance of a pattern in one file with each successive item in another file

Using awk or sed how can I loop through each instance of a pattern in one file and replace it with the contents of another file. For example:
File1.txt:
123 zone
123 patch
XXX family
456 zone
456 patch
XXX family
789 zone
789 patch
XXX family
File2.txt:
123
456
789
I know how to replace all instances of a pattern to the same number, i.e. -
{a = 0;}
($2 == "family") {printf("%f %s\n",$1=123,$2); a=1;}
{a = 0;}
but I do not now how to change each successive instance of a pattern to a different number - supplied in a list of numbers.
I want to replace the first instance of XXX family in File 1 with the first number in File 2, replace the second instance of XXX family in File 1 with the second number in File 2, etc... so I end up with:
New_file.txt
123 zone
123 patch
123 family
456 zone
456 patch
456 family
789 zone
789 patch
789 family
This might work for you (GNU sed):
sed -E '1{x;s/.*/cat file2.txt/e;s/ //g;x}
/XXX/{G;s/XXX([^\n]*)\n([^\n]*)/\2\1/;P;s/[^\n]*\n//;h;d}' file1.txt
Prime the hold space with the contents of file2.txt (remove any trailing spaces on each line).
Match XXX and if so append the hold space, use pattern matching to replace XXX with the contents of the first line of the hold space. Print the result and then remove the first line and restore the remainder of the hold space, ready for the next replacement.
----------- Edit -----------------
From Glenn Jackman's very helpful and correct comment
awk 'FNR==NR {chk[FNR]=$1; next}
$1 == "XXX" {
$1 = chk[++j]
}
{print}' f2.txt f1.txt
Here is a verbal description of what is happening
(awk can read any number of files from it's command line arguments (limited by the OS cmd-line size) )
We pass in the fix list file (f2.txt) first, so it can be captured into an array. Fortunately your data is in order so we use chk[1]-chk[3] as the keys to those values. next skips any further code in the file, but reading the next record from f2 until all data is stored in the chk array. No records have been printed yet.
The rest of the code processes records from the f1.txt file only (which is the 2nd file in the list). It checks that the record's first field is XXX, using $1 == "XXX". If that test is true, we cycle thru the list stored in the chk[] array, and replace the XXX field with the next value from chk[++j]. We are using the j variable as a counter to index the elsments of the chk[] array.
As all records need to be printed, including the now "fixed" record, we use the print command.
(If you need to learn about ++ for variables, you'll do best to consult a programming book, as it is often the topic for a complete chapter.)
------------- Original answer ------------------
awk 'FNR==NR{
chk[++i]=$1;next
}
{
if (FNR!=NR && $1 ~ /^XXX/){
$1=chk[++j]
print $0
}
else {
print $0
}
}' f2.txt f1.txt
output
123 zone
123 patch
123 family
456 zone
456 patch
456 family
789 zone
789 patch
789 family
awk can read any number of files from it's command line arguments (limited by the OS cmd-line size)
We pass in the fix list file (f2.txt) first, so it can be captured into an array. Fortunately your data is in order so we use chk[1]-chk[3] as the keys to those values. next skips any further processing to get the next record from f2 until all data is stored in the chk array.
The second block only processes records from the f1.txt (which is the 2nd file in the list) . This is done with the check FNR!=NR, which you can read all about in various awk books (this is a common pattern for multi-file processing where 2 "types" of data are being processed).
AHD we check that the record begins XXX, using a regexp /^XXX/. ^ means "anchored to the beginning of the line".
We cycle thru the list stored in the chk[] array, and replace the XXX field with the next value from chk[++j].
With final else, we print any records that haven't been modified.

awk merging 2 columns and adding an extra column to txt file [duplicate]

This question already has answers here:
Why does my tool output overwrite itself and how do I fix it?
(3 answers)
Closed 3 years ago.
I did this in the past without problems, but I can't this time and I don't understand why.....
My original files is
1002 10214
1002 10220
1002 10222
1002 10248
1002 10256
I need to make a new file where the 2 columns above are merged and add a second column with value 1
Desired output should look like this
100210214 1
100210220 1
100210222 1
100210248 1
100210256 1
I tried the below awk commands to first print the 2 columns into 1 into a tmp file, then adding the extra column with "1"
cat input.txt | awk '{ print ($1$2)}' > tmp1.txt
cat tmp1.txt | awk ' {print $0, (1) }' > output.txt
While the first command seems working ok, the second does not
tmp1.txt (OK)
100210214
100210220
100210222
100210248
100210256
output.txt (not OK)
10210214
10210220
10210222
10210248
10210256
The "1"comes in the front of the first column, not sure why, even replacing the first 2 characters. Is it because the original input file is different (may be "space" was used instead of tab)?
Could you please try following.
awk 'BEGIN{OFS="\t"} {sub(/\r$/,"");print $1 $2,"1"}' Input_file
This happens when input file has Windows line endings (i.e. \r\n). You can fix it using this command:
dos2unix file
and then get the desired output with this one:
awk '{$1=$1$2;$2=1}1' file

Awk: Append output to new field in existing file

Is there a way to print the output of an awk script to an existing file as a new field every time?
Hi!
I'm very new at awk (so my terminology might not be correct, sorry about that!) and I'm trying to print the output of a script that will operate on several hundred files to the same file, in different fields.
For example, my data files have this structure:
#File1
1
Values, 2, Hanna
20
15
Values, 2, Josh
30
56
Values, 2, Anna
50
70
#File2
2
Values, 2, Hanna
45
60
Values, 2, Josh
98
63
Values, 2, Anna
10
56
I have several of these files, which are divided by numbered month, with the same names, but different values. I want files that are named by the name of the person, and the values in fields by month, like so:
#Hanna
20 45
15 60
#Josh
30 98
56 63
#Anna
50 10
70 56
In my script, I search for the word "values", and determine which records to print (based on the number after "value"). This works fine. Then I want to print these values. It works fine for one file, with the command:
Print $0 > name #the varible name have I saved to be = $3 of the correct row
This creates three files correctly named "Hanna", "Josh" and "Anna", with their values. However, I would like to run the script for all my datafiles, and append them to only one "Hanna"-file etc, in a new field.
So what I'm looking for is something like print $0 > $month name, reading out like "print the record to the field corresponding to the month"
I have tried to find a solution, but most solutions either just paste temporary files together or append the values after the existing ones (so that they all are in field 1). I want to avoid the temporary files and have them in different fields (so that I get a kind of matrix-structure).
Thank you in advance!
try following, though I have not checked all permutations and combinations and only considered your post. Also your output Josh column is not consistent also(Or please do let us know if more conditions are there for same too). Let me know how it goes then.
awk 'FNR==NR{if($0 ~ /^Values/){Q=$NF;B[$NF]=$NF;i="";next};A[Q,++i]=$0;next} /^Values/{V=$NF;print "#"B[V];i="";next} B[V]{print A[V,++i],$0}' file1 file2
EDIT: Adding a non-one liner form of solution too.
awk 'FNR==NR{
if($0 ~ /^Values/){
Q=$NF;
B[$NF]=$NF;
i="";
next
};
A[Q,++i]=$0;
next
}
/^Values/{
V=$NF;
print "#"B[V];
i="";
next
}
B[V]{
print A[V,++i],$0
}
' file1 file2
EDIT2: Adding explanation too now for same.
awk 'FNR==NR{ ###Checking condition FNR==NR where this condition will be TRUE only when first file named file1 is being read. FNR and NR both indicate number of lines in a Input_file, only difference between them is FNR value will be RESET whenever there is next Input_file is being read and NR value will be keep on incresing till all the Input_files are read.
if($0 ~ /^Values/){ ###Checking here if any line starts from string Values if yes then perform following operations.
Q=$NF; ###Creating a variable named Q whose value is the last field of the line.
B[$NF]=$NF;###Creating an array named B whose index is $NF(last field of the line) and value is same too.
i=""; ###Making variable i value to NULL now.
next ###using next here, it is built-in keyword for awk and it will skip all further statements now.
};
A[Q,++i]=$0; ###Creating an array named A whose index is Q and variable i with increasing value with 1 to it, each time it comes on this statement.
next ###Using next will skip all further statements now.
}
/^Values/{ ###All statements from here will be executed when second file named file2 is being read. So I am checking here if a line starts from string Values then do following.
V=$NF; ###create variable V whose value is $NF of current line.
print "#"B[V]; ###printing the string # then value of array B whose index is variable V.
i=""; ###Nullifying the variable i value here.
next ###next will sip all the further statements now.
}
B[V]{ ###Checking here if array B with index V is having a value in it, then perform following on it too.
print A[V,++i],$0 ###printing the value of array A whose index is variable V and variable i increasing value with 1 and current line.
}
' file1 file2 ###Mentioning the Input_files here named file1 and file2.

Comparing corresponding values of two lines in a file using awk [duplicate]

This question already has answers here:
Finding max value of a specific date awk
(3 answers)
Closed 6 years ago.
name1 20160801|76 20160802|67 20160803|49 20160804|35 20160805|55 20160806|76 20160807|77 20160808|70 2016089|50 20160810|75 20160811|97 20160812|90 20160813|87 20160814|99 20160815|113 20160816|83 20160817|57 20160818|158 20160819|61 20160820|46 20160821|1769608 20160822|2580938 20160823|436093 20160824|75 20160825|57 20160826|70 20160827|97 20160828|101 20160829|96 20160830|95 20160831|89
name2 20160801|32413 20160802|37707 20160803|32230 20160804|31711 20160805|32366 20160806|35532 20160807|36961 20160808|45423 2016089|65230 20160810|111078 20160811|74357 20160812|71196 20160813|71748 20160814|77001 20160815|91687 20160816|92076 20160817|89706 20160818|126690 20160819|168587 20160820|207128 20160821|221440 20160822|234594 20160823|200963 20160824|165231 20160825|139600 20160826|145483 20160827|209013 20160828|228550 20160829|223712 20160830|217959 20160831|169106
I have the line position of two lines in a file say line1 and line2. These lines may be anywhere in the file but I can access the line position using a search keyword based on name(the first word) in each line
20160801 means yyyymmdd and has an associated value separated by |
I need to compare the values associated with each of the date for the given two lines.
I am a newbie in awk. I am not understanding how to compare these two lines at the same time.
Your question is not at all clear. Perhaps the first step is to clearly articulate 1) What is the problem I am trying to solve; 2) what tools or data do I have to solve it?
The only hints specific to your question I can offer (since your problem statement is not clearly articulated) are these:
In awk, you can compare two different files by using the test FNR==NR which is only true on the first file.
You can find the key words by using a regular expression of the form /^name1/ which means lines that start with that pattern
You can split on a delimiter in awk by setting the field separator to that delimiter -- in this case (I think) it sounds like that is | but you are also comparing white space delimited fields inside of those fields?
You can compare by saving the data from the first line and comparing with the data from the second line in the other file once you can articulate what 'compare' means to you.
Wrapping that up, given:
$ cat /tmp/f1.txt
name1 20160801|76 20160802|67 20160803|49 20160804|35 20160805|55 20160806|76 20160807|77 20160808|70 2016089|50 20160810|75 20160811|97 20160812|90 20160813|87 20160814|99 20160815|113 20160816|83 20160817|57 20160818|158 20160819|61 20160820|46 20160821|1769608 20160822|2580938 20160823|436093 20160824|75 20160825|57 20160826|70 20160827|97 20160828|101 20160829|96 20160830|95 20160831|89
$ cat /tmp/f2.txt
name2 20160801|32413 20160802|37707 20160803|32230 20160804|31711 20160805|32366 20160806|35532 20160807|36961 20160808|45423 2016089|65230 20160810|111078 20160811|74357 20160812|71196 20160813|71748 20160814|77001 20160815|91687 20160816|92076 20160817|89706 20160818|126690 20160819|168587 20160820|207128 20160821|221440 20160822|234594 20160823|200963 20160824|165231 20160825|139600 20160826|145483 20160827|209013 20160828|228550 20160829|223712 20160830|217959 20160831|169106
You can find the lines in question like so:
$ awk -F"|" '/^name/ && FNR==NR {print $1}' f1.txt f2.txt
name1 20160801
$ awk -F"|" '/^name/ && FNR<NR {print $1}' f1.txt f2.txt
name2 20160801
(I have only printed the first field for clarity)
Then use that to compare. Save the first in an associative array and then compare the second when found.

Manipulating the awk output depending on the number of occurrences

I don't know how to word it well. I have an input file with the first column of each row being the index. I need to convert this input file into multi-columned output file so that starting indexes of each such columns match.
I have an input file in the following format:
1 11.32 12.55
1 13.32 17.55
1 56.77 33.22
2 34.22 1.112
3 12.13 13.14
3 12.55 34.55
3 22.44 12.33
3 44.32 77.44
The expected output should be:
1 11.32 12.55 2 34.22 1.112 3 12.13 13.14
1 13.32 17.55 3 12.55 34.55
1 56.77 33.22 3 22.44 12.33
3 44.32 77.44
Is there an easy way I can do this in awk?
Something like this, in bash:
paste <(grep '^1 ' input.txt) <(grep '^2 ' input.txt) <(grep '^3 ' input.txt)
paste has an option to set the delimiter if you don't want the default tab characters used, or you could post-process the tabs with expand...
EDIT: For an input file with many more tags, you could take this sort of approach:
awk '{print > "/tmp/output" $1 ".txt"}' input.txt
paste /tmp/output*.txt > final-output.txt
The awk line outputs each line to a file named after the first field of the line, then paste recombines them.
EDIT: as pointed out in a comment below, you might have issues if you end up with more than 9 intermediate files. One way around that would be something like this:
paste /tmp/output[0-9].txt /tmp/output[0-9][0-9].txt > final-output.txt
Add additional arguments as needed if you have more than 99 files... or more than 999... If that's the case, though, a python or perl solution might be a better route...
If all you need is independently running columns (without trying to line up matching items between the columns or anything like that) then the simplest solution might be something like:
awk '{print > $1".OUT"}' FILE; paste 1.OUT 2.OUT 3.OUT
The only issue with that is it won't fill in missing columns so you will need to fill those in yourself to line up your columns.
If the column width is known in advance (and the same for every column) then using:
paste 1.OUT 2.OUT 3.OUT | sed -e 's/^\t/ \t/;s/\t\t/\t \t/'
where those spaces are the width of the column should get you what you want. I feel like there should be a way to do this in a more automated fashion but can't think of one offhand.