concatenation of lines as a variable while comparing two files using awk

concatenation of lines as a variable while comparing two files using awk - variables

I am comparing two files using awk. Following is representation of first file (file1.txt):
1
15
MRUKLM
GHLKGM
BNUIOK
Following is representation of second file (file2.txt):
AGHLKMT
MFBGSJY
GSBDGLM
I want to compare two files based on certain patterns. Moreover, first line of the output file should contain the total number of lines in the second file followed by second and third line of the first file. Hence, the header of the output file should be as follows:
3(total lines of second file)
15(second line of first file)
MRUKLM(third line of first file)
certain pattern.....
certain pattern....
certain pattern....
I wrote the following codes:
vari=$(wc -l file2.txt)|awk -v lin="" 'NR==FNR{if(NR>1 && NR<4)lin=$lin$0;else a[NR]=$0;next}BEGIN{print vari,lin}match($0,/([0-9]*)_(.*)/,c){print a[2*c[1]+2];print a[2*c[1]+3]}' file1.txt file2.txt> output_file.txt
The part of the code that extract the pattern is working perfectly, however, I could not get any header in the output file. The output I get is as follows:
certain pattern....
certain pattern....
It turns out that I did some mistakes in assigning variables. Following are the updated codes:
awk -v vari="$(cat file2.txt|wc -l)" 'NR==FNR{if(NR>1 && NR<4)print $0;else a[NR]=$0;next}BEGIN{print vari}match($0,/([0-9]*)_(.*)/,c){print a[2*c[1]+2];print a[2*c[1]+3]}' file1.txt file2.txt > output.txt
It is giving the desirable output.

Related

Parse strings within quotations

I have a log file that includes lines with the pattern as below. I want to extract the two strings within the quotations and write them to another file, each one in a separate column. (Not all lines have this pattern, but these specific lines come sequentially.)
Input
(multiple lines of header)
Of these, 0 are new, while 1723332 are present in the base dataset.
Warning: Variants 'Variant47911' and 'Variant47910' have the same position.
Warning: Variants 'exm2254099' and 'exm12471' have the same position.
Warning: Variants 'newrs140234726' and 'exm15862' have the same position.
Desired output:
Variant47911 Variant47910
exm2254099 exm12471
newrs140234726 exm15862
This retrieves the lines but do not know how to specify the strings that need to be printed.
awk '/Warning: Variants '*'/ Input

Using the single quote as a field delimiter should get you most of the way there, and then you have to have a way to uniquely identify the lines you want to match. Below works for the sample you gave, but might have to be tweaked depending on the lines from the file that we're not seeing.
$ awk -v q="'" 'BEGIN {FS=q; OFS="\t"} /Warning: Variants/ && NF==5 {print $2, $4}' file
Variant47911 Variant47910
exm2254099 exm12471
newrs140234726 exm15862

This might work for you (GNU sed):
sed -En "/Variant/{s/[^']*'([^']*)'[^']*/\1\t/g;T;s/.$//p}" file
For all lines that contain Variant, remove everything except the text between single quotes and tab separate the results.

Match a string from file and print only the first row that matching

I am trying to match a string from file and only print the first line that matches the string. I am able to get the result using grep but is there a way I can achieve the same output using awk?
# cat file
/dev/sdac
/dev/cciss/c0d0
/dev/cciss/c0d0p1
/dev/cciss/c0d0p2
/dev/cciss/c0d0p1
# grep -wm1 c0d0p1 file
/dev/cciss/c0d0p1

Could you please try following.
awk '/c0p0d1/{print;exit}' Input_file
Explanation: I am searching string in each line and when a match is found I am printing the line and exiting ASAP since we need not to read file unnecessary. Exiting from program will make it faster too.

Returning Nth line from multiple files

Given a folder with multiple .csv files, I want to return the Nth line from each file and write to a new file.
For a single file, I use
awk 'NR==5' file.csv
For multiple files, I figured
ls *.csv | xargs awk 'NR==5'
...however that only returns the 5th line from the first file in the list.
Thanks!

Could you please try following and let me know if this helps you(GNU awk should help I believe):
awk 'FNR==5{print;nextfile}' *.csv
In case you need to take output into a single out file then append > output_file at last of above command too.
Explanation:
FNR==5: Checking here if line number is 5th for current Input_file then do actions mentioned after it.
{print;}: print is awk out of the box keyword which prints the current line so it will print the 5th line only.
nextfile: as by name itself it is clear that nextfile will skip all the lines in current Input_file(since I have given *.csv in end so it means it will pass all csv files to awk 1 by 1) it will save time for us since we DO NOT want to read the entries Input_file we needed only 5th line and we got it.

Renaming files based on internal text match - keep all content of file

Still having trouble figuring out how to preserve the contents of a given file using the following code that is attempting to rename the file based on a specific regex match within said file (i.e. within a given file there will always be one SMILE followed by 12 digits, e.g., SMILE000123456789).
for f in FILENAMEX_*; do awk '/SMILE[0-9]/ {OUT=$f ".txt"}; OUT {print >OUT}' ${f%.*}; done
This code is naming the file correctly but is simply printing out everything after the match instead of the entire contents of the file.
The list of files to be processed don't currently have an extension (and they need one for the next step) because I was using csplit to parse out the content from a larger file.

There are two problems: the first is using a shell variable in your awk program, and the second is the logic of the awk program itself.
To use a shell variable in awk, you can use
awk -v var="$var" '<program>'
and then use just var inside of awk.
For the second problem: if a line doesn't match your pattern and OUT is not set, you don't print the line. After the first line matching the pattern, OUT is set and you print. Since the match might be anywhere in the file, you have to store the lines at least up to the first match.
Here is a version that should work and is pretty close to your approach:
for f in FILENAMEX_*; do
awk -v f="${f%.*}" '
/SMILE[0-9]/ {
out=f".txt"
for (i=1;i<NR;++i) # Print file so far
print lines[i] > out
}
out { print > out } # Match has been seen: print
! out { lines[NR] = $0 } # No match yet: store
' "$f"
done
You could do some trickery and work with FILENAME or similar to do everything in a single invocation of awk, but since the main purpose is to find the presence of a pattern in the file, you're much better off using grep -q, which returns an exit status of 0 if the pattern has been found:
for f in FILENAMEX_*; do grep -q 'SMILE[0-9]' "$f" && cp "$f" "${f%.*}".txt; done

perhaps a different approach and just do each step separately..
ie pseudocode
for all files with some given text
extract text
rename file

How to print the 'nth + x' lines after a match is found?

I have a file which contains the output below. I want only the lines which contain the actual vm_id number.
I want to match pattern 'vm_id' and print 2nd line + all other lines until 'rows' is reached.
FILE BEGIN:
vm_id
--------------------------------------
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
(6 rows)
datacenter=
FILE END:
So the resulting output would be;
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
Also, the number of VM Id's will vary, this example has 6 while others could have 3 or 300.
I have tried the following but they only output a single line that's specified;
awk 'c&&!--c;/vm_id/{c=2}'
and
awk 'c&&!--c;/vm_id/{c=2+1}'

$ awk '/rows/{f=0} f&&(++c>2); /vm_id/{f=1}' file
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
If you wanted that first line of hex(?) printed too then just change the starting number to compare c to from 2 to 1 (or 3 or 127 or however many lines you want to skip after hitting the vm_id line):
$ awk '/rows/{f=0} f&&(++c>1); /vm_id/{f=1}' file
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64

What about this:
awk '/vm_id/{p=1;getline;next}/\([0-9]+ rows/{p=0}p'
I'm setting the p flag on vm_id and resetting it on (0-9+ rows).
Also sed comes in mind, the command follows basically the same logic as the awk command above:
sed -n '/vm_id/{n;:a;n;/([0-9]* rows)/!{p;ba}}'
Another thing, if it is safe that the only GUIDs in your input file are the vm ids, grep might be the tool of choise:
grep -Eo '([0-9a-f]+-){4}([0-9a-f]+)'
It's not 100% bullet proof in this form, but it should be good enough for the most use cases.
Bullet proof would be:
grep -Eoi '[0-9a-f]{8}(-[0-9a-f]{4}){3}-[0-9a-f]{12}'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

concatenation of lines as a variable while comparing two files using awk - variables

Related

Parse strings within quotations

Match a string from file and print only the first row that matching

Returning Nth line from multiple files

Renaming files based on internal text match - keep all content of file

How to print the 'nth + x' lines after a match is found?

Categories

Resources