Multiple input files to `awk`? - awk

I have a bunch of files named dataX.dat where X is a number that ranges over an interval. I was trying to get the seventh line of each of those and take them to a new file using awk but it just gets the latest one (the one with the highest X:
awk 'NR==7' data*.dat
only displays the seventh line of data128.dat being 128 the last one of them. Why does it do so and how should I do it so it takes all the files?
Thank you in advance.
EDIT: The system orders files alphabetically and 128 happens to be the first one (no other X number starts with 1), so it's actually taking only the first file.

NR is across all files, FNR is in each file:
awk 'FNR==7' data*.dat

Try something like:
for file in *
do
awk ... "$file"
done
If you need to awk on multiple files recursively, then use find command instead like find . -type f -exec awk ... {} ;

Related

Returning Nth line from multiple files

Given a folder with multiple .csv files, I want to return the Nth line from each file and write to a new file.
For a single file, I use
awk 'NR==5' file.csv
For multiple files, I figured
ls *.csv | xargs awk 'NR==5'
...however that only returns the 5th line from the first file in the list.
Thanks!
Could you please try following and let me know if this helps you(GNU awk should help I believe):
awk 'FNR==5{print;nextfile}' *.csv
In case you need to take output into a single out file then append > output_file at last of above command too.
Explanation:
FNR==5: Checking here if line number is 5th for current Input_file then do actions mentioned after it.
{print;}: print is awk out of the box keyword which prints the current line so it will print the 5th line only.
nextfile: as by name itself it is clear that nextfile will skip all the lines in current Input_file(since I have given *.csv in end so it means it will pass all csv files to awk 1 by 1) it will save time for us since we DO NOT want to read the entries Input_file we needed only 5th line and we got it.

Renaming files based on internal text match - keep all content of file

Still having trouble figuring out how to preserve the contents of a given file using the following code that is attempting to rename the file based on a specific regex match within said file (i.e. within a given file there will always be one SMILE followed by 12 digits, e.g., SMILE000123456789).
for f in FILENAMEX_*; do awk '/SMILE[0-9]/ {OUT=$f ".txt"}; OUT {print >OUT}' ${f%.*}; done
This code is naming the file correctly but is simply printing out everything after the match instead of the entire contents of the file.
The list of files to be processed don't currently have an extension (and they need one for the next step) because I was using csplit to parse out the content from a larger file.
There are two problems: the first is using a shell variable in your awk program, and the second is the logic of the awk program itself.
To use a shell variable in awk, you can use
awk -v var="$var" '<program>'
and then use just var inside of awk.
For the second problem: if a line doesn't match your pattern and OUT is not set, you don't print the line. After the first line matching the pattern, OUT is set and you print. Since the match might be anywhere in the file, you have to store the lines at least up to the first match.
Here is a version that should work and is pretty close to your approach:
for f in FILENAMEX_*; do
awk -v f="${f%.*}" '
/SMILE[0-9]/ {
out=f".txt"
for (i=1;i<NR;++i) # Print file so far
print lines[i] > out
}
out { print > out } # Match has been seen: print
! out { lines[NR] = $0 } # No match yet: store
' "$f"
done
You could do some trickery and work with FILENAME or similar to do everything in a single invocation of awk, but since the main purpose is to find the presence of a pattern in the file, you're much better off using grep -q, which returns an exit status of 0 if the pattern has been found:
for f in FILENAMEX_*; do grep -q 'SMILE[0-9]' "$f" && cp "$f" "${f%.*}".txt; done
perhaps a different approach and just do each step separately..
ie pseudocode
for all files with some given text
extract text
rename file

How to print the 'nth + x' lines after a match is found?

I have a file which contains the output below. I want only the lines which contain the actual vm_id number.
I want to match pattern 'vm_id' and print 2nd line + all other lines until 'rows' is reached.
FILE BEGIN:
vm_id
--------------------------------------
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
(6 rows)
datacenter=
FILE END:
So the resulting output would be;
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
Also, the number of VM Id's will vary, this example has 6 while others could have 3 or 300.
I have tried the following but they only output a single line that's specified;
awk 'c&&!--c;/vm_id/{c=2}'
and
awk 'c&&!--c;/vm_id/{c=2+1}'
$ awk '/rows/{f=0} f&&(++c>2); /vm_id/{f=1}' file
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
If you wanted that first line of hex(?) printed too then just change the starting number to compare c to from 2 to 1 (or 3 or 127 or however many lines you want to skip after hitting the vm_id line):
$ awk '/rows/{f=0} f&&(++c>1); /vm_id/{f=1}' file
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
What about this:
awk '/vm_id/{p=1;getline;next}/\([0-9]+ rows/{p=0}p'
I'm setting the p flag on vm_id and resetting it on (0-9+ rows).
Also sed comes in mind, the command follows basically the same logic as the awk command above:
sed -n '/vm_id/{n;:a;n;/([0-9]* rows)/!{p;ba}}'
Another thing, if it is safe that the only GUIDs in your input file are the vm ids, grep might be the tool of choise:
grep -Eo '([0-9a-f]+-){4}([0-9a-f]+)'
It's not 100% bullet proof in this form, but it should be good enough for the most use cases.
Bullet proof would be:
grep -Eoi '[0-9a-f]{8}(-[0-9a-f]{4}){3}-[0-9a-f]{12}'

Can I speed up AWK program using NR function

I am using awk to pull out data form a file that us +30M records. I know within a few 1000 records where the records I want are. I am curious if I can cut down on the time it take awk to find the records by telling it a starting point setting the NR. for example, my record is >25 million lines in I could use the following:
awk 'BEGIN{NR=25000000}{rest of my script}' in
would this make awk skip straight to the 25M record and save me the time of it scanning each record before that?
For a better example, I am using this AWK in a loop in sh. I need the normal output of the awk script, but I would also like it pass along the NR when it finished to the next interation when loop comes back to this script again.
awk -v n=$line -v r=$record 'BEGIN{a=1}$4==n{print $10;a=2}($4!=n&&a==2){(pass NR out to $record);exit}' in
Nope. Let's try it:
$ cat -n file
1 one
2 two
3 three
4 four
$ awk 'BEGIN {NR=2} {print NR, $0}' file
3 one
4 two
5 three
6 four
Are your records fixed length, or do you know the average line length? If yes, then you can use a language that allows you to open a file and seek to a position. Otherwise you have to read all those lines:
awk -v start=25000000 'NR < start {next} {your program here}' file
To maintain your position between runs of the script, I'd use a language like perl: at the end of the run use tell() to output the current position, say to a file; then at the start of the next run, use seek() to pick up where you left off. Add a check that the starting position is less than the current file size, in case the file was truncated.
One way (Using sed), if you know the line numbers
for n in 3 5 8 9 ....
do
sed -n "${n}p" file |awk command
done
or
sed -n "25000,30000p" file |awk command
Records generally have no fixed size so there is no way for awk but to scan the first part of the file even just to skip them.
Should you want to skip the first part of the input file and you (roughly) know the size to ignore, you can use dd to truncate the input, eg here assuming a record is 80 bytes wide:
dd if=inputfile bs=25MB skip=80 | awk ...
Finally, you can avoid awk to scan the last records by exiting from the awk script when you have hit the end of the interesting zone.

very long line causing problems

I have a problem with a file containing ~80,000 lines. It is a large file of 23Gb. I have managed to chunk up similar files of that size with the following command:
awk '{fn = NR % 24; print > ("file1_" fn)}' file1
However, this command stalls on this one problem file. The problem file does have a very large line of 3 billion characters (longest lines in other files are less than 1 billion) and I am guessing this is the problem.
I would like to get rid of this long line from the file and proceed but this is proving difficult. I though simply using the following would work
awk 'length < 1000000000' file1 > file2
However, this is also still running after 3.5hrs. Is there a fast way of just going through a file and the moment a count for the number of characters in a line exceeds e.g. 1 billion, it stops counting and moves to the next line?
maybe you could try to combine the two awk lines into one command, it could be faster. Because, it processes your monster file only once. But you have to test.
awk '{fn = NR % 24; if(length< 1000000000) print > ("file1_" fn)}' file1
Try using sed to delete lines longer than a certain number of characters
# delete lines longer than 65 characters
sed '/^.\{65\}/d' file
You can also use a 2-step approach:
# use sed to output the line numbers containing lines
# longer than a certain number of characters
sed -n '/^.\{65\}/=' file
and then use that list to build a skip-list in awk, i.e. if NR equals any of those numbers, skip that line.