Run awk on file being decoded continuously and then print patterns from decode file using awk - awk

I a command which decodes binary logs to ascii format
From ASCII format file, I need to grep some patters using awk and print them
How can this be done?
What I have tried is as below in shell script and it does not work.
command > file.txt | awk /pattern/ | sed/pattern/
Also I need command to continously decode file and keep printing patterns on file being updated
Thanks in advance

command to continously decode file and keep printing patterns
The first question is exactly how continuously manifests itself. Most log files grow by being appended to -- for our purpose here, by some unknown external process -- and are periodically rotated. If you're going to continuously decode them, you're going to have to keep track of log rotation.
Can command continuously decode, or do you intend to re-run command periodically, picking up where you left off? If the latter, you might instead try some variation of:
cat log | command | awk
If that can't be done, you'll have to record where each iteration terminates, something like:
touch pos
while -f pos
do
command | awk -v status=pos script.awk > output || rm pos
done
where script.awk skips input until NR equals the value of the number in the pos file. It then processes lines until EOF, and overwrites pos with its final NR. On error, it calls exit 1, and the file is removed, and the loop terminates.
I recommend you ignore sed, and put all the pattern matching logic in one awk script. It will be easier to understand and cheaper to execute.

Related

How to flush output when using inplace editing with awk?

I want to use awk to edit a column of a large file inplace. If, for any reason, the process break/stops, I don't want to lose the work already done. I've tried to add fflush but seems it does not wort with inplace.
In order to simulate the desired result, here is a test file with 3 columns. The last column is all zeros.
paste -d '\t' <(seq 1 10) <(seq 11 20) |
awk 'BEGIN {FS="\t"; OFS=FS} {$(NF+1)=0; print}' > testfile
Then I want to replace the values in last column. In this simple example, I'm just replacing them by the sum of the first and second columns. I'm adding a system sleep so it might be possible to abort the script in the middle in see the result.
awk -i inplace 'BEGIN {FS="\t"; OFS=FS} $3==0{$3=$1+$2; print; fflush(); system("sleep 1")}' testfile
If you run the script and abort it (ctrl+z) before it ends, the test file is unchanged.
Is it possible to achieve the desired result (get the partial result when the script breaks or stops)? How should I do it?
"In-place" editing is not really. A temporary file holds the output, and replaces the input at the end of the script.
Actual in-place editing would be slow: unless the output is the same length as the input, the file size needs to change, and awk would have to re-write the entire file (everything after the current line, at least) on every buffer flush. Note this caveat from the documentation:
If the program dies prematurely … a temporary file may be left behind.
You could script up some recovery code to merge that temporary file with your input after an abort.
Or, you could adjust your script to only modify one line per run (and simply print every subsequent line, unmodified), and re-run it until there are no changes left to make. This would force awk to re-write the file on every change. It will be slow, but there just isn't any fast way to remove data from the middle of a file.

Renaming files based on internal text match - keep all content of file

Still having trouble figuring out how to preserve the contents of a given file using the following code that is attempting to rename the file based on a specific regex match within said file (i.e. within a given file there will always be one SMILE followed by 12 digits, e.g., SMILE000123456789).
for f in FILENAMEX_*; do awk '/SMILE[0-9]/ {OUT=$f ".txt"}; OUT {print >OUT}' ${f%.*}; done
This code is naming the file correctly but is simply printing out everything after the match instead of the entire contents of the file.
The list of files to be processed don't currently have an extension (and they need one for the next step) because I was using csplit to parse out the content from a larger file.
There are two problems: the first is using a shell variable in your awk program, and the second is the logic of the awk program itself.
To use a shell variable in awk, you can use
awk -v var="$var" '<program>'
and then use just var inside of awk.
For the second problem: if a line doesn't match your pattern and OUT is not set, you don't print the line. After the first line matching the pattern, OUT is set and you print. Since the match might be anywhere in the file, you have to store the lines at least up to the first match.
Here is a version that should work and is pretty close to your approach:
for f in FILENAMEX_*; do
awk -v f="${f%.*}" '
/SMILE[0-9]/ {
out=f".txt"
for (i=1;i<NR;++i) # Print file so far
print lines[i] > out
}
out { print > out } # Match has been seen: print
! out { lines[NR] = $0 } # No match yet: store
' "$f"
done
You could do some trickery and work with FILENAME or similar to do everything in a single invocation of awk, but since the main purpose is to find the presence of a pattern in the file, you're much better off using grep -q, which returns an exit status of 0 if the pattern has been found:
for f in FILENAMEX_*; do grep -q 'SMILE[0-9]' "$f" && cp "$f" "${f%.*}".txt; done
perhaps a different approach and just do each step separately..
ie pseudocode
for all files with some given text
extract text
rename file

using literal string for gawk

I thing I'm too close to the problem already that I just can solve it on my own, alltough I'm sure it's easy to solve.
I'm working on a NAS with a SHELL Script for my Raspberry PI which automaticly collects data and distributes it over my other devices. So I decided to include a delete-option, since otherwise it would be a pain in the ass to delete a file, since the raspberry would always copy it right back from the other devices. While the script runs it creats a file: del_tmp_$ip.txt in which are directorys and files to delete from del_$ip.txt (Not del_TMP_$ip.txt).
It looks like this:
test/delete_me.txt
test/hello/hello.txt
pi.txt
I tried to delete the lines viá awk, and this is how far I got by now:
while read r; do
gawk -i inplace '!/^'$r'$/' del_$ip.txt
done <del_tmp_$ip.txt
If the line from del_tmp_$ip.txt tells gawk to delete pi.txt it works without problems, but if the string includes a slash like test/delete_me.txt it doesn't work:
"unexpected newline or end of string"
and it points to the last slash then.
I can't escape the forwardslash with a backwardslash manually, since I don't know whether and how many slashes there will be. Depending on the line of the file which contains the information to be deleted.
I hope you can help me!
Never allow a shell variable to expand to become part of the awk script text before awk evaluates it (which is what you're doing with '!/^'$r'$/') and always quote your shell variables (so the correct shell syntax would have been '!/^'"$r"'$/' IF it hadn't been the wrong approach anyway). The correct syntax to write that command would have been
awk -v r="$r" '$0 !~ "^"r"$"' file
but you said you wanted a string comparison, not regexp so then it'd be simply:
awk -v r="$r" '$0 != r' file
and of course you don't need a shell loop at all:
while read r; do
gawk -i inplace '!/^'$r'$/' del_$ip.txt
done <del_tmp_$ip.txt
you just need 1 awk command:
gawk -i inplace 'NR==FNR{skip[$0];print;next} !($0 in skip)' "del_tmp_$ip.txt" "del_$ip.txt"

Processing of awk with multiple variable from previous processing?

I have a Q's for awk processing, i got a file below
cat test.txt
/home/shhh/
abc.c
/home/shhh/2/
def.c
gthjrjrdj.c
/kernel/sssh
sarawtera.c
wrawrt.h
wearwaerw.h
My goal is to make a full path from splitting sentences into /home/jhyoon/abc.c.
This is the command I got from someone:
cat test.txt | awk '/^\/.*/{path=$0}/^[a-zA-Z]/{printf("%s/%s\n",path,$0);}'
It works, but I do not understand well about how do make interpret it step by step.
Could you teach me how do I make interpret it?
Result :
/home/shhh//abc.c
/home/shhh/2//def.c
/home/shhh/2//gthjrjrdj.c
/kernel/sssh/sarawtera.c
/kernel/sssh/wrawrt.h
/kernel/sssh/wearwaerw.h
What you probably want is the following:
$ awk '/^\//{path=$0}/^[a-zA-Z]/ {printf("%s/%s\n",path,$0)}' file
/home/jhyoon//abc.c
/home/jhyoon/2//def.c
/home/jhyoon/2//gthjrjrdj.c
/kernel/sssh/sarawtera.c
/kernel/sssh/wrawrt.h
/kernel/sssh/wearwaerw.h
Explanation
/^\//{path=$0} on lines starting with a /, store it in the path variable.
/^[a-zA-Z]/ {printf("%s/%s\n",path,$0)} on lines starting with a letter, print the stored path together with the current line.
Note you can also say
awk '/^\//{path=$0; next} {printf("%s/%s\n",path,$0)}' file
Some comments
cat file | awk '...' is better written as awk '...' file.
You don't need the ; at the end of a block {} if you are executing just one command. It is implicit.

Can I speed up AWK program using NR function

I am using awk to pull out data form a file that us +30M records. I know within a few 1000 records where the records I want are. I am curious if I can cut down on the time it take awk to find the records by telling it a starting point setting the NR. for example, my record is >25 million lines in I could use the following:
awk 'BEGIN{NR=25000000}{rest of my script}' in
would this make awk skip straight to the 25M record and save me the time of it scanning each record before that?
For a better example, I am using this AWK in a loop in sh. I need the normal output of the awk script, but I would also like it pass along the NR when it finished to the next interation when loop comes back to this script again.
awk -v n=$line -v r=$record 'BEGIN{a=1}$4==n{print $10;a=2}($4!=n&&a==2){(pass NR out to $record);exit}' in
Nope. Let's try it:
$ cat -n file
1 one
2 two
3 three
4 four
$ awk 'BEGIN {NR=2} {print NR, $0}' file
3 one
4 two
5 three
6 four
Are your records fixed length, or do you know the average line length? If yes, then you can use a language that allows you to open a file and seek to a position. Otherwise you have to read all those lines:
awk -v start=25000000 'NR < start {next} {your program here}' file
To maintain your position between runs of the script, I'd use a language like perl: at the end of the run use tell() to output the current position, say to a file; then at the start of the next run, use seek() to pick up where you left off. Add a check that the starting position is less than the current file size, in case the file was truncated.
One way (Using sed), if you know the line numbers
for n in 3 5 8 9 ....
do
sed -n "${n}p" file |awk command
done
or
sed -n "25000,30000p" file |awk command
Records generally have no fixed size so there is no way for awk but to scan the first part of the file even just to skip them.
Should you want to skip the first part of the input file and you (roughly) know the size to ignore, you can use dd to truncate the input, eg here assuming a record is 80 bytes wide:
dd if=inputfile bs=25MB skip=80 | awk ...
Finally, you can avoid awk to scan the last records by exiting from the awk script when you have hit the end of the interesting zone.