Log file looks something like below
Process Beginning - 2016-04-02-00.36.13
Putting Files To daADadD for
File will move to /sadafJJHFASJFFASJ/
Extract Files :-/ASFDSHAF_ABC_2016-04-02.csv
/ASFDSHAF_ABC.2016-04-02.csv /
ASFDSHAF_ABC.2016-04-02.csv /ASFDSHAF_ABC.2016-04-02.csv /
Process Ending - 2016-04-02-00.36.36
Process Beginning - 2016-04-02-10.01.20
Putting Files To daADadD for
File will move to /sadafJJHFASJFFASJ/
Extract Files :-/sdshsdhsh_cvb.2016-04-02.csv
/sdshsdhsh_cvb.2016-04-02.csv /sdshsdhsh_cvb.2016-04-02.csv
Process Ending - 2016-04-02-10.01.21
There is multiple entries of pattern / Process Beginning - 2016-04-02/ /Process Ending - 2016-04-02/
how do I find entry or block which has pattern /ABC_2016-04-02.csv/ in between
You can do it with sed (at least with GNU sed 4, sorry no AIX sed to test with):
# read complete block into hold space
/Process Beginning/,/Process Ending/ {
H
}
# "test" for csvfile in the block
/Process Ending/ {
# get hold space (i.e. complete block)
x
# if s did a "substition" (i.e. csvfile occured in the block): print
s/ASFDSHAF_ABC.2016-04-02.csv/&/p
# clear hold space for next block
s/.*//
h
}
Say your log file is grplog.log and the script is grplog.sed then run the script like this: sed -n -f grplog.sed grplog.log.
Explanation:
The script reads all lines in your processing block into the hold space in the first part.
The second part triggered on the Process Ending line uses the p flag of the s command: if a substitution was made, print the new pattern space. Due to the use of & old and new pattern space are the same.
Related
I need help with a simple dash script solution. A script reading the values of file: "Install.txt", (sample content):
TRUE 203
TRUE 301
TRUE 602
TRUE 603
The numbers will correspond with the same number (at the end of line) in file: "ExtraExt.sys", (sample content):
# Read $[EXTRA_DIR]/diaryThumbPlace #202
# Read $[CORE_DIR]/myBorderStyle #203
# Read $[EXTRA_DIR]/mMenu #301
# Read $[EXTRA_DIR]/dDecor #501
# Read $[EXTRA_DIR]/controlPg #601
# Read $[EXTRA_DIR]/DashToDock #602
# Read $[EXTRA_DIR]/deskSwitch #603
All lines are tagged (#). The script will untag the corresponding line that has the same number as in the file "install.txt". For example, TRUE 203 will untag the line ending with #203
# Read $[CORE_DIR]/myBorderStyle #203
to (without "#" before Read)
Read $[CORE_DIR]/myBorderStyle #203
I've searched for awk/sed solution, but this requires a loop to go through the numbers in Install.txt.
Any help is appreciated. Thank you.
you can try,
# store "203|301|602|603" in search variable
search=$(awk 'BEGIN{OFS=ORS=""}{if(NR>1){print "|"}print $2;}' Install.txt)
sed -r "s/^# (.*#($search))$/\1/g" ExtraExt.sys
you get,
# Read $[EXTRA_DIR]/diaryThumbPlace #202
Read $[CORE_DIR]/myBorderStyle #203
Read $[EXTRA_DIR]/mMenu #301
# Read $[EXTRA_DIR]/dDecor #501
# Read $[EXTRA_DIR]/controlPg #601
Read $[EXTRA_DIR]/DashToDock #602
Read $[EXTRA_DIR]/deskSwitch #603
or, -i option for edit in file
sed -i -r "s/^# (.*#($search))$/\1/g" ExtraExt.sys
One option using awk could be reading the first file Install.txt and store the values of the second field in arr
Reading the second file ExtraExt.sys you could get numbers of the last column using $NF and match that against the pattern ^#[0-9]+$
If there is a match, remove the # part from the match leaving only the number and check if the number is in arr.
If it is, print the current line without the leading #
awk '
FNR==NR{
arr[$2]
next
}
match ($NF, /^#[0-9]+$/) {
if (substr($NF, RSTART+1, RLENGTH) in arr) {
sub(/^#[[:space:]]+/, "", $0); print $0
}
}
' Install.txt ExtraExt.sys
Output
Read $[CORE_DIR]/myBorderStyle #203
Read $[EXTRA_DIR]/mMenu #301
Read $[EXTRA_DIR]/DashToDock #602
Read $[EXTRA_DIR]/deskSwitch #603
We are currently using sed to filter output of regression runs. Sometimes we have a filter that looks like this:
/copyright/,/end copyright/d
If that end copyright is ever missing, the rest of the file is deleted. I'm wondering if there's some way to generate an error for this? awk would also be okay to use. I don't really want to add code that reads the file line by line and issues an error if it hits EOF.
here's a string
copyright
2016 jan 15
end copyright
date 2016 jan 5 time 15:36
last one
I'd like to get an error if end copyright is missing. The real filter also would replace the date line with DATE, so it's more that just ripping out the copyright.
You can persuade sed to generate an error if you reach end of input (i.e. see address $) between your start and end, but it won't be a very helpful message:
/copyright/,/end copyright/{
$s//\1/ # here
d
}
This will error if end copyright is missing or on the last line, with an exit status of 1 and the helpful message:
sed: -e expression #1, char 0: invalid reference \1 on `s' command's RHS
If you're using this in a makefile, you might want to echo a helpful message first, or (better) to wrap this in something that catches the error and produces a more useful one.
I tested this with GNU sed; though if you are using GNU sed, you could more easily use its useful extension:
q [EXIT-CODE]
This command only accepts a single address.
Exit 'sed' without processing any more commands or input. Note
that the current pattern space is printed if auto-print is not
disabled with the -n options. The ability to return an exit code
from the 'sed' script is a GNU 'sed' extension.
Q [EXIT-CODE]
This command only accepts a single address.
This command is the same as 'q', but will not print the contents of
pattern space. Like 'q', it provides the ability to return an exit
code to the caller.
So you could simply write
/copyright/,/end copyright/{
$Q 42
d
}
Never use range expressions /start/,/end/ as they make trivial code very slightly briefer but require a complete rewrite or duplicate conditions when you have the tiniest requirements change. Always use a flag instead. Note that since sed doesn't support variables, it doesn't support flag variables, and so you shouldn't be using sed you should be using awk instead.
In this case your original code would be:
awk '/copyright/{f=1} !f; /end copyright/{f=0}' file
And your modified code would be:
awk '/copyright/{f=1} !f; /end copyright/{f=0} END{if (f) print "Missing end copyright"}' file
The above is obviously untested since you didn't provide any sample input/output we could test a potential solution against.
With sed you can build a loop:
sed -e '/copyright/{:a;/end copyright/d;N;ba;};' file
:a defines the label "a"
/copyright end/d deletes the pattern space, only when "end copyright" matches
N appends the next line to the pattern space
ba jumps to the label "a"
Note that d ends the loop.
In this way you can avoid to delete the text until the end.
If you don't want the text to be displayed at all and prefer an error message when a "copyright" block stays unclosed, you obviously need to wait the end of the file. You can do it with sed too storing all the lines in the buffer space until the end:
sed -n -e '/copyright/{:a;/end copyright/d;${c\ERROR MESSAGE
;};N;ba;};H;${g;p};' file
H appends the current line to the buffer space
g put the content of the buffer space to the pattern space
The file content is only displayed once the last line reached with ${g;p} otherwise when the closing "end copyright" is missing, the current line is changed in the error message with ${c\ERROR MESSAGE\n;} inside the loop.
This way you can test what returns sed before redirecting it to whatever you want.
Ok, so after spending 2 days, I am not able solve it and I am almost out of time now. It might be a very silly question, so please bear with me. My awk script does something like this:
BEGIN{ n=50; i=n; }
FNR==NR {
# Read file-1, which has just 1 column
ids[$1]=int(i++/n);
next
}
{
# Read file-2 which has 4 columns
# Do something
next
}
END {...}
It works fine. But now I want to extend it to read 3 files. Let's say, instead of hard-coding the value of "n", I need to read a properties file and set value of "n" from that. I found this question and have tried something like this:
BEGIN{ n=0; i=0; }
FNR==NR {
# Block A
# Try to read file-0
next
}
{
# Block B
# Read file-1, which has just 1 column
next
}
{
# Block C
# Read file-2 which has 4 columns
# Do something
next
}
END {...}
But it is not working. Block A is executed for file-0, I am able to read the property from properties files. But Block B is executed for both files file-1 and file-2. And Block C is never executed.
Can someone please help me solve this? I have never used awk before and the syntax is very confusing. Also, if someone can explain how awk reads input from different files, that will be very helpful.
Please let me know if I need to add more details to the question.
If you have gawk, just test ARGIND:
awk '
ARGIND == 1 { do file 1 stuff; next }
ARGIND == 2 { do file 2 stuff; next }
' file1 file2
If you don't have gawk, get it.
In other awks though you can just test for the file name:
awk '
FILENAME == ARGV[1] { do file 1 stuff; next }
FILENAME == ARGV[2] { do file 2 stuff; next }
' file1 file2
That only fails if you want to parse the same file twice, if that's the case you need to add a count of the number of times that file's been opened.
Update: The solution below works, as long as all input files are nonempty, but see #Ed Morton's answer for a simpler and more robust way of adding file-specific handling.
However, this answer still provides a hopefully helpful explanation of some awk basics and why the OP's approach didn't work.
Try the following (note that I've made the indices 1-based, as that's how awk does it):
awk '
# Increment the current-file index, if a new file is being processed.
FNR == 1 { ++fIndex }
# Process current line if from 1st file.
fIndex == 1 {
print "file 1: " FILENAME
next
}
# Process current line if from 2nd file.
fIndex == 2 {
print "file 2: " FILENAME
next
}
# Process current line (from all remaining files).
{
print "file " fIndex ": " FILENAME
}
' file-1 file-2 file-3
Pattern FNR==1 is true whenever a new input file is starting to get processed (FNR contains the input file-relative line number).
Every time a new file starts processing, fIndexis incremented and thus reflects the 1-based index of the current input file. Tip of the hat to #twalberg's helpful answer.
Note that an uninitialized awk variable used in a numeric context defaults to 0, so there's no need to initialize fIndex (unless you want a different start value).
Patterns such as fIndex == 1 can then be used to execute blocks for lines from a specific input file only (assuming the block ends in next).
The last block is then executed for all input files that don't have file-specific blocks (above).
As for why your approach didn't work:
Your 2nd and 3rd blocks are potentially executed unconditionally, for lines from all input files, because they are not preceded by a pattern (condition).
So your 2nd block is entered for lines from all subsequent input files, and its next statement then prevents the 3rd block from ever getting reached.
Potential misconceptions:
Perhaps you think that each block functions as a loop processing a single input file. This is NOT how awk works. Instead, the entire awk program is processed in a loop, with each iteration processing a single input line, starting with all lines from file 1, then from file 2, ...
An awk program can have any number of blocks (typically preceded by patterns), and whether they're executed for the current input line is solely governed by whether the pattern evaluates to true; if there is no pattern, the block is executed unconditionally (across input files). However, as you've already discovered, next inside a block can be used to skip subsequent blocks (pattern-block pairs).
Perhaps you need to consider adding some additional structure like this:
BEGIN { file_number=1 }
FNR==1 { ++file_number }
file_number==3 && /something_else/ { ...}
I am trying to understand how the following command works (from here):
<!-- language: lang-bash -->
pfiles /proc/* 2>&- |
nawk 'END {
if (f) print p
}
/^[0-9]/ {
if (f) print p, RS
p = $0
f = 0
}
/INET / {
sub(/.*INET/,"")
p = p ? p RS $0 : $0
f = 1
}'
This command works well (in SOLARIS 5.10) and shows all the ports opened by processes.
I understand that, pfiles /proc/* displays a bunch of output related to all processes by querying the /proc/ filesystem. From the man-page:
pfiles Report fstat(2) and fcntl(2) information
for all open files in each process. In
addition, a path to the file is reported
if the information is available from
/proc/pid/path. This is not necessarily
the same name used to open the file. See
proc(4) for more information.
The output from pfiles is then processed by nawk ('New Awk').
Questions
Could you please explain how NAWK is processing the output of pfiles in the following command? It would be most helpful to know how the parameters f, p and $0 mean.
In the first line, what does redirection of standard error to &- mean? Does it mean the standard error stream is being closed ?
I had to read that script once or twice to make sure I got it straight in
my head. It's a little confusing because we see the END at the beginning.
$0 is the entire line.
The line /^[0-9]/ matches the process id (specifically) and that block
then sets the sentinel variable f to 0.
The block starting with /INET / matches (and then strips, via the sub(..))
the open port number. The sentinel value f is set to 1 so that we know to
print differently when we hit the END. Each time we finish an output
collection (ie, the entire output from pfiles for a process), we hit the END
block and print the output.
BTW, the RS is the Record Separator.
Running the script on just one process might make it a little easier to get
the head around it.
Sorry, forgot to answer your other question re the redirection.
2>&-
in this context means "redirect stderr from the process to standard input",
so that nawk takes input from there rather than a file.
I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow.
Here is the awk script:
#!/bin/bash
awk '{t=$4" "$5; gsub("[\[\]\/]"," ",t); sub(":"," ",t);printf("%s,",$1);system("date -d \""t"\" +%s");}' $1
EDIT:
For non-awkers, the script reads each line, gets the date information, modifies it to a format the utility date recognizes and calls it to represent the date as the number of seconds since 1970, finally returning it as a line of a .csv file, along with the IP.
Example input: 189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
Returned output: 189.5.56.113,124237889
#OP, your script is slow mainly due to the excessive call of system date command for every line in the file, and its a big file as well (in the GB). If you have gawk, use its internal mktime() command to do the date to epoch seconds conversion
awk 'BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
for(o=1;o<=m;o++){
date[d[o]]=sprintf("%02d",o)
}
}
{
gsub(/\[/,"",$4); gsub(":","/",$4); gsub(/\]/,"",$5)
n=split($4, DATE,"/")
day=DATE[1]
mth=DATE[2]
year=DATE[3]
hr=DATE[4]
min=DATE[5]
sec=DATE[6]
MKTIME= mktime(year" "date[mth]" "day" "hr" "min" "sec)
print $1,MKTIME
}' file
output
$ more file
189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
$ ./shell.sh
189.5.56.113 1264110895
If you really really need it to be faster, you can do what I did. I rewrote an Apache log file analyzer using Ragel. Ragel allows you to mix regular expressions with C code. The regular expressions get transformed into very efficient C code and then compiled. Unfortunately, this requires that you are very comfortable writing code in C. I no longer have this analyzer. It processed 1 GB of Apache access logs in 1 or 2 seconds.
You may have limited success removing unnecessary printfs from your awk statement and replacing them with something simpler.
If you are using gawk, you can massage your date and time into a format that mktime (a gawk function) understands. It will give you the same timestamp you're using now and save you the overhead of repeated system() calls.
This little Python script handles a ~400MB worth of copies of your example line in about 3 minutes on my machine producing ~200MB of output (keep in mind your sample line was quite short, so that's a handicap):
import time
src = open('x.log', 'r')
dest = open('x.csv', 'w')
for line in src:
ip = line[:line.index(' ')]
date = line[line.index('[') + 1:line.index(']') - 6]
t = time.mktime(time.strptime(date, '%d/%b/%Y:%X'))
dest.write(ip)
dest.write(',')
dest.write(str(int(t)))
dest.write('\n')
src.close()
dest.close()
A minor problem is that it doesn't handle timezones (strptime() problem), but you could either hardcode that or add a little extra to take care of it.
But to be honest, something as simple as that should be just as easy to rewrite in C.
gawk '{
dt=substr($4,2,11);
gsub(/\//," ",dt);
"date -d \""dt"\" +%s"|getline ts;
print $1, ts
}' yourfile