Can awk find a field containing a string from a list? - awk

I have a file containing different fields.
I have another file containing a list of different words.
I need to use awk command to extract from my 1st file all records where a specific field contains one or different words from my 2nd file.
For example 1st file:
Feb 15 12:05:10 lcif adm.slm: root [23416]: cd /tmp
Feb 15 12:05:24 lcif adm.slm: root [23416]: cat tst.sh
Feb 15 12:05:44 lcif adm.slm: root [23416]: date
Feb 15 12:05:52 lcif adm.pse: root [23419]: rm -f file
Feb 15 12:05:58 lcif adm.pse: root [23419]: who
Feb 15 12:06:02 lcif adm.pse: root [23419]: uptime
Feb 15 12:06:56 lcif adm.pse: root [23419]: reboot
Feb 15 12:06:58 lcif adm.pse: root [23419]: ls -lrt
For example 2nd file:
rm
reboot
shutdown
Then awk command should returns:
Feb 15 12:05:52 lcif adm.pse: root [23419]: rm -f file
Feb 15 12:06:56 lcif adm.pse: root [23419]: reboot
Tried deperatly with array/map.
Tried this to:
awk -F ": " '{if ($3 ~ "^rm" || $3 ~ "^reboot" || $3 ~ "^shutdown") print}'
But the list of words I'm looking for is getting bigger and bigger.
I'd rather use a file list.
Appreciate any help.
Thank you !
Serge

You might do it like this:
awk -F ': ' '
FNR == NR { commands[$0]; next }
split($3,cmdline," ") && (cmdline[1] in commands)
' file2 file1
output:
Feb 15 12:05:52 lcif adm.pse: root [23419]: rm -f file
Feb 15 12:06:56 lcif adm.pse: root [23419]: reboot

don't waste time with arrays. just dynamically generate hard-coded regex on the fly :
printf '%s' "${file_a}" |
gawk -p- -b 'BEGIN { FS = "[]]: " } '"$(
awk -v __="${file_b}" 'BEGIN {
FS = RS ; OFS = "|"
RS = "^$"; _= ORS = ""
$_ = __
print "$NF ~ \"^(" $(_*(NF-=_==$NF)) ")( |$)\"" }' )"
Feb 15 12:05:52 lcif adm.pse: root [23419]: rm -f file
Feb 15 12:06:56 lcif adm.pse: root [23419]: reboot
# this part being dynamically generated
awk 'BEGIN { FS = "[]]: " } $NF ~ "^(rm|reboot|shutdown)( |$)" '
then instead of looping through an array, it'll be a high speed single pass through file A without having to store any rows in between

Related

Text processing to create a list of unique ID's

I Have a file with ID’s and Names applicable to them as below:
1234|abc|cde|fgh
5678|ijk|abc|lmn
9101|cde|fgh|klm
1213|klm|abc|cde
I need a file with only unique Names as a list.
Output File:
abc|sysdate
cde|sysdate
fgh|sysdate
ijk|sysdate
lmn|sysdate
klm|sysdate
Where sysdate is the current timestamp of processing.
Requesting you to help on this. Also requesting for a explanation for the code suggested.
What this code does :
awk -F\| '{ for(i=2; i <= NF; i++) a[$i] = a[$i] FS $1 }' input.csv
-F sets the delimiter to |, awk process line by line your file, creates a map named 'a', reads from column 2 until the end and fill the map using the current cell processed as key and the current cell + file separator + value in the first column as value.
When awk ends processing the first line, 'a' is :
a['abc'] = 'abc|1234'
a['cde'] = 'cde|1234'
a['fgh'] = 'fgh|1234'
This script does not print anything.
What you want is something like this :
awk -F'|' '{for(i=2;i<=NF;i++){if(seen[$i] != 1){print $i, strftime(); seen[$i]=1}}}' OFS='|' input.csv
-F sets the input delimiter to |, OFS does the same for the output delimiter.
For each value from the column 2 to the end of the line, we check if it has already been seen before. If not, we print the value and the time of process. Then we register the value in a map so we can avoid to process it again.
Output :
abc|Thu Oct 18 10:40:13 CEST 2018
cde|Thu Oct 18 10:40:13 CEST 2018
fgh|Thu Oct 18 10:40:13 CEST 2018
ijk|Thu Oct 18 10:40:13 CEST 2018
lmn|Thu Oct 18 10:40:13 CEST 2018
klm|Thu Oct 18 10:40:13 CEST 2018
You can change the format of sysdate. See documentation of gawk strftime here

awk, how to pass in a list of files based on a condition?

I was wondering if there is any way to pass in a file list to awk. The file list has thousands of files and I am using a grep -l to find a subset of files I am interested in passing to awk
E.g.,
grep -l id file-*.csv
file-1.csv
file-2.csv
$ cat file-1.csv
id,col_1,col_2
1,abc,100
2,def,200
$ cat file-2.csv
id,col_1,col_2
3,xyz,1000
4,hij,2000
If I do
$ awk -F, '{print $2,$3}' file-1.csv file-2.csv | grep -v col
abc 100
def 200
xyz 1000
hij 2000
it works how I would want but seeing as there are too many files to manually do like this
file-1.csv file-2.csv
I was wondering if there is a way to pass in the result of the...
grep -l id file-*.csv
Edit:
grep -l id
is the condition. Each file has a header but only some have 'id' in the header so I can't use the file-*.csv wildcard in the awk statement.
If I did an ls on file-*.csv I would end up with more the file-1.csv and file-2.csv.
e.g.,
$ cat file-3.csv
name,col,num
a1,hij,3000
b2,lmn,50000
$ ls -l file-*.csv
-rw-r--r-- 1 tp staff 35 20 Sep 18:50 file-1.csv
-rw-r--r-- 1 tp staff 37 20 Sep 18:51 file-2.csv
-rw-r--r-- 1 tp staff 38 20 Sep 18:52 file-3.csv
$ grep -l id file-*.csv
file-1.csv
file-2.csv
Based on the output you show under "If I do", it sounds like this might be what you're trying to do:
awk -F, 'FNR>1{print $2,$3}' file-*.csv
but your question isn't clear so it's a guess.
Given your updated question all you need with GNU awk for nextfile is:
awk -F, 'FNR==1{if ($1 != "id") nextfile} {print $2,$3}' file-*.csv
and with any awk (but less efficiently than with GNU awk):
awk -F, 'FNR==1{f=($1=="id"?1:0); next} f{print $2,$3}' file-*.csv
awk -F, 'NR > 1{print $2,$3}' $(grep -l id file-*.csv)
(This will not work if any of your filenames contain whitespace.)
To find the files with id field, merge/output their contents excluding the lines with field id:
grep trick:
grep --no-group-separator -hA 1000000 'id' file-*.csv | grep -v 'id'
-h - suppress the prefixing the file names on output
-A num - print num lines of trailing context after matching line(s). 1000000 - considered as maximal number of line which, presumably, will not be exceeded(you may adjust it in case if you really have files with more than 1000000 lines)
The output (for 2 sample files from the question):
1,abc,100
2,def,200
3,xyz,1000
4,hij,2000

How to match "field 5 through the end of the line" (for example, in awk)

I want to pretty-print the output of a find-like script that would take input like this:
- 2015-10-02 19:45 102 /My Directory/some file.txt
and produce something like this:
- 102 /My Directory/some file.txt
In other words: "f" (for "file"), file size (right-justified), then pathname (with an arbitrary number of spaces).
This would be easy in awk if I could write a script that takes $1, $4, and "everything from $5 through the end of the line".
I tried using the awk construct substr($0, index($0, $8)), which I thought meant "everything starting with field $8 to the end of $0".
Using index() in this way is offered as a solution on linuxquestions.org and was upvoted 29 times in a stackoverflow.com thread.
On closer inspection, however, I found that index() does not achieve this effect if the starting field happens to match an earlier point in the string. For example, given:
-rw-r--r-- 1 tbaker staff 3024 2015-10-01 14:39 calendar
-rw-r--r-- 1 tbaker staff 4062 2015-10-01 14:39 b
-rw-r--r-- 1 tbaker staff 2374 2015-10-01 14:39 now or later
Gawk (and awk) get the following results:
$ gawk '{ print index($0, $8) }' test.txt
49
15
49
In other words, the value of $8 ('b') matches at index 15 instead of 49 (i.e., like most of the other filenames).
My issue, then is how to specify "everything from field X to the end of the string".
I have re-written this question in order to make this clear.
Looks to me like you should just be using the "stat" command rather than "ls", for the reasons already commented upon:
stat -c "f%15s %n" *
But you should double-check how your "stat" operates; it apparently can be shell-specific.
The built-in awk function index() is sometimes recommended as a way
to print "from field 5 through the end of the string" [1, 2, 3].
In awk, index($0, $8) does not mean "the index of the first character of
field 8 in string $0". Rather, it means "the index of the first occurrence in
string $0 of the string value of field 8". In many cases, that first
occurrence will indeed be the first character in field 8 but this is not the
case in the example above.
It has been pointed out that parsing the output of ls is generally a bad
idea [4], in part because implementations of ls significantly differ in output.
Since the author of that note recommends find as a replacement for ls for some uses,
here is a script using find:
find $# -ls |
sed -e 's/^ *//' -e 's/ */ /g' -e 's/ /|/2' -e 's/ /|/2' -e 's/ /|/4' -e 's/ /|/4' -e 's/ /|/6' |
gawk -F'|' '{ $2 = substr($2, 1, 1) ; gsub(/^-/, "f", $2) }
{ printf("%s %15s %s\n", $2, $4, $6) }'
...which yields the required output:
f 4639 /Users/foobar/uu/a
f 3024 /Users/foobar/uu/calendar
f 2374 /Users/foobar/uu/xpect
This approach recursively walks through a file tree. However, there may of course be implementation differences between versions of find as well.
http://www.linuxquestions.org/questions/linux-newbie-8/awk-print-field-to-end-and-character-count-179078/
How to print third column to last column?
Print Field 'N' to End of Line
http://mywiki.wooledge.org/ParsingLs
Maybe some variation of find -printf | awk is what you're looking for?
$ ls -l tmp
total 2
-rw-r--r-- 1 Ed None 7 Oct 2 14:35 bar
-rw-r--r-- 1 Ed None 2 Oct 2 14:35 foo
-rw-r--r-- 1 Ed None 0 May 3 09:55 foo bar
$ find tmp -type f -printf "f %s %p\n" | awk '{sub(/^[^ ]+ +[^ ]/,sprintf("%s %10d",$1,$2))}1'
f 7 tmp/bar
f 2 tmp/foo
f 0 tmp/foo bar
or
$ find tmp -type f -printf "%s %p\n" | awk '{sub(/^[^ ]+/,sprintf("f %10d",$1))}1'
f 7 tmp/bar
f 2 tmp/foo
f 0 tmp/foo bar
It won't work with file names that contain newlines.

Split a field and then remove duplicates

Sample file:
# cat test1
-rw-r--r-- 1 root root 19460 Feb 10 03:56 catalina.2015-02-10.log
-rw-r--r-- 1 root root 206868 May 4 15:05 catalina.2015-05-04.log
-rw-r--r-- 1 root root 922121 Jun 24 09:26 catalina.out
-rw-r--r-- 1 root root 0 Feb 10 02:27 host-manager.2015-02-10.log
-rw-r--r-- 1 root root 0 May 4 04:17 host-manager.2015-05-04.log
-rw-r--r-- 1 root root 2025 Feb 10 03:56 localhost.2015-02-10.log
-rw-r--r-- 1 root root 8323 May 4 15:05 localhost.2015-05-04.log
-rw-r--r-- 1 root root 873 Feb 10 03:56 localhost_access_log.2015-02-10.txt
-rw-r--r-- 1 root root 458600 May 4 23:59 localhost_access_log.2015-05-04.txt
-rw-r--r-- 1 root root 0 Feb 10 02:27 manager.2015-02-10.log
-rw-r--r-- 1 root root 0 May 4 04:17 manager.2015-05-04.log
Expected Output:
catalina
host-manager
localhost
localhost_access_log
manager
Attempt 1 (works):
# awk '{split($9,a,"."); print a[1]}' test1 | awk '!z[$i]++'
catalina
host-manager
localhost
localhost_access_log
manager
Attempt 2 (works):
# awk '{split($9,a,"."); print a[1]}' test1 | uniq
catalina
host-manager
localhost
localhost_access_log
manager
Attempt 3 (Fails):
# awk '{split($9,a,"."); a[1]++} {for (i in a){print a[i]}}' test1
1
2015-02-10
log
1
2015-05-04
log
1
out
.
.
.
Question:
I wanted to split the 9th field and then display only the uniq entries. However, I wanted to do this in a single awk one-liner. Seeking help on my 3rd attempt.
Another, more idiomatic awk one-liner:
awk '!a[ $0 = substr($NF,1,index($NF,".")-1) ]++' file
or, expressed more explicitly:
awk '{$0=substr($NF,1,index($NF,".")-1)} !a[$0]++' file
We use the well-known !a[$0]++ line de-duplication trick.
but first we change $0 to : substr($NF,1,index($NF,".")-1)
the whole line becomes the substring of the last field $NF up to the the first dot (.) – with substr() and some help from index()
A benefit of this solution is that you don't need to wait until the whole file has been parsed. The split fields are de-duplicated and printed on-the-fly.
You have to use the END block to print the results:
awk '{split($NF,a,"."); b[a[1]]} END{for (i in b){print i}}' file
Notes:
I am using $NF to catch the last field. This way, if you happen to have more or less fields than 9, it will also work (as long as there are no filenames with spaces, because parsing ls is evil).
We cannot loop directly through the a[] array, because it is the one containing the splitted data. For this we need to create another array, for example b[]. That's why we say b[a[1]]. Alone, there is no need to b[a[1]]++ unless you want to keep track of how many times any item appears.
END block is executed after processing the whole file. Otherwise you were going through the results once per record (that is, once per line) and subsequently duplicates were appearing.

convert month from Aaa to xx in little script with awk

I am trying to report on the number of files created on each date. I can do that with this little one liner:
ls -la foo*.bar|awk '{print $7, $6}'|sort|uniq -c
and I get a list how many fooxxx.bar files were created by date, but the month is in the form: Aaa (ie: Apr) and I want xx (ie: 04).
I have feeling the answer is in here:
awk '
BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
for(o=1;o<=m;o++){
months[d[o]]=sprintf("%02d",o)
}
format = "%m/%d/%Y %H:%M"
}
{
split($4,time,":")
date = (strftime("%Y") " " months[$2] " " $3 " " time[1] " " time[2] " 0")
print strftime(format, mktime(date))
}'
But have no to little idea what I need to strip out and no idea how to pass $7 to whatever I carve out of this to convert Apr to 04.
Thanks!
Here's the idiomatic way to convert an abbreviated month name to a number in awk:
$ echo "Feb" | awk '{printf "%02d\n",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'
02
$ echo "May" | awk '{printf "%02d\n",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'
05
Let us know if you need more info to solve your problem.
Assuming the name of the months only appear in the month column, then you could do this:
ls -la foo*.bar|awk '{sub(/Jan/,"01");sub(/Feb/,"02");print $7, $6}'|sort|uniq -c
Just use the field number of your month as an index into the months array.
print months[$6]
Since ls output differs from system to system and sometimes on the same system depending on file age and you didn't give any examples, I have no way of knowing how to guide you further.
Oh, and don't parse ls.
To parse AIX istat, I use:
istat .profile | grep "^Last modified" | read dummy dummy dummy mon day time dummy yr dummy
echo "M: $mon D: $day T: $time Y: $yr"
-> Month: Mar Day: 12 Time: 12:05:36 Year: 2012
To parse AIX istat month, I use this two-liner AIX 6.1 ksh 88:
monstr="???JanFebMarAprMayJunJulAugSepOctNovDec???"
mon="Oct" ; hugo=${monstr%${mon}*} ; hugolen=${#hugo} ; let hugol=hugolen/3 ; echo "Month: $hugol"
-> Month: 10
1..12 : month name ok
If lt 1 or gt 12 : month name not ok
Instead of "hugo" use speaking names ;-))
Adding a version for AIX, that shows how to retrieve all the date elements (in whatever timezone you need it them in), and display an iso8601 output
tempTZ="UTC" ; TZ="$tempTZ" istat /path/to/somefile \
| grep modified \
| awk -v tmpTZ="$tempTZ" '
BEGIN {Mmms="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec";
n=split(Mmms,Mmm," ") ;
for(i=1;i<=n;i++){ mm[Mmm[i]]=sprintf("%02d",i) }
}
{ printf("%s-%s-%sT%s %s",$NF, mm[$4], $5, $6, tmpTZ ) }
' ## this will output an iso8601 date of the modification date of that file,
## for ex: 2019-04-18T14:16:05 UTC
## you can tempTZ=anything, for ex: tempTZ="UTC+2" to see that date in UTC+2 timezone... or tempTZ="EST" , etc
I show the iso8601 version to make it more known & used, but of course you may only need the "mm" portion, which is easly done : mm[$4]