Split a field and then remove duplicates - awk

Sample file:
# cat test1
-rw-r--r-- 1 root root 19460 Feb 10 03:56 catalina.2015-02-10.log
-rw-r--r-- 1 root root 206868 May 4 15:05 catalina.2015-05-04.log
-rw-r--r-- 1 root root 922121 Jun 24 09:26 catalina.out
-rw-r--r-- 1 root root 0 Feb 10 02:27 host-manager.2015-02-10.log
-rw-r--r-- 1 root root 0 May 4 04:17 host-manager.2015-05-04.log
-rw-r--r-- 1 root root 2025 Feb 10 03:56 localhost.2015-02-10.log
-rw-r--r-- 1 root root 8323 May 4 15:05 localhost.2015-05-04.log
-rw-r--r-- 1 root root 873 Feb 10 03:56 localhost_access_log.2015-02-10.txt
-rw-r--r-- 1 root root 458600 May 4 23:59 localhost_access_log.2015-05-04.txt
-rw-r--r-- 1 root root 0 Feb 10 02:27 manager.2015-02-10.log
-rw-r--r-- 1 root root 0 May 4 04:17 manager.2015-05-04.log
Expected Output:
catalina
host-manager
localhost
localhost_access_log
manager
Attempt 1 (works):
# awk '{split($9,a,"."); print a[1]}' test1 | awk '!z[$i]++'
catalina
host-manager
localhost
localhost_access_log
manager
Attempt 2 (works):
# awk '{split($9,a,"."); print a[1]}' test1 | uniq
catalina
host-manager
localhost
localhost_access_log
manager
Attempt 3 (Fails):
# awk '{split($9,a,"."); a[1]++} {for (i in a){print a[i]}}' test1
1
2015-02-10
log
1
2015-05-04
log
1
out
.
.
.
Question:
I wanted to split the 9th field and then display only the uniq entries. However, I wanted to do this in a single awk one-liner. Seeking help on my 3rd attempt.

Another, more idiomatic awk one-liner:
awk '!a[ $0 = substr($NF,1,index($NF,".")-1) ]++' file
or, expressed more explicitly:
awk '{$0=substr($NF,1,index($NF,".")-1)} !a[$0]++' file
We use the well-known !a[$0]++ line de-duplication trick.
but first we change $0 to : substr($NF,1,index($NF,".")-1)
the whole line becomes the substring of the last field $NF up to the the first dot (.) – with substr() and some help from index()
A benefit of this solution is that you don't need to wait until the whole file has been parsed. The split fields are de-duplicated and printed on-the-fly.

You have to use the END block to print the results:
awk '{split($NF,a,"."); b[a[1]]} END{for (i in b){print i}}' file
Notes:
I am using $NF to catch the last field. This way, if you happen to have more or less fields than 9, it will also work (as long as there are no filenames with spaces, because parsing ls is evil).
We cannot loop directly through the a[] array, because it is the one containing the splitted data. For this we need to create another array, for example b[]. That's why we say b[a[1]]. Alone, there is no need to b[a[1]]++ unless you want to keep track of how many times any item appears.
END block is executed after processing the whole file. Otherwise you were going through the results once per record (that is, once per line) and subsequently duplicates were appearing.

Related

Can awk find a field containing a string from a list?

I have a file containing different fields.
I have another file containing a list of different words.
I need to use awk command to extract from my 1st file all records where a specific field contains one or different words from my 2nd file.
For example 1st file:
Feb 15 12:05:10 lcif adm.slm: root [23416]: cd /tmp
Feb 15 12:05:24 lcif adm.slm: root [23416]: cat tst.sh
Feb 15 12:05:44 lcif adm.slm: root [23416]: date
Feb 15 12:05:52 lcif adm.pse: root [23419]: rm -f file
Feb 15 12:05:58 lcif adm.pse: root [23419]: who
Feb 15 12:06:02 lcif adm.pse: root [23419]: uptime
Feb 15 12:06:56 lcif adm.pse: root [23419]: reboot
Feb 15 12:06:58 lcif adm.pse: root [23419]: ls -lrt
For example 2nd file:
rm
reboot
shutdown
Then awk command should returns:
Feb 15 12:05:52 lcif adm.pse: root [23419]: rm -f file
Feb 15 12:06:56 lcif adm.pse: root [23419]: reboot
Tried deperatly with array/map.
Tried this to:
awk -F ": " '{if ($3 ~ "^rm" || $3 ~ "^reboot" || $3 ~ "^shutdown") print}'
But the list of words I'm looking for is getting bigger and bigger.
I'd rather use a file list.
Appreciate any help.
Thank you !
Serge
You might do it like this:
awk -F ': ' '
FNR == NR { commands[$0]; next }
split($3,cmdline," ") && (cmdline[1] in commands)
' file2 file1
output:
Feb 15 12:05:52 lcif adm.pse: root [23419]: rm -f file
Feb 15 12:06:56 lcif adm.pse: root [23419]: reboot
don't waste time with arrays. just dynamically generate hard-coded regex on the fly :
printf '%s' "${file_a}" |
gawk -p- -b 'BEGIN { FS = "[]]: " } '"$(
awk -v __="${file_b}" 'BEGIN {
FS = RS ; OFS = "|"
RS = "^$"; _= ORS = ""
$_ = __
print "$NF ~ \"^(" $(_*(NF-=_==$NF)) ")( |$)\"" }' )"
Feb 15 12:05:52 lcif adm.pse: root [23419]: rm -f file
Feb 15 12:06:56 lcif adm.pse: root [23419]: reboot
# this part being dynamically generated
awk 'BEGIN { FS = "[]]: " } $NF ~ "^(rm|reboot|shutdown)( |$)" '
then instead of looping through an array, it'll be a high speed single pass through file A without having to store any rows in between

Text processing to create a list of unique ID's

I Have a file with ID’s and Names applicable to them as below:
1234|abc|cde|fgh
5678|ijk|abc|lmn
9101|cde|fgh|klm
1213|klm|abc|cde
I need a file with only unique Names as a list.
Output File:
abc|sysdate
cde|sysdate
fgh|sysdate
ijk|sysdate
lmn|sysdate
klm|sysdate
Where sysdate is the current timestamp of processing.
Requesting you to help on this. Also requesting for a explanation for the code suggested.
What this code does :
awk -F\| '{ for(i=2; i <= NF; i++) a[$i] = a[$i] FS $1 }' input.csv
-F sets the delimiter to |, awk process line by line your file, creates a map named 'a', reads from column 2 until the end and fill the map using the current cell processed as key and the current cell + file separator + value in the first column as value.
When awk ends processing the first line, 'a' is :
a['abc'] = 'abc|1234'
a['cde'] = 'cde|1234'
a['fgh'] = 'fgh|1234'
This script does not print anything.
What you want is something like this :
awk -F'|' '{for(i=2;i<=NF;i++){if(seen[$i] != 1){print $i, strftime(); seen[$i]=1}}}' OFS='|' input.csv
-F sets the input delimiter to |, OFS does the same for the output delimiter.
For each value from the column 2 to the end of the line, we check if it has already been seen before. If not, we print the value and the time of process. Then we register the value in a map so we can avoid to process it again.
Output :
abc|Thu Oct 18 10:40:13 CEST 2018
cde|Thu Oct 18 10:40:13 CEST 2018
fgh|Thu Oct 18 10:40:13 CEST 2018
ijk|Thu Oct 18 10:40:13 CEST 2018
lmn|Thu Oct 18 10:40:13 CEST 2018
klm|Thu Oct 18 10:40:13 CEST 2018
You can change the format of sysdate. See documentation of gawk strftime here

Cut column from multiple files with the same name in different directories and paste into one

I have multiple files with the same name (3pGtoA_freq.txt), but all located in different directories.
Each file looks like this:
pos 5pG>A
1 0.162421557770395
2 0.0989643268124281
3 0.0804131316857248
4 0.0616563298066399
5 0.0577551761714493
6 0.0582450832072617
7 0.0393129770992366
8 0.037037037037037
9 0.0301016419077404
10 0.0327510917030568
11 0.0301598837209302
12 0.0309050772626932
13 0.0262089331856774
14 0.0254612546125461
15 0.0226130653266332
16 0.0206971677559913
17 0.0181280059193489
18 0.0243993993993994
19 0.0181347150259067
20 0.0224429727740986
21 0.0175690211545357
22 0.0183916336098089
23 0.0196078431372549
24 0.0187983781791375
25 0.0173192771084337
I want to cut column 2 from each file and paste column by column in one file
I tried running:
for s in results_Sample_*_hg19/results_MapDamage_Sample_*/results_Sample_*_bwa_LongSeed_sorted_hg19_noPCR/3pGtoA_freq.txt; do awk '{print $2}' $s >> /home/users/istolarek/aDNA/3pGtoA_all; done
but it's not pasting the columns next to each other.
Also I wanted to name each column by the '*', which is the only string that changes in path.
Any help with that?
for i in $(find you_file_dir -name 3pGtoA_freq.txt);do awk '{print $2>>"NewFile"}' $i; done
I would do this by processing all files in parallel in awk:
awk 'BEGIN{printf "pos ";
for(i=1;i<ARGC;++i)
printf "%-19s",gensub("^results_Sample_","",1,gensub("_hg19.*","",1,ARGV[i]));
printf "\n";
while(getline<ARGV[1]){
printf "%-4s%-19s",$1,$2;
for(i=2;i<ARGC;++i){
getline<ARGV[i];
printf "%-19s",$2}
printf "\n"}}{exit}' \
results_Sample_*_hg19/results_MapDamage_Sample_*/results_Sample_*_bwa_LongSeed_sorted_hg19_noPCR/3pGtoA_freq.txt
If your awk doesn't have gensub (I'm using cygwin), you can remove the first four lines (printf-printf); headers won't be printed in that case.

How to match "field 5 through the end of the line" (for example, in awk)

I want to pretty-print the output of a find-like script that would take input like this:
- 2015-10-02 19:45 102 /My Directory/some file.txt
and produce something like this:
- 102 /My Directory/some file.txt
In other words: "f" (for "file"), file size (right-justified), then pathname (with an arbitrary number of spaces).
This would be easy in awk if I could write a script that takes $1, $4, and "everything from $5 through the end of the line".
I tried using the awk construct substr($0, index($0, $8)), which I thought meant "everything starting with field $8 to the end of $0".
Using index() in this way is offered as a solution on linuxquestions.org and was upvoted 29 times in a stackoverflow.com thread.
On closer inspection, however, I found that index() does not achieve this effect if the starting field happens to match an earlier point in the string. For example, given:
-rw-r--r-- 1 tbaker staff 3024 2015-10-01 14:39 calendar
-rw-r--r-- 1 tbaker staff 4062 2015-10-01 14:39 b
-rw-r--r-- 1 tbaker staff 2374 2015-10-01 14:39 now or later
Gawk (and awk) get the following results:
$ gawk '{ print index($0, $8) }' test.txt
49
15
49
In other words, the value of $8 ('b') matches at index 15 instead of 49 (i.e., like most of the other filenames).
My issue, then is how to specify "everything from field X to the end of the string".
I have re-written this question in order to make this clear.
Looks to me like you should just be using the "stat" command rather than "ls", for the reasons already commented upon:
stat -c "f%15s %n" *
But you should double-check how your "stat" operates; it apparently can be shell-specific.
The built-in awk function index() is sometimes recommended as a way
to print "from field 5 through the end of the string" [1, 2, 3].
In awk, index($0, $8) does not mean "the index of the first character of
field 8 in string $0". Rather, it means "the index of the first occurrence in
string $0 of the string value of field 8". In many cases, that first
occurrence will indeed be the first character in field 8 but this is not the
case in the example above.
It has been pointed out that parsing the output of ls is generally a bad
idea [4], in part because implementations of ls significantly differ in output.
Since the author of that note recommends find as a replacement for ls for some uses,
here is a script using find:
find $# -ls |
sed -e 's/^ *//' -e 's/ */ /g' -e 's/ /|/2' -e 's/ /|/2' -e 's/ /|/4' -e 's/ /|/4' -e 's/ /|/6' |
gawk -F'|' '{ $2 = substr($2, 1, 1) ; gsub(/^-/, "f", $2) }
{ printf("%s %15s %s\n", $2, $4, $6) }'
...which yields the required output:
f 4639 /Users/foobar/uu/a
f 3024 /Users/foobar/uu/calendar
f 2374 /Users/foobar/uu/xpect
This approach recursively walks through a file tree. However, there may of course be implementation differences between versions of find as well.
http://www.linuxquestions.org/questions/linux-newbie-8/awk-print-field-to-end-and-character-count-179078/
How to print third column to last column?
Print Field 'N' to End of Line
http://mywiki.wooledge.org/ParsingLs
Maybe some variation of find -printf | awk is what you're looking for?
$ ls -l tmp
total 2
-rw-r--r-- 1 Ed None 7 Oct 2 14:35 bar
-rw-r--r-- 1 Ed None 2 Oct 2 14:35 foo
-rw-r--r-- 1 Ed None 0 May 3 09:55 foo bar
$ find tmp -type f -printf "f %s %p\n" | awk '{sub(/^[^ ]+ +[^ ]/,sprintf("%s %10d",$1,$2))}1'
f 7 tmp/bar
f 2 tmp/foo
f 0 tmp/foo bar
or
$ find tmp -type f -printf "%s %p\n" | awk '{sub(/^[^ ]+/,sprintf("f %10d",$1))}1'
f 7 tmp/bar
f 2 tmp/foo
f 0 tmp/foo bar
It won't work with file names that contain newlines.

Modification of date format within a text file

I have some text files containing lines as follow :
07JAN01, -0.247297942769082E+07, -0.467133797284279E+07, 0.355810777473149E+07
07JAN02, -0.247297942405032E+07, -0.467133797586388E+07, 0.355810777517715E+07
07JAN03, -0.247297942584851E+07, -0.467133797727224E+07, 0.355810777627353E+07
. . . .
. . . .
I need to produce a script which will amend the format of the date to :
01/01/07, -0.247297942769082E+07, -0.467133797284279E+07, 0.355810777473149E+07
02/01/07, -0.247297942405032E+07, -0.467133797586388E+07, 0.355810777517715E+07
03/01/07, -0.247297942584851E+07, -0.467133797727224E+07, 0.355810777627353E+07
. . . .
. . . .
I was looking for an appropriate sed or grep command to extract only some characters of each line, to define it as a variable in my script. As I would like to "reorganize" the date, I was thinking about defining three variable, where, for the for the first line it would be :
a=07
b=JAN (need to implement a "case" in the script to deal with this I think?)
c=03
I looked at some grep examples, and tons of docs, but nothing really clear appeared ...
found something about the -cut command, but ... I'm not too sure it's appropriate here.
The other question I have is about the output, as sed doesn't modify the input data, how can I modify directly the files ? Is there a way ?
Any help would really be appreciated :)
I don't think grep is the right tool for the job myself. You need something a little more expressive like Perl or awk:
echo '07JAN01, -0.24729E+07, -0.46713E+07, 0.35581E+07
07JAN02, -0.24729E+07, -0.46713E+07, 0.35581E+07
07AUG03, -0.24729E+07, -0.46713E+07, 0.35581E+07' | awk -F, '
{
yy=substr($1,1,2);
mm=substr($1,3,3);
mm=(index(":JAN:FEB:MAR:APR:MAY:JUN:JUL:AUG:SEP:OCT:NOV:DEC",mm)+2)/4;
dd=substr($1,6,2);
printf "%02d/%02d/%02d,%s,%s,%s\n",dd,mm,yy,$2,$3,$4
}'
which generates:
01/01/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
02/01/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
03/08/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
Obviously, that's just pumping some test data through a command line awk script. You'd be better off putting that into an actual awk script file and running your input through it.
If datchg.awk contains:
{
yy=substr($1,1,2);
mm=substr($1,3,3);
mm=(index(":JAN:FEB:MAR:APR:MAY:JUN:JUL:AUG:SEP:OCT:NOV:DEC",mm)+2)/4;
dd=substr($1,6,2);
printf "%02d/%02d/%02d,%s,%s,%s\n",dd,mm,yy,$2,$3,$4
}
then:
echo '07JAN01, -0.24729E+07, -0.46713E+07, 0.35581E+07
07JAN02, -0.24729E+07, -0.46713E+07, 0.35581E+07
07AUG03, -0.24729E+07, -0.46713E+07, 0.35581E+07' | awk -F, -fdatechg.awk
also produces:
01/01/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
02/01/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
03/08/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
The way this works is as follows. Each line is split into fields (-F, sets the field separator to a comma) and we extract and process the relevant parts of field 1 (the date). By this I mean the year and day are reversed and the textual month is turned into a numeric month by searching a string for it and manipulating the index where it was found, so that it falls in the range 1 through 12.
This is the only (relatively) tricky bit and is done with some basic mathematics: the index function simply finds the position within the string of your month (where the first char is 1). So JAN is at position 2, FEB at 6, MAR at 10, ..., DEC at 46 (the set {2, 6, 10, ..., 46}). They're 4 apart so we're going to need to divide by 4 eventually to get consecutive month numbers but first we add 2 so the division will work well. Adding that 2 gives you the set {4, 8, 12, ..., 48}. Then you divide by 4 to get {1, 2, 3, ... 12} and there's your month number:
Text Pos +2 /4
---- --- -- --
JAN 2 4 1
FEB 6 8 2
MAR 10 12 3
APR 14 16 4
MAY 18 20 5
JUN 22 24 6
JUL 26 28 7
AUG 30 32 8
SEP 34 36 9
OCT 38 40 10
NOV 42 44 11
DEC 46 48 12
Then we just output the new information. Obviously, this is likely to barf if you provide bad data but I'm assuming either:
the data is good; or
you'll add your own error checks.
Regarding modifying the files directly, the time-honored UNIX tradition is to use a shell script to save the current file elsewhere, process it to create a new file, then overwrite the old file with the new file (but not touching the saved file, in case something goes horribly wrong).
I won't make my answer any longer by detailing that, you've probably fallen asleep already :-)
A bit clunky, but you could do:
sed -e 's/^\(..\)JAN\(..\)/\2\/01\/\1/'
sed -e 's/^\(..\)FEB\(..\)/\2\/02\/\1/'
...
In order to run sed in-place, see the -i commandline option:
sed -i -e ...
Edit
Just to point out that this answers a previous version of the question where AWK was not specified.
awk 'BEGIN{
OFS=FS=","
# create table of mapping of months to numbers
s=split("JAN:FEB:MAR:APR:MAY:JUN:JUL:AUG:SEP:OCT:NOV:DEC",d,":")
for(o=1;o<=s;o++){
m=sprintf("%02s",o) # add 0 is single digit
date[d[o]]=m
}
}
{
yr=substr($1,1,2)
mth=substr($1,3,3)
day=substr($1,6,2)
$1=day"/"date[mth]"/"yr
}1' file