AWK: go through the file twice, doing different tasks - awk

I am processing a fairly big collection of Tweets and I'd like to obtain, for each tweet, its mentions (other user's names, prefixed with an #), if the mentioned user is also in the file:
users = new Dictionary()
for each line in file:
username = get_username(line)
userid = get_userid(line)
users.add(key = userid, value = username)
for each line in file:
mentioned_names = get_mentioned_names(line)
mentioned_ids = mentioned_names.map(x => if x in users: users[x] else null)
print "$line | $mentioned_ids"
I was already processing the file with GAWK, so instead of processing it again in Python or C I decided to try and add this to my AWK script. However, I can't find a way to make to passes over the same file, executing different code for each one. Most solutions imply calling AWK several times, but then I'd loose the associative array I made in the first pass.
I could do it in very hacky ways (like cat'ing the file twice, passing it through sed to add a different prefix to all the lines in each cat), but I'd like to be able to understand this code in a couple of months without hating myself.
What would be the AWK way to do this?
PD:
The less terrible way I've found:
function rewind( i)
{
# from https://www.gnu.org/software/gawk/manual/html_node/Rewind-Function.html
# shift remaining arguments up
for (i = ARGC; i > ARGIND; i--)
ARGV[i] = ARGV[i-1]
# make sure gawk knows to keep going
ARGC++
# make current file next to get done
ARGV[ARGIND+1] = FILENAME
# do it
nextfile
}
BEGIN {
count = 1;
}
count == 1 {
# first pass, fills an associative array
}
count == 2 {
# second pass, uses the array
}
FNR == 30 {
# handcoded length, horrible
# could also be automated calling wc -l, passing as parameter
if (count == 1) {
count = 2;
rewind(1)
}
}

The idiomatic way to process two separate files, or the same file twice in awk is like this:
awk 'NR==FNR{
# fill associative array
next
}
{
# use the array
}' file1 file2
The total record number NR is only equal to the record number for the current file FNR on the first file. next skips the second block for the first file. The second block is then processed for the second file. If file1 and file2 are the same file, then this passes through the file twice.

Related

Awk array, replace with full length matches of keys

I want to replace strings in a target file (target.txt) by strings in a lookup table (lookup.tab), which looks as follows.
Seq_1 Name_one
Seq_2 Name_two
Seq_3 Name_three
...
Seq_10 Name_ten
Seq_11 Name_eleven
Seq_12 Name_twelve
The target.txt file is a large file with a tree structure (Nexus format). It is not arranged in columns.
Therefore I use the following command:
awk 'FNR==NR { array[$1]=$2; next } { for (i in array) gsub(i, array[i]) }1' "lookup.tab" "target.txt"
Unfortunately, this command does not take the full length of the elements from the first column, so that Seq_1, Seq_10, Seq_11, Seq_12 end up as Name_one, Name_one0, Name_one1, Name_one2 etc...
How can the awk command be made more specific to correctly substitute the strings?
Try this please, see if it meets your need:
awk 'FNR==NR { le=length($1); a[le][$1]=$2; if (maxL<le) maxL=le; next } { for(le=maxL;le>0;le--) if(length(a[le])) for (i in a[le]) gsub(i, a[le][i]) }1' "lookup.tab" "target.txt"
It's based on your own trying, but instead of randomly replace using the hashes in the array, replace using those longer keys first.
By this way, and based on your examples, I think it's enough to avoid wrongly substitudes.

AWK: finding common elements across arbitrary number of columns (either single column files or column matrix)

Problem
I have several files, each one column, and I want to compare each of them to one another to find what elements are contained across all files. Alternatively - if it is easier - I could make a column matrix.
Question
How can I find the common elements across multiple columns.
Request
I am not an expert at awk (obviously). So a verbose explanation of the code would be much appreciated.
Other
# joepvd made some code that was somewhat similar... https://unix.stackexchange.com/questions/216511/comparing-the-first-column-of-two-files-and-printing-the-entire-row-of-the-secon/216515#216515?newreg=f4fd3a8743aa4210863f2ef527d0838b
to find what elements are contained across all files
awk is your friend as you guessed. Use the procedure below
#Store the files in an array. Assuming all files in one place
filelist=( $(find . -maxdepth 1 -type f) ) #array of files
awk -v count="${#filelist[#]}" '{value[$1]++}END{for(i in value){
if(value[i]==count){printf "Value %d is found in all files\n",i}}}' "${filelist[#]}"
Note
We used -v count="${#filelist[#]}" to pass the total file count to awk Note # in the beginning of an array gives element count.
value[$1]++ increments the count of a value as seen in the file. Also it creates value[$1] if not already exist with the initial value zero.
This method fails, if a value appear in a file more than once.
And END block with awk is executed only at last, ie after every records from all the files have been processed.
If you can have the same value multiple times in a single file, we'll need to take care to only count it once for each file.
A couple of variations with GNU awk (which is needed for ARGIND to be available. It could be emulated by checking FILENAME but that's even uglier.)
gawk '{ A[$0] = or(A[$0], lshift(1, ARGIND-1)) }
END { for (x in A) if (A[x] == lshift(1, ARGIND) - 1) print x }'
file1 file2 file3
The array A is keyed by the values (lines), and holds a bitmap of the files in which a line has been found. For each line read, we set bit number ARGIND-1 (since ARGIND starts with one).
At the end of input, run through all saved lines, and print them if the bitmap is all ones (up to the number of files seen).
gawk 'ARGIND > LASTIND {
LASTIND = ARGIND; for (x in CURR) { ALL[x] += 1; delete CURR[x] }
}
{ CURR[$0] = 1 }
END { for (x in CURR) ALL[x] += 1;
for (x in ALL) if (ALL[x] == ARGIND) print x
}' file1 file2 file3
Here, when a line is encountered, the corresponding element in arrayCURR, is set (middle part). When the file number changes (ARGIND > LASTIND), values in array ALL are increased for all values set in CURR, and the latter is cleared. At the END of input, the values in ALL are updated for the last file, and the total count is checked against the total number of files, printing the ones that appear in all files.
The bitmap approach is likely slightly faster with large inputs, since it doesn't involve creating and walking through a temporary array, but the number of files it can handle is limited by the number of bits the bit operations can handle (which seems to be about 50 on 64-bit Linux).
In both cases, the resulting printout will be in essentially a random order, since associative arrays do not preserve ordering.
I'm going to assume that it's the problem that matters, not the implementation language so here's an alternative using perl:
#! /usr/bin/perl
use strict;
my %elements=();
my $filecount=#ARGV;
while(<>) {
$elements{$_}->{$ARGV}++;
};
print grep {!/^$/} map {
"$_" if (keys %{ $elements{$_} } == $filecount)
} (keys %elements);
The while loop builds a hash-of-hashes (aka "HoH". See man perldsc and man perllol for details. Also see below for an example), with the top level key being each line from each input file, and the second-level key being the names of the file(s) that value appeared in.
The grep ... map {...} returns each top-level key where the number of files it appears in is equal to the number of input files
Here's what the data structure looks like, using the example you gave to ilkkachu:
{
'A' => { 'file1' => 1 },
'B' => { 'file2' => 1 },
'C' => { 'file1' => 1, 'file2' => 1, 'file3' => 1 },
'E' => { 'file2' => 1 },
'F' => { 'file1' => 1 },
'K' => { 'file3' => 1 },
'L' => { 'file3' => 1 }
}
Note that if there happen to be any duplicates in a single file, that fact is stored in this structure and can be checked.
The grep before the map isn't strictly required in this particular example, but is useful if you want to store the result in an array for further processing rather than print it immediately.
With the grep, it returns an array of only the matching elements, or in this case just the single value C. Without it, it returns an array of empty strings plus the matching elements. e.g. ("", "", "", "", "C", "", ""). Actually, they return the elements with a newline (\n) at the end because I didn't use chomp in the while loop as I knew i'd be printing them directly. In most programs, i'd use chomp to strip newlines and/or carriage-returns.

How to handle 3 files with awk?

Ok, so after spending 2 days, I am not able solve it and I am almost out of time now. It might be a very silly question, so please bear with me. My awk script does something like this:
BEGIN{ n=50; i=n; }
FNR==NR {
# Read file-1, which has just 1 column
ids[$1]=int(i++/n);
next
}
{
# Read file-2 which has 4 columns
# Do something
next
}
END {...}
It works fine. But now I want to extend it to read 3 files. Let's say, instead of hard-coding the value of "n", I need to read a properties file and set value of "n" from that. I found this question and have tried something like this:
BEGIN{ n=0; i=0; }
FNR==NR {
# Block A
# Try to read file-0
next
}
{
# Block B
# Read file-1, which has just 1 column
next
}
{
# Block C
# Read file-2 which has 4 columns
# Do something
next
}
END {...}
But it is not working. Block A is executed for file-0, I am able to read the property from properties files. But Block B is executed for both files file-1 and file-2. And Block C is never executed.
Can someone please help me solve this? I have never used awk before and the syntax is very confusing. Also, if someone can explain how awk reads input from different files, that will be very helpful.
Please let me know if I need to add more details to the question.
If you have gawk, just test ARGIND:
awk '
ARGIND == 1 { do file 1 stuff; next }
ARGIND == 2 { do file 2 stuff; next }
' file1 file2
If you don't have gawk, get it.
In other awks though you can just test for the file name:
awk '
FILENAME == ARGV[1] { do file 1 stuff; next }
FILENAME == ARGV[2] { do file 2 stuff; next }
' file1 file2
That only fails if you want to parse the same file twice, if that's the case you need to add a count of the number of times that file's been opened.
Update: The solution below works, as long as all input files are nonempty, but see #Ed Morton's answer for a simpler and more robust way of adding file-specific handling.
However, this answer still provides a hopefully helpful explanation of some awk basics and why the OP's approach didn't work.
Try the following (note that I've made the indices 1-based, as that's how awk does it):
awk '
# Increment the current-file index, if a new file is being processed.
FNR == 1 { ++fIndex }
# Process current line if from 1st file.
fIndex == 1 {
print "file 1: " FILENAME
next
}
# Process current line if from 2nd file.
fIndex == 2 {
print "file 2: " FILENAME
next
}
# Process current line (from all remaining files).
{
print "file " fIndex ": " FILENAME
}
' file-1 file-2 file-3
Pattern FNR==1 is true whenever a new input file is starting to get processed (FNR contains the input file-relative line number).
Every time a new file starts processing, fIndexis incremented and thus reflects the 1-based index of the current input file. Tip of the hat to #twalberg's helpful answer.
Note that an uninitialized awk variable used in a numeric context defaults to 0, so there's no need to initialize fIndex (unless you want a different start value).
Patterns such as fIndex == 1 can then be used to execute blocks for lines from a specific input file only (assuming the block ends in next).
The last block is then executed for all input files that don't have file-specific blocks (above).
As for why your approach didn't work:
Your 2nd and 3rd blocks are potentially executed unconditionally, for lines from all input files, because they are not preceded by a pattern (condition).
So your 2nd block is entered for lines from all subsequent input files, and its next statement then prevents the 3rd block from ever getting reached.
Potential misconceptions:
Perhaps you think that each block functions as a loop processing a single input file. This is NOT how awk works. Instead, the entire awk program is processed in a loop, with each iteration processing a single input line, starting with all lines from file 1, then from file 2, ...
An awk program can have any number of blocks (typically preceded by patterns), and whether they're executed for the current input line is solely governed by whether the pattern evaluates to true; if there is no pattern, the block is executed unconditionally (across input files). However, as you've already discovered, next inside a block can be used to skip subsequent blocks (pattern-block pairs).
Perhaps you need to consider adding some additional structure like this:
BEGIN { file_number=1 }
FNR==1 { ++file_number }
file_number==3 && /something_else/ { ...}

Create postfix aliases file from LDIF using awk

I want to create a Postfix aliases file from the LDIF output of ldapsearch.
The LDIF file contains records for approximately 10,000 users. Each user has at least one entry for the proxyAddresses attribute. I need to create an alias corresponding with each proxyAddress that meets the conditions below. The created aliases must point to sAMAccountName#other.domain.
Type is SMTP or smtp (case-insensitive)
Domain is exactly contoso.com
I'm not sure if the attribute ordering in the LDIF file is consistent. I don't think I can assume that sAMAccountName will always appear last.
Example input file
dn: CN=John Smith,OU=Users,DC=contoso,DC=com
proxyAddresses: SMTP:smith#contoso.com
proxyAddresses: smtp:John.Smith#contoso.com
proxyAddresses: smtp:jsmith#elsewhere.com
proxyAddresses: MS:ORG/ORGEXCH/JOHNSMITH
sAMAccountName: smith
dn: CN=Tom Frank,OU=Users,DC=contoso,DC=com
sAMAccountName: frank
proxyAddresses: SMTP:frank#contoso.com
proxyAddresses: smtp:Tom.Frank#contoso.com
proxyAddresses: smtp:frank#elsewhere.com
proxyAddresses: MS:ORG/ORGEXCH/TOMFRANK
Example output file
smith: smith#other.domain
John.Smith: smith#other.domain
frank: frank#other.domain
Tom.Frank: frank#other.domain
Ideal solution
I'd like to see a solution using awk, but other method are acceptable too. Here are the qualities that are most important to me, in order:
Simple and readable. Self-documenting is better than one-liners.
Efficient. This will be used thousands of times.
Idiomatic. Doing it "the awk way" would be nice if it doesn't compromise the first two goals.
What I've tried
I've managed to make a start on this, but I'm struggling to understand the finer points of awk.
I tried using csplit to create seperate files for each record in the LDIF output, but that seems wasteful since I only want a single file in the end.
I tried setting RS="" in awk to get complete records instead of individual lines, but then I wasn't sure where to go from there.
I tried using awk to split the big LIDF file into separate files for each record and then processing those with another shell script, but that seemed wasteful.
Here a gawk script which you could run like this: gawk -f ldif.awk yourfile.ldif
Please note: the multicharacter value of `RS' is a gawk extension.
$ cat ldif.awk
BEGIN {
RS = "\n\n" # Record separator: empty line
FS = "\n" # Field separator: newline
}
# For each record: loop twice through fields
{
# Loop #1 identifies the sAMAccountName
for (i = 1; i <= NF; i++) {
if ($i ~ /^sAMAccountName: /) {
sAN = substr($i, 17)
break
}
}
# Loop #2 prints output lines
for (i = 1; i <= NF; i++) {
if (tolower($i) ~ /smtp:.*#contoso.com$/) {
split($i, n, ":|#")
print n[3] ": " sAN "#other.domain"
}
}
}
Here is a way to do it using standard awk.
# Display the postfix alias(es) for the previous user (if any)
function dump() {
for(i in id) printf("%s: %s#other.domain\n",id[i],an);
delete id;
}
# store all email names for that user in the id array
/^proxyAddresses:.[Ss][Mm][Tt][Pp]:.*#contoso.com/ {gsub(/^.*:/,"");gsub(/#.*$/,"");id[i++]=$0}
# store the account name
/^sAMAccountName:/ {an=$2};
# When a new record is found, process the previous one
/^dn:/ {dump()}
# Process the last record
END {dump()}

pre-populate associative array keys in awk?

I've written a munin plugin that uses slurm's sacct to monitor job states on a HPC cluster. I've written it in sh + awk (rather than my usual tool of choice, perl).
The script works, but it took me ages to figure out how to pre-populate the associative array of possible states (some/most may not be present in sacct output, and i want them to default to zero). Google wasn't much help, and the best I could come up with was to use split on a string to produce a temporary array, which I then iterated over.
I came up with this:
BEGIN {
num = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames," ");
for (i=1;i<=num;i++) {
states[statenames[i]] = 0
}
}
This works, but seems clumsy compared to how i'd do it in perl, like this:
foreach (qw(cancelled completed completing failed nodefail pending running suspended timeout)) {
$states{$_} = 0;
}
or this
%states = map {$_ => 0} qw(cancelled completed completing failed nodefail pending running suspended timeout);
my question is: is there a way of doing this in awk that is similar to either of the perl versions?
[ edited ]
to clarify, here's a sample of the sacct output i'm piping into awk. Note that the only states in this output are RUNNING, COMPLETED, and CANCELLED - the others don't exist (because they haven't occurred today), but i want them in my script's output anyway (in a form usable by munin as "statename.value 0").
# sacct -X -P -o 'state' -n
RUNNING
RUNNING
RUNNING
RUNNING
COMPLETED
RUNNING
COMPLETED
RUNNING
COMPLETED
COMPLETED
CANCELLED by 1000
COMPLETED
[ edited again ]
and here's sample output from my munin plugin:
# ./slurm-sacct
suspended.value 0
pending.value 0
nodefail.value 0
failed.value 0
running.value 6
completing.value 0
completed.value 5
timeout.value 0
cancelled.value 1
The script runs and does what I want, I just wanted to know if there was a better way to initialise the associative array.
You probably don't need to do it at all. Variables in awk are dynamic, which means they're automatically initialized when they are first used (either assigned to or accessed), and this applies to array elements as well.
A variable will be initialized to 0 if it's accessed in a numeric context, or to the empty string otherwise. (At least gawk does this, though I'm not sure if it's implementation-dependent) So if you're doing something like counting the number of jobs that are in each state, the entire program is as simple as something like
{ states[$1]++ }
END {
for (state in states) print state, states[state]
}
Each time the expression states[$1]++ is executed, it will check for the existence of states[$1] and initialize it to 0 if it doesn't already exist.
EDIT: From your comment I'm guessing you want to print out a line for each possible state, regardless of whether there are any jobs in that state or not. In that case, you need to include all the possible state names, and there is no shortcut notation for doing so as there is in Perl. As far as I know, what you've already found is about as clean as it gets. (Awk is not really designed with that usage in mind)
I'd suggest the following:
{ states[$1]++ }
END {
split("cancelled completed completing failed nodefail pending running suspended timeout",statenames," ");
for (state in statenames) print state, states[state]+0
}
Perhaps Craig can use instead of :
print "Timeout states ",states[timeout],".";
this:
print "Timeout states ",int(states[timeout]),".";
In my case if there is no timeout state in awk input, the first print will give:
Timeout states .
While the second will give:
Timeout states 0.
I think a more natural approach in awk would be to have a separate file of keys. Consider a file keys.txt with one key per line. You could then do something like this:
printf "key1\nkey2\nkey2\nkey5" |
awk '
FILENAME == "keys.txt" {
counts[$0] = 0
next
}
{
counts[$0]++
}
END {
for (key in counts) {
print key, counts[key]
}
}' keys.txt -
With five keys in keys.txt, this produces:
key1 1
key2 2
key3 0
key4 0
key5 1
Although the keys are shown in order here, that's just incidental and shouldn't be relied upon.
For the specific example, you could also skip the associative array altogether. Instead, you could minimally process the lines with awk and use sort | uniq -c to tabulate the counts. The presence of all keys could be ensured using join against a file of keys.
awk is somewhat clumsier (I would say "less terse") than Perl.
You could write this (similar to #Michael's answer):
pipeline of data |
awk '
NR == FNR {statenames[$1]=0; next}
{ usual processing }
END { usual output }
' <(printf "%s\n" cancelled completed completing failed nodefail pending running suspended timeout) -
One tweak to #DavidZaslavsky's answer might be to print the states in the order you specified them on the split() line. That would be:
{ states[tolower($1)]++ }
END {
n = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames)
for (i=1; i<=n; i++) {
state = statenames[i]
print state, states[state]+0
}
}
I also converted the input to lower case so it matches your hard-coded values, got rid of the unnecessary 3rd arg to split() and the subsequent null statement (trailing semi-colon).
In case you want to account for finding state names in your input that weren't in your hard-coded set, you could tweak it to:
{ states[tolower($1)]++ }
END {
n = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames)
for (i=1; i<=n; i++) {
state = statenames[i]
print state, states[state]+0
delete states[state]
}
for (state in states) {
print "WARNING: found new state name %s\n",state | "cat>&2"
print state, states[state]+0
}
}