Create postfix aliases file from LDIF using awk - awk

I want to create a Postfix aliases file from the LDIF output of ldapsearch.
The LDIF file contains records for approximately 10,000 users. Each user has at least one entry for the proxyAddresses attribute. I need to create an alias corresponding with each proxyAddress that meets the conditions below. The created aliases must point to sAMAccountName#other.domain.
Type is SMTP or smtp (case-insensitive)
Domain is exactly contoso.com
I'm not sure if the attribute ordering in the LDIF file is consistent. I don't think I can assume that sAMAccountName will always appear last.
Example input file
dn: CN=John Smith,OU=Users,DC=contoso,DC=com
proxyAddresses: SMTP:smith#contoso.com
proxyAddresses: smtp:John.Smith#contoso.com
proxyAddresses: smtp:jsmith#elsewhere.com
proxyAddresses: MS:ORG/ORGEXCH/JOHNSMITH
sAMAccountName: smith
dn: CN=Tom Frank,OU=Users,DC=contoso,DC=com
sAMAccountName: frank
proxyAddresses: SMTP:frank#contoso.com
proxyAddresses: smtp:Tom.Frank#contoso.com
proxyAddresses: smtp:frank#elsewhere.com
proxyAddresses: MS:ORG/ORGEXCH/TOMFRANK
Example output file
smith: smith#other.domain
John.Smith: smith#other.domain
frank: frank#other.domain
Tom.Frank: frank#other.domain
Ideal solution
I'd like to see a solution using awk, but other method are acceptable too. Here are the qualities that are most important to me, in order:
Simple and readable. Self-documenting is better than one-liners.
Efficient. This will be used thousands of times.
Idiomatic. Doing it "the awk way" would be nice if it doesn't compromise the first two goals.
What I've tried
I've managed to make a start on this, but I'm struggling to understand the finer points of awk.
I tried using csplit to create seperate files for each record in the LDIF output, but that seems wasteful since I only want a single file in the end.
I tried setting RS="" in awk to get complete records instead of individual lines, but then I wasn't sure where to go from there.
I tried using awk to split the big LIDF file into separate files for each record and then processing those with another shell script, but that seemed wasteful.

Here a gawk script which you could run like this: gawk -f ldif.awk yourfile.ldif
Please note: the multicharacter value of `RS' is a gawk extension.
$ cat ldif.awk
BEGIN {
RS = "\n\n" # Record separator: empty line
FS = "\n" # Field separator: newline
}
# For each record: loop twice through fields
{
# Loop #1 identifies the sAMAccountName
for (i = 1; i <= NF; i++) {
if ($i ~ /^sAMAccountName: /) {
sAN = substr($i, 17)
break
}
}
# Loop #2 prints output lines
for (i = 1; i <= NF; i++) {
if (tolower($i) ~ /smtp:.*#contoso.com$/) {
split($i, n, ":|#")
print n[3] ": " sAN "#other.domain"
}
}
}

Here is a way to do it using standard awk.
# Display the postfix alias(es) for the previous user (if any)
function dump() {
for(i in id) printf("%s: %s#other.domain\n",id[i],an);
delete id;
}
# store all email names for that user in the id array
/^proxyAddresses:.[Ss][Mm][Tt][Pp]:.*#contoso.com/ {gsub(/^.*:/,"");gsub(/#.*$/,"");id[i++]=$0}
# store the account name
/^sAMAccountName:/ {an=$2};
# When a new record is found, process the previous one
/^dn:/ {dump()}
# Process the last record
END {dump()}

Related

Awk array, replace with full length matches of keys

I want to replace strings in a target file (target.txt) by strings in a lookup table (lookup.tab), which looks as follows.
Seq_1 Name_one
Seq_2 Name_two
Seq_3 Name_three
...
Seq_10 Name_ten
Seq_11 Name_eleven
Seq_12 Name_twelve
The target.txt file is a large file with a tree structure (Nexus format). It is not arranged in columns.
Therefore I use the following command:
awk 'FNR==NR { array[$1]=$2; next } { for (i in array) gsub(i, array[i]) }1' "lookup.tab" "target.txt"
Unfortunately, this command does not take the full length of the elements from the first column, so that Seq_1, Seq_10, Seq_11, Seq_12 end up as Name_one, Name_one0, Name_one1, Name_one2 etc...
How can the awk command be made more specific to correctly substitute the strings?
Try this please, see if it meets your need:
awk 'FNR==NR { le=length($1); a[le][$1]=$2; if (maxL<le) maxL=le; next } { for(le=maxL;le>0;le--) if(length(a[le])) for (i in a[le]) gsub(i, a[le][i]) }1' "lookup.tab" "target.txt"
It's based on your own trying, but instead of randomly replace using the hashes in the array, replace using those longer keys first.
By this way, and based on your examples, I think it's enough to avoid wrongly substitudes.

print from match & process several input files

when you scrutiny my questions from the past weeks you find I asked questions similar to this one. I had problems to ask in a demanded format since I did not really know where my problems came from. E. Morton tells me not to use range expression. Well, I do not know what they are excactly. I found in this forum many questions alike mine with working answers.
Like: "How to print following line from a match" (e.g.)
But all solutions I found stop working when I process more than one input file. I need to process many.
I use this command:
gawk -f 1.awk print*.csv > new.txt
while 1.awk contains:
BEGIN { OFS=FS=";"
pattern="row4"
}
go {print} $0 ~ pattern {go = 1}
input file 1 print1.csv contains:
row1;something;in;this;row;;;;;;;
row2;something;in;this;row;;;;;;;
row3;something;in;this;row;;;;;;;
row4;don't;need;to;match;the;whole;line,;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
Input file 2 print2.csv contains the same just for illustration purpose.
The 1.awk (and several others ways I found in this forum to print from match) works for one file. Output:
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
BUT not when I process more input files.
Each time I process this way more than one input file awk commands 'to print from match' seem to be ignored.
As said I was told not to use range expression. I do not know how and maybe the problem is linked to the way I input several files?
just reset your match indicator at the beginning of each file
$ awk 'FNR==1{p=0} p; /row4/{p=1} ' file1 file2
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
UPDATE
From the comments
is it possible to combine your awk with: "If $1="row5" then write in
$6="row5" and delete the value "row5" in $5? In other words, to move
content "row5" in column1, if found there, to new column 6? I could to
this with another awk but a combination into one would be nicer
... $1=="row5"{$6=$5; $5=""} ...
or, if you want to use another field instead of $5 replace $5 with the corresponding field number.

How to systematically replace certain parts of a file?

I am using orgmode on Emacs and want to automatically update parts of an orgmode file using cron scheduling.
I know how to get the cron job to run at the times I choose but now I am faced with the issue of selecting certain parts of the file to change.
I would like to increment numbers at a certain locations in a file everyday (like every day at 3am or something).
So say I have the file fruit.org:
* Apple
age: 2
* Bananas
age: 1
A really bad fruit
* Cranberry
* Death
* Easter
A cool day
I want to select all the numerical values after age and then increment them every day. How would I do this selection and replacing. I believe it would involve regexp and some tool (maybe awk) but I am relatively clueless from there on.
In awk, you could say:
awk '/age:/ { $2++ } { print }' foo.org
If you have a recent version of GNU awk, you can edit the file in-place using the option -i inplace. Otherwise, just do the usual, i.e. redirect to a temporary file and then replace the original:
awk '/age:/ { $2++ } { print }' foo.org > foo.org.tmp && mv foo.org{.tmp,}
That's basically what the inplace option of awk or sed does behind the scenes anyway.

AWK: go through the file twice, doing different tasks

I am processing a fairly big collection of Tweets and I'd like to obtain, for each tweet, its mentions (other user's names, prefixed with an #), if the mentioned user is also in the file:
users = new Dictionary()
for each line in file:
username = get_username(line)
userid = get_userid(line)
users.add(key = userid, value = username)
for each line in file:
mentioned_names = get_mentioned_names(line)
mentioned_ids = mentioned_names.map(x => if x in users: users[x] else null)
print "$line | $mentioned_ids"
I was already processing the file with GAWK, so instead of processing it again in Python or C I decided to try and add this to my AWK script. However, I can't find a way to make to passes over the same file, executing different code for each one. Most solutions imply calling AWK several times, but then I'd loose the associative array I made in the first pass.
I could do it in very hacky ways (like cat'ing the file twice, passing it through sed to add a different prefix to all the lines in each cat), but I'd like to be able to understand this code in a couple of months without hating myself.
What would be the AWK way to do this?
PD:
The less terrible way I've found:
function rewind( i)
{
# from https://www.gnu.org/software/gawk/manual/html_node/Rewind-Function.html
# shift remaining arguments up
for (i = ARGC; i > ARGIND; i--)
ARGV[i] = ARGV[i-1]
# make sure gawk knows to keep going
ARGC++
# make current file next to get done
ARGV[ARGIND+1] = FILENAME
# do it
nextfile
}
BEGIN {
count = 1;
}
count == 1 {
# first pass, fills an associative array
}
count == 2 {
# second pass, uses the array
}
FNR == 30 {
# handcoded length, horrible
# could also be automated calling wc -l, passing as parameter
if (count == 1) {
count = 2;
rewind(1)
}
}
The idiomatic way to process two separate files, or the same file twice in awk is like this:
awk 'NR==FNR{
# fill associative array
next
}
{
# use the array
}' file1 file2
The total record number NR is only equal to the record number for the current file FNR on the first file. next skips the second block for the first file. The second block is then processed for the second file. If file1 and file2 are the same file, then this passes through the file twice.

How to handle 3 files with awk?

Ok, so after spending 2 days, I am not able solve it and I am almost out of time now. It might be a very silly question, so please bear with me. My awk script does something like this:
BEGIN{ n=50; i=n; }
FNR==NR {
# Read file-1, which has just 1 column
ids[$1]=int(i++/n);
next
}
{
# Read file-2 which has 4 columns
# Do something
next
}
END {...}
It works fine. But now I want to extend it to read 3 files. Let's say, instead of hard-coding the value of "n", I need to read a properties file and set value of "n" from that. I found this question and have tried something like this:
BEGIN{ n=0; i=0; }
FNR==NR {
# Block A
# Try to read file-0
next
}
{
# Block B
# Read file-1, which has just 1 column
next
}
{
# Block C
# Read file-2 which has 4 columns
# Do something
next
}
END {...}
But it is not working. Block A is executed for file-0, I am able to read the property from properties files. But Block B is executed for both files file-1 and file-2. And Block C is never executed.
Can someone please help me solve this? I have never used awk before and the syntax is very confusing. Also, if someone can explain how awk reads input from different files, that will be very helpful.
Please let me know if I need to add more details to the question.
If you have gawk, just test ARGIND:
awk '
ARGIND == 1 { do file 1 stuff; next }
ARGIND == 2 { do file 2 stuff; next }
' file1 file2
If you don't have gawk, get it.
In other awks though you can just test for the file name:
awk '
FILENAME == ARGV[1] { do file 1 stuff; next }
FILENAME == ARGV[2] { do file 2 stuff; next }
' file1 file2
That only fails if you want to parse the same file twice, if that's the case you need to add a count of the number of times that file's been opened.
Update: The solution below works, as long as all input files are nonempty, but see #Ed Morton's answer for a simpler and more robust way of adding file-specific handling.
However, this answer still provides a hopefully helpful explanation of some awk basics and why the OP's approach didn't work.
Try the following (note that I've made the indices 1-based, as that's how awk does it):
awk '
# Increment the current-file index, if a new file is being processed.
FNR == 1 { ++fIndex }
# Process current line if from 1st file.
fIndex == 1 {
print "file 1: " FILENAME
next
}
# Process current line if from 2nd file.
fIndex == 2 {
print "file 2: " FILENAME
next
}
# Process current line (from all remaining files).
{
print "file " fIndex ": " FILENAME
}
' file-1 file-2 file-3
Pattern FNR==1 is true whenever a new input file is starting to get processed (FNR contains the input file-relative line number).
Every time a new file starts processing, fIndexis incremented and thus reflects the 1-based index of the current input file. Tip of the hat to #twalberg's helpful answer.
Note that an uninitialized awk variable used in a numeric context defaults to 0, so there's no need to initialize fIndex (unless you want a different start value).
Patterns such as fIndex == 1 can then be used to execute blocks for lines from a specific input file only (assuming the block ends in next).
The last block is then executed for all input files that don't have file-specific blocks (above).
As for why your approach didn't work:
Your 2nd and 3rd blocks are potentially executed unconditionally, for lines from all input files, because they are not preceded by a pattern (condition).
So your 2nd block is entered for lines from all subsequent input files, and its next statement then prevents the 3rd block from ever getting reached.
Potential misconceptions:
Perhaps you think that each block functions as a loop processing a single input file. This is NOT how awk works. Instead, the entire awk program is processed in a loop, with each iteration processing a single input line, starting with all lines from file 1, then from file 2, ...
An awk program can have any number of blocks (typically preceded by patterns), and whether they're executed for the current input line is solely governed by whether the pattern evaluates to true; if there is no pattern, the block is executed unconditionally (across input files). However, as you've already discovered, next inside a block can be used to skip subsequent blocks (pattern-block pairs).
Perhaps you need to consider adding some additional structure like this:
BEGIN { file_number=1 }
FNR==1 { ++file_number }
file_number==3 && /something_else/ { ...}