LDAP Search attribute with multiple occurence - ldap

i am looking for a ldap filter to find users who have an multi value attribute with duplicate values.
For example in this case i will find the users with duplicate {PersNumber}:
MVAttribute {PersNumber}111111
{PersNumber}111112
I tried many filters, but i could not find the right one yet.

It is not an ldap search solution, but depeding on the number of users you could export the users in EXPORT.ldif and then apply an ldif awk solution in [https://stackoverflow.com/questions/74649357/remove-duplicate-attributes-for-a-core-single-value-no-user-modification-attri/75235694#75235694]. Hier is the modified awk script for duplicate Attribute cn
grep '^dn:\|^cn:' REPORT.ldif | awk
'BEGINN {L1="",L2="";TYP="";DN_PREF="dn:";DN=""}
{
if (TYP==$1 && L2==$2) {
printf("\n%s %s\n%s %s\n%s %s\n--------",
DN_PREF,DN,
$1,$2,
L1,L2);
}
if (TYP==$1) {
DN=$2;
}
L1=$1;
L2=$2;
}'

Related

Awk array, replace with full length matches of keys

I want to replace strings in a target file (target.txt) by strings in a lookup table (lookup.tab), which looks as follows.
Seq_1 Name_one
Seq_2 Name_two
Seq_3 Name_three
...
Seq_10 Name_ten
Seq_11 Name_eleven
Seq_12 Name_twelve
The target.txt file is a large file with a tree structure (Nexus format). It is not arranged in columns.
Therefore I use the following command:
awk 'FNR==NR { array[$1]=$2; next } { for (i in array) gsub(i, array[i]) }1' "lookup.tab" "target.txt"
Unfortunately, this command does not take the full length of the elements from the first column, so that Seq_1, Seq_10, Seq_11, Seq_12 end up as Name_one, Name_one0, Name_one1, Name_one2 etc...
How can the awk command be made more specific to correctly substitute the strings?
Try this please, see if it meets your need:
awk 'FNR==NR { le=length($1); a[le][$1]=$2; if (maxL<le) maxL=le; next } { for(le=maxL;le>0;le--) if(length(a[le])) for (i in a[le]) gsub(i, a[le][i]) }1' "lookup.tab" "target.txt"
It's based on your own trying, but instead of randomly replace using the hashes in the array, replace using those longer keys first.
By this way, and based on your examples, I think it's enough to avoid wrongly substitudes.

print from match & process several input files

when you scrutiny my questions from the past weeks you find I asked questions similar to this one. I had problems to ask in a demanded format since I did not really know where my problems came from. E. Morton tells me not to use range expression. Well, I do not know what they are excactly. I found in this forum many questions alike mine with working answers.
Like: "How to print following line from a match" (e.g.)
But all solutions I found stop working when I process more than one input file. I need to process many.
I use this command:
gawk -f 1.awk print*.csv > new.txt
while 1.awk contains:
BEGIN { OFS=FS=";"
pattern="row4"
}
go {print} $0 ~ pattern {go = 1}
input file 1 print1.csv contains:
row1;something;in;this;row;;;;;;;
row2;something;in;this;row;;;;;;;
row3;something;in;this;row;;;;;;;
row4;don't;need;to;match;the;whole;line,;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
Input file 2 print2.csv contains the same just for illustration purpose.
The 1.awk (and several others ways I found in this forum to print from match) works for one file. Output:
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
BUT not when I process more input files.
Each time I process this way more than one input file awk commands 'to print from match' seem to be ignored.
As said I was told not to use range expression. I do not know how and maybe the problem is linked to the way I input several files?
just reset your match indicator at the beginning of each file
$ awk 'FNR==1{p=0} p; /row4/{p=1} ' file1 file2
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
UPDATE
From the comments
is it possible to combine your awk with: "If $1="row5" then write in
$6="row5" and delete the value "row5" in $5? In other words, to move
content "row5" in column1, if found there, to new column 6? I could to
this with another awk but a combination into one would be nicer
... $1=="row5"{$6=$5; $5=""} ...
or, if you want to use another field instead of $5 replace $5 with the corresponding field number.

How to parse the XML output from postgres as input for basex in Linux

How can I parse the XML output from Postgres as an input for Basex in Linux?
oh I see may answer is somehow outdated; yet I'll leave it here as in my opinion the appraoch you describe in your answer might be overkill for the task at hand.
I am not sure if you even have a question, yet I'd like to propose a fundamentally leaner approach ;-)
I hope it helps a little! Have fun!
For the current use case you may throw away awk, sed, postgres and wget, you can do all that you need in 25 lines of XQuery:
1) Some basics, fetch a file from a remote server:
fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
2) Skip the first line.
I decided to use the header that came with the original file, but you
fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
=> tokenize(out:nl()) (: Split string by newline :)
=> tail() (: Skip first line :)
=> string-join(out:nl()) (: Join strings with newline :)
So in total your Requirements condense to:
RQ1.:
(: Fetch CSV as Text, split it per line, skip the first line: :)
let $lines := fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
=> tokenize(out:nl()) (: Split string by newline :)
=> tail() (: Skip first line :)
=> string-join(out:nl()) (: Join strings with newline :)
(: Parse the csv file, first line contains element names.:)
let $csv := csv:parse($lines, map { "header": true(), "separator": ";"})
for $record in $csv/csv/record
group by $date := $record/REF_DATE
order by $date ascending
return element year_total {
attribute date { $date },
attribute population { sum($record/POP_TOTAL) => format-number("0000000")}
}
RQ 2.:
(: Fetch CSV as Text, split it per line, skip the first line: :)
let $lines := fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
=> tokenize(out:nl()) (: Split string by newline :)
=> tail() (: Skip first line :)
=> string-join(out:nl()) (: Join strings with newline :)
(: Parse the csv file, first line contains element names.:)
let $csv := csv:parse($lines, map { "header": true(), "separator": ";"})
for $record in $csv/csv/record
group by $date := $record/REF_DATE
order by $date ascending
return element year_total {
attribute date { $date },
attribute population { sum($record/POP_TOTAL) => format-number("0000000")},
for $sub_item in $record
group by $per-district := $sub_item/DISTRICT_CODE
return element district {
attribute name { $per-district },
attribute population { sum($sub_item/POP_TOTAL) => format-number("0000000")}
}
}
Including the file write and the date formatted in a more readable way:
(: wrap elements in single root element :)
let $result := element result {
(: Fetch CSV as Text, split it per line, skip the first line: :)
let $lines := fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
=> tokenize(out:nl()) (: Split string by newline :)
=> tail() (: Skip first line :)
=> string-join(out:nl()) (: Join strings with newline :)
(: Parse the csv file, first line contains element names.:)
let $csv := csv:parse($lines, map { "header": true(), "separator": ";"})
for $record in $csv/csv/record
group by $date := $record/REF_DATE
order by $date ascending
return element year_total {
attribute date { $date => replace("^(\d{4})(\d{2})(\d{2})","$3.$2.$1")},
attribute population { sum($record/POP_TOTAL) => format-number("0000000")},
for $sub_item in $record
group by $per-district := $sub_item/DISTRICT_CODE
return element district {
attribute name { $per-district },
attribute population { sum($sub_item/POP_TOTAL) => format-number("0000000")},
$sub_item
}
}
}
return file:write("result.xml", $result)
Setup
Data source : http://www.wien.gv.at/statistik/ogd/vie_101.csv
Research questions (RQ):
RQ1: How many people lived in Vienna in total per census?
RQ2: How many people lived in each Viennese district per census?
Preparation
In order to answer the RQ the postgre DB was chosen. Adhering to the proverbial saying “Where
there’s a shell, there’s a way” this code shows a neat solution for the BASH (CLI Debian/Ubuntu
flavored). Also, it is much easier to interact with postgre from the BASH when creating files needed
for further processing. Regarding the installation process please consult:
https://tecadmin.net/install-postgresql-server-on-ubuntu/
First download the file with wget:
cd /path/to/directory/ ;
wget -O ./vie_101.csv http://www.wien.gv.at/statistik/ogd/vie_101.csv ;
Then look at the file with your favorite spread sheet calculation program (Libre Office Calc).
vie_101 should be in UTF-8 encoding and probably uses a semicolumn \; delimiter. Open,
check, change, save.
Some reformatting is needed for ease of processing down the line. First, a header file is created
with the appropriate column names. Second, the downloaded file is “beheaded” (first 2 rows are
removed) and “cut” (into the columns of interest). Finally, it is attached to the header file.
echo 'DISTRICT,POPULATION,MALE,FEMALE,DATE' > ./vie.csv ;
declare=$(sed -e 's/,/ INT,/g' ./vie.csv)' INT' ;
sed 's/\;/\,/g' ./vie_101.csv | sed 's/\.//g' | tail -n+3 | cut -d ','
-f4,6-9 >> ./vie.csv ;
Postgre
In order to load data into postgres a schema needs to be created first:
echo "create table vie ( $declare );" | sudo -u postgres psql ;
In order to actually load data into postgres the previously created and formatted file (vie.csv)
needs to be copied into the folder accessible by postgres by the super user. Only then the copy
command can be executed to load data into postgres. It needs to be noted that root privileges are
required for this operation (sudo).
sudo cp ./vie.csv /var/lib/postgresql/ ;
echo "\copy vie from '/var/lib/postgresql/vie.csv' delimiter ',' csv
header ;" | sudo -u postgres psql ;
XML Schema
Before we create our XML document, we have to design the structure of our file. We decided to
create an XML schema (schema.xsd) instead of the DTD.
Our schema defines a root element , and its child which are complex elements.
The element can occur in any number. The children of element are ,
, , and . These 5 elements(siblings) are simple
elements and the defined value type is always an integer.
Create XML with Postgre
Since the ultimate goal is to answer the RQ via an xquery an xml file is needed. This file
(xml.xml) needs to be correctly formatted and well formed. As the next step the query_to_xml
command is piped to postgres -Aqt is used to:
-A [aligned mode disable, remove header and + at end of line]
-q [quiet output]
-t [tuples only, removes footer]
echo "select query_to_xml( 'select * from vie order by date asc', true,
false, 'vie' ) ;" | sudo -u postgres psql -Aqt > ./vie_data.xml ;
Now, it is important to export the schema of the table with table_to_xmlschema().
echo "select table_to_xmlschema( 'vie', true, false, '') ;" | sudo -u
postgres psql -Aqt > ./vie_schema.xsd ;
This concludes all tasks within postgre and the BASH. As last command basex can be launched.
basexgui
Xquery
Using basex the XML file can easily be validated against the schema with via:
validate:xsd('vie_data.xml', 'vie_schema.xsd')
The XML file can be imporet by clicking:
Database -> New
General -> Browse Select XML file.
Parsing Turn on "Enable Namespaces" if its not enabled.
OK
RQ1 can only be answered by grouping the the data by ‘DATE’ via a for-loop. Results are saved
via:
file:write( 'path/to/directory/file_name' ).
file:write( '/path/to/directory/population_year_total.xml',
for $row in //table/row
group by $date := $row/date
order by $date ascending
return <year_total date="{$date}"
population="{sum($row/population)}">
</year_total>)
RQ2 is answerd by nesting two for loops. The outer loop groups by DATE and returns the
POPULATION total for each DATE given. The inner loop groups by DISTRICT, hence, it returns a
sub-sum of the POPULATION.
file:write( '/path/to/directory/district_year_subtotal.xml',
for $row in //table/row
group by $date:= $row/date
order by $date ascending
return <sub_sum date="{$date}"
population="{sum($row/population)}">{
for $sub_item in $row
group by $district := $sub_item/district
order by $district ascending
return <sub_item district="{$district}"
population="{sum($sub_item/population)}"/>
}</sub_sum>)
Done

AWK: finding common elements across arbitrary number of columns (either single column files or column matrix)

Problem
I have several files, each one column, and I want to compare each of them to one another to find what elements are contained across all files. Alternatively - if it is easier - I could make a column matrix.
Question
How can I find the common elements across multiple columns.
Request
I am not an expert at awk (obviously). So a verbose explanation of the code would be much appreciated.
Other
# joepvd made some code that was somewhat similar... https://unix.stackexchange.com/questions/216511/comparing-the-first-column-of-two-files-and-printing-the-entire-row-of-the-secon/216515#216515?newreg=f4fd3a8743aa4210863f2ef527d0838b
to find what elements are contained across all files
awk is your friend as you guessed. Use the procedure below
#Store the files in an array. Assuming all files in one place
filelist=( $(find . -maxdepth 1 -type f) ) #array of files
awk -v count="${#filelist[#]}" '{value[$1]++}END{for(i in value){
if(value[i]==count){printf "Value %d is found in all files\n",i}}}' "${filelist[#]}"
Note
We used -v count="${#filelist[#]}" to pass the total file count to awk Note # in the beginning of an array gives element count.
value[$1]++ increments the count of a value as seen in the file. Also it creates value[$1] if not already exist with the initial value zero.
This method fails, if a value appear in a file more than once.
And END block with awk is executed only at last, ie after every records from all the files have been processed.
If you can have the same value multiple times in a single file, we'll need to take care to only count it once for each file.
A couple of variations with GNU awk (which is needed for ARGIND to be available. It could be emulated by checking FILENAME but that's even uglier.)
gawk '{ A[$0] = or(A[$0], lshift(1, ARGIND-1)) }
END { for (x in A) if (A[x] == lshift(1, ARGIND) - 1) print x }'
file1 file2 file3
The array A is keyed by the values (lines), and holds a bitmap of the files in which a line has been found. For each line read, we set bit number ARGIND-1 (since ARGIND starts with one).
At the end of input, run through all saved lines, and print them if the bitmap is all ones (up to the number of files seen).
gawk 'ARGIND > LASTIND {
LASTIND = ARGIND; for (x in CURR) { ALL[x] += 1; delete CURR[x] }
}
{ CURR[$0] = 1 }
END { for (x in CURR) ALL[x] += 1;
for (x in ALL) if (ALL[x] == ARGIND) print x
}' file1 file2 file3
Here, when a line is encountered, the corresponding element in arrayCURR, is set (middle part). When the file number changes (ARGIND > LASTIND), values in array ALL are increased for all values set in CURR, and the latter is cleared. At the END of input, the values in ALL are updated for the last file, and the total count is checked against the total number of files, printing the ones that appear in all files.
The bitmap approach is likely slightly faster with large inputs, since it doesn't involve creating and walking through a temporary array, but the number of files it can handle is limited by the number of bits the bit operations can handle (which seems to be about 50 on 64-bit Linux).
In both cases, the resulting printout will be in essentially a random order, since associative arrays do not preserve ordering.
I'm going to assume that it's the problem that matters, not the implementation language so here's an alternative using perl:
#! /usr/bin/perl
use strict;
my %elements=();
my $filecount=#ARGV;
while(<>) {
$elements{$_}->{$ARGV}++;
};
print grep {!/^$/} map {
"$_" if (keys %{ $elements{$_} } == $filecount)
} (keys %elements);
The while loop builds a hash-of-hashes (aka "HoH". See man perldsc and man perllol for details. Also see below for an example), with the top level key being each line from each input file, and the second-level key being the names of the file(s) that value appeared in.
The grep ... map {...} returns each top-level key where the number of files it appears in is equal to the number of input files
Here's what the data structure looks like, using the example you gave to ilkkachu:
{
'A' => { 'file1' => 1 },
'B' => { 'file2' => 1 },
'C' => { 'file1' => 1, 'file2' => 1, 'file3' => 1 },
'E' => { 'file2' => 1 },
'F' => { 'file1' => 1 },
'K' => { 'file3' => 1 },
'L' => { 'file3' => 1 }
}
Note that if there happen to be any duplicates in a single file, that fact is stored in this structure and can be checked.
The grep before the map isn't strictly required in this particular example, but is useful if you want to store the result in an array for further processing rather than print it immediately.
With the grep, it returns an array of only the matching elements, or in this case just the single value C. Without it, it returns an array of empty strings plus the matching elements. e.g. ("", "", "", "", "C", "", ""). Actually, they return the elements with a newline (\n) at the end because I didn't use chomp in the while loop as I knew i'd be printing them directly. In most programs, i'd use chomp to strip newlines and/or carriage-returns.

Create postfix aliases file from LDIF using awk

I want to create a Postfix aliases file from the LDIF output of ldapsearch.
The LDIF file contains records for approximately 10,000 users. Each user has at least one entry for the proxyAddresses attribute. I need to create an alias corresponding with each proxyAddress that meets the conditions below. The created aliases must point to sAMAccountName#other.domain.
Type is SMTP or smtp (case-insensitive)
Domain is exactly contoso.com
I'm not sure if the attribute ordering in the LDIF file is consistent. I don't think I can assume that sAMAccountName will always appear last.
Example input file
dn: CN=John Smith,OU=Users,DC=contoso,DC=com
proxyAddresses: SMTP:smith#contoso.com
proxyAddresses: smtp:John.Smith#contoso.com
proxyAddresses: smtp:jsmith#elsewhere.com
proxyAddresses: MS:ORG/ORGEXCH/JOHNSMITH
sAMAccountName: smith
dn: CN=Tom Frank,OU=Users,DC=contoso,DC=com
sAMAccountName: frank
proxyAddresses: SMTP:frank#contoso.com
proxyAddresses: smtp:Tom.Frank#contoso.com
proxyAddresses: smtp:frank#elsewhere.com
proxyAddresses: MS:ORG/ORGEXCH/TOMFRANK
Example output file
smith: smith#other.domain
John.Smith: smith#other.domain
frank: frank#other.domain
Tom.Frank: frank#other.domain
Ideal solution
I'd like to see a solution using awk, but other method are acceptable too. Here are the qualities that are most important to me, in order:
Simple and readable. Self-documenting is better than one-liners.
Efficient. This will be used thousands of times.
Idiomatic. Doing it "the awk way" would be nice if it doesn't compromise the first two goals.
What I've tried
I've managed to make a start on this, but I'm struggling to understand the finer points of awk.
I tried using csplit to create seperate files for each record in the LDIF output, but that seems wasteful since I only want a single file in the end.
I tried setting RS="" in awk to get complete records instead of individual lines, but then I wasn't sure where to go from there.
I tried using awk to split the big LIDF file into separate files for each record and then processing those with another shell script, but that seemed wasteful.
Here a gawk script which you could run like this: gawk -f ldif.awk yourfile.ldif
Please note: the multicharacter value of `RS' is a gawk extension.
$ cat ldif.awk
BEGIN {
RS = "\n\n" # Record separator: empty line
FS = "\n" # Field separator: newline
}
# For each record: loop twice through fields
{
# Loop #1 identifies the sAMAccountName
for (i = 1; i <= NF; i++) {
if ($i ~ /^sAMAccountName: /) {
sAN = substr($i, 17)
break
}
}
# Loop #2 prints output lines
for (i = 1; i <= NF; i++) {
if (tolower($i) ~ /smtp:.*#contoso.com$/) {
split($i, n, ":|#")
print n[3] ": " sAN "#other.domain"
}
}
}
Here is a way to do it using standard awk.
# Display the postfix alias(es) for the previous user (if any)
function dump() {
for(i in id) printf("%s: %s#other.domain\n",id[i],an);
delete id;
}
# store all email names for that user in the id array
/^proxyAddresses:.[Ss][Mm][Tt][Pp]:.*#contoso.com/ {gsub(/^.*:/,"");gsub(/#.*$/,"");id[i++]=$0}
# store the account name
/^sAMAccountName:/ {an=$2};
# When a new record is found, process the previous one
/^dn:/ {dump()}
# Process the last record
END {dump()}