LDAP search for multiple complete DNs? - ldap

Assume I have an array of N DNs (distinguished names), e.g.:
cn=foo,dc=capmon,dc=lan
cn=bar,dc=capmon,dc=lan
cn=Fred Flintstone,ou=CapMon,dc=capmon,dc=lan
cn=Clark Kent,ou=yada,ou=whatnot,dc=capmon,dc=lan
They are not related and I cannot reduce/simplify the search. I have N complete DNs and want N records.
Can I write a single LDAP search that will return exactly N records, one for each DN? The assumption being that performance of both client and server will be better if I do it all in one search. Had it been SQL, it would be:
SELECT *
FROM dc=capmon,dc=lan
WHERE dn IN (
"cn=foo,dc=capmon,dc=lan",
"cn=bar,dc=capmon,dc=lan",
"cn=Fred Flintstone,ou=CapMon,dc=capmon,dc=lan",
"cn=Clark Kent,ou=yada,ou=whatnot,dc=capmon,dc=lan"
)
rather than doing individual LDAP searches in a for loop (which I do know how to do).
I tried against an MS Active Directory. There, all fields (seem to) have a distinguishedName attribute, and a search filter like this works (I added some newlines for readability):
(|
(distinguishedName=cn=ppolicy,dc=capmon,dc=lan)
(distinguishedName=cn=Users,dc=capmon,dc=lan)
<more ORed terms>
)
But this doesn't work:
(|
(dn=cn=ppolicy,dc=capmon,dc=lan)
(dn=cn=Users,dc=capmon,dc=lan)
<more ORed terms>
)
even though the returned records look like they contain dn attributes. :-(
An OpenLDAP server's records don't have distinguishedName attributes, and neither of the filters above work against it.
Can I do something that will work against most major LDAP servers?

It's not possible to "Read" several entries in a single operation.
You can do a single search operation that will match and return several entries, but you cannot search on the "DN" itself.
I've seen several applications that are trying to get several entries by using complex filters such as "(|(cn=foo)(cn=bar)(cn=Fred Flintstone))", but this may result in more entries, unless all CN values are unique. It's not really a good practice either, as there are limits in the number of elements you can have in the filter, and such requests are usually not optimized in term of I/O.
It will be faster to read each invidual entry, as LDAP servers are optimized for such operations. If you want to reduce the latency, you can issue multiple asynchronous search operations on the same connection.

Related

When clustering with OpenRefine, is there a way to "exclude" a string in a cluster ? right now it feels like either it clusterize everything or not

When using the clustering function in OpenRefine, you can select the "Merge?" option to clusterize the strings that were put together with the method of your choose, but what if the method clusterizes correctly most of them except for one string that I manually identify doesnt belongs in th ecluster, is there a way to exclude that specific string from the rest of the cluster ?
Unfortunately there is not currently a way of excluding or selecting a subset of terms from a cluster. The only two options I can think of are:
a) modify the clustering algorithm you are using to try to get better
clustering which doesn't include the incorrect terms
b) Go to 'browse
cluster' and mark the rows with the terms you don't want to have in
the cluster (e.g. by Flagging the rows), exclude the flagged rows in
a facet and re-cluster - this will then not include any of the terms
you didn't want

lucene index match

I am trying to use Lucene for doing undup or dedup match. Essentially I have a file with records which I want to group based on certain fields (fuzzy search) and get back a result with a match key that tells me which records within that file matched to each other.
Is this possible?
This can be done (if I understand this correctly). You would index your terms that/records will be searched on in one pass. In the second pass, you will search for each term and log results.
While pre-processing the document you can generate a hash that aggregate those fields, and store this (as NOT_ANALYZED), this way you just have to search by one field with a known size, take a look at MessageDigest. This is what I normally do for duplicate detection of the file content (since the content might be too big for a single query).
If what you are looking for is creating a more complex query, try using CachingWrapperFilter, this way subsequent calls to your deduplication algorithm will be much faster.

LDAP filter boolean expression maximum number of arguments

I was writing a small test case to see what's more efficient, multiple small queries or a single big query, when I encountered this limitation.
The query looks like this:
(| (clientid=1) (clientid=2) (clientid=3) ...)
When the number of clients goes beyond 2103 ?! the LDAP server throws an error:
error code 1 - Operations Error
As far as I can tell the actual filter string length does not matter ~69KB (at least for Microsoft AD the length limit is 10MB). I tried with longer attribute names and got the same strange limit: 2103 operands
Does anyone have more information about this limitation?
Is this something specified in the LDAP protocol specification or is it implementation specific?
Is it configurable?
I tested this against IBM Tivoli Directory Server V6.2 using both the UnboundID and JNDI Java libraries.
It cannot be more than 8099 characters. See http://www-01.ibm.com/support/docview.wss?uid=swg21295980
Also, what you are doing is not a good practice. If there are common attributes these entries share (e.g., country code, department number, location, etc.), try to retrieve the results using common criteria given you by those attributes. If not, divide your search filter into smaller ones each of which is with few predicates and execute multiple searches. It depends the programming language you're using to do this, but try to execute each search in a separate thread to speed up your data retrieval process.

What do people use for CN with inetOrgPerson in LDAP directories

I've been using givenName+" "+surname for the CN field and I woke up screaming last night 'what about John Smith'? I can imagine any large organization employing multiple people with the same name. So of course this isn't going to work. What do people use instead?
EDIT Note: in inetOrgPerson the CN is part of the DN.
EDIT Note: in this situation I am expecting to grow to hundreds of thousands of user entries.
In a LDAP Directory, whatever if it's OpenLDAP or Active-Directory, a rule is that a DistinguishName (DN) must be unique, independently of the attribute (or the attributes) used to constitute the Relative Distinguish Name (RDN).
How do people make sure that it's unique :
I would say that in a small business the guy who creates the entry in the directory guarantee that it's unique, first by knowledge, second by preliminary search. If a duplicate appears he finds some solutions like 'John E Smith'. Using this solution if the name changes (marriage, divorce etc.), the LDAP record has to "move" from one DN to another. It's better to avoid changing the DN of an entry whenever possible, but in a small directory it's not important.
In a medium business the uniqueness is most of the time given by the employee ID coming from human resources. For example FR12345678. I saw, in big companies, people logging in with their employee ID. For the thing I describe here, it's more standart to use the uid attribute to name an object in spite of cn (but some directories don't let you choise of the naming attribute, I think it's a
X500 feature).
In most directories (not in AD) you can use more than one attribute to compose the RDN. For example sn=Assin+TelephoneNumber=1234 is a valid RDN in an openLDAP and it can make sense in a PBX.
One more thing
In some directories (designed for system administration) some attributes are tested by the server side as unique all over the tree. That's the case of sAMAccountName or userPrincipalName in Active-Directory and they are used for loging purpose. Using the CN attribute with "given-Name Name" oblige the administrators to guarantee uniqueness. You can use unique attribute in OpenLDAP for that in the database definition in slapd.conf, add :
# index since the unique overlay will search for matching mail attributes
index mail eq
overlay unique
unique_attributes mail
If unique overlay is not compiled in, you'll need to recompile with :
./configure ... --enable-unique
Adding to JPBlanc's answer with some of my experience. We have several ldap servers/trees where I work. Our AD server is using the DisplayName as the value of the CN. Out of 4K+ users we have only had a few instances where duplicates have occurred. I believe the default action there is to tack a 1 on the value if there is a dupe. It is surprisingly rare even with a high turn over rate in the largest section of that user base. We have two different e-directory trees that are linked to each other and those use the username. Username is first initial + last name. Any duplicates there have an incrementing number attached to them. As you can imagine that happens a lot with the Browns and the Smiths and other common names. Another tree that is an ADLDS (formerly ADAM) directory uses a uniquely generated number for each new entry as the CN. It is basically an auto-incremented number that is controlled by an external loading process. Lastly we have a directory for external partners (think independent agents) that uses a combination of email address + an id number as the CN.
I do a lot of maintenance work on the user bases and my least favorite scheme is the externally generated number. If I get a support call about Joe Brown in all of the other systems I can at least have an idea of where I need to browse to find him. Sure a simple search filter will give me all of the Browns but I still have to write it and execute it. So my advice is to use some part of the name for the CN and ensure uniqueness somehow. From an administration point of view it will be a bit easier. Really the CN is important but you'll be dealing with the rest of the user attributes far more so don't sweat it too badly.

Compound Queries with Redis

For learning purposes I'm trying to write a simple structured document store in Redis. In my example application I'm indexing millions of documents that look a little like the following.
<book id="1234">
<title>Quick Brown Fox</title>
<year>1999</year>
<isbn>309815</isbn>
<author>Fred</author>
</book>
I'm writing a little query language that allows me to say YEAR = 1999 AND TITLE="Quick Brown Fox" (again, just for my learning, I don't care that I'm reinventing the wheel!) and this should return the ID's of the matching documents (1234 in this case). The AND and OR expressions can be arbitrarily nested.
For each document I'm generating keys as follows
BOOK_TITLE.QUICK_BROWN_FOX = 1234
BOOK_YEAR.1999 = 1234
I'm using SADD to plop these documents in a series of sets in the form KEYNAME.VALUE = { REFS }.
When I do the querying, I parse the expression into an AST. A simple expression such as YEAR=1999 maps directly to a SMEMBERS command which gets me the set of matching documents back. However, I'm not sure how to most efficiently perform the AND and OR parts.
Given a query such as:
(TITLE=Dental Surgery OR TITLE=DIY Appendectomy)
AND
(YEAR = 1999 AND AUTHOR = FOO)
I currently make the following requests to Redis to answer these queries.
-- Stage one generates the intermediate results and returns RANDOM_GENERATED_KEY3
SUNIONSTORE RANDOMLY_GENERATED_KEY1 BOOK_TITLE.DENTAL_SURGERY BOOK_TITLE.DIY_APPENDECTOMY
SINTERSTORE RANDOMLY_GENERATED_KEY2 BOOK_YEAR.1999 BOOK_YEAR.1998
SINTERSTORE RANDOMLY_GENERATED_KEY3 RANDOMLY_GENERATED_KEY1 RANDOMLY_GENERATED_KEY2
-- Retrieving the top level results just requires the last key generated
SMEMBERS RANDOMLY_GENERATED_KEY3
When I encounter an AND I use SINTERSTORE based on the two child keys (and similarly for OR I use SUNIONSTORE). I randomly generate a key to store the results in (and set a short TTL so I don't fill Redis up with cruft). By the end of this series of commands the return value is a key that I can use to retrieve the results with SMEMBERS. The reason I've used the store functions is that I don't want to transport all the matching document references back to the server, so I use temporary keys to store the result on the Redis instance and then only bring back the matching results at the end.
My question is simply, is this the best way to make use of Redis as a document store?
I'm using a similar approach with sorted sets to implement full text indexing. The overall approach is good, though there are a couple of fairly simple improvements you could make.
Rather than using randomly generated keys, you can use the query (or a short form thereof) as the key. That lets you reuse the sets that have already been calculated, which could significantly improve performance if you have queries across two large sets that are commonly combined in similar ways.
Handling title as a complete string will result in a very large number of single member sets. It may be better to index individual words in the title and filter the final results for an exact match if you really need it.