Full text address matching - sql

I'm looking for duplicate records. I have a Property table with the fields street, number, city, state, county and zip. They get geo-coded based on location, but there are some holes in the data. Problem is if they make a simple typing error or omit certain fields, they won't come up as matches.
As of now a straight = comparison and LIKE aren't really doing a very good job. But Jaro Winkler and similar edit distance algorithms are running with extremely poor performance.

The CASS-Certified Scrubbing Service from SmartyStreets offers deduplication as part of their address verification process. Just upload the data in a delimited text file and the duplicates will be marked on the output file you download. There's always a free preview for each file you process so you don't have to purchase anything before you're satisfied with the results. I'm a software developer for SmartyStreets and helped write the application. I'm pretty pleased with both its functionality and ease of use. We also have an API you could use but the deduplication would be your responsibility (just compare the full, 12-digit Delivery Point Barcode, which serves as a unique identifier for addresses).

Related

I need to store HTML emails in a database. Is that a bad idea?

The templates for these HTML emails are all the same, but there are just different variables for say, first name, last name and such.
Would it just make sense to store the most minimal of data that I need, and load the template in and replace the variables everytime?
Another option would be to actually create the HTML file and store a reference to it, which probably would be the easiest to do except it might be a pain managing the files, and it adds complexity in regards to migrating, file permissions, et cetera.
Looking for opinions from people who've done this before...
GOAL/PURPOSE/USE:
I have a booking engine. When users make a booking, they are sent a confirmation email, generated from the sessionized booking data.
This email provides a "Cannot view this email? See it here" link which provides a web view of the email, in addition to a plaintext view.
I need to display the same email that was sent out, in addition to the plaintext view.
The template is subject to change, but I think because of that very fact I should have a table of templates and map the data to a template.
That's what I would do, because the template layout may change over the time, but the person information should remain the same. So, it makes sense to just store the person information in the database and leave the template out from the database.
In fact, it would be even better if you use template engine such as Velocity (in Java) to construct your HTML emails... very easy, by the way.
On the one hand cpu is more expensive then memory, so mostly it is better to save more data to reduce cpu power used by computation.
But in your case, I would save the minimal data, the emails or what you are tying to save, because it allows you to easily remodel your templates, and to reuse the data at multiple places of your application.
You persist redundant data (especially because of the template) which is in no way normalized. I would not suggest to do that. But mentioned in the comment it is important what you want to do with that data.
If you only save the data you need you could for example exchange that template easy and use another one.
Yea, your right on track. I did a similar thing. All dynamic/runtime variables were starting from ##symbol.
So in database you would have one Template table. One table would be for dynamic/runtime variables. One table for Mapping between Template and dynamic/runtime variables.
tblTemplate - TemplateID, TemplateValue
tblRuntimeVariables - RuntimeVariableID, VariableString, VariableSQL
tblMapping - TemplateID, RuntimeVariableID, RuntimeVariableValue
Advantage of using an extra mapping table is that on adding new dynamic variables to existing change would mean making no change to existing database. Only more rows would be added to tblMapping.
In my case I was also having one extra column for storing SQL Statements in tblRuntimeVariables in case the value for a runtime variable is fetched from database.

How does enterprise search display results for the user and hide unauthorized results?

I am looking to understand how enterprise search solutions tackle the issue of user-permissions.
My question is on displaying the search results for users. The naive approach would display the search results to the user, and then if the user clicks a document he is not authorized to see, he will fail to open it. However, it is even forbidden to display a document's title or excerpt if the user does not have permission to read it. So do the various enterprise earch engines:
index each document together with its ACL?
index all documents with no permission info, but check each link in every search result to see whether the querying user has permission to view this link?
Option #2 makes more sense to me, but also seems much slower than option #1.
Option #1 suffers from the need to constantly update the changes in permissions on the indexed documents.
I am looking to understand what is the common approach in the existing solutions in the market today. Is there a third option?
I'm surprised to see that this 5 year old question hasn't got any answers, as I think it's quite a common and important problem in enterprise search.
As outlined in the question there are two common approaches to deal with document-level security:
early-binding-security: indexing ACL's along with the content, and
late-binding-security: handling security at query-time, by filtering out protected results
Handling security on content side only is never recommended as at that point in time confidential information might already have been revealed (e.g. title or preview of a document in the search result).
The advantage of implementing security with a late-binding approach is, that it's very flexible, because there is no need to re-index content upon changed ACLs. The biggest drawback however is, that by doing so, confidential information might be leaked via facet values, and it's not possible to retrieve and display correct facet counts. It also more difficult to properly populate the result list and handle pagination. Last but not least, this approach can significantly slow down the performance.
The advantage of implementing security with an early-binding approach is, that it addresses all of the above disadvantages for the price of re-indexing the content as soon as ACLs change. However, leaks are still possible, e.g. when a group membership or ACL just got changed and isn't reflected yet in the search index. To address this gap the two approaches early-binding and late-binding are often combined.
Last but not least there might be a third option, depending on the Enterprise Search Platform you are using: Attivio's Active Security is based on query time joins, which allows to index security information independent from the document itself, but at query time merges the two documents to ensure that only authorised content makes it into the search results.

sDesigning a database with flexible user profile

I am working on a design where I can have flexible attributes for users and I am confused how to continue the design of the schema.
I made a table where I kept system needed information:
Table name: users
id
username
password
Now, I wish to create a profile table and have one to one relation where all the other attributes in profile table such as email, first name, last name, etc. My question is: is there a way to add a third table in which profiles will be flexible? In other words, if my clients need to create a new attribute he/she won't need any customization to the code.
You're looking for a normalized table. That is a table that has user_id, key, value columns which produce a 1:N relationship between User & this new table. Look into http://en.wikipedia.org/wiki/Database_normalization for a little more information. Performance isn't amazing with normalized tables and it can take some interesting planning for optimization of your code but it's a very standard practice.
Keep the fixed parts of the profile in a standard table to make it easy to query, add constraints, etc.
For the configurable parts it sounds like you are looking for an entity-attribute-value model. The extra configurability comes at a high cost though: everything will have to be stored as strings and you will have to do any data validation in the application, not in the database.
How will these attributes be used? Are they simply a bag of data or would the user expect that the system would do something with these values? Are there ever going to be any reports against them?
If the system must do something with these attributes then you should make them columns since code will have to be written anyway that does something special with the values. However, if the customers just want them to store data then an EAV might be the ticket.
If you are going to implement an EAV, I would suggest adding a DataType column to your attributes table. This enables you to do some rudimentary validation on the entered data and dynamically change the control used for entry.
If you are going to use an EAV, then the one rule you must follow is to never write any code where you specify a particular attribute. If these custom attributes are nothing more than a wad of data, then an EAV for this one portion of your system will work. You could even consider creating an XML column to store these attributes. SQL Server actually has an XML data type but all databases have some form of large text data type that will also work. On reports, the data would only ever be spit out. You would never place specific values in specific places on reports nor would you ever do any kind of numerical operation against the data.
The price of an EAV is vigilence and discipline. You have to have discipline amongst yourself and the other developers and especially report writers to never filter on a specific attribute no matter how much pressure you get from management. The moment a client wants to filter or do operations on a specific attribute, it must become a first class attribute as a column. If you feel that this kind of discipline cannot be maintained, then I would simply create columns for each attribute which would mean an adjustment to code but it will create less of mess down the road.

How should I format user uploaded pictures' filenames?

My website deals with pictures that users upload. I'm kind of conflicted on what my picture filename should consist of. I'm worried about scalability simply and possibly security? Maybe someone out there deals with the same thing and can tell me what their use on their site?
Currently, my filename convention is
{pictureId}_{userId}_{salt}_{variant}.{fileExt}
where salt is a token generated server-side (not sure why I decided to put this here, maybe for security purposes I don't know) and variant is something like t where it signifies it's a thumbnail. So it would look something like
12332_22_hb8324jk_t.jpg
Please advise, thanks.
In addition to the previous comments, you may want to consider creating a directory hierarchy for your files. Depending on volume and the particular OS hosting the files, you can easily reach a point where you have an unreasonably large number of files in a single directory. There may be limits on the number of files allowed per folder. If you ever need to do any manual QA or maintenance on your files, this may be problematic (especially if such maintenance is not scripted).
I once worked on a project with a high volume of images. We decided to record a subpath in our database in addition to the filename of each file. Our folder names looked like this:
a/e/2/f/9
3/3/2/b/7
Essentially, we created folders 5 deep with a single hex value as the folder name. The depth was probably excessive, but effective. I suppose this could have led to us reaching a limit on the number of folders on a volume (not sure if such a limit exists).
I would also consider storing a drive in addition to a path (assuming you have a bunch of disks for storage). This way you can move images around and then update your database (assuming you have one) as part of the move.
My 2 pence worth; there is a bit of a conflict between scalability and security in this problem I would say.
If you have real security concerns, then you should not rely at all on the filename of the target image : this is just security-by-obfusication - somebody could just guess the name eventually.[even with your salt idea, which makes it harder]
Instead you should at least have a login mechanism to create a session between client and server , to make sure you can only get at stuff once you have authenticated: even then stuff is sniffable: if security really is a concern , then I would say you have to use SSL.
Regarding scalability : I would suggest you actually do give your images sequential numbers: and store them in 'bins' of (say) 500 images each. As you fill up a bin, create a new one. Store bin (min-image-id, max-image id) information in one DB table and image numbers in another: you can then comparitively cheaply find which bin a particular image lives in from its id. This is a fairly common solution for storing lots of docs/images.
You could then map your URLs to the bin+image id: but then to avoid the problem noted by Jason Williams (sequential numbering, makes it easy to probe), you really should address security separately as in point 1.
You might like to consider replacing the underscores with (e.g.) minuses. (Underscores are used as wildcards in SQL, so you could potentially run into trouble one day in a LIKE comparison). (And of course, underscores are just plain evil :-)
It looks form your example like you're avoiding spaces and upper-case characters - good move. I'd keep everything lowercase and use case-insensitive comparisons to eliminate any potential case-sensitivity issues with different file systems.
Scalability should be fine as long as you can cope with any number of digits in your user, picture and type IDs. You're very unlikely to hit any filename length limits with this scheme.
Security could be an issue if you use sequential IDs, as someone could potentially tweak the numbers and request a picture they shouldn't be able to access - but the salt should make it virtually impossible for someone to guess the correct filename for another picture. If users can't see/access the internal filename in any way, that may be an unnecessary measure though.
The first thing to do is to setup a directory structure that models your use case. In your case you have a user that uploads a picture. You would probably have a directory structure like this (probably on a network share somewhere):
-Pictures
-UserID1
-PictureID1~^~Variant.jpg
-PictureID2~^~Variant.jpg
-UserID2
-PictureID1~^~Variant.jpg
-PictureID2~^~Variant.jpg
Pictures - simply the root directory for the following.
UserID - is the database user ID.
PictureID is simply the picture ID from the database (assuming you record the filename of each uploaded picture in a database.)
~^~ - This is simply a delimitor. You can use a one character or X character sequence. I like three characters as it is easily handled with the split function and is readily distinguishable in the file name.
Sometimes I like to add the size of the picture in with the file name .256.jpg or .1024.jpg.
At any rate, all of this depends on your use case. The most important thing is setting up the directory structure properly. That will make it easier to access/serve and manage the pictures.
You can add any other information you need into the filename as long it doesn't exceed the maximum filename length on your system.

What is a managable way to store e-mails for extended periods of time?

If you have a site which sends out emails to the customer, and you want to save a copy of the mail, what is an effective strategy?
If you save it to a table in your database (e.g. create a table called Mail), it gets very large very quickly.
Some strategies I've seen are:
Save it to the file system
Run a scheduled task to clear old entries from the database - but then you wind up not having a copy;
Create a separate table for each time frame (one each year, or one each month)
What strategies have you used?
I don't agree that gmail is an effective backup for business data.
Why trust your business information to a provider who makes no guarantees of service, or over who you have no control whatsoever?
Makes no sense to me.
Depending on how frequently you need to access this information, I'd say go with the filesystem or database archive. At least that way, you have control over your own data.
Data you want to save is saved in a database. The only exception that is justified is large binary data (images, videos). Who cares how large the table gets? If the mails are automated and template-based, you just have to save the variable parts anyway. The size will be about the same wherever you save it, but you probably already have a mechanism to backup your database, so you won't have to invent one to handle millions of files.
Lots of assumptions:
1. You're running windows / would like an archive in windows
2. The ability to search in the mails is important.
Since you are sending mails to your customers there isn't any reason you can't bcc a mail account of your own. Assuming you have a suitable account on your own server then I'd look at using MailStore (home) to pull the mails out from your account and put them into it's own compressed database.
Another option (depending on the email content) is to not save the email, but make sure you can recreate the email by archiving the original content that went into generating the email.
It depends on the content of your email. If it contains large images. I would plump for the file system. Otherwise if your Mail table table is getting very large very quickly I would go for the separate table, archiving off dead customers.
We save the email to a database table. It really doesn't get that big that quickly. We've a table with 32,000 emails in it (they're biggish emails too # 50kb per email) and with compression, the file only uses 16MB.
If you're sending a shed load of email, then know that GMail(free) currently only allows 7GB of data. I'd be happy holding that on a disk.
I'd think about putting in place some sort of general archiving functionality. How you implement that depends on your specific retrieval needs.
For example if you wish just to retrieve emails sent to a particular customer for a certain month then stocking them in an appropriate heirachy on the File System (zip them up if necessary) should be simple to do. You might want to record a list of sent emails in a database table with a pointer to the appropriate directory but a naming convention for your directories and files might be sufficient
You might not need to access very old emails very infrequently so you might archive these to DVD for example if online storage is a problem
If you're wanting to often search the actual content of emails then your going to have to put the content in a DB table or use an indexer like Lucerne to examine the files stocked on disk