I need match about 25 million address records with about 200,000 other address records. I also would like to have a small degree of fuzziness so comparing for exact matches is kind of out. The addresses are parsed into components in both data sets. And they are both stored in a SQL Server 2008 database.
I had an idea to do comparisons in batches (grouping the batches by state) until I reached the end, dumping matches into a temporary database. This would be done in a .NET application, but I don't think this is too efficient since I have to pull the data from SQL into the application and iterate over it one by one. Threading could speed up the process, but I don't know by how much.
I also thought about indexing the 25 million records into a Lucene index and utilizing the filtering in that to trim narrow down potential matches.
Are either of these a good approach? What other options are there?
For a first pass do an exact
For fuzzy you can use Levenstien Distance
Levenstien Distance TSQL
You can also run Levenstien in .NET
It might make sense to bring the 200,000 into a .NET collection
And then compare the 25 million one at a time to the 200,000
I assume the .NET implementation is faster just don't know how much faster.
C# Levenshtein
MSSQL has SOUNDEX but it is way to fuzzy.
Hopefully you have a valid state to filter on
Hopefully you have a some valid zip code
If the zip code is valid then filter to only that zip code
Related
I have two very large files of which both field 1's have similar data. Now I want to use a query to compare these two files and output all rows of a particular column (Field3) in which file1.field1==file2.field1. This can of course be done with a simple query. I can not use join since there is too much text in the fields, so I use WHERE but this apparantly slows down the query by a lot.
Query:
SELECT Category_list.Field3
FROM Category_list, Uri_list
WHERE (Category_list.Field1=Uri_list.Field1);
Now this query is running, but I have no idea how long it will take. It could be a few hours but it wouldn't surprise me if it takes days. Is it possible to see in access how far into the query it is, so that I can get at least an idea of what the runtime will be?
Category_list has about 2.8 million rows and Uri_list has about 4 million rows. If needed I could lower the Uri_list to about 100.000 rows minimum, but that depends on the runtime..
Thanks in advance for the help :)
My suggestion:
Import both files into Access as local tables, using the text import assistant
Create an index in both tables on Field1. If the values are unique, an unique index or even better a PrimaryKey; if not, a non-unique index.
Create a simple query with JOIN
This should run fast enough.
Is it possible to see in access how far into the query it is, so that I can get at least an idea of what the runtime will be?
Sometimes (especially for Update queries), Access shows a progress bar. But if not, no - there is no way to see the progress.
I have MS SQL 2012 DB with table for documents.
In application there are users.
users can be document managers.
one user can be manager for many documents
one document can have many managers.
There is limit 50 user for app.
I was wondering what will be te best way(or fastest) to search document that manager is some user(or few users).
1) One table for document and additional for document manager objects. then search like
select from document dk
join documentmanager dm on dm.dokid=dk.id and dm.userlogin='xxx'
or
2) do not use additional table for manager object, instead bind each user managernumber from 1 to 50 then when serach use :
SELECT * FROM documents where (managers & CAST(manager AS BIGINT) <> 0)
where manager is 2^managernumber .
Second one seem to be faster and simplier and not required that additional table so also requires less space. But i dont know if i use indexes on that additional table that maybe it will be faster then 2) . Of course there is limitation to 63 users but let say its not important.
It's hard to tell which one would be faster, at least when the number of records is small. The second approach has a simpler query, but it can't make use of any indexes as it has to calculate the value of the expression for each document.
The second approach may seem easier, but it's actually quite unconventional. Looking at the table design of the first approach anyone with a bit of database experience can immediately tell how it's supposed to work. Anyone looking at the second approach needs to examine the query to figure out what the "magic" numbers in the table is supposed to mean.
Even if the number of users is limited so that the second approach would be usable, the number of documents is likely to grow over time. As the query in the second approach has to examine every document it will get slower when the number of documents grow. The query in the first approach on the other hand can make use of indexes, so the execution time is mostly depending on the number of records returned, not so much on the number of records in the tables. It can easily handle tables with upwards of millions of records before you would even notice any difference in performance.
The first idea is how a relational database is typically designed. There is a reason -- it is the better design for a database.
You say the limit on the number of users does not matter you don't need more than 63. In my opinion if you have less than 63 of anything you don't need a database. You can load it in from any file and store all the information in memory. If size and scalability don't matter then don't even use a database.
In every other case use the standard relational design that has been proven robust over many years.
I have the following scenario where the search returns a list of userid values (1,2,3,4,5,6... etc.) If the search were to be run again, the results are guaranteed to change given some time. However I need to stored the instance of the search results to be used in the future.
We have a current implementation (legacy), which creates a record for the search_id with the criteria and inserts every row returned into a different table with the associated search_id.
table search_results
search_id unsigned int FK, PK (clustered index)
user_id unsigned int FK
This is an unacceptable approach as this table has grown onto millions of records. I've considered partitioning the table, but either I will have numerous partitions (1000s).
I've optimized the existing tables that search results expired unless they're used elsewhere, so all the search results are referenced elsewhere.
In the current schema, I cannot store the results as serialized arrays or XML. I am looking to efficiently store the search result information, such that it can be efficiently accessed later without being burdened by the number of records.
EDIT: Thank you for the answers, I don't have any problems running the searches themselves, but the result set for the search gets used in this case for recipient lists, which will be used over and over again, the purpose of storing is exactly to have a snapshot of the data at the given time.
The answer is don't store query results. It's a terrible idea!
It introduces statefulness, which is very bad unless you really (really really) need it
It isn't scalable (as you're finding out)
The data is stale as soon as it's stored
The correct approach is to fix your query/database so it runs acceptable quickly.
If you can't make the queries faster using better SQL and/or indexes etc, I recommend using lucene (or any text-based search engine) and denormalizing your database into it. Lucene queries are incredibly fast.
I recently did exactly this on a large web site that was doing what you're doing: It was caching query results from the production relational database in the session object in an attempt top speed up queries, but it was a mess, and wasn't much faster anyway - before my time, a "senior" java developer (whose name started with Jam.. and ended with .illiams) who was actually a moron decided it was a good idea.
I put in Solr (a java-tailored lucene implementation) and kept Solr up to date with the relational database (using work queues) and the web queries are now just a few milliseconds.
Is there a reason why you need to store every search? Surely you would want the most up to date information available for the user ?
I'll admit first, this isn't a great solution.
Setup another database alongside your current one [SYS_Searches]
The save script could use SELECT INTO [SYS_Searches].Results_{Search_ID}
The script that retrieves can do a simple SELECT out of the matching table.
Benefits:
Every search is neatly packed into it's own table, [preferably in another DB]
The retrieval query is very simple
The retrieval time should be very quick, no massive table scans.
Drawbacks:
You will have a table for every x user * y searches a user can store.
This could get very silly very quickly unless there is management involved to expire results or the user can only have 1 cached search result set.
Not pretty, but I can't think of another way.
Say tableA has 1 row to be returned but will have 100 columns returned while tableB has 100 rows to be returned but only one column from each. TableB has a foreign key for table A.
Will a left join of tableA to tableB return 100*100 cells of data while 2 separate queries return 100 + 100 cells of data or 50 times less data or is that a misunderstanding of how it works?
Is it ever more efficient to use many simple queries rather than fewer more complex ones?
First and foremost, I would question a table with 100 columns, and suggest that there is a possibly a better design for your schema. In the real world, this number of columns is less common, so typically the difference in the amount of data returned with one query vs. two becomes less significant. 100 columns in a table is not necessarily bad, just a flag that it shold be considered.
However, assuming your numbers are what they are to make clear the question, there are a few important variables to consider:
1 - What is the speed of the link between the db server and the application server? If it is very slow, then you are probably better off minimizing the amount data returned vs. the number of queries you run. If it is not slow, then you will likely expend more time in the execution of two queries than you would returning the increased payload. Which is better can only be determined by testing in your own environment.
2 - How efficient is the transport protocol itself? Perhaps there is some kind of compression of the data, or an even more clever algorithm that knows column 2 through 101 are duplicate for every row, so it only passes them once. Strategies like this in the transport protocol would mitigate any of your concerns. Again, this is why you need to test in your own envionment to know for sure.
As others have pointed out, you also need to consider what will be done with the data once you get it (e.g., JOINs, GROUPing, etc), but I am limiting my response to the specifics of your question around query count vs. payload size.
What is best at joining? A database engine or client code? Saying that, I use both techniques: it depends on the client and how data will be used.
Where the data requires some processing to, say, render on a web page I'd probably split header and details recordsets. We do use this because we have some business logic between DB and HTML
Where it's consumed simply and linearly, I'd join in the database to avoid unnecessary processing. For example, simple reports or exports
It depends, if you only take into account the SQL efficiency obviusly several simpler and smaller result queries will be more efficient.
But you need to take into account the whole process if the join will be made otherwise on the client or you need to filter results after the join, then probably the DBM will be more efficient that doing it on your code.
Coding is always a tradeoff between diferent systems, DB vs Client, RAM vs CPU... you need to be conscious about this and try to find the perfect solution.
In this case probably 2 queries outperform 1 but that is not a general solution.
I think that your question basically is about database normalization. In general, it is advisable to normalize a database into multiple tables (using primary and foreign keys) and to join them as needed upon queries. This is better for insert/update performance and for keeping the data consistent, and usually results in smaller database sizes as well.
As for the row numbers returned, only a cross join would actually return 100*100 rows; any inner or outer join will not create all combinations, but rather tie together rows on the given conditions, and for outer joins preserve rows which could not be matched. Wikipedia has some samples in its JOIN article.
For very query-intense applications, the performance may be better when using less normlized tables. However, as always with optimizations, I'd only consider going into that direction after seeing real measurable problems (e.g. with a profiling tool).
In general, try to keep the number of roundtrips to the database low; a large number of single simple queries will suffer from the overhead of talking to the DB engine (network etc.). If you need to execute complex series of statements, consider using stored procedures.
Generally fewer queries makes for better performance, as long as the queries return data that is actually related. There is no point in trying to put unrelated data into the same query just to reduce the number or queries.
There are of course exceptions, and your example may be one of them. However, it depends on more than the number of fields returnes, like what the fields actually return, i.e. the actual amount of data.
As an example of how the number of queries affects performance, I can mention a solution that I have (sadly enough) seen many times. In that solution the programmer would first get a number of records from one table, then loop through the records and run another query for each record to get the related records from another table. This clearly results in a lot of queries, and a solution having either one or two queries would be much more efficient.
“Is it ever more efficient to use many simple queries rather than fewer more complex ones?”
The query that requires the least amount of data to traverse, and gives you no more than what you need is the more efficient one. Beyond this, there can be RDBMS specific conditions that can be more efficient on one RDBMS system than another. At the very low level, when you deal with less data, then your results can be retrieved much quicker, so efficient queries are queries that only work with the least amount of data needed to get you the result you are looking for.
I have a DB having text file attributes and text file primary key IDs and
indexed around 1 million text files along with their IDs (primary keys in DB).
Now, I am searching at two levels.
First is straight forward DB search, where i get primary keys as result (roughly 2 or 3 million IDs)
Then i make a Boolean query for instance as following
+Text:"test*" +(pkID:1 pkID:4 pkID:100 pkID:115 pkID:1041 .... )
and search it in my Index file.
The problem is that such query (having 2 million clauses) takes toooooo much time to give result and consumes reallly too much memory....
Is there any optimization solution for this problem ?
Assuming you can reuse the dbid part of your queries:
Split the query into two parts: one part (the text query) will become the query and the other part (the pkID query) will become the filter
Make both parts into queries
Convert the pkid query to a filter (by using QueryWrapperFilter)
Convert the filter into a cached filter (using CachingWrapperFilter)
Hang onto the filter, perhaps via some kind of dictionary
Next time you do a search, use the overload that allows you to use a query and filter
As long as the pkid search can be reused, you should quite a large improvement. As long as you don't optimise your index, the effect of caching should even work through commit points (I understand the bit sets are calculated on a per-segment basis).
HTH
p.s.
I think it would be remiss of me not to note that I think you're putting your index through all sorts of abuse by using it like this!
The best optimization is NOT to use the query with 2 million clauses. Any Lucene query with 2 million clauses will run slowly no matter how you optimize it.
In your particular case, I think it will be much more practical to search your index first with +Text:"test*" query and then limit the results by running a DB query on Lucene hits.