How to model data for a CouchDB geocoder - lucene

I am working on a CouchDB based geocoding application using a large national dataset that is supplied relationally. There are some 250 million records split over 9 tables (The ER Diagram can be viewed at http://bit.ly/1dlgZBt). I am quite new to nosql document databases and CouchDB in particular and am considering how to model this. I have currently loaded the data into a CouchDB database per table with a type field indicating which kind of record it is. The _id attribute is set to be the primary key for table [A] and [C], for everything else it is auto-generated by Couch. I plan on setting up Lucene with Couch for indexing and full text search. The X and Y Point coordinates are all stored in table [A] but to find these I will need to search using data in [Table E], [Tables B, C & D combined] and/or [Table I] with the option of filtering results based on data in [Table F].
My original intention was to create a single CouchDB database which would combine all of these tables into a single structure with [Table A] as the root and all related tables nested under this. I would then build my various search indexes on this and also setup a spatial index using GeoCouch for reverse geocoding. However I have read articles that suggest view collation as an alternative approach.
An important factor here I guess is reads vs writes. The plan is that this data will never be updated, only read. Data is released every quarter at which time the existing DB would be blown away and a new DB created.
I would welcome any suggestions for how best to setup and organise this from any experienced Couch or related document database users.
Many thanks in advance for any assistance.

guygrange,
While I am far from an expert in document database design, the key thing to recognize about documents DBs is that everything is about making your queries fast by keeping all of the necessary information in a single document. Hence, you need to look at your queries and how you expect to access this data. For example, I can easily imagine a geocoding application to not need access to everything in each table for your most frequent queries. Hence, to save on bandwidth, you would make a main document that has the main information you most frequently care about along with a key for the rest of the appropriate data. Then you could fetch the remaining data with that key and merge the dictionaries for easy management in your client code.
Anon,
Andrew

Related

Normalization of SQL Database with similar data managed by different tools

I'm designing a database for storing a bunch of product data that is both pulled via an API and scraped off the web. This scraper will pull some data that is static and some data that varies with time. Therefore there will one table for each type of data (static/variable). I'm trying to decide if there should be a separate table for variable data that is scraped compared to variable data that is pulled through an API.
At first, I thought they should be stored in separate tables because they are managed by separate tools. However, data will be pulled through the API and scraped on the same schedule (daily), and so they will both be mapped with the same ProductID and date. So, it seems like I could just combine the schema of both tables to save on the join time during queries for processing the data later. The obvious downside to this is managing whether rows need to be created or updated whenever one of the processes runs (which of the scraper vs API tools create or update rows).
For what it's worth, these scripts will be pulling data for millions (maybe tens of millions) of rows per day, and storing it for quite a while. So, the tables are going to get quite huge, and that's why I'm concerned with join times later on.
Here's an example in case this is all a little cloudy as an example. There are multiple industries for this, but I'll just use real estate:
Scraped Static Data: ProductID, Address, City, State, Zip, SquareFeet, etc.
Scraped Variable Data: ProductID, Price, PricePerSqFt, etc.
API Variable Data: ProductID, PageHits, UniqueVisitors, etc.
Mainly just the variable data is the concern here. So, just summarize, separate tables for the sake of general design principles, or one table for the sake of speed on joins?
Thanks in advance for the input
The example you give indicates that, apart from having 2 or 3 tables, you should also consider having just one table for both static and variable data. As long as the key of everything is just the product id, you can keep all information describing a particular id value in one record. Or do you intend to have a time stamp as part of the key of your variable data?
Once this has been decided, I can't see any advantage in having more tables than necessary.
The joins you mention won't be particularly complicated, as they basically mean to read a single record from each of your tables, each time using a primary key, which is fast. But still reading 3 records means more effort than reading 2, or only one.
There is no general design principle saying you should have a separate table for each way to collect data. On the contrary, it's the purpose of a database to contain data according to their logical structure without (too much) regard of the technical means of collecting or accessing them.
The logic to decide whether to insert or update a row isn't complicated. Also, if you want to verify your data, you might need some logic anyway, e.g. making sure that variable data only get inserted for an object that already has static data.

Why does Wordpress have separate 'usersmeta' and 'users' SQL tables. Why not combine them?

Alongside the users table, Wordpress has a usersmeta table with the following columns
meta_id
user_id
meta_key (e.g. first_name)
meta_value (e.g. Tom)
Each user has 20 rows in the usersmeta table, regardless of whether or not the rows have a filled-in meta_value. That said, would it not be more efficient to add the always-present meta rows to the users table?
I'm guessing that the information in the users table is more frequently queried (e.g. user_id, username, pass), so it is more efficient to keep those rows smaller. Is this true? And are there other reasons for this separation of tables?
Entity Attribute Value
It's known as the Entity Attribute Value (EAV) data model, and allows an arbitrary number of attributes to be assigned to a given entity. That means any number of meta-data entries per user.
Why use it
By default there are a few keys that wordpress sets (20 stated in the question) but there can be any number. If all users have one thousand meta data entries - there are simply one thousand entries in the usermeta table for each user - it doesn't have (in terms of the database structure) a limit to the number of meta data entries a user can have. It also permits one user to have one thousand meta data entires, whilst all others have 20 and still store the data efficiently - or any permutation thereof.
In addition to flexibility, using this kind of structure permits the main users table to remain small - which means more efficient queries.
Alternatives
The alternatives to using EAV include:
Modify the schema whenever the number of attributes changes
Store all attributes in a serialized string (on the user object)
Use a schemaless db
Permissions is the biggest problem with the first point, it is not a good idea to grant blanket access to alter the schema of your database tables, and is a (sane) roadblock for many if not most wordpress installs (hosted on wordpress.com or on a shared host where the db user has no alter permissions). Mysql also has a hard-limit of 4096 columns and 65,535 bytes per row. Attempting to store a large number of columns in a single table will eventually fail, along the way creating a table that is inefficient to query.
Storing all attribute in a serialized string would make it difficult and slow to query by a meta-data value.
Wordpress is quite tied to mysql, and therefore changing datastore isn't a realistic option.
Further WP info
If you aren't using any/many plugins it's possible you will have a constant number of rows in the usermeta table for each user, but typically each plugin you add may need to add meta-data for users; the number added may not be trivial and this data is stored in the usermeta table.
The docs for add_meta_user may add some clarity as to why the database is structured that way. If you put code like this somewhere:
add_user_meta($user_id, "favorite_color", "blue");
It will create a row in the usermeta table for the given user_id, without the need to add a column (favorite_color) to the main users table. That makes it easy-ish to find users by favorite color without the need to modify the schema of the users table.
This is really a question about database normalization. You can look for information on that topic in many places.
Basic answer Since there is a huge literature about this, and there are a lot of differences, I will just give some examples of why this might happen - it boild down to trade-offs; Speed versus storage requirements, or ease of use versus data duplication. Efficiency is multidimensional, and since wordpress does a lot of different things, it may have various reasons to keep them separate - space could be an issue, speed of queries may depend on this, it may be easier to look at just the meta table instead of the full table for some purposes, or vice versa.
Further reading This is a deep topic, you may want to learn more - there are hundreds of books and thousands of scholarly papers on these issues. For instance, look at this previous SO question about designing a database:
Database design: one huge table or separate tables?, or this one: First-time database design: am I overengineering?
or Database Normalization Basics
on About.com.

Improving query performance of of database table with large number of columns and rows(50 columns, 5mm rows)

We are building an caching solution for our user data. The data is currently stored i sybase and is distributed across 5 - 6 tables but query service built on top of it using hibernate and we are getting a very poor performance. In order to load the data into the cache it would take in the range of 10 - 15 hours.
So we have decided to create a denormalized table of 50 - 60 columns and 5mm rows into another relational database (UDB), populate that table first and then populate the cache from the new denormalized table using JDBC so the time to build us cache is lower. This gives us a lot better performance and now we can build the cache in around an hour but this also does not meet our requirement of building the cache whithin 5 mins. The denormlized table is queried using the following query
select * from users where user id in (...)
Here user id is the primary key. We also tried a query
select * from user where user_location in (...)
and created a non unique index on location also but that also did not help.
So is there a way we can make the queries faster. If not then we are also open to consider some NOSQL solutions.
Which NOSQL solution would be suited for our needs. Apart from the large table we would be making around 1mm updates on the table on a daily basis.
I have read about mongo db and seems that it might work but no one has posted any experience with mongo db with so many rows and so many daily updates.
Please let us know your thoughts.
The short answer here, relating to MongoDB, is yes - it can be used in this way to create a denormalized cache in front of an RDBMS. Others have used MongoDB to store datasets of similar (and larger) sizes to the one you described, and can keep a dataset of that size in RAM. There are some details missing here in terms of your data, but it is certainly not beyond the capabilities of MongoDB and is one of the more frequently used implementations:
http://www.mongodb.org/display/DOCS/The+Database+and+Caching
The key will be the size of your working data set and therefore your available RAM (MongoDB maps data into memory). For larger solutions, write heavy scaling, and similar issues, there are numerous approaches (sharding, replica sets) that can be employed.
With the level of detail given it is hard to say for certain that MongoDB will meet all of your requirements, but given that others have already done similar implementations and based on the information given there is no reason it will not work either.

SQL -> MongoDB Export Performance Issues

I'm trying to set up an automated process to regularly transform and export a large MS SQL 2008 database to MongoDB.
There is not a 1-1 correspondence between tables in SQL and collections in MongoDB -- for example the Address table in SQL is translated into an array embedded in each customer's record in Mongo and so on.
Right now I have a 3 step process:
Export all the relevant portions of the database to XML using a FOR XML query.
Translate XML to mongoimport-friendly JSON using XSLT
Import to mongo using mongoimport
The bottleneck right now seems to be #2. XML->JSON conversion for 3 million customer records (each with demographic info and embedded address and order arrays) takes hours with libxslt.
It seems hard to believe that there's not already some pre-built way to do this, but I can't seem to find one anywhere.
Questions:
A) Are there any pre-existing utilities I could use to do this?
B) If no, is there a way I could speed up my process?
C) Am I approaching the whole problem the wrong way?
Another approach is to go through each table and add information to mongo on a record by record basis and let Mongo do the denormalizing! For instance to add each phone number, just go through the phone number table and do a '$addToSet' for each phone number to the record.
You can also do this in parallel and do tables separately. This may speed things up but may 'fragment' the mongo database more.
You may want to add any required indexes before you start, otherwise adding the indexes at the end may be a large delay.

Can anyone please explain "storing" vs "indexing" in databases?

What is storing and what is indexing a field when it comes to searching?
Specifically I am talking about MySQL or SOLR.
Is there any thorough article about this, I have made some searches without luck!
Thanks
Storing information in a database just means writing the information to a file.
Indexing a database involves looking at the data in a table and creating an 'index' which is then used to perform a more efficient lookup in the table when you want to retreive the stored data.
From Wikipedia:
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random look ups and efficient access of ordered records. The disk space required to store the index is typically less than that required by the table (since indexes usually contain only the key-fields according to which the table is to be arranged, and excludes all the other details in the table), yielding the possibility to store indexes in memory for a table whose data is too large to store in memory.
Storing is just putting data in the tables.
Storing vs. indexing is a SOLR's concept.
In SOLR, a stored field cannot be searched for or sorted on. It can be retrieved as a result of the query that includes a search on an indexed field.
In MySQL, on contrary, you can search and sort on unindexed fields too: this will be just slower, but still possible (unlike SOLR)
Storing data is just storing data somewhere so you can retrieve it later. Where indexing comes in is retrieving parts of the data efficiently. Wikipedia explains the idea quite well.
storing is just that saving the data to the disk (or whatever) so that the database can retrieve it later on demand.
indexing means creating some separate data structure to optimize the location and retrieval of that data in a faster way than simply reading the entire database (or the entire table) and looking at each and everyt record until the database searching algorithm finds what you asked it for... Generally databases use what is called a Balanced-Tree indices, which is an extension of the concept of a Binary-Tree. Look up Binary Tree on google/wikipedia to get a more indepth understanding of how this works...
Data
L1. This
L2. Is
L3. My Data
And the index is
This -> L1
Is -> L2
My -> L3
Data -> L3
The data/index analogy holds for books as well.