What would be the best way to index and search my data using Lucene? - lucene

I’ve found multiple questions on SO and elsewhere that ask questions along the lines of “How can I index and then search relational data in Lucene”. Quite rightly these questions are met with the standard response that Lucene is not designed to model data like this. This quote I found sums it up…
A Lucene Index is a Document Store. In a Document Store, a single
document represents a single concept with all necessary data stored to
represent that concept (compared to that same concept being spread
across multiple tables in an RDBMS requiring several joins to
re-create).
So I will not ask that question and instead provide my high level requirements and see if any Lucene gurus out there can help me.
We have data on People (Name, Gender, DOB, Nationality, etc)
And data on Companies (Name, Country, City, etc).
We also have data about how these two types of entity relate to each other where a person worked at the company (Person, Company, Role, Date Started, Date Ended, etc).
We have two entities – Person and Company – that have their own properties and then properties exist for the many-to-many link between them.
Some example searches could be as follows…
Find all Companies in Australia
Find all People born between two dates
Find all People who have worked as a .Net Developer
Find all males who have worked as a.Net Developer in London.
Find all People who have worked as a .Net Developer between 2008 and 2010
The criteria span all the three sets of data. Our requirement is to provide a Faceted Search over the data that accepts any combination of the various properties, of which I have given some examples.
I would like to use Lucene.Net for this. We are a .Net software house and so feel slightly intimidated by java. However, all suggestions are welcome.
I am aware of the idea that the Index should be constructed with the search in mind. But I can’t seem to come up with a sensible index that would meet all the combinations of search criteria
What classes native to Lucene or what extension points can we make use of.
Are there are established techniques for doing this kind of thing?
Are there any third open source contributions that I have missed that will help us here?
For now I won’t describe the scenarios we have considered because I don’t want to bloat out this question and make it too intimidating. Please ask me to elaborate where necessary.

To store both companies and people in a single index, you could create documents with a type field that identifies the type of entities they describe.
Birthdays can be stored as date fields.
You could give each person a simple text field containing the names of companies that they worked for. Note that you won't get an error if you enter a company that is not represented by a document in your index. Lucene is not a relational DB tool, but you knew that.
(Sorry that I've not posted any links to the API; I'm familiar with Lucene Core but not Lucene.NET.)

Related

Library Database MS Access

I am currently designing a database for a library a project for my database class. Below is the ER diagram so you can see the basic formatting of the database:
I have completed all of the requirements of the assignment, so please don't think that I am asking for someone to do my homework. What I am asking for assistance with is an idea. We can add extra features into this here such as a report generator, entry form, etc. I have added in a report generator to show the most active member/most popular book and an entry form. But I cannot think of anything else I can add into here to increase the usefulness, any suggestion would be appreciated.
There are a lot of possibilities:
Popularity can be split by gender, which could be interesting.
You could also work on popularity by address to make a geographical analysis, but if this is a local library you probably won't get much out of that.
Another stat that could be worthwile is the average length of borrow.
But the more interesting reports could be those that combine popularity, popularity by gender, average length of borrow with the catalog attributes... so you could have a view by author, by publisher or by publication year.
That is with the current structure. If you were to add genre or media type (book, magazine, CD, DVD) attributes, you would open up a whole lot of new dimensions.

How to design a database for efficient search-ability?

I am trying to design a database with search-ability at its core. My knowledge of database design and SQL is all self-taught and still fairly beginner-level, so my questions may possibly have easy answers.
Suppose I have a single table containing a large number of records. For example, suppose that each record contains details of a different computer application (name, developer, version number, etc). A list of keywords are associated with each record, such as a list of programming languages used to write the applications.
I wish to be able to enter one or more keywords (each separated by a space) into a search box, and I wish to have all associated records returned. How should I design the database to store the keywords, and what SQL query would I need to apply to the search text? (The search should be uppercase/lowercase independent.)
My next challenge would then be to order search results by relevance, and to allow entire key-phrases as well as keywords to be associated with each record. For example, if I type "Visual Basic" into the search field, I want the first results to have exactly the key-phrase "Visual Basic" associated with them. The next results should all have both keywords "Visual" and "Basic" associated with them, and the remaining results should have only one of these keywords. Again, please could anyone advise on how to implement this?
The final challenge I believe would be much harder: how much 'intelligent interpretation' can I design my database and SQL code to handle? For example, if I search for "CSS", can I get the records with the key-phrase "Cascading Style Sheets" to appear? Can I also get SQL to identify and search for similar words, such as plurals of search phrases or, for example, "programmer" or "programming" when "program" is input? Thanks!
Learn relational algebra, normalization rules, and SQL.
Start with entity relationships. Sounds like you could have an APPLICATION table as parent for a FEATURE child table, with a one-to-many relationship between the two. You'll query them by JOINing one to the other:
SELECT A.NAME, F.NAME
FROM APPLICATION AS A
JOIN FEATURE AS F
ON F.APP_ID = A.ID
Your challenges would not suggest SQL and relations to me. I would think more in terms of a parser, an indexer and search engine like Lucene, and a NoSQL document database like MongoDB.
I've come to the conclusion, after a LOT of research, that #duffymo's answer is hinting in the right direction. For the benefit of other n00bs like me, here's the conclusion I've drawn:
Many open source search engine server apps are out there to install for free. Lucene was the first I had ever heard of them, but others do exist and I think my favourite at the moment is Sphinx. As far as I can tell, the 'indexer' that #duffymo mentions is built into it. I have learnt that the indexer is the program that will examine my database for keywords and will automatically keep a record of which results should be returned for different input queries. I have also now learnt that the terminology for the behaviour I was looking for (and which Sphinx has) is 'stemming'. I'm still not sure what role a parser plays in all this...
A more basic approach would be to use SQL itself. Whilst I was already aware of the most basic of these (ie. using the LIKE keyword with 'wildcards'), I also discovered something a little more powerful: natural language / full-text search. For anyone not interested in installing a server app, I recommend you look this up.
Also, I see no reason why I would need to use NoSQL instead of SQL (as #duffymo has suggested), and so I'm going to stick with SQL for the moment (at least until I come across some good entry-level books to learn NoSQL from). Furthermore, I have very little intention to learn relational algebra until I know why I should and how it would be useful. The message here is that other beginners shouldn't be off-put by these things, as I don't think Sphinx requires any knowledge of them.
while I like #duffymo's answer, I will also suggest you research SPARQL and the wordnet project for your semantic equivalence questions.
If you choose Oracle, you can use the spatial option triple store to implement the SPARQL endpoint and do some very nice seaching like your css = Cascading Style Sheet example.

Should I use EAV database design model or a lot of tables

I started a new application and now I am looking at two paths and don't know which is good way to continue.
I am building something like eCommerce site. I have a categories and subcategories.
The problem is that there are different type of products on site and each has different properties. And site must be filterable by those product properties.
This is my initial database design:
Products{ProductId, Name, ProductCategoryId}
ProductCategories{ProductCategoryId, Name, ParentId}
CategoryProperties{CategoryPropertyId, ProductCategoryId, Name}
ProductPropertyValues{ProductId, CategoryPropertyId, Value}
Now after some analysis I see that this design is actually EAV model and I read that people usually don't recommend this design.
It seems that dynamic sql queries are required for everything.
That's one way and I am looking at it right now.
Another way that I see is probably named a LOT WORK WAY but if it's better I want to go there.
To make table
Product{ProductId, CategoryId, Name, ManufacturerId}
and to make table inheritance in database wich means to make tables like
Cpus{ProductId ....}
HardDisks{ProductId ....}
MotherBoards{ProductId ....}
erc. for each product (1 to 1 relation).
I understand that this will be a very large database and very large application domain but is it better, easier and performance better than the option one with EAV design.
EAV is rarely a win. In your case I can see the appeal of EAV given that different categories will have different attributes and this will be hard to manage otherwise. However, suppose someone wants to search for "all hard drives with more than 3 platters, using a SATA interface, spinning at 10k rpm?" Your query in EAV will be painful. If you ever want to support a query like that, EAV is out.
There are other approaches however. You could consider an XML field with extended data or, if you are on PostgreSQL 9.2, a JSON field (XML is easier to search though). This would give you a significantly larger range of possible searches without the headaches of EAV. The tradeoff would be that schema enforcement would be harder.
This questions seems to discuss the issue in greater detail.
Apart from performance, extensibility and complexity discussed there, also take into account:
SQL databases such as SQL Server have full-text search features; so if you have a single field describing the product - full text search will index it and will be able to provide advanced semantic searches
take a look at no-sql systems that are all the rage right now; scalability should be quite good with them and they provide support for non-structured data such as the one you have. Hadoop and Casandra are good starting points.
You could very well work with the EAV model.
We do something similar with a Logistics application. It is built on .net though.
Apart from the tables, your application code has to handle the objects correctly.
See if you can add generic table for each object. It works for us.

Sharepoint: Using multiple content types in list. Pros and Cons

I newbie in Sharepoint development.
I has some hierarchical structure like internet forum:
Forum
Post
Comment
For each of this entities I create content type.
I see, that Sharepoint allow store in list different content types and I can store all forums with their posts and comments in single list (Forum and Post will be 'Folder', Comment - Item).
From other side, I can create separate lists for each content type:
Forums List, Posts List, Comments List and link them in some way.
Is anybody can outline Pros and Cons for both solutions? I have about 2 weeks experience in Sharepoint and can't select best way.
P.S. Sorry for my English.
The short answer is: it depends.
First, they need to logically fit together. A user should expect items of these various types to be grouped together (or at least wouldn't be surprised that they have been grouped together). And in terms of design, they should have some common intersection of list type and fields. Combining Documents, Discussions, and Events into a single list wouldn't be a good idea. Likewise, I'm not sure Posts and Comments (as you mention above) would be a good fit for a single list. They just don't logically fit and their schemas probably do not have enough in common.
Once that has been determined, I would put multiple Content Types in the same list if they are meant to be used together. Will you want to show all of these items, regardless of Content Type, together in a view? Do all of these items share the same workflows, policies, permissions, etc? If the answer is no for any of these, then split the Content Types into different lists.
As I said, it depends. I'm not sure there really is a hard or fast rule for this. I see it a little like database normalization. We know the forms and the options. But depending on the project, sometimes we normalize a little more, sometimes we denormalize a little more, but we almost never (I hope) have one, monster table that contains every type of row in the database.

How to document a database [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
This post was edited and submitted for review 12 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
(Note: I realize this is close to How do you document your database structure? , but I don't think it's identical.)
I've started work at a place with a database with literally hundreds of tables and views, all with cryptic names with very few vowels, and no documentation. They also don't allow gratuitous changes to the database schema, nor can I touch any database except the test one on my own machine (which gets blown away and recreated regularly), so I can't add comments that would help anybody.
I tried using "Toad" to create an ER diagram, but after leaving it running for 48 hours straight it still hadn't produced anything visible and I needed my computer back. I was talking to some other recent hires and we all suggested that whenever we've puzzled out what a particular table or what some of its columns means, we should update it in the developers wiki.
So what's a good way to do this? Just list tables/views and their columns and fill them in as we go? The basic tools I've got to hand are Toad, Oracle's "SQL Developer", MS Office, and Visio.
In my experience, ER (or UML) diagrams aren't the most useful artifact - with a large number of tables, diagrams (especially reverse engineered ones) are often a big convoluted mess that nobody learns anything from.
For my money, some good human-readable documentation (perhaps supplemented with diagrams of smaller portions of the system) will give you the most mileage. This will include, for each table:
Descriptions of what the table means and how it's functionally used (in the UI, etc.)
Descriptions of what each attribute means, if it isn't obvious
Explanations of the relationships (foreign keys) from this table to others, and vice-versa
Explanations of additional constraints and / or triggers
Additional explanation of major views & procs that touch the table, if they're not well documented already
With all of the above, don't document for the sake of documenting - documentation that restates the obvious just gets in people's way. Instead, focus on the stuff that confused you at first, and spend a few minutes writing really clear, concise explanations. That'll help you think it through, and it'll massively help other developers who run into these tables for the first time.
As others have mentioned, there are a wide variety of tools to help you manage this, like Enterprise Architect, Red Gate SQL Doc, and the built-in tools from various vendors. But while tool support is helpful (and even critical, in bigger databases), doing the hard work of understanding and explaining the conceptual model of the database is the real win. From that perspective, you can even do it in a text file (though doing it in Wiki form would allow several people to collaborate on adding to that documentation incrementally - so, every time someone figures out something, they can add it to the growing body of documentation instantly).
One thing to consider is the COMMENT facility built into the DBMS. If you put comments on all of the tables and all of the columns in the DBMS itself, then your documentation will be inside the database system.
Using the COMMENT facility does not make any changes to the schema itself, it only adds data to the USER_TAB_COMMENTS catalog table.
In our team we came to useful approach to documenting legacy large Oracle and SQL Server databases. We use Dataedo for documenting database schema elements (data dictionary) and creating ERD diagrams. Dataedo comes with documentation repository so all your team can work on documenting and reading recent documentation online. And you don’t need to interfere with database (Oracle comments or SQL Server MS_Description).
First you import schema (all tables, views, stored procedures and functions – with triggers, foreign keys etc.). Then you define logical domains/modules and group all objects (drag & drop) into them to be able to analyze and work on smaller chunks of database. For each module you create an ERD diagram and write top level description. Then, as you discover meaning of tables and views write a short description for each. Do the same for each column. Dataedo enables you to add meaningful title for each object and column – it’s useful if object names are vague or invalid. Pro version enables you to describe foreign keys, unique keys/constraints and triggers – which is useful but not essential to understand a database.
You can access documentation through UI or you can export it to PDF or interactive HTML (the latter is available only in Pro version).
Described here is a continuous process rather than one time job. If your database changes (eg. new columns, views) you should sync your documentation on regular basis (couple clicks with Dataedo).
See sample documentation:
http://dataedo.com/download/Dataedo%20repository.pdf
Some guidelines on documentation process:
Diagrams:
Keep your diagrams small and readable – just include important tables, relations and columns – only the one that have any meaning to understand big picture – primary/business keys, important attributes and relations,
Use different color for key tables in a diagram,
You can have more than one diagram per module,
You can add diagram to description of most important tables/with most relations.
Descriptions:
Don’t document the obvious – don’t write description “Document date” for document.date column. If there’s nothing meaningful to add just leave it blank,
If objects stored in tables have types or statuses it’s good to list them in general description of a table,
Define format that is expected, eg. “mm/dd/yy” for a date that is stored in text field,
List all known/important values an it’s meaning, e.g. for status column could be something like this: “Document status: A – Active, C – Cancelled, D – Deleted”,
If there’s any API to a table – a view that should be used to read data and function/procedures to insert/update data – list it in the description of table,
Describe where does rows/columns’ values come from (procedure, form, interface etc.) ,
Use “[deprecated]” mark (or similar) for columns that should not be used (title column is useful for this, explain which field should be used instead in description field).
We use Enterprise Architect for our DB definitions. We include stored procedures, triggers, and all table definitions defined in UML. The three brilliant features of the program are:
Import UML Diagrams from an ODBC Connection.
Generate SQL Scripts (DDL) for the entire DB at once
Generate Custom Templated Documentation of your DB.
You can edit your class / table definitions within the UML tool, and generate a fully descriptive with pictures included document. The autogenerated document can be in multiple formats including MSWord. We have just less than 100 tables in our schema, and it's quite managable.
I've never been more impressed with any other tool in my 10+ years as a developer. EA supports Oracle, MySQL, SQL Server (multiple versions), PostGreSQL, Interbase, DB2, and Access in one fell swoop. Any time I've had problems, their forums have answered my problems promptly. Highly recommended!!
When DB changes come in, we make then in EA, generate the SQL, and check it into our version control (svn). We use Hudson for building, and it auto-builds the database from scripts when it sees you've modified the checked-in sql.
(Mostly stolen from another answer of mine)
This answer extends Kieveli's above, which I upvoted. If your version of EA supports Object Role Modeling (conceptual design, vs. logical design = ERD), reverse engineer to that and then fill out the model with the expressive richness it gives you.
The cheap and lighter-weight option is to download Visiomodeler for free from MS, and do the same with that.
The ORM (call it ORMDB) is the only tool I've ever found that supports and encourages database design conversations with non-IS stakeholders about BL objects and relationships.
Reality check - on the way to generating your DDL, it passes through a full-stop ERD phase where you can satisfy your questions about whether it does anything screwy. It doesn't. It will probably show you weaknesses in the ERD you designed yourself.
ORMDB is a classic case of the principle that the more conceptual the tool, the smaller the market. Girls just want to have fun, and programmers just want to code.
A wiki solution supports hyperlinks and collaborative editing, but a wiki is only as good as the people who keep it organized and up to date. You need someone to take ownership of the document project, regardless of what tool you use. That person may involve other knowledgeable people to fill in the details, but one person should be responsible for organizing the information.
If you can't use a tool to generate an ERD by reverse engineering, you'll have to design one by hand using TOAD or VISIO.
Any ERD with hundreds of objects is probably useless as a guide for developers, because it'll be unreadable with so many boxes and lines. In a database with so many objects, it's likely that there are "sub-systems" of a few dozen tables and views each. So you should make custom diagrams of these sub-systems, instead of expecting a tool to do it for you.
You can also design a pseudo-ERD, where groups of tables are represented by a single object in one diagram, and that group is expanded in another diagram.
A single ERD or set of ERD's are not sufficient to document a system of this complexity, any more than a class diagram would be adequate to document an OO system. You'll have to write a document, using the ERD's as illustrations. You need text descriptions of the meaning and use of each table, each column, and the relationships between tables (especially where such relationships are implicit instead of represented by referential integrity constraints).
All of this is a lot of work, but it will be worth it. If there's a clear and up-to-date place where the schema is documented, the whole team will benefit from it.
Since you have the luxury of working with fellow developers that are in the same boat, I would suggest asking them what they feel would convey the needed information, most easily. My company has over 100 tables, and my boss gave me an ERD for a specific set tables that all connect. So also, you might want to try breaking 1 massive ERD into a bunch of smaller, manageable, ERDs.
Well, a picture tells a thousand words so I would recommend creating ER diagrams where you can view the relationship between tables at a glance, something that is hard to do with a text-only description.
You don't have to do the whole database in one diagram, break it up into sections. We use Visual Paradigm at work but EA is a good alternative as is ERWIN, and no doubt there are lots of others that are just as good.
If you have the patience, then using html to document the tables and columns makes your documentation easier to access.
If describing your databases to your end users is your primary goal Ooluk Data Dictionary Manager can prove useful. It is a web-based multi-user software that allows you to attach descriptions to tables and columns and allows full text searches on those descriptions. It also allows you to logically group tables using labels and browse tables using those labels. Tables as well as columns can be tagged to find similar data items across your database/databases.
The software allows you to import metadata information such as table name, column name, column data type, foreign keys into its internal repository using an API. Support for JDBC data sources comes built-in and can be extended further as the API source is distributed under ASL 2.0. It is coded to read the COMMENTS/REMARKS from many RDBMSs.You can always manually override the imported information. The information you can store about tables and columns can be extended using custom fields.
The Data Dictionary Manager uses the "data object" and "attribute" terminology instead of table and column because it isn't designed specifically for relational databases.
Notes
If describing technical aspects of your database such as triggers,
indexes, statistics is important this software isn't the best option.
It is however possible to combine a technical solution with this
software using hyperlink custom fields.
The software doesn't produce an ERD
Disclosure: I work at the company that develops this product.