Using solr for indexing different types of data - indexing

I'm considering the use of Apache solr for indexing data in a new project. The data is made of different, independent types, which means there are for example
botanicals
animals
cars
computers
to index. Should I be using different indexes for each of the types or does it make more sense to use only one index? How does using many indexes affect performance?
Or is there any other possibility to achieve this?
Thanks.

Both are legitimate approaches, but there are tradeoffs. First, how big is your dataset? If it is large enough that you may want to partition it across multiple servers, it probably makes sense to have different indexes.
Second, how important is performance - indexing it all together will likely result in worse performance, but the degree depends on how much data there is and how complex the queries can get.
Third, do you have the need to query for multiple data types in the same search? If so, indexing everything together can be a convenient way to allow this. Technically this could be achieved with separate indexes, but getting the most relevant results for the query could be a challenge (not that it isn't already)
Fourth, a single index with a single schema and configuration can simplify the life of whoever will be deploying and maintaining the system.
One other thing to consider is IDs - do the all of the different objects have a unique identifier across all types? If not, you probably will need to generate this if you want to index them together.

Related

Are sql tuning ways always same for different DB engine?

I used Oracle for the half past year and learned some tricks of sql tuning,but now our DB is moving to greenplum and the project manager suggest us to change some of the codes that writted in Oracle sql for their efficiency or grammar.
I am curious that Are sql tuning ways same for different DB engine,like oracle,postgresql,mysql and so on?if yes or not,why?Any suggestion are welcomed!
some like:
in or exists
count(*) or count(column)
use index or not
use exact column instead of select *
For the most part the syntax that is used will remain the same, there may be small differences from one engine to another and you may run into different terms to achieve some of the more specific output or do more complex tasks. In order to achieve parity you will need to learn those new terms.
As far as tuning, this will vary from system to system. Specifically going from Oracle to Greenplum you are looking at moving from a database where efficiency in a query if often driven by dropping an index on the data. Where Greenplum is a parallel execution system where efficiency is gained by effectively distributing the data across multiple systems and querying them in parallel. In Greenplum indexing is an additional layer that usually does not add benefit, just additional overhead.
Even within a single system using changing the storage engine type can result in different ways to optimize a query. In practice queries are often moved to a new platform and work, but are far from optimal as they don't take advantage of optimizations of that platform. I would strongly suggest getting an understanding of the new platform and you should not go in assuming a query that is optimized for one platform is the optimal way to run it in another.
Getting specifics in why they differ requires someone to be an expert in bother to be able to compare both. I don't claim to know much of greenplum.
The basic principles which I would expect all developers to learn over time dont really change. But there are "quirks" of individual engines which make specific differences. From your question I would personally anticipate 1 and 4 to remain the same.
Indexing is something which does vary. For example the ability to use two indexes was not (is not?) Ubiquitous. I wouldn't like to guess which DBMS can / can't count columns from the second field in a composite index. And the way indexes are maintained is very different from one DBMS to the next.
From my own experience I've also seen differences caused by:
Different capabilities in the data access path. As an example, one optimisation is for a DBMS to create a bit map of rows (matching and not matching) the combine multiple bitmaps to select rows. A DBMS with this feature can use multiple indexes in a single query. One without it can't.
Availability of hints / lack of hints. Not all DBMS support them. I know they are very common in Oracle.
Different locking strategies. This is a big one and can really affect update and insert queries.
In some cases DBMS have very specific capabilities for certain types of data such as geographic data or searchable free text (natural language). In these cases the way of working with the data is entirely different from one DBMS to the next.

Elasticsearch querying multiple types and grouped by types?

Suppose I am to search against two types [cars] and [buildings], and I would want the results to be separated. Is there a way one can group results by types?
I understand one simple way will be to query each types separately, but for other use cases one may actually need to query tens or hundreds of types together. Is there a native way or hacky way(like using sort) to achieve this?
This type of grouping behavior is (currently) not available in elasticsearch. It has been a long standing request:
https://github.com/elasticsearch/elasticsearch/issues/256
There are two approaches that can help, both of which are far from perfect, but may be good enough for some use cases.
Client side aggregation. Request a lot more results than you plan on displaying and the then bucket those.
Using multi-query. This allows you to easily pass down some number of queries in a single batch, but will have potential scaling problems if the number of queries gets to large.
This is one feature that Solr has that elasticsearch doesn't, but I have never tried it. I used a similar feature with Autonomy IDOL years back, but the performance was abysmal.
If you want the results separated in groups of documents, you're going to have to restructure your documents, since, elasticsearch is focused on finding matching documents. You might get around this by designing a document that has child documents then you can query for matches on the parent document that represents your type.
I guess there might be some common field (let's say it's [price]) if you want to search against different types. Then it would be reasonable to add some different type like [price_aggregator] and put into it fields [type] and [price]. And then you could easily build your query against just one type. This requires some additional work while indexing and more memory to store index but it's much performant when you search.

Emulating join behavior with Rails and Mongoid

Just wanted to ask some advice when building a database with mongodb, I have been reading a lot that if you have a database with a lot of joins it's better to go with say postgresql.
So if I wanted flexibility and needed my data to join multiple times, should I go with Postgresql? I know mongodb has fast reads / writes but needs to query multiple times to emulate joins. So when would this become a performance hit? Does mongodb limit your ability to create new complex relationships on your data that did not previously exist?
I guess the attractiveness of mongodb is its javascript syntax and similarity to json :)
I will start from the end:
I guess the attractiveness of mongodb is its javascript syntax and
similarity to json :)
Not only this, and json style is not main advantage. Main advantages of mongodb is ability to embedd documents, high performance and full scalability, full index support, map/reduce, etc.
So if I wanted flexibility and needed my data to join multiple times,
should I go with Postgresql?
It depends on concrete task, for example if you designing report system i prefer to use some relational database. But sometimes instead of joins and separate collections you can embedd documents + mongodb good fit for the data denormalization ( and in many situations you can denormalize in background to avoid joins )
I know mongodb has fast reads / writes but needs to query multiple
times to emulate joins. So when would this become a performance hit?
If you will use mongodb as regular relational database (without embedding and denormaliztion) you never achieve best performance.
Does mongodb limit your ability to create new complex relationships on
your data that did not previously exist?
No mongodb not limit you, because of it does not contains any constraints between collections like foreign key in any sql database + it allow embedd and easy denormalize data to fit your business needs and achieve best performance.
Another alternative would be to denormalize your data.
You store copies of data in multiple tables/collections. In doing so, you avoid the need for JOINs and lookups needed to stitch together related pieces of data.
You avoid joins and you’re storing more data - but your overall application can be faster.
In mongoid there are two great gems to make this easier:
Mongoid_alize &
Mongoid_denomalize
http://blog.joshdzielak.com/blog/2012/05/03/releasing-mongoid-alize-comprehensive-field-denormalization-for-mongoid/
You can always use:
http://www.mongodb.org/display/DOCS/MapReduce
Or
http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group

Should I be concerned that ORMs, by default, return all columns?

In my limited experience in working with ORMs (so far LLBL Gen Pro and Entity Framework 4), I've noticed that inherently, queries return data for all columns. I know NHibernate is another popular ORM, and I'm not sure that this applies with it or not, but I would assume it does.
Of course, I know there are workarounds:
Create a SQL view and create models and mappings on the view
Use a stored procedure and create models and mappings on the result set returned
I know that adhering to certain practices can help mitigate this:
Ensuring your row counts are reasonably limited when selecting data
Ensuring your tables aren't excessively wide (large number of columns and/or large data types)
So here are my questions:
Are the above practices sufficient, or should I still consider finding ways to limit the number of columns returned?
Are there other ways to limit returned columns other than the ones I listed above?
How do you typically approach this in your projects?
Thanks in advance.
UPDATE: This sort of stems from the notion that SELECT * is thought of as a bad practice. See this discussion.
One of the reasons to use an ORM of nearly any kind is to delay a lot of those lower-level concerns and focus on the business logic. As long as you keep your joins reasonable and your table widths sane, ORMs are designed to make it easy to get data in and out, and that requires having the entire row available.
Personally, I consider issues like this premature optimization until encountering a specific case that bogs down because of table width.
First of : great question, and about time someone asked this! :-)
Yes, the fact an ORM typically returns all columns for a database table is something you need to take into consideration when designing your systems. But as you've mentioned - there are ways around this.
The main fact for me is to be aware that this is what happens - either a SELECT * FROM dbo.YourTable, or (better) a SELECT (list of all columns) FROM dbo.YourTable.
This is not a problem when you really want the whole object and all its properties, and as long as you load a few rows, that's fine, too - the convenience beats the raw performance.
You might need to think about changing your database structures a little bit - things like:
maybe put large columns like BLOBs into separate tables with a 1:1 link to your base table - that way, a select on the parent tables doesn't grab all those large blobs of data
maybe put groups of columns that are optional, that might only show up in certain situations, into separate tables and link them - again, just to keep the base tables lean'n'mean
Also: avoid trying to "arm-wrestle" your ORM into doing bulk operations - that's just not their strong point.
And: keep an eye on performance, and try to pick an ORM that allows you to change certain operations into e.g. stored procedures - Entity Framework 4 allows this. So if the deletes are killing you - maybe you just write a Delete stored proc for that table and handle that operation differently.
The question here covers your options fairly well. Basically you're limited to hand-crafting the HQL/SQL. It's something you want to do if you run into scalability problems, but if you do in my experience it can have a very large positive impact. In particular, it saves a lot of disk and network IO, so your scalability can take a big jump. Not something to do right away though: analyse then optimise.
Are there other ways to limit returned columns other than the ones I listed above?
NHibernate lets you add projections to your queries so you wouldn't need to use views or procs just to limit your columns.
For me this has only been an issue if the tables has LOTS of columns > 30 or if the column had alot of data for example a over 5000 character in a field.
The approach I have used is to just map another object to the existing table but with only the fields I need. So for a search that populates a table with 100 rows I would have a
MyObjectLite, but when I click to view the Details of that Row I would call a GetById and return a MyObject that has all the columns.
Another approach is to use custom SQL, Stroed procs but I only think you should go down this path if you REALLY need the performance gain and have users complaining. SO unless there is a performance problem do not waste your time trying to fix a problem that does not exist.
You can limit number of returned columns by using Projection and Transformers.AliasToBean and DTO here how it looks in Criteria API:
.SetProjection(Projections.ProjectionList()
.Add(Projections.Property("Id"), "Id")
.Add(Projections.Property("PackageName"), "Caption"))
.SetResultTransformer(Transformers.AliasToBean(typeof(PackageNameDTO)));
In LLBLGen Pro, you can return Typed Lists which not only allow you to define which fields are returned but also allow you to join data so you can pull a custom list of fields from multiple tables.
Overall, I agree that for most situations, this is premature optimization.
One of the big advantages of using LLBLGen and other ORMs as well (I just feel confident speaking about LLBLGen because I have used it since its inception) is that the performance of the data access has been optimized by folks who understand the issues better than your average bear.
Whenever they figure out a way to further speed up their code, you get those changes "for free" just by re-generating your data layer or by installing a new dll.
Unless you consider yourself an expert at writing data access code, ORMs probably improve most developers efficacy and accuracy.

DB Design: Does having 2 Tables (One is optimized for Read, one for Write) improve performance?

I am thinking about a DB Design Problem.
For example, I am designing this stackoverflow website where I have a list of Questions.
Each Question contains certain meta data that will probably not change.
Each Question also contains certain data that will be consistently changing (Recently Viewed Date, Total Views...etc)
Would it be better to have a Main Table for reading the constant meta data and doing a join
and also keeping the changing values in a different table?
OR
Would it be better to keep everything all in one table.
I am not sure if this is the case, but when updating, does the ROW lock?
When designing a database structure, it's best to normalize first and change for performance after you've profiled and benchmarked your queries. Normalization aims to prevent data-duplication, increase integrity and define the correct relationships between your data.
Bear in mind that performing the join comes at a cost as well, so it's hard to say if your idea would help any. Proper indexing with a normalized structure would be much more helpful.
And regarding row-level locks, that depends on the storage engine - some use row-level locking and some use table-locks.
Your initial database design should be based on conceptual and relational considerations only, completely indepedent of physical considerations. Database software is designed and intended to support good relational design. You will hardly ever need to relax those considerations to deal with performance. Don't even think about the costs of joins, locking, and activity type at first. Then further along, put off these considerations until all other avenues have been explored.
Your rdbms is your friend, not your adversary.
You should have the two table separated out as you might want to record the history of the question. The main Question table is indexed by question ID then the Status table is indexed by query ID and date/time stamp and contains a row for each time the status changes.
Don't know that the updates are really significant unless you were using pessimistic locking where the row would be locked for a period of time.
I would look at caching your results either locally with Asp.net caching or using MemCached.
This would certainly be a bad idea if you were using Oracle. In Oracle, you can quite happily read records while other sessions are modifying them due to it's multi-version concurency control. You would incur extra performance penalty for the join for no savings.
A design patter that is useful, however, is to pre-join tables, pre-calculate aggregates or pre-apply where clauses using materialized views.
As already said, better start with a clean normalized design. It's just easier to denormalize later, than to go the other way around. The experience teaches that you will never denormalize that one big table! You will just throw more columns in as needed. And you will need more and more indexes and updates will go slower and slower.
You should also take a look at the expected loads: Will be there more new answers or just more querying? What other operations will you have? When it comes to optimization, you can use the features of your dbms system: indexing, views, ...
Eran Galperin already provided most of my answer. In addition, the structure you propose really wouldn't help you in terms of locking. If their are relatively static and dynamic attributes in the same row, breaking the static and dynamic into two tables isn't of much benefit. It doesn't matter if static data is being locked, since no one is trying to change it anyway.
In fact, you may actually do worse with this design. Some database engines use page locking. If a table has fewer/smaller columns, more rows will fit on a page. The more rows there are on a page, the more likely there will be a lock contention. By having the static data mixed in with the dynamic, the rows are bigger, therefore there are fewer rows in a page, and therefore fewer waits on page locks.
If you have two independent sets of dynamic attributes, and they are normally modified by different actors, then you might get some benefit by breaking them into different tables. This is a pretty unusual case, however.
I'd also point out that breaking the table into a static and dynamic portion may not be of benefit in a relatively small environment, but in a large distributed environment it may be useful to cache and replicate the dynamic data at different rates than the static data.