Is Full text indexing for high transaction responses a good idea? - sql

For example, in order to provide an effective way to query repsondents answers to a dynamic questionnaire, where responses are stored in a keyword/response pair.
I am aware that there may be some latency in updating the catalogue/text index as new entries are added, but this may not matter if reporting/querying is not a real time concern. (i.e. performed at some later date)
So in answer to my own question, the transactional aspect of this doesnt actually matter, does it?

I would distinguish between data consistency in selected storage and gap between data arrival and appearing in search results for the user as you might use external or even remote search solutions for your application as the index update might take some significant time depends on the case.

Related

Buffer table in a database, Good or not?

I have a question !
I need to make a university project, and in this project i will have one database table like this :
This table will have a LOT of records !!!!!!
And for manage this i need to create a validation system.
What is the best (and why) between create a buffer table like this :
Or add a column in my table like this :
Thank you !
Your question does not have enough information to provide a real answer. Here is some guidance on how to think about the situation. Which approach depends on the nature of your application and especially on what "validation" means.
One reasonable interpretation is that "validation" is part of a work-flow process, so it happens only once (or 99% of the time only once). And, you never want to see unvalidated advertisements when you look look at advertisements. If this is the case, then there would typically be additional information about the validation process.
This scenario suggests two reasonable approaches:
Do the validation inside a transaction. This would be reasonable if the validation process were entirely in the database and was measured in seconds.
Have a separate table for advertisements being validated. Perhaps even a separate table per "user" or "entity" responsible for them. Depending on the nature of the validation process, this could be a queue that feeds them to people doing the validation.
Putting them in the "advertisements" table doesn't make sense, because there is likely to be additional information involved with the validation process -- who, what, where, when, how.
If an advertisement can be validated and invalidated multiple times, then the best approach may be to put them in the same table. Once again, there are questions about the nature of the process.
Getting access to the two groups without a full table scan is tricky. If 10% of the rows are invalidated and 90% are validated, then a normal index would require a full table scan for reading either group. To get faster access to the smaller group, here are two options:
clustered index on the validation flag.
separate partitions for validated and invalidated rows.
In both cases, changing the validation flag for a record is relatively expensive, because it involves reading and writing the record on different data pages. Unless dozens of changes are made per second, this is probably not a big deal.
Here, there is no need to have a separate "buffer table". You can just properly index the valid field. So the following index would essentially automatically create a buffer table:
create unique index x on y (id)
include (all columns)
where (valid = 0)
This index creates a copy of the yet invalid data. You can do lots of variations such as
create unique index x on y (valid, id)
There's really no need for a separate table. Indexes are very easy compared to partitioning or even manually partitioning. Much less work, more general, more flexible and less potential for human error.
Either approach is valid, and which will perform better will depend more on the type of database you are using rather than the theoretical question of whether it is more correct to use a boolean or partition this into two tables.
I actually prefer the partitioning approach (your buffer table idea), but it will be more complex to code around. This may be a significant point to consider. Most modern databases will handle the boolean criteria very well with an index, but sometimes you can be surprised.
The most important thing from a development perspective right now is to pick one and run with it instead of paralyzing your project while you decide the "right" one.

Row Inserted and Updated Time in Fact Table

I see there is an importance in having a row inserted and a row last updated fields in a fact table. But I could not find any standard data warehouse or a reference which says that this is a good thing to do. I am uncertain whether this is because it is a bad practice; if so why should it be so? If it is because of the data size, I see it is only 8bytes for a full date field.
Any help is greatly appreciated!!!
There's nothing talking about whether it's a good or bad practice because we include creation time and updated time only if we need them or ever will need them.
It's a "good thing to do" if you need to access those columns and a "bad thing to do" if your table will never require those columns.
The inclusion of insert and update timestamps in your data warehouse tables allow you to be able to report from the perspective of as was and/or as is with regards to the data warehouse. These timestamps would be in addition to any timestamps that may be captured from the source.
They also make troubleshooting easier and in a worse case scenario the ability to back out a set of data from a specific run of an ETL process.
At a previous client, the data model we implemented included upwards of 6 different timestamps to provide slowly changing history, as is/ as was reporting, and source related time stamps. It made for very flexible reporting but also increased the learning curve of how to get exactly what you wanted from the table(s).

Permanent table, temp tables or php session?

My web app offers personalized recommendations. When a user starting to use it, about 1000+ rows are being inserted to one big recommendation table, correlating with other tables in the database. Every item the user votes for affects all of those 1000+ rows.
Since the recommendation info is only useful during the session, and since the recommendation table is getting huge, we'd like to switch to a more appropiate method. There's the possibility of deleting the relevant rows as soon as the user session is over. I guess PHP session array or temp tables are better for this case?
One temp table per session will lead to catalog pollution, so not really recommended.
Have you considered actually keeping the data, so as periodically mine it to improve the suggestions?
First: consider redesigning your data structure, I think it is not optimal.
Store a user's recommendation in a table user-recommendeditem-score: I don't see any need for a temp table or anything else.
Otherwise, you could start using sessions, but you should encapsulate the code carefully, making it easy to change if/when this solution is no more maintainable.
I suspect that the method is flawed - 1000+ recommendations per user? How many of them do they ever look at? If you don't know the answer to that question - then you need to spend some time thinking about why you don't know the answer.
Every item the user votes for affects all of those 1000+ rows
Are you sure your data is properly normalised?
But leaving that aside for the moment. The right place to generate / store that is in the database - a relational database is explicitly designed, and a lot more efficient about generating and maintaining tabular sets of data then a conventional programming language.

Storing multiple choice values in database

Say I offer user to check off languages she speaks and store it in a db. Important side note, I will not search db for any of those values, as I will have some separate search engine for search.
Now, the obvious way of storing these values is to create a table like
UserLanguages
(
UserID nvarchar(50),
LookupLanguageID int
)
but the site will be high load and we are trying to eliminate any overhead where possible, so in order to avoid joins with main member table when showing results on UI, I was thinking of storing languages for a user in the main table, having them comma separated, like "12,34,65"
Again, I don't search for them so I don't worry about having to do fulltext index on that column.
I don't really see any problems with this solution, but am I overlooking anything?
Thanks,
Andrey
Don't.
You don't search for them now
Data is useless to anything but this one situation
No data integrity (eg no FK)
You still have to change to "English,German" etc for display
"Give me all users who speak x" = FAIL
The list is actually a presentation issue
It's your system, though, and I look forward to answering the inevitable "help" questions later...
You might not be missing anything now, but when you're requirements change you might regret that decision. You should store it normalized like your first instinct suggested. That's the correct approach.
What you're suggesting is a classic premature optimization. You don't know yet whether that join will be a bottleneck, and so you don't know whether you're actually buying any performance improvement. Wait until you can profile the thing, and then you'll know whether that piece needs to be optimized.
If it does, I would consider a materialized view, or some other approach that pre-computes the answer using the normalized data to a cache that is not considered the book of record.
More generally, there are a lot of possible optimizations that could be done, if necessary, without compromising your design in the way you suggest.
This type of storage has almost ALWAYS come back to haunt me. For one, you are not even in first normal form. For another, some manager or the other will definitely come back and say.. "hey, now that we store this, can you write me a report on... "
I would suggest going with a normalized design. Put it in a separate table.
Problems:
You lose join capability (obviously).
You have to reparse the list on each page load / post back. Which results in more code client side.
You lose all pretenses of trying to keep database integrity. Just imagine if you decide to REMOVE a language later on... What's the sql going to be to fix all of your user profiles?
Assuming your various profile options are stored in a lookup table in the DB, you still have to run "30 queries" per profile page. If they aren't then you have to code deploy for each little change. bad, very bad.
Basing a design decision on something that "won't happen" is an absolute recipe for failure. Sure, the business people said they won't ever do that... Until they think of a reason they absolutely must do it. Today. Which will be promptly after you finish coding this.
As I stated in a comment, 30 queries for a low use page is nothing. Don't sweat it, and definitely don't optimize unless you know for darn sure it's necessary. Guess how many queries SO does for it's profile page?
I generally stay away at the solution you described, you asking for troubles when you store relational data in such fashion.
As alternative solution:
You could store as one bitmasked integer, for example:
0 - No selection
1 - English
2 - Spanish
4 - German
8 - French
16 - Russian
--and so on powers of 2
So if someone selected English and Russian the value would be 17, and you could easily query the values with Bitwise operators.
Premature optimization is the root of all evil.
EDIT: Apparently the context of my observation has been misconstrued by some - and hence the downvotes. So I will clarify.
Denormalizing your model to make things easier and/or 'more performant' - such as creating concatenated columns to represent business information (as in the OP case) - is what I refer to as a "premature optimization".
While there may be some extreme edge cases where there is no other way to get the necessary performance necessary for a particular problem domain - one should rarely assume this is the case. In general, such premature optimizations cause long-term grief because they are hard to undo - changing your data model once it is in production takes a lot more effort than when it initially deployed.
When designing a database, developers (and DBAs) should apply standard practices like normalization to ensure that their data model expresses the business information being collected and managed. I don't believe that proper use of data normalization is an "optimization" - it is a necessary practice. In my opinion, data modelers should always be on the lookout for models that could be restructured to (at least) third normal form (3NF).
If you're not querying against them, you don't lose anything by storing them in a form like your initial plan.
If you are, then storing them in the comma-delimited format will come back to haunt you, and I doubt that any speed savings would be significant, especially when you factor in the work required to translate them back.
You seem to be extremely worried about adding in a few extra lookup table joins. In my experience, the time it takes to actually transmit the HTML response and have the browser render it far exceed a few extra table joins. Especially if you are using indexes for your primary and foreign keys (as you should be). It's like you are planning a multi-day cross-country trip and you are worried about 1 extra 10 minute bathroom stop.
The lack of long-term flexibility and data integrity are not worth it for such a small optimization (which may not be necessary or even noticeable).
Nooooooooooooooooo!!!!!!!!
As stated very well in the above few posts.
If you want a contrary view to this debate, look at wordpress. Tables are chocked full of delimited data, and it's a great, simple platform.

Any SQL database: When is it better to fetch a whole table instead of querying for particular rows?

I have a table that contains maybe 10k to 100k rows and I need varying sets of up to 1 or 2 thousand rows, but often enough a lot less. I want these queries to be as fast as possible and I would like to know which approach is generally smarter:
Always query for exactly the rows I need with a WHERE clause that's different all the time.
Load the whole table into a cache in memory inside my app and search there, syncing the cache regularly
Always query the whole table (without WHERE clause), let the SQL server handle the cache (it's always the same query so it can cache the result) and filter the output as needed
I'd like to be agnostic of a specific DB engine for now.
with 10K to 100K rows, number 1 is the clear winner to me. If it was <1K I might say keep it cached in the application, but with this many rows, let the DB do what it was designed to do. With the proper indexes, number 1 would be the best bet.
If you were pulling the same set of data over and over each time then caching the results might be a better bet too, but when you are going to have a different where all the time, it would be best to let the DB take care of it.
Like I said though, just make sure you index well on all the appropriate fields.
Seems to me that a system that was designed for rapid searching, slicing, and dicing of information is going to be a lot faster at it than the average developers' code. On the other hand, some factors that you don't mention include the location or potential location of the database server in relation to the application - returning large data sets over slower networks would certainly tip the scales in favor of the "grab it all and search locally" option. I think that, in the 'general' case, I'd recommend querying for just what you want, but that in special circumstances, other options may be better.
I firmly believe option 1 should be preferred in an initial situation.
When you encounter performance problems, you can look on how you could optimize it using caching. (Pre optimization is the root of all evil, Dijkstra once said).
Also, remember that if you would choose option 3, you'll be sending the complete table-contents over the network as well. This also has an impact on performance .
In my experience it is best to query for what you want and let the database figure out the best way to do it. You can examine the query plan to see if you have any bottlenecks that could be helped by indexes as well.
First of all, let us dismiss #2. Searching tables is data servers reason for existence, and they will almost certainly do a better job of it than any ad hoc search you cook up.
For #3, you just say 'filter the output as needed" without saying where that filter is been done. If it's in the application code as in #2, than, as with #2, than you have the same problem as #2.
Databases were created specifically to handle this exact problem. They are very good at it. Let them do it.
The only reason to use anything other than option 1 is if the WHERE clause itself is huge (i.e. if your WHERE clause identifies each row individually, e.g. WHERE id = 3 or id = 4 or id = 32 or ...).
Is anything else changing your data? The point about letting the SQL engine optimally slice and dice is a good one. But it would be surprising if you were working with a database and do not have the possibility of "someone else" changing the data. If changes can be made elsewhere, you certainly want to re-query frequently.
Trust that the SQL server will do a better job of both caching and filtering than you can afford to do yourself (unless performance testing shows otherwise.)
Note that I said "afford to do" not just "do". You may very well be able to do it better but you are being paid (presumably) to provide functionality not caching.
Ask yourself this... Is spending time writing cache management code helping you fulfil your requirements document?
if you do this:
SELECT * FROM users;
mysql should perform two queries: one to know fields in the table and another to bring back the data you asked for.
doing
SELECT id, email, password FROM users;
mysql only reach the data since fields are explicit.
about limits: always ss best query the quantity of rows you will need, no more no less. more data means more time to drive it