SQL Full Text Search - Design Decision: Multiple Tables or One Huge One

SQL Full Text Search - Design Decision: Multiple Tables or One Huge One - sql

I'm somewhat new to SQL but I'm creating a database where there's one table of movie metadata, and I need to do full text searches on the movie scripts, which are currently organized into large tables, one for each movie, with columns for line number, timestamp, and a body of text (which needs to be able to be searched for keywords, phrases). My question is whether my searches would run faster using one massive table for all of the scripts instead of one for each movie. I'm using SQLite and Python. I'm using fts4 to implement the full text search capabilities.

Having each movie into a separate table is not a architecture easy to maintain, as you have to add/delete tables when a new movie is inserted of removed.
Searching in a massive table can be efficient only with indexes with a good clustering factor.
Generally as I see the whole problem I think that the adaptation of a documentation based database like mongoDB will provide better storage and search of your data.

Related

Creating a DB for comment sections with multiple page tags

I'm having a tough time choosing the correct database (SQL, NoSQL) for this use case even though it's so common.
This is the main record -
Page:
Contains a number of fields (which will probably change and updated in the future).
Connected to a list of tags (Which can contain up to 50 tags).
Connected to a comment section.
Page records will be queried by using the tags.
In general, the reads are more important (so writes could be more expensive) and the availability should be high.
The reason not to choose mongodb style DB is because of the comment section, There are no joins, so the comment section must be embedded in the page and the document size could grow too much.
Also MongoDB is less reliant on availability (using CAP) and availability is important to me.
The reason not to choose SQL is because the scheme of the Page could be updated and there is no fixed scheme.
Also because of the tags system - another relational table should be created and as I understood, it's bad for performance.
What's the best approach here?

Take a look at Postgres and you can have the best of both.
Postgres supports jsonb which allows indexing of jsonb data types so your search by tags can be executed pretty efficiently, or keep them as an array data type.
If you concerned about the comments embedding, then link off to another table and benefit from joins which are first-class citizens.
Given your use-case, you could have a Pages table with main, well known columns and a few foreign keys to Authors etc, tags as an array or jsonb column, some page attributes in jsonb and your comments in a separate Comments table with foreign key to Users and Pages.
Both Mongodb and Postgres and great choices.
PS, I have built far more on Mongodb than Postgres, but really impressed by Postgres after recent evaluation for a new project.

Retrieve and interpret large amounts of data from a SQL Server database

I'm an Electrical Test Engineer. Programming experience with C, mostly for devices with 256B of RAM or less. And have not a lot of experience with SQL databases...
We got a database with production data, serial numbers and testing results.
At the creation of the database no tools was created to retrieve the data.
If we can't retrieve the data, the database may as well not exist.
We have the data, the database exists. I want to create tools to retrieve and interpret data. And in the future do statistical analysis on the data.
The database has over 500k unique devices. With over 10 million measurements.
My question is: what's the most sensical way to retrieve and display the data?
For instance: a program what loops trough every entry and records the data will be complicated to write and will take days to complete.
The program and query's get complicated very fast.
We have Device types, Batch numbers, Serial numbers.
For every DISTINCT (DeviceType)
For Every DISTINCT (Batch number)
COUNT DISTINCT (Serial number) where...
NOT IN User <> 'development'...
AND Testing result <> 'FAIL'...
AND Date between ... and ...
Not to mention the measurement data, as each device may be tested multiple times. It seemed a trivial task, I'm now overwhelmed by the complexity.
I will create the code and query's myself. What I ask is help finding a strategy.

Ask yourself what questions you want to answer by recourse to the data. If the data is recorded in the most granular way possible, then it may be appropriate to consider common grouping or aggregation methods. These might include grouping by device, or location, or something else - each of these dimensions will have a business interpretation.
Writing down 3 top business requests should give you a starting point for constructing your extract/analysis strategy.
Next, try and draw together a data model, find out what tables exist, what references and relations they have to one another.
Together, between the questions you want to answer and the table-design, you should then be in a position to start constructing sensible general use queries.
Sometimes you may find different business questions can be answered using a common view of the data - when you're happy with an extract path, you can write this out using a common query language called SQL and - if appropriate, create a VIEW using that language. This abstracts the problem and makes it more convenient for users to actually get the answers they're looking for.
Your database will provide tools to write and run SQL statements, and you will need to refer to the documentation for your database to figure out how that happens - it's usually similar, but implementations differ across databases.

Is it ever a good idea to store an array as a field value, or store array values as records?

In my application I've got "articles" (similar to posts/tweets/articles) that are tagged with descriptive predefined tags: i.e "difficult", "easy", "red", "blue", "business" etc
These available tags are stored in a table, call it "tags" that contains all available tags.
Each article can be tagged with multiple tags, editable through a custom admin interface.
It could be tempting to simply bundle the tags for each entity into a stringified array of the IDs of each tag and store it alongside the article record in my "articles" table:
id | title | author | tags
---+-------+--------+-------------
1 | title | TG | "[1,4,7,12]"
though I'm sure this is a bad idea for a number of reasons, is there ever a reasonable reason to do the above?

I think you should read about Database normalization and decide for yourself. In short though, there are a number of issues with your proposal, but you may decide you can live with them.
The most obvious are:
What if an additional tag is added to row(1)? Do you have to first parse, check if it's already present then update the row to be tags.append(newTag).
Worse still deleting a tag? Search tags, is present, re-create tags.
What if a tag is to change name - some moderation process, perhaps?
Worse again, what about dfferent people specifying a tag-name differently - it'd be hard to rationalise.
What if you want to query data based on tags? Your query becomes far more complex than it would need to be.
Presentation: The client has to parse the tag in order to use it. What about the separator field? Change that and all clients have to change.
In short, all of these operations become harder and more cumbersome. Normalization is designed to overcome such issues. Probably the only reason for doing what you say, IMO, is that you're capturing the data as a one-off and it's informational only - that is, makes sense to a user but not to a system per-se. This is kind of like saying it's probably best avoided (again, IMO).

It seems to me like you want to have a separate table that stores tags and holds a foreign key which relates the tag records back to their parent record in the articles table (this is referred to as "normalizing" the database structure).
Doing it like you have suggested by cramming the tags into one field may seem to make sense now, but it will prove to be difficult to maintain and difficult/time consuming to pull the values out efficiently as your application grows in size or the amount of data grows a lot larger.
I would say that there are very few reasons to do what you have suggested, given how straightforward it is to create another table and setup a relationship to link keys between the two tables to maintain referential integrity.

I totally agree that it CAN be a good idea. I am a strong advocate of storing tags in the database as a single delimited list of strings.
BUT: The reason that I agree is that I like to use Azure Search API to index these types of data, so the query to do a lookup based on tags is not done via SQL. (using the Azure search API service is not necessary, but In my experience you will get much better performance and scalability by using a search index that is outside of the database.)
If you primary query language will be SQL (relational based queries)
then you are better off creating a child table that has a row for each
tag, otherwise you will wear a performance hit when your query has to
perform a logic on each value to split it for analysis.
Tagging is a concept that we use to get around relational data or hierarchical mapping, so to get the best performance do no try to use these relational concepts to query the tags. It is often best implemented in NoSQL data storage because they don't try to use the database to process the search queries.
I encourage you to store the data as a delimited string, and use an external indexing service to provide search and insights into your data. This is a good trade off between CRUD data access performance attempts to manage the data and indexes to optimise for searching. Sure you can optimise the DB and the search queries to make it work in SQL but it can take effort to get it right.
Once your user base hits large volumes and you need to support multiple concurrent searches without affecting update performance you will find that external indexing is an awesome investment in your time now, to save you time and resources later.

oracle views and network traffic

I cant really understand what this line from the Oracle E-Business Suite Developer's Guide means. When using views, "network traffic is minimized because all foreign keys are
denormalized on the server."
Could anyone please throw some light on when a query, associated with a view is parsed?
Is a query associated with a view is already parsed?
I cant find the answers. Please provide any oracle documentation links that would be helpful.

The quote is talking about E-Business Suite, and specifically, how to build EBS (i.e. Forms) applications in a performant fashion. The pertinent context is this:
"In general, complex blocks are based on views while simple setup blocks are based on tables."
Take this scenario: we have a table with many columns, including three which are foreign keys to lookup tables. We want to display data from this table in a Form. To be user-friendly our Form needs to show the meanings from the lookup tables not the codes from the main table. It is more efficient to execute a single query joining to the reference tables than to issue four queries. Because network traffic, and other considerations.
So we should build the Form's data block on a view which joins all four tables, rather than building it just on the main table and using Post-Query triggers to issue three separate queries which retrieve the codes' descriptions. This is especially relevant with multi-row blocks: we definitely want to avoid issuing several queries for each record returned.
Although the context for the quote is Oracle Forms, the point pertinent for most data retrieval applications. Although I suspect these days using a ref cursor to pass a result set is a more popular solution than using views.
tl;dr
It is a statement about application design not database optimization

Database model refactoring for combining multiple tables of heterogeneous data in SQL Server?

I took over the task of re-developing a database of scientific data which is used by a web interface, where the original author had taken a 'table-per-dataset' approach which didn't scale well and is now fairly difficult to manage with more than 200 tables that have been created. I've spent quite a bit of time trying to figure out how to wrangle the thing, but the datasets contain heterogeneous values, so it is not reasonably possible to combine them into one table with a set schema for column definitions.
I've explored the possibility of EAV, XML columns, and ended up attempting to go with a table with many sparse columns since the database is running on SQL Server 2008. The DBAs are having some issues with my recently created sparse columns causing some havoc with their backup scripts, so I'm left wondering again if there isn't a better way to do this. I know that EAV does not lead to decent performance, and my experiments with XML data types also demonstrated poor performance, probably thanks to the large number of records in some of the tables.
Here's the summary:
Around 200 tables, most of which have a few columns containing floats and small strings
Some tables have as many as 15,000 records
Table schemas are not consistent, as the columns depended on the number of samples in the original experimental data.
SQL Server 2008
I'll be treating most of this data as legacy in the new version I'm developing, but I still need to be able to display it and query it- and I'd rather not have to do so by dynamically specifying the table name in my stored procedures as it would be with the current multi-table approach. Any suggestions?

I would suggest that the first step is looking to rationalise the data through views; attempt to consolidate similar data sets into logical pools through views.
You could then look at refactoring the code to look at the views, and see if the web platform operates effectively. From there you could decided whether or not the view structure is beneficial and if so, look to physically rationalising the data into a new table.
The benefit of using views in this manner is you should be able to squeak a little performance out of indexes on the views, and it should also give you a better handle on the data (that said, since you are dev'ing the new version, it would suggest you are perfectly capable of understanding the problem domain).
With 200 tables as simple raw data sets, and considering you believe your version will be taking over, I would probably go through the prototype exercise of seeing if you can't write the views to be identically named to what your final table names will be in V2, that way you can also backtest if your new database structure is in fact going to work.
Finally, a word to the wise, when someone has built a database in the way you've described, without looking at the data, and really knowing the problem set; they did it for a reason. Either it was bad design, or there was a cause for what now appears on the surface to be bad design; you raise consistency as an issue - look to wrap the data and see how consistent you can make it.
Good luck!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas