I am a newbie when it comes with the effect of Database design on performance. I am creating an app which uses hibernate to query the DB. I am trying to create a DB schema for the application and I'm confused between using clobs and separate tables for some of the data. I have the following tables:
Person(id, name)
Address(person_id,address(varchar2))
products_bought(person_id, product_name(varchar2))
places_visited(person_id, place_name(varchar2))
..few more tables
I am sure I need to write/read the data to/from all these table every time. I'm thinking of designing them this way to reduce the tables thereby reducing the number of joins I need to make and make it easy for hibernate to fetch the info in one-go:
Person(id, name, products_bought(CLOB), places_visited(CLOB))
Address(person_id,address(varchar2))
..few more tables
Now I came across many posts online arguing that the performance will take a hit when using CLOBs. Same goes when using JOINs. How do I decide which is better based on the fact that the person table will not have more than 10K rows?
Related
Suppose i have a User table, and other tables (e.g. UserSettings, UserStatistics) which have one-to-one relationship with a user.
Since sql databases don't save complex structs in table fields (some allow JSON fields with undefined format), is it ok to just add said tables, allowing to store individual (complex) data for each user? Will it complicate performance by 'joining' more queries?
And in distirbuted databases cases, will it save those (connected) tables randomly in different nodes, making more redundant requests with each other and decreasing efficiency?
1:1 joins can definitely add overhead, especially in a distributed database. Using a JSON or other schema-less column is one way to avoid that, but there are others.
The simplest approach is a "wide table": instead of creating a new table UserSettings with columns a,b,c, add columns setting_a, setting_b, setting_c to your User table. You can still treat them as separate objects when using an ORM, it'll just need a little extra code.
Some databases (like CockroachDB which you've tagged in your question) let you subdivide a wide table into "column families". This tends to let you get the best of both worlds: the database knows to store rows for the same user on the same node, but also to let them be updated independently.
The main downside of using JSON columns is they're harder to query efficiently--if you want all users with a certain setting, or want to know just one setting for a user, you're going to get at least a minor performance hit if the database has to parse a JSON column to figure that out, or you have to fetch the entire blob and do it in your app. If they're more convenient for other reasons though, you can work around this by adding inverted indexes on your JSON columns, or expression indexes on the specific values you're interested in. Indexes can have a similar cost to 1:1 joins, but you can mitigate that in CockroachDB using by using the STORING keyword to tell the DB to write a copy of all the user columns to the index.
This looks like a standard task to sync records from SQL server (Primary data source) to NOSQL (elasticsearch), so to support advanced fast search functionality supported by NOSQL databases.
And there is already standard solution to support this using Logstash for elasticsearch. But the main challenge here is to convert normalized data stored in SQL to denormalized data in elasticsearch.
Ex: SQL might have below normalized tables for Person entity.
Person
PersonAddress
PersonJob
PersonContact
PersonSalary
But when we store this in elasticsearch, we tend to create single denormalized entity called Person. Now I see below challenges
SQL query to convert multiple rows from normalized table to single denormalized single entity. We can still write some complex query to join all tables, but I am looking for any standard approach used to solve this problem. Any built in support in SQL?
Update time for entity. Each row in SQL tables has their own update time, but I expect any change in any table for this entity should be seen as update time for entity. Now even query for finding changes in all normalized table is complex.
Any references/ideas to solve the above problems will be appreciated. Thanks in advance
I am developing a mssql db for stores that are in different cities. is it better to have a table for each city, or house all in one table. I also dont want users from different cities accessing data from cities that are not theirs
SQL is designed to handle large tables, really big tables. It is not designed to handle a zillion little tables. The clear answer to your question is that all examples of a particular entity should go in a single table. There are lots of good reasons for this:
You want to be able to write a query that will return data about any city or all cities. This is easy if the data is in one table; hard if the data is in multiple tables.
You want to optimize your queries by choosing correct indexes and data types and collecting statistics and defragging indexes and so on. Why multiply the work by multiplying the number of tables?
Foreign key relationships should be properly declared. You cannot do that if the foreign key could be to multiple tables.
Lots of small tables results in lots of partially filled data pages, which just makes the database bigger and slows it down.
I could go on. But you probably get the idea by now that one table per entity is the right way to go (at least under most circumstances).
Your issue of limiting users to see data only in one city can be handled in a variety of ways. Probably the most common is simply to use views.
In Relational Model, why we not keep all our data in a single table ? Why we need to create multiple tables ?
It depends on what your purpose is. For many analytic purposes, a single table is the simplest method.
However, relational databases are really design to keep data integrity. And one aspect of data integrity is that any given item of data is stored in only one place. For instance, a customer name is stored on the customer table, so it does not need to be repeated on the orders table.
This ensures that the customer name is always correct, because it is stored in one place.
In addition, repeating data through a single table often requires duplicating data -- and that would make the table way larger than needed and hence slow everything down.
I am not an expert, but i think that this would consume very much time and many resources, if there are a lot of data in the table. If we seperate them we can make operations a lot easier
Taken from https://www.bbc.co.uk/bitesize/guides/zvq634j/revision/1
A single flat-file table is useful for recording a limited amount of
data. But a large flat-file database can be inefficient as it takes up
more space and memory than a relational database. It also requires new
data to be added every time you enter a new record, whereas a
relational database does not. Finally, data redundancy – where data is
partially duplicated across records – can occur in flat-file tables,
and can more easily be avoided in relational databases.
Therefore, if you have a large set of data about many different
entities, it is more efficient to create separate tables and connect
them with relationships.
I want to move multiple SQLite files to PostgreSQL.
Data contained in these files are monthly time-series (one month in a single *.sqlite file). Each has about 300,000 rows. There are more than 20 of these files.
My dilemma is how to organize the data in the new database:
a) Keep it in multiple tables
or
b) Merge it to one huge table with new column describing the time period (e.g. 04.2016, 05.2016, ...)
The database will be used only to pull data out of it (with the exception of adding data for new month).
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Which structure should I go for - one huge table or multiple smaller tables?
Think I would definitely go for one table - just make sure you use sensible indexes.
If you have the space and the resource 1 table, as other users have appropriately pointed out databases can handle millions of rows no problem.....Well depends on the data that is in them. The row size can make a big difference... Such as storing VARCHAR(MAX), VARBINARY(MAX) and several per row......
there is no doubt writing queries, ETL (extract transform load) is significantly easier on a single table! And maintenance of that is easier too from a archival perspective.
But if you never access the data and you need the performance in the primary table some sort of archive might make since.
There are some BI related reasons to maintain multiple tables but it doesn't sound like that is your issue here.
There is no perfect answer and will depend on your situation.
PostgreSQL is easily able to handle millions of rows in a table.
Go for option b) but..
with new column describing the time period (e.g. 04.2016, 05/2016, ...)
Please don't. Querying the different periods will become a pain, an unnecessary one. Just put the date in one column, put a index on the column and you can, probably, execute fast queries on it.
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Complicated for you to write or for the database to execute? An Example would be nice for us to get an image of your actual requirements.