where is the best place to store big data schema - sql

The question name may seem irrelevant, but I couldn't find a name best fit for my intent.
I am working on an e-commerce project and using a relational database for storage. But I have a little issue with the performance my product domain queries. The thing we want to accomplish is a little bit complex, so the way we designed the database schema is a little bit complex. We have over 40 tables for product domain and we are having some performance problems while querying the db to make our product pages work. To view some simple information about a product we need to query at least 5-7 table and this causes a huge penalty for website since we have hundreds of thousands of requests per day.
By the way our data is not big. The problem is, we have many tables containing little data and we have to join many of them at a time.
Is there a better way or place to store such data? I have looked at nosql dbs, but as I understand they might not be the best fit for my solution. Can a graph db like neo4j be helpful in my case?
Thanks for any help...

Related

In MongoDB, if my queries do not involve any joins, can I assume that it will scale?

I have an APP that will be demanding in terms of pulling data. Each time a user logs in, data is pulled, each time a new page is visited data is pulled, etc.
Let's suppose that these queries will never involve joins.
Can I assume then that the queries will scale?
No, it does not follow that using MongoDB and not using joins means "your queries will scale." That's a myth told by MongoDB marketing, not real software engineering.
It depends what your query is doing. Every query has a cost, no matter what brand of datastore you use. Every data access needs to use resources on the server, and that resource usage adds up. Do you queries scan thousands or millions of documents in the MongoDB datastore? Do they need to do map-reduce? How many documents are in the query response? Is it pulling data that is cached, or will it cost I/O overhead to pull that data? How many requests per second do you need to serve? Can MongoDB support the rate of queries you need to do? Are you configuring a MongoDB replica set or a sharded cluster? How many shards do you queries need to visit to get their result? How powerful are the servers hosting each node?
These are some examples of the types of questions you need to understand and analyze for your queries and your MongoDB cluster (the list is not complete).
You don't need to give me the answers to these questions. I'm just using them to illustrate why it's a naive question to ask "will it scale?"
It's like asking "I'm need to drive my car to my brother's house, will I have to refill my fuel tank?" That's not enough information to answer the question. How far away is your brother's house? What type of vehicle do you have? What is its fuel efficiency? Is your vehicle laden with a lot of heavy cargo? How many times do you need to make the trip? How fast are you driving? How rough are the roads on the route?
There are probably many things to consider depending on your needs but i think the main difference comes from the document data model (that MongoDB is made to support and scale on)
Document => more related data in 1 place
fewer joins (expensive especially if data are in different machines)
fewer transactions (single document updates are atomic)
simpler smaller schema, more tailored to your application
data model, similar to the way programmers save their data on
objects(maps)/arrays
If you have many applications or too many different ways to access the same data, maybe you end up normalizing more your data to a more general data representation => losing some of the above benefits or duplicating some of your data to serve the different needs.

To normalize or not normalize? What performs better? [duplicate]

This question already has answers here:
more performant to have normalized or denormalized tables
(5 answers)
Closed 8 years ago.
Would multiple, joined, normalized tables return queries faster than 1 denormalized table? I'm interested in the performance of read (select) statement, not insert, delete, update.
I believe the normalized, joined tables return select queries faster, but I've also heard that since all of the data is in one row with 1 denormalized table, that denormalized tables return queries faster.
I'm trying to find this out, so I can improve visualization rendering on Tableau, so I'm concerned with the read operations of the table, not write.
Any clearing up on this confusion would be appreciated.
If you are dealing with static data warehouse, sometimes it IS better to deal with denormalized data, especially with any type of aggregations / roll-up values you may be interested in within the data. Having pre-summarized tables on very large datasets is good, but without knowing more of the context of your data, as best I can offer as an answer.
To clarify from your comment...
Lets say you are dealing with (ex: something I worked with in the past) government contract and grants data for the year 2010-2012. The data itself is not going to change... who awarded, gov't sector, small/large business classification, amount awarded, etc. These values won't really change, so if you wanted to know which companies were awarded how much per gov't congressional district, per state, per industry, etc. Having pre-aggregate totals would save time.
Having a read-only display system (querying only) from another system that is performing the data entry (such as sales activity that DOES the insert/update/delete), you should obviously stay in a normalized mode as the underlying data IS changing.. again even though you are giving read-only inquiry access to it.
It should be pretty obvious that the fastest way to get a query result is if it has already been pre-built and is sitting ready for retrieval in a single table.
However, from a maintenance perspective that is not practical.
It is generally good advice to keep most data in normalized tables, but see DRapp's answer for scenarios where denormalization is sometimes used.
That's very dependent on the situation, as others have pointed out. The best thing you can do if you need top-notch performance is generate some tests to see how things work out and then implement the fastest solution. Create one set of denormalized tables, one set of normalized, and run some queries and see how fast they execute. Go from there.
However, unless you have TONS of data, speed is probably not your biggest concern. Modern RDBMS's are extremely efficient, especially with the appropriate indexes, etc. in place. You might be better off asking whether normalized or de-normalized tables make more logical sense for the work you are doing. You might also consider that one of the biggest arguments for normalized tables is that they help prevent data errors. Consider doing some background reading on normalization for an explanation of this. If you want to make sure your data is as clean as possible, you may want to normalize, even if you take a small performance hit.

How to handle big database?

I have a database with over million users, each user has enormous amount of data stored.
Needless to say, the performance has decayed.
(Each user has its own website and CMS)
How do I handle the database for many users?
I herd an idea of saving each user information as its own database, instead of tables with foreign keys.
What are your thoughts of this idea? What are the advantages and disadvantages?
What other ways should I be considering?
One million users with referenced data is not big data.
If the performance is bad then you might have a look at your SQL code or front-end code.
Use indexes also to increase the Query Execution time. Most of the times indexes and optimization of the code is the trick. A lot of other things also plays a big role like your CPU, memory disk etc.
I would first have a look at the code and see if you can optimize anything and then if that doesn't help then seperate the data in multiple databases.
Even if you do this you might still have problem with performance if the databases is hosted on the same server.
Good luck!

Database Normalisation and Searching it Quickly

I'm working on the technical architecture for a content solution integration. The data from the solution provider runs to millions of rows and normalised to 3NF. It is updated on a regular schedule (daily most likely) and its data is split down to a very granular level of atomicity.
I need to search and query this data and my current inclination is to leave the normalised data alone and create a denormalised database from its data (OLAP to OLTP). The 'transfer' can be a custom built program that can contain the necessary business logic in addition to the raw copying power and be run at a set schedule as required. The denormalised database would then reduce the atomicity and allow the keyword searches and queries to run efficiently. I was looking at using Lucene .NET for the keyword work on the denormalised database.
So before I sing loudly from the hills that this is the way forward, I wanted some expert opinion on this and what is the perceived "best practise". Is the method I have suggested the best way forward considering the data I will be provided? It was suggested that perhaps I could use a 'search engine' to search the normalised data. This scared the hell out of me, but raised the question; what search engine and how?
Opinions, flames, bad language and help appreciated :)
I have built reporting databases and data warehouses based on data stored in normalized form. There is quite a bit of work involved in the transfer program (ETL). Given your description of the data feed, maybe some of that work has been done for you by the feeder.
Millions of rows isn't a lot, these days. You may be able to get away with report oriented views into the existing database. Try it and see.
The biggest benefit to building an OLAP oriented database is not speed. It's flexibility. "We love this report, but now we want to see it weekly and quarterly instead of monthly. Bam! Done!" "Can you break it down by marketing category instead of manufacturing category? Bam! Done!" And so on.
A resonably normalized model (3NF/BCNF) provides the best average performance and the least amount of modification anomalies for the largest number of scenarios. That's big, so I would start from there. As your requirements are fuzzy, it's seems like the most sensible option.
Actually, the most sensible thing would be to go over the requirements until they are a bit more "crisp" ;)
Also, if you could get your hands on a few early extracts from your data provider you could experiment with it and get a feeling for the data distributions (not all people live in one country, and some countries holds more people than others. Not all people have children, and the number children per person is vastly different depending on the country). This is a major point and it is crucial that the optimizer can make good decisions.
Other than that, I agree with everything Walter said and also gave him my vote.

Efficient Ad-hoc SQL OLAP Structure

Over the years I have read a lot of people's opinions on how to get better performance out of their SQL (Microsoft SQL Server, just so we are all on the same page...) queries. However, they all seem to be tightly tied to either a high-performance OLTP setup or a data warehouse OLAP setup (cubes-galore...). However, my situation today is kind of in the middle of the 2, hence my indecision.
I have a general DB structure of [Contacts], [Sites], [SiteContacts] (the junction table of [Sites] and [Contacts]), [SiteTraits], and [ContractTraits]. I have nearly 3 million contacts with about 50 fields (between [Contacts] and [ContactTraits]) relating to just the contact, and about 600 thousand sites with about 150 fields (between [Sites] and [SiteTraits]) relating to just the sites. Basically it’s a pretty big flattened table or view… Most of the columns are int, bit, char(3), or short varchar(s). My problem is that a good portion of these columns are available to be used in ad-hoc queries by the user, and as quickly as possible because the main UI for this will be a website. I know the most common filters, but even with heavy indexing on them I think this will still be a beast… This data is read-only; the data doesn’t change at all during the day and the database will only be refreshed with the latest information during scheduled downtime. So I see this situation like an OLAP database with the read requirements of an OLTP database.
I see 3 options; 1. Break the table into smaller divisible units sub-query everything, 2. make one flat table and really go to town on the indexing 3. Create an OLAP cube and sub-query the rest based on what filter values I don’t put as the cube dimensions, and. I have not done much with OLAP cubes so I frankly don’t even know if that is an option, but from what I’ve done with them in the past I think it might be an option. Also, just to clarify what I mean when I say “sub-query everything” is instead of having a WHERE clause on the outer select, there would be one (if applicable) for each table being brought into the query and then the tables are INNER JOINed, to eliminate a really large Cartesian Product. As for the second option of the one large table, I have heard and seen conflicting results with that approach as it will save on joins but at the same time a table scan takes much longer.
Ideas anyone? Do I need to share what I’m smoking? I think this could turn into a pretty good discussion if everyone puts in their 2 cents. Oh, and feel free to tell me if I’m way off base with the OLAP cube idea if that’s the case, I’m new to that stuff too.
Thanks in advance to any and all opinions and help with this dilemma I’ve found myself in.
You may want to consider this as a relational data warehouse. You could design your relational database tables as a star schema (or, a snowflake schema). This design is very similar to the OLAP cube logical structure, but the physical structure is in the relational database.
In the star schema you would have one or more fact tables, which represent transactions of some sort and is usually associated with a date. I'm not sure what a transaction might be in this case though. The fact may be the association of sites to contacts and the table.
The fact table would reference dimension tables, which describe the fact. Dimensions might be Sites and Contacts. A dimension contains attributes, such as contact name, contact address, etc. If you are familiar with the OLAP cube, then this will be a familiar logical architecture.
It wouldn't be a very big problem to add numerous indexes to your architecture. The database is mostly read only, except for the refresh time. You won't have to worry about read performance while indexes are being updated. So, the architecture can accommodate all indexes that are needed (as long as you can dedicate enough downtime to refresh the data).
I agree with bobs answer: throw an OLAP front end and query through the cube. The reason why this will be a good think is that cubes are highly efficient at querying (often precomputed) aggregates by multiple dimensions and they store the data in a column-oriented format that is more efficient for data analysis.
The relational data underneath the cube will be great for detail drill-ins to find the individual facts that give a certain aggregate value. But querying directly the relational data will always be slow, because those aggregates users are interested in for analysis can only be produced by scanning large amounts of data. OLAP is just better at this.
OLAP/SSAS is efficient for aggregate queries, not as much for granular data in my experience.
What are the most common queries? For single pieces of data or aggregates?
If the granularity of SiteContacts is pretty close to that of Contacts (ie. circa 3 million records - most contacts associated with only a single site), you may get the best performance out of a single table (with plenty of appropriate indexes, obviously; partitioning should also be considered).
On the other hand, if most contacts are associated with many sites, it might be better to stick with something close to your current schema.
OLAP tends to produce the best results on aggregated data - it sounds as though there will be relatively little aggregation carried out on this data.
Star schemas consist of fact tables with dimensions hanging off them - depending on the relationship between Sites and Contacts, it sounds as though you either have one huge dimension table, or two large dimensions with a factless fact table (sounds like an oxymoron, but is covered in Kimball's methodology) linking them.