Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I often hear comments like: "we need to get the data from the cube". A quick Google search shows way to create a cube, but no definition of what a cube is.
What my coworkers means with "extract the data from the cube"? A "cube" is just a specific structure of tables?
What in the world is a "cube" in SQL Server?
This is a very broad question. But if you can imagine a simple data set, perhaps with census data. The data you care about is number of residents in the US.
Now, we can subdivide that number into states, list each state on a row, and show the respective count for each state. That's one 'dimension' on the data.
We could further interrogate this set by ethnicity. We'd list ethnicities in columns, changing the table into a crosstab, and the count of residents for each ethnicity/state would be listed in the intersection of the corresponding column and row.
Finally, if we wanted a third dimension, maybe religion, we'd need some third direction (not rows or columns) to list these categories. Our crosstab would become... a cube.
'Cube' is the shorthand name for a kind of database that has been specifically built to handle the various efficiency issues that come with analyzing datasets on many different dimensions -- slicing and aggregating those measures of data across the several available characteristics.
For Microsoft, the tool that generates Cubes is call SSAS - sql server analysis services.
Going into more detail would take hours and hours. You will need to find some kind of tutorial resources and expect to make quite an investment of time if you want to learn the strengths and weaknesses of the various types of cubes and how to get information from them.
Cube is a Dataset which could be result of one table (or) joined query on multiple tables. Cube has populated data and could be fetched real time as push.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm building an app, and this is my first time working with databases. I went with MongoDB because originally I thought my data structure would be fitting for it. After more research, I've become a bit lost in all the possible ways I could structure my data, and which of those would be best for performance vs best for my DB type (currently MongoDB, but could change to PostgreSQL). Here are all my data structures and iterations:
Note: I understand that the "Payrolls" collection is somewhat redundant in the below example. It's just there as something to represent the data hierarchy in this hypothetical.
Original Data Structure
The structure here is consistent with what NoSQL is good at, quickly fetching everything in a single document. However, I intend for my employee object to hold lots of data, and I don't want that to encroach on the document size limit as a user continues to add employees and data to those employees, so I split them into a separate collection and tied them together using reference (object) IDs:
Second Data Structure
It wasn't long after that I wanted to be able to manipulate the clients, locations, departments, and employees all independent of one another, but still maintain their relationships, and I arrived at this iteration of my data structure:
Third and Current Data Structure
It was at this point that I began to realize I had been shifting away from the NoSQL philosophy. Now, instead of executing one query against one collection in a database (1st iteration), or executing one query with a follow-up population (2nd iteration), I was now doing 4 queries in parallel when grabbing my data, despite all the data being related tied to each other.
My Questions
Is my first data structure suitable to continue with MongoDB? If so, how do I compensate for the document size limit in the event the employees field grows too large?
Is my second data structure more suitable to continue with MongoDB? If so, how can I manipulate the fields independently? Do I create document schemas/models for each field and query them by model?
Is my third data structure still suitable for MongoDB, or should I consider a move to a relational database with this level of decentralized structure? Does this structure allow me any more freedom or ease of access to manipulate my data than the others?
Your question is a bit broad, but I will answer by saying that MongoDB should be able to handle your current data structure without too much trouble. The maximum document size for a BSON Mongo document is 16MB (q.v. the documentation). This is quite a lot of text, and it is probably unlikely that, e.g., an employee would need 16MB of storage.
In the event that you need a single transaction per object to occupy more than the 16MB BSON maximum, you may use GridFS. GridFS uses special collections (files and chunks) which do not have any storage limit (other than the limit of maximum database size). With GridFS, you may write objects of any size, and MongoDB will accommodate the operations.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have multiple databases that sometimes interact with each other but are mostly independent. I now need build a new application that allows users to search though the data of the rest of the application (sort of searching through the history of the other applications).
So I'm going to need a dozen or so stored procedures/views that will access data from various databases.
Should I have each stored procedure/view on the database that is being queried? Or do I have a brand new database for this part of the application that gathers data from all other databases in views/SPs and just query that?
I think it should be the first option, but then where do I put the Login table that tracks user logins into this new report application? It doesn't belong in any other database. (each database has it's own login table, its just the way it was setup).
What you are asking here fits into the wide umbrella of business intelligence.
The problem you are going to hit quickly...reporting queries tend to be low number of queries and relatively resource intense (from a hardware point of view). If you will, low volume high intensity.
The databases you are hitting are most likely high transaction databases. IE they are dealing with a large number of smaller queries, either as a large number of single (or multiple) inserts or quick selects. If you will, high volume low intensity queries.
Of course, these two models conflict heavily when trying to optimize them. Running a reporting query that joins multiple tables and runs for several minutes will often lock tables or consume resources that prevent (or severely inhibit) the database from performing its day to day job. If the system is configured for high number of small transactions, then your reporting query simply isn't going to get the resources it requires and the time lines on reporting results will be horribly long.
The answer here is the centralized data warehouse that collects the data from several sources and brings it together so it can be reported on. It's usually 3 components, a centralized data model, an etl platform to load that data model from the several data sources, and a reporting platform that interacts with this data. There are several third party potentials (listed in comments) that somewhat mimic the functionality of all three, or you can create these separately.
There are a few scenarios (usually due to an abundance of resources or a lack of traffic) where reporting direct from the production data of multiple data sources works, but those scenarios are pretty far and few between (usually never in an actual production environment).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a program with a database with information about people that contains million records.
One of the tasks were to filter the results by birth date, then group
them by city and finally compare the population of each city with the
given numbers.
I started to write everything in SQL query, but then I started to wonder, that it may make server too busy and maybe it's better to do some calculations with the application itself.
I would like to know if there are any rules/recommendations
when to use server to make calculations ?
when to use tools like LINQ in the application ?
For such requirements, there's no fixed rule or strategy, it is driven by application / business requirements, couple of suggestions that may help:
Normally Sql Query does a good job in churning lots of data to deliver a smaller result set post filtering / Grouping / Sorting. However it needs
correct table design, indexing to optimize. As the data size increase Sql may under perform
Transferring data over the network, from hosted database to application is what kills the performance, since network can be big bottleneck, especially if the data is beyond certain size
In memory processing using Linq2Objects can be very fast for repetitive calls, which needs to apply filters, sort data and do some more processing
If the UI is a rich client, then you can afford to bring lots of data in the memory and keep working on it using Linq, it can be part of in memory data structures, if the UI is Web then you need to Cache the data
For having the same operations as sql, for in memory data, for multiple types, you need custom code, which preferably use Expression trees along with linq, else a simple linq would do for a known fixed type
I have a similar design in one of my web application, normally it is a combination, which works best in the most of the practical scenarios
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Suppose I have an online store application that contains millions of items that are maintained by the application. The application is so famous that millions of items get sold for each hour. I store all this information in a database, say Oracle DB.
Now if I want to show the top 5 items sold in the last 1 Hour then I can write a query something like :
Get the list of products that were sold in last 1 Hour.
Find the count of each product from above result and order by that count value, then display the top 5 records.
This seems to be a working query, but the problem is, for each 1 Hour if I am having millions of items sold then running this query against the table that contains all the transactional information will definitely hit performance issues. How can we fix such issues? Are there any other way of implementing it.
As a note, Amazon at its peak on Cyber Monday is selling a bit over a million items per hour. You must have access to an incredible data store.
Partitioning is definitely one solution, but it can be a little complicated. When you say "the last hour" that can go over a partitioning boundary. Not a big deal, but it would mean accessing multiple partitions for each query.
Even one million items and hour is just a few hundred items per second. This might give you enough leeway to add a trigger (or probably logic to an existing trigger) that would maintain a summary table of what you are looking for.
I offer this as food-for-thought.
I doubt that you are actually querying the real operational system. My guess is that any environment that is handling even a dozen sales per second is not going to have such queries running on the operational system. The architecture is more likely a feed into a decision support system. And, that gives you the leeway to implement an additional summary table as data goes into the system. This is not question of creating triggers on a load. It is, instead, a question of loading detailed data into one table and summary information into another table, based on how the information is being passed from the original operation system to the decision support system.
I think you should try the partitioning.
E.g. you can split the data for each month/week/whatever into different partitions using maybe range partitioning and then for the last hour it is quite easy to run the query only for a specific, last partition. See partitioning-wise joins to learn more about it.
Of course, you'll need to perform some specific implementation steps, but every war can require some sacrifice...
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I've been tasked with a project to collect server configuration metadata from Windows servers and storing it in a DB for the purpose of reporting. I will be collecting data for over 100 configuration fields for each server.
One of the tasks the client wants to be able to do is compare config data for either the same server at different points in time, or two different servers which have the same function (i.e. Exchange servers). To see if there are any differences and what those differences may be.
As for DB design, I would normally just normalize all of the data into a OLTP type schema, where all of the similar config items would be persisted to a table relating to their specific area (e.g. Hardware info). But I'm thinking this may be a bad move and I should be looking to save this to some kind of OLAP type data warehouse.
I'm just not sure which way to go with the DB design, so could do with some direction on this. Should I go with normalizing the data and creating lots of tables, or one massive table with no normalisation and over 100 fields, or should I look into a star topology or something completely different (EAV)?
I am limited to using .Net and MSSQL server 2005.
Edit: The tool to collect and store the data will be run on an as required basis, rather than just grabbing the config data every day/week. Would be looking to keep the data for a couple of years at least.
Star Schema is best for reporting purposes in my experience. It is not necessary to use Star Schema for storage because it might be a set of views (indexed for performance) and you can design views for Star Schema later. Storage model should be a set of event tables to record configuration changes. You can start from flat log file structure and normalize it iteratively to find good structures for storage and queries. Storage model is supposed to be good if you can define model constraints, reporting model should be good for fast ad-hoc queries. You should focus on storage model because reporting model is a denormalization of storage model and it is easier to denormalize later. EAV structures are useless for both models because you can not define any constraints but queries are complex anyways.