Getting top most sold items from millions of transactional data [closed] - sql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Suppose I have an online store application that contains millions of items that are maintained by the application. The application is so famous that millions of items get sold for each hour. I store all this information in a database, say Oracle DB.
Now if I want to show the top 5 items sold in the last 1 Hour then I can write a query something like :
Get the list of products that were sold in last 1 Hour.
Find the count of each product from above result and order by that count value, then display the top 5 records.
This seems to be a working query, but the problem is, for each 1 Hour if I am having millions of items sold then running this query against the table that contains all the transactional information will definitely hit performance issues. How can we fix such issues? Are there any other way of implementing it.

As a note, Amazon at its peak on Cyber Monday is selling a bit over a million items per hour. You must have access to an incredible data store.
Partitioning is definitely one solution, but it can be a little complicated. When you say "the last hour" that can go over a partitioning boundary. Not a big deal, but it would mean accessing multiple partitions for each query.
Even one million items and hour is just a few hundred items per second. This might give you enough leeway to add a trigger (or probably logic to an existing trigger) that would maintain a summary table of what you are looking for.
I offer this as food-for-thought.
I doubt that you are actually querying the real operational system. My guess is that any environment that is handling even a dozen sales per second is not going to have such queries running on the operational system. The architecture is more likely a feed into a decision support system. And, that gives you the leeway to implement an additional summary table as data goes into the system. This is not question of creating triggers on a load. It is, instead, a question of loading detailed data into one table and summary information into another table, based on how the information is being passed from the original operation system to the decision support system.

I think you should try the partitioning.
E.g. you can split the data for each month/week/whatever into different partitions using maybe range partitioning and then for the last hour it is quite easy to run the query only for a specific, last partition. See partitioning-wise joins to learn more about it.
Of course, you'll need to perform some specific implementation steps, but every war can require some sacrifice...

Related

NoSQL or SQL for Data Structure [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm building an app, and this is my first time working with databases. I went with MongoDB because originally I thought my data structure would be fitting for it. After more research, I've become a bit lost in all the possible ways I could structure my data, and which of those would be best for performance vs best for my DB type (currently MongoDB, but could change to PostgreSQL). Here are all my data structures and iterations:
Note: I understand that the "Payrolls" collection is somewhat redundant in the below example. It's just there as something to represent the data hierarchy in this hypothetical.
Original Data Structure
The structure here is consistent with what NoSQL is good at, quickly fetching everything in a single document. However, I intend for my employee object to hold lots of data, and I don't want that to encroach on the document size limit as a user continues to add employees and data to those employees, so I split them into a separate collection and tied them together using reference (object) IDs:
Second Data Structure
It wasn't long after that I wanted to be able to manipulate the clients, locations, departments, and employees all independent of one another, but still maintain their relationships, and I arrived at this iteration of my data structure:
Third and Current Data Structure
It was at this point that I began to realize I had been shifting away from the NoSQL philosophy. Now, instead of executing one query against one collection in a database (1st iteration), or executing one query with a follow-up population (2nd iteration), I was now doing 4 queries in parallel when grabbing my data, despite all the data being related tied to each other.
My Questions
Is my first data structure suitable to continue with MongoDB? If so, how do I compensate for the document size limit in the event the employees field grows too large?
Is my second data structure more suitable to continue with MongoDB? If so, how can I manipulate the fields independently? Do I create document schemas/models for each field and query them by model?
Is my third data structure still suitable for MongoDB, or should I consider a move to a relational database with this level of decentralized structure? Does this structure allow me any more freedom or ease of access to manipulate my data than the others?
Your question is a bit broad, but I will answer by saying that MongoDB should be able to handle your current data structure without too much trouble. The maximum document size for a BSON Mongo document is 16MB (q.v. the documentation). This is quite a lot of text, and it is probably unlikely that, e.g., an employee would need 16MB of storage.
In the event that you need a single transaction per object to occupy more than the 16MB BSON maximum, you may use GridFS. GridFS uses special collections (files and chunks) which do not have any storage limit (other than the limit of maximum database size). With GridFS, you may write objects of any size, and MongoDB will accommodate the operations.

Tools for real-time data visualization in a table? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
So this might be a bit of a strange one, but I'm trying to find a tool that would help me visualize real time data in a form of a table rather than a graph/chart. There are a lot of tools out there like Grafana, Kibana, Tableau that kind of fit a similar purpose, but for a very different application and they're primarily made for aggregated data visualization.
I am essentially looking to build something like what a departure board is at an airport. You got flight flight AAA that landed 20 minutes ago, XXX departing in 50 minutes, once flight AAA is clear it disappears from the departure board etc. Only I want to have that real-time, as the input will be driven by actions users are performing on the shop floor on their RF guns.
I'd be connecting to a HANA database for this. I know it's definitely possible to build it using HTML5, Ajax and Websocket but before I get on the journey of building it myself I want to see if there's anything out there that somebody else has already done better.
Surely there's something there already - especially in the manufacturing/warehousing space where having real-time information on big screens is of big benefit?
Thanks,
Linas M.
Based on your description I think you might be looking for a dashboard solution.
Dashboards are used in many scenarios, especially where an overview of the past/current/expected state of a process is required.
This can be aggregated data (e.g. how long a queue is, how many tellers are occupied/available, what the throughput of your process is, etc.) or individual data (e.g. which cashier desk is open, which team player is online, etc.).
The real-time part of your question really boils down to what you define to be real-time.
Commonly, it’s something like “information delivery that is quick enough to be able to make a difference”.
So, if, for example, I have a dashboard that tells me that I will likely be short of, say, service staff tomorrow evening (based on my reservations) then to make a difference I need to know this as soon as possible (so I can call more staff for tomorrows shift). It won’t matter much if the data takes 5 or 10 minutes from the system entry to the dashboard, but when I only learn about it tomorrow afternoon, that’s too late.
Thus, if you’re after a dashboard solution, then there are in fact many tools available and you already mentioned some of them.
Others would be e.g. SAP tools like Business Objects Platform or SAP Cloud Analytics. To turn those into “real-time” all you need to do is to define how soon the data needs to be present in the dashboard and set the auto-refresh period accordingly.

What is a Cube in SQL Server? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I often hear comments like: "we need to get the data from the cube". A quick Google search shows way to create a cube, but no definition of what a cube is.
What my coworkers means with "extract the data from the cube"? A "cube" is just a specific structure of tables?
What in the world is a "cube" in SQL Server?
This is a very broad question. But if you can imagine a simple data set, perhaps with census data. The data you care about is number of residents in the US.
Now, we can subdivide that number into states, list each state on a row, and show the respective count for each state. That's one 'dimension' on the data.
We could further interrogate this set by ethnicity. We'd list ethnicities in columns, changing the table into a crosstab, and the count of residents for each ethnicity/state would be listed in the intersection of the corresponding column and row.
Finally, if we wanted a third dimension, maybe religion, we'd need some third direction (not rows or columns) to list these categories. Our crosstab would become... a cube.
'Cube' is the shorthand name for a kind of database that has been specifically built to handle the various efficiency issues that come with analyzing datasets on many different dimensions -- slicing and aggregating those measures of data across the several available characteristics.
For Microsoft, the tool that generates Cubes is call SSAS - sql server analysis services.
Going into more detail would take hours and hours. You will need to find some kind of tutorial resources and expect to make quite an investment of time if you want to learn the strengths and weaknesses of the various types of cubes and how to get information from them.
Cube is a Dataset which could be result of one table (or) joined query on multiple tables. Cube has populated data and could be fetched real time as push.

Should I create a new database or use an existing databases? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have multiple databases that sometimes interact with each other but are mostly independent. I now need build a new application that allows users to search though the data of the rest of the application (sort of searching through the history of the other applications).
So I'm going to need a dozen or so stored procedures/views that will access data from various databases.
Should I have each stored procedure/view on the database that is being queried? Or do I have a brand new database for this part of the application that gathers data from all other databases in views/SPs and just query that?
I think it should be the first option, but then where do I put the Login table that tracks user logins into this new report application? It doesn't belong in any other database. (each database has it's own login table, its just the way it was setup).
What you are asking here fits into the wide umbrella of business intelligence.
The problem you are going to hit quickly...reporting queries tend to be low number of queries and relatively resource intense (from a hardware point of view). If you will, low volume high intensity.
The databases you are hitting are most likely high transaction databases. IE they are dealing with a large number of smaller queries, either as a large number of single (or multiple) inserts or quick selects. If you will, high volume low intensity queries.
Of course, these two models conflict heavily when trying to optimize them. Running a reporting query that joins multiple tables and runs for several minutes will often lock tables or consume resources that prevent (or severely inhibit) the database from performing its day to day job. If the system is configured for high number of small transactions, then your reporting query simply isn't going to get the resources it requires and the time lines on reporting results will be horribly long.
The answer here is the centralized data warehouse that collects the data from several sources and brings it together so it can be reported on. It's usually 3 components, a centralized data model, an etl platform to load that data model from the several data sources, and a reporting platform that interacts with this data. There are several third party potentials (listed in comments) that somewhat mimic the functionality of all three, or you can create these separately.
There are a few scenarios (usually due to an abundance of resources or a lack of traffic) where reporting direct from the production data of multiple data sources works, but those scenarios are pretty far and few between (usually never in an actual production environment).

LINQ or SQL queries. What is more efficient? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a program with a database with information about people that contains million records.
One of the tasks were to filter the results by birth date, then group
them by city and finally compare the population of each city with the
given numbers.
I started to write everything in SQL query, but then I started to wonder, that it may make server too busy and maybe it's better to do some calculations with the application itself.
I would like to know if there are any rules/recommendations
when to use server to make calculations ?
when to use tools like LINQ in the application ?
For such requirements, there's no fixed rule or strategy, it is driven by application / business requirements, couple of suggestions that may help:
Normally Sql Query does a good job in churning lots of data to deliver a smaller result set post filtering / Grouping / Sorting. However it needs
correct table design, indexing to optimize. As the data size increase Sql may under perform
Transferring data over the network, from hosted database to application is what kills the performance, since network can be big bottleneck, especially if the data is beyond certain size
In memory processing using Linq2Objects can be very fast for repetitive calls, which needs to apply filters, sort data and do some more processing
If the UI is a rich client, then you can afford to bring lots of data in the memory and keep working on it using Linq, it can be part of in memory data structures, if the UI is Web then you need to Cache the data
For having the same operations as sql, for in memory data, for multiple types, you need custom code, which preferably use Expression trees along with linq, else a simple linq would do for a known fixed type
I have a similar design in one of my web application, normally it is a combination, which works best in the most of the practical scenarios