what are the ASO and BSO , what is advantage to use these - essbase

what are ASO and BSO and difference between aggregated storage and block storage?
when to use aggregated and when to use block storage technique.?

Oracle Answers http://docs.oracle.com/cd/E26232_01/doc.11122/esb_dbag/frameset.htm?ainaggr.html
Shortly: If you have а very spare cube and do not need update values in cells by users, use ASO

A very fundamental and frequent question which appears in all Essbase interviews is what is the difference between ASO and BSO applications.
Here are few differences between ASO and BSO
Essbase system has two distinct storage options Aggregate Storage Option (ASO) and Block Storage Option (BSO) each one has its own unique significance.
Characteristics of ASO:
High dimensionality.
No Calculation scripts.
Only one database can be created under one application.
Mandate to fallow the naming conventions for Application name as Applications names should not be metadata, temp, log, default.
Dynamic time series and Time balance properties are not available.
The dimension build process builds any new member then the data will be erased otherwise the data will be alive.
Only one type of partition available (Transparent)
There is no concept of Sparse and Dense dimensions.
No Boolean attribute tag.
Only store data, never share, label only data storage properties are available.
Characteristics of BSO:
Less number of dimensions but shows the business model.
Special functionalities for Accounts and Time dimensions like Dynamic time series, Time balance, Variance reporting.
3 types of partitions Replicated, Transparent, Linked.
Currency conversion is possible.
There is no restriction of the number of databases under one application but performance costs.
Complex calculations can be achieved using calc scripts.

In ASO we can load data at only level 0 where as in BSO we can load data at any level.....

I know BSO a little, so I want to talk about ASO.
ASO(App Store Optimization) is the process of optimizing mobile apps to rank higher in an app store’s search results. The higher your app ranks in an app store’s search results, the more visible it is to potential customers. That increased visibility tends to translate into more traffic to your app’s page in the app store.

Related

How to model Storage Capacity in BPMN?

I am right now trying to model a warehouse with import and export processes. I have the problem that I do not know how I should model the capacity of different storage places in the warehouse. There are processes where vehicles with different loadings come and all of them need to be stored in the warehouse with a limited capacity. Else the arriving goods have to be declined.
I am modeling this process in a BPM Suite and was thinking about using Python to access this problem. I thought that I could simply use variables and if clauses to check the capacity of each storage. But if I would simulate this process with this approach then the variables are re-instantiated each time with the start value and do not hold the actual value., beucause with the script is included in the model as a script task.
Does anyone has other ideas to model capacity in BPMN?
Have you considered to not use BPMN as it is clearly adds more complexity than benefit in your case? Look at the Cadence Workflow which allows to specify orchestration logic using normal code and would support your requirements directly without any ugly workarounds.

What is a difference between table distribution and table partition in sql?

I am still struggling with identifying how the concept of table distribution in azure sql data warehouse differs from concept of table partition in Sql server?
Definition of both seems to be achieving same results.
Azure DW has up to 60 computing nodes as part of it's MPP architecture. When you store a table on Azure DW you are storing it amongst those nodes. Your tables data is distributed across these nodes (using Hash distribution or Round Robin distribution depending on your needs). You can also choose to have your table (preferably a very small table) replicated across these nodes.
That is distribution. Each node has its own distinct records that only that node worries about when interacting with the data. It's a shared-nothing architecture.
Partitioning is completely divorced from this concept of distribution. When we partition a table we decide which rows belong into which partitions based on some scheme (like partitioning an order table by the order.create_date for instance). A chunk of records for each create_date then gets stored in its own table separate from any other create_date set of records (invisibly behind the scenes).
Partitioning is nice because you may find that you only want to select 10 days worth of orders from your table, so you only need to read against 10 smaller tables, instead of having to scan across years of order data to find the 10 days you are after.
Here's an example from the Microsoft website where horizontal partitioning is done on the name column with two "shards" based on the names alphabetical order:
Table distribution is a concept that is only available on MPP type RDBMSs like Azure DW or Teradata. It's easiest to think of it as a hardware concept that is somewhat divorced (to a degree) from the data. Azure gives you a lot of control here where other MPP databases base distribution on primary keys. Partitioning is available on nearly every RDBMS (MPP or not) and it's easiest to think of it as a storage/software concept that is defined by and dependent on the data in the table.
In the end, they do both work to solve the same problem. But... nearly every RDBMS concept (indexing, disk storage, optimization, partition, distribution, etc) are there to solve the same problem. Namely: "How do I get the exact data I need out as quickly as possible?" When you combine these concepts together to match your data retrieval needs you make your SQL requests CRAZY fast even against monstrously huge data.
Just for fun, allow me to explain it with an analogy.
Suppose there exists one massive book about all history of the world. It has the size of a 42 story building.
Now what if the librarian splits that book into 1 book per year. That makes it much easier to find all information you need for some specific years. Because you can just keep the other books on the shelves.
A small book is easier to carry too.
That's what table partitioning is about. (Reference: Data Partitioning in Azure)
Keeping chunks of data together, based on a key (or set of columns) that is usefull for the majority of the queries and has a nice average distribution.
This can reduce IO because only the relevant chunks need to be accessed.
Now what if the chief librarian unbinds that book. And sends sets of pages to many different libraries.
When we then need certain information, we ask each library to send us copies of the pages we need.
Even better, those librarians could already summarize the information of their pages and then just send only their summaries to one library that collects them for you.
That's what the table distribution is about. (Reference: Table Distribution Guidance in Azure)
To spread out the data over the different nodes.
Conceptually they are the same. The basic idea is that the data will be split across multiple stores. However, the implementation is radically different. Under the covers, Azure SQL Data Warehouse manages and maintains the 70 databases that each table you define is created within. You do nothing beyond define the keys. The distribution is taken care of. For partitioning, you have to define and maintain pretty much everything to get it to work. There's even more to it, but you get the core idea. These are different processes and mechanisms that are, at the macro level, arriving at a similar end point. However, the processes these things support are very different. The distribution assists in increased performance while partitioning is primarily a means of improved data management (rolling windows, etc.). These are very different things with different intents even as they are similar.

Data model design guide lines with GEODE

We are soon going to start something with GEODE regarding reference data. I would like to get some guide lines for the same.
As you know in financial reference data world there exists complex relationships between various reference data entities like Instrument, Account, Client etc. which might be available in database as 3NF.
If my queries are mostly read intensive which requires joins across
tables (2-5 tables), what's the best way to deal with the same with in
memory grid?
Case 1:
Separate regions for all tables in your database and then do a similar join using OQL as you do in database?
Even if you do so, you will have to design it with solid care that related entities are always co-located within same partition.
Modeling 1-to-many and many-many relationship using object graph?
Case 2:
If you know how your join queries look like, create a view model per join query having equi join characteristics.
Confusion:
(1) I have 1 join query requiring Employee,Department using emp.deptId = dept.deptId [OK fantastic 1 region with such view model exists]
(2) I have another join query requiring, Employee, Department, Salary, Address joins to address different requirement
So again I have to create a view model to address (2) which will contain similar Employee and Department data as (1). This may soon reach to memory threshold.
Changes in database can still be managed by event listeners, but what's the recommendations for that?
Thanks,
Dharam
I think your general question is pretty broad and there isn't just one recommended approach to cover all UCs (primarily all your analytical views/models of your data as required by your application(s)).
Such questions involve many factors, such as the size of individual data elements, the volume of data, the frequency of access or access patterns originating from the application or applications, the timely delivery of information, how accurate the data needs to be, the size of your cluster, the physical resources of each (virtual) machine, and so on. Thus, any given approach will undoubtedly require application tuning, tuning GemFire accordingly and JVM tuning regardless of your data model. Still, a carefully crafted data model can determine the extent of such tuning.
In GemFire specifically, such tuning will involve different configuration such as, but not limited to: data management policies, eviction (Overflow) and expiration (LRU, or perhaps custom) settings along with different eviction/expiration thresholds, maybe storing data in Off-Heap memory, employing different partition strategies (PartitionResolver), and so on and so forth.
For example, if your Address information is relatively static, unchanging (i.e. actual "reference" data) then you might consider storing Address data in a REPLICATE Region. Data that is written to frequently (typically "transactional" data) is better off in a PARTITION Region.
Of course, as you know, any PARTITION data (managed in separate Regions) you "join" in a query (using OQL) must be collocated. GemFire/Geode does not currently support distributed joins.
Additionally, certain nodes could host certain Regions, thus dividing your cluster into "transactional" vs. "analytical" nodes, where the analytical-based nodes are updated from CacheListeners on Regions in transactional nodes (be careful of this), or perhaps better yet, asynchronously using an AEQ with AsyncEventListeners. AEQs can be separately made highly available and durable as well. This transactional vs analytical approach is the basis for CQRS.
The size of your data is also impacted by the form in which it is stored, i.e. serialized vs. not serialized, and GemFire's proprietary serialization format (PDX) is quite optimal compared with Java Serialization. It all depends on how "portable" your data needs to be and whether you can keep your data in serialized form.
Also, you might consider how expensive it is to join the data on-the-fly. Meaning, if your are able to aggregate, transform and enrich data at runtime relatively cheaply (compute vs. memory/storage), then you might consider using GemFire's Function Execution service, bringing your logic to the data rather than the data to your logic (the fundamental basis of MapReduce).
You should know, and I am sure you are aware, GemFire is a Key-Value store, therefore mapping a complex object graph into separate Regions is not a trivial problem. Dividing objects up by references (especially many-to-many) and knowing exactly when to eagerly vs. lazily load them is an overloaded problem, especially in a distributed, replicated data store such as GemFire where consistency and availability tradeoffs exist.
There are different APIs and frameworks to simplify persistence and querying with GemFire. One of the more notable approaches is Spring Data GemFire's extension of Spring Data Commons Repository abstraction.
It also might be a matter of using the right data model for the job. If you have very complex data relationships, then perhaps creating analytical models using a graph database (such as Neo4j) would be a simpler option. Spring also provides great support for Neo4j, led by the Neo4j team.
No doubt any design choice you make will undoubtedly involve a hybrid approach. Often times the path is not clear since it really "depends" (i.e. depends on the application and data access patterns, load, all that).
But one thing is for certain, make sure you have a good cursory knowledge and understanding of the underlying data store and it' data management capabilities, particularly as it pertains to consistency and availability, beginning with this.
Note, there is also a GemFire slack channel as well as a Apache DEV mailing list you can use to reach out to the GemFire experts and community of (advanced) GemFire/Geode users if you have more specific problems as you proceed down this architectural design path.

1 or many sql tables for persisting "families" of properties about one object?

Our application (using a SQL Server 2008 R2 back-end) stores data about remote hardware devices reporting back to our servers over the Internet. There are a few "families" of information we have about each device, each stored by a different server application into a shared database:
static configuration information stored by users using our web app. e.g. Physical Location, Friendly Name, etc.
logged information about device behavior, e.g. last reporting time, date the device first came online, whether device is healthy, etc.
expensive information re-computed by scheduled jobs, e.g. average signal strength, average length of transmission, historical failure rates, etc.
These properties are all scalar values reflecting the most current data we have about a device. We have a separate way to store historical information.
The largest number of device instances we have to worry about will be around 100,000, so this is not a "big data" problem. In most cases a database will have 10,000 devices or less to worry about.
Writes to the data about an individual device happens infrequently-- typically every few hours. It's theoretically possible for a scheduled task, user-inputted configuration changes, and dynamic data to all make updates for the same device at the same time, but this seems very rare. Reads are more frequent: probably 10x per minute reads against at least one device in a database, and several times per hour for a full scan of some properties of all devices described in a database.
Deletes are relatively rare, in fact many cases we only "soft delete" devices so we can use them for historical reporting. New device inserts are more common, perhaps a few every day.
There are (at least) two obvious ways to store this data in our SQL database:
The current design of our application stores each of these families of information in separate tables, each with a clustered index on a Device ID primary key. One server application writes to one table each.
An alternate implementation that's been proposed is to use one large table, and create covering indexes as needed to accelerate queries for groups of properties (e.g. all static info, all reliability info, etc.) that are frequently queried together.
My question: is there a clearly superior option? If the answer is "it depends" then what are the circumstances which would make "one large table" or "multiple tables" better?
Answers should consider: performance, maintainability of DB itself, maintainability of code that reads/writes rows, and reliability in the face of unexpected behavior. Maintanability and reliability are probably a higher priority for us than performance, if we have to trade off.
Don't know about a clearly superior option, and I don't know about sql-server architecture. But I would go for the first option with separate tables for different families of data. Some advantages could be:
granting access to specific sets of data (may be desirable for future applications)
archiving different famalies of data at different rates
partial functionality of the application in the case of maintenance on a part (some tables available while another is restored)
indexing and partitioning/sharding can be performed on different attributes (static information could be partitioned on device id, logging information on date)
different families can be assigned to different cache areas (so the static data can remain in a more "static" cache, and more rapidly changing logging type data can be in another "rolling" cache area)
smaller rows pack more rows into a block which means fewer block pulls to scan a table for a specific attribute
less chance of row chaining if altering a table to add a row, easier to perform maintenance if you do
easier to understand the data when seprated into logical units (families)
I wouldn't consider table joining as a disadvantage when properly indexed. But more tables will mean more moving parts and the need for greater awareness/documentation on what is going on.
The first option is the recognized "standard" way to store such data in a relational database.
Although a good design would probably result in more tables. Relational databases software such as SQLServer were designed to store and retrieve data in multiple tables quickly and efficiently.
In addition such designs allow for great flexibility, both in terms of changing the database to store extra data, and, in allowing unexpected/unusual queries against the data stored.
The single table option sounds beguilingly simple to practitioners unfamiliar with Relational databases. In practice they perform very badly, are difficult to manage, and lead to a high number of deadlocks and timeouts.
They also lead to development paralysis. You cannot add a requested feature because it cannot be done without a total redesign of the "simple" database schema.

Thoughts on dimension measures for BI

I am working with a consultant who recommends creating a measure dimension and then adding the measure dimension key to our fact table.
I can see how this can make adding new measures easier by just adding rows instead of physically creating columns in the fact table. I can also see how this can add work to the ETL process, adds another join to the star schema, one generic column in fact table to hold all measure data etc.
I'm interested in how others have dealt with this situation. We currently have close to twenty measures.
Instinctively, I don't like it: it's the EAV model, which is not very popular (you can Google the reasons why).
The EAV model is generally considered to be a headache to query and maintain
Different measures go together with different dimensions; this approach could easily turn into "one giant fact table for everything" instead of multiple smaller fact tables for specific reporting areas
I suspect you would end up creating views to give the appearance of multiple fact tables anyway
You will multiply the number of rows in your fact table by the number of measures, resulting in a much bigger physical table
Even with a good indexing/partitioning scheme, queries that include more than one measure will have to read a lot more rows to get the data
What about measures with different data types?
Is this easily supported in your reporting tool?
I'm sure there are other issues, but those are the ones that come to mind immediately. As a rule of thumb, if someone suggests an EAV implementation in any context, you should be very wary and ask them exactly what advantages it offers and how it will be managed as the data and complexity increase. But I think you've already identified some key areas of concern.
SSAS will do this, and I know of a major vendor of insurance policy administration software that provided a M.I. solution for their system that works like this. You do get some flexibility from the approach in that you can add measures without having to deploy a build of the cube, although for 20 measures I don't think you need to worry about that.
'Measures' is essentially another dimension (and often referred to as such in the documentation). I believe SSAS uses a largely column-oriented structure behind the scenes.
However, a naive application of this approach does have some issues that could come and bite you to a greater or lesser extent.
You only have one measure, [Value], [Amount] or whatever it's called. If your tool won't let you inject calculated measures at the front-end then you can't sort the whole data set on the value of one of your attribute types. ProClarity and report builder >=2.0 will do this but Excel won't.
You can't do ratios or other calculated measures in this way. You will have to either embed them in the cube script (meaning you need to deploy a build to add them) or use a tool that lets you define them in the client.
Although it doesn't make a lot of differece to the cube it will be slow to query on the database and increase storage requirements. It's also fiddly to query on the database.