gurus!
I've inherited an SSAS 2014 multidimensional cube at work. I've been doing SQL Server database work (queries, tables, stored procs, etc) for many years now. But I'm a complete SSAS newbie. And, even in my ignorance, I can tell that this cube I've inherited is a mess!
I've been able to keep the thing updated with new data each month, but now our company has rolled out a new product and I'm having to add five new fields to the fact table / view for keys related to that product, along with the related dimension views. I've taken a couple of shots at it, but wind up hitting numerous errors when I process the fact table partitions.
BTW, heading off the natural question, there's no way I can roll the "five new fields" data into fields that already exist unless I completely rebuild the cube from scratch, which is out of the question right now.
So, I'll try to boil down what I THINK is the problem here. Hoping someone can answer my question.
The fact data is located in four different data warehouse databases (names changed to protect company data) -
DB_Current
DB_2018
DB_2017
DB_2016
There is a fact view within each of those databases to stage the fact data. That view is called "vw_fact" and is identical across all databases. When that view gets pulled into the cube, it gets partitioned into four different partitions (per month-year) due to data size.
The new product was just rolled out this year, so I added the five new fields to "vw_fact" only in "DB_Current". I didn't change the prior years' views in their respective databases. My shot-in-the-dark guess there was that the prior years views would automically join the matching field names to the current year's view without needing the new fields.
When I tried processing the four years' worth of partitions, I then ran into numerous "field doesn't exist errors".
So, my questions are these:
Do I have to add five new fields to ALL FOUR views? That is, the individual views within all four years' of databases?
If I have to do #1 above, do I then need to run a "Process Full" on all partitions for all four years? Or do I need to run one of the other process options?
Thank you so much in advance for any advice you can offer here!
Joel
You need to have the matching result sets for all partitions source queries. Though, that doesn't mean you will necessarily have to add it to all views. You can edit the source query for the different partitions in visual studio. If you for some reason don't want to edit the 4 views ( which I would probably do ) you could hard code the surrogate key for the unknown member or something similar in the queries of the partitions where the new fields are not relevant (if it's dimension foreign keys we're talking about, alternatively to 0 or something if it's measures). If you have new dimensions I would go for a FULL process.
Related
I am new to database design and I am trying to practice with available government statistics for a small country. I have found almost 100 tables that store information collected for given years and months from a specific region. Some tables are updated monthly, while others are updated annually. I believe this means that in each table, there will be a natural composite PK made up of the year and month, or simply a PK made up of the year.
ER Diagram
In the above image, each parent attribute of Trip Survey represents one of the many data tables I have collected from public databanks specific to the region being researched (e.g. satisfaction_level, motivation_level, amount_spent all represent different surveys on the same population). Does it make sense to combine all of the tables into one table (e.g. Trip Survey)?
I'm not sure if my relationships are accurate (total and partial participation). My goal is to be able to queries the data to find points of correlation and make predictions about the future. I want to try and connect all of the tables over time.
The surveys collected can cover nearly any topic, but the common thread is they represent a moment in time, either monthly or annually. I want to eventually add a table of significant political events that may reflect outliers from trends.
Example Result: When motivation levels were low in 2018, spending was also down and length of stay was shorter relative to 'n' period.
As a newbie, any and all help is greatly appreciated.
Thank you
Simplify simplify simplify.
Start with one table, with at least some columns you comprehend. Load it into some dbms (pick one with geospatial capabilities and windowing functions, you may want them later: recent versions of MariaDB, MySQL and PostreSQL are fine choices). Import your table. This can be a pain in the axx neck to get right, but do your best to get it right anyhow.
Don't worry about primary keys or unique indexes when starting out. You're just exploring the data, not building it. Don't worry about buying or renting a server: most laptops can handle this kind of exploration just fine.
Pick a client program that keeps a history of the queries you put into it. HeidiSQL is a good choice. The relatively new Datagrip from Jetbrains is worth a look. Avoid Microsoft's SQL Server Management Studio: no history feature. (You'll often want to go back to something you tried a few hours or days ago when you're exploring, so the query history feature is vital.)
Then fiddle around with queries, especially aggregates ... e.g.
SELECT COUNT(*), year, origin, destination
FROM trip
GROUP BY year, origin, destination;
Look for interesting stuff you can glean from the one table. Get the hang of it. Then add another table that can be JOINed easily to the first table. Repeat your exploration.
That should get you started. Once you begin to understand your dataset, you can start ranking stuff, working out quintiles, and all that.
And, when you have to update or augment your data without reloading it, you'll need various primary / unique keys. That's in the future for you.
I have a curious question and as my name suggests I am a novice so please bear with me, oh and hi to you all, I have learned so much using this site already.
I have an MSSQL database for customers where I am trying to track their status on a daily basis, with various attributes being recorded in several tables, which are then joined together using a data table to create a master table which yields approximately 600million rows.
As you can imagine querying this beast on a middling server (Intel i5, SSD HD OS, 2tb 7200rpm HD, Standard SQL Server 2017) is really slow. I was using Google BigQuery, but that got expensive very quickly. I have implemented indexes which have somewhat sped up the process, but still not fast enough. A simple select distinct on customer id for a given attribute is still taking 12 minutes on average for a first run.
The whole point of having a daily view is to make it easier to have something like tableau or QLIK connect to a single table to make it easy for the end user to create reports by just dragging the required columns. I have thought of using the main query that creates the master table and parameterizes it, but visualization tools aren't great for passing many variables.
This is a snippet of the table, there are approximately 300,000 customers and a row per day is created for customers who join between 2010 and 2017. They fall off the list if they leave.
My questions are:
1) should I even bother creating a flat file or should I just parameterize the query.
2) Are there any techniques I can use aside from setting the smallest data types for each column to keep the DB size to a minimal.
3) There are in fact over a hundred attribute columns, a lot of them, once they are set to either a 0 or 1, seldom change, is there another way to achieve this and save space?
4)What types of indexes should I have on the master table if many of the attributes are binary
any ideas would be greatly received.
We are in a process of building a health data warehouse. And have been having discussions over the basic structure of the data warehouse. I need your suggestions on pros and cons of the below structures. DWH will be used for reporting and research purpose. It will be a near real time data warehouse with latency time of around 5-10 minutes.
The Source database has one Encounter/visit table. Everything is saved in this table. It's the central table which links everything. So If I need to get a patient's journey in the production database, I just go to the encounter/visit table and see how many times a patient has come for a treatment/has been admitted or went back from emergency, has been admitted from emergency etc.
model 1 ->
Encounter/visit table having the common fields (like encounter_id,arrival_date,care_type etc)
and then further tables can be built as per the encounter types with encounter specific fields :
Encounter_Emergency (Emergency specific fields such as emergency diagnosis, triage category etc)
Encounter_Inpatient
Encounter_outpatient
Model 2 ->
Having separate tables as base tables and then create a view on the top which then includes all the encounter types together.
Encounter_Emergency (Emergency specific fields such as emergency diagnosis,triage category etc)
Encounter_Inpatient
Encounter_outpatient
model 3 ->
Encounter/visit table having all the fields as the source database
and views are created as per the encounter types with encounter specific fields :
view_Encounter_Emergency
view_Encounter_Inpatient
view_Encounter_outpatient
these views can be further combined with the emergency_diagnosis table to get the diagnosis or emergency_alerts table to access the alerts etc.
A prime consideration would be how often there will be additions, deletions, or alterations to Encounter Types.
Model B will require extensive rework in advance of any such change just to make sure the data continues to be captured. Either of the other two models will continue to capture reclassed data, but will require rework to report on it.
As between A and C, the question becomes traffic. Views are comparatively easy to spin up/down, but they'll be putting load on that big base table. That might be acceptable if the DW won't have tons of load on it. But if there will be extensive reporting (Pro Tip there's always more extensive reporting than the business tells you there will be), it may be more advantageous to break the data out into stand alone tables.
There is, of course, ETL overhead to maintaining all of those tables.
For speed of delivery, perhaps build Model C, but architect Model A in case consumption requires the more robust model. For the record, you could build Views that don't have any kind of vw_ prefix, or any other identifier in their names that lets users know that they're views. Then, later, you can replace them with tables of the same name, and legacy queries against the old views will continue to work. I've done just the same thing in the opposite direction, sneaking in views to replace redundant tables.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Problem to be solved:
Im new to DataBases and Im trying to find out the best way to store changes in a table, that is a daily snapshot of some statuses: eg. "hotel_room_rentals" table (with 20 columns - every can change).
Id like to be able to generate that table for a selected day (e.g. data inside changes on production, so I have to store it somewhere else), or do some other transformations on it (e.g. average number of days rented in a period)
My theoretical example - detailed:
Let's say that Im creating a DB for a hotel.
In the production system I have a table that shows info for all 10 000 rooms in the hotel.
This is a daily snapshot - let's assume that the table is updated once per day.
Some attributes of a room change often: e.g. is_rented; customer_number, rate_usd.
Some attributes dont change too often: e.g. disabled_room, room_color, type_of_furniture.
Room_number obviously does not change (primary key)
Now I want to find the best way to track changes in this table; the best way to create statistics on base of this table (e.g. average number of days rented in a period) and to be able to generate the table for selected date (e.g. 2013-01-01)
MY IDEA:
Since I have no clue about databases, my idea is to copy the whole table every day, with 1 more column, called "DB_dump_date" (with a date). This is a pretty straightforward approach, which will probably require a lot of space; since my 10k rooms table, will have to be copied 365 times in a year.
OTHER SOLUTIONS:
On some other website, I was recommended to create two tables:
"Reservation" table with these columns: Startdate Enddate Room Rate Occupant_name
Then to transform this table into a FactReservations table: Date Room Is_occupied Rate Occupant_name
I do not understand how does this help me... in fact I assume I would have to make 20 intermediary tables and then 20 Fact tables (since I have 20 columns in my database).
QUESTIONS:
What are the recommended ways to deal with such problems?
Is there any DB schema that is prepared to deal with it, without the user making magic ETLs? (e.g. a DB that can optimize the problem by itself)
What are the alternatives?
How would you, smart people, do this? (preferably in MS Access... or some freeware technology)
edit:
one more thing - everything can change in the table, not only room reservetions, everything; and I want to be able to track the changes
stop - slow down - and take a breath.
do not - repeat do not make copies of tables each day. this approach is way off base.
your problem is a normalization problem. as you indicate - you have other suggestions on how to normalize - this is the direction you want to go.
Your goal will be to find a structure that accommodates the SQL statements that can answer your questions (and hopefully many more that you haven't thought up yet) This will be one static model where the tables do not change or get copied, but are instead static - and the only thing that changes is the data inside the tables. (ideally - to me there will also be few to no updates, only inserts)
You will certainly need a ROOM table, and a CUSTOMER table, and then a relation between them possibly RESERVATION.
these can then fill up - and you can get all the answers to the questions you posed without any copying or materialization or anything.. just SQL.
You need to focus on the requirements and start there. So far for requirements I see are:
-Generate that table for a selected day
-average number of days rented in a period
If we consider two extremes of design, at the more complex end would be a datamart with SCD tables, tracking changes to rooms, and at the simple end would be some kind of log table, along the lines of what you have already mentioned.
Reading between the lines, I don't really see any requirement for knowing the attributes of a room on a given day, but I do see a requirement for analysis of historical transactions.
So my suggestion is have a good hard think about your requirements before you start designing the database.
There is no magic design to cover this automatically. Dimensional design is a standard way of modelling business data to allow for easy analysis, but it might be over the top for your requirement.
Welcome to the world of databases! With that in mind – take almost everything that you know about Excel and throw it out the window. Whereas it’s much more difficult in Excel to define relationships between two sheets of a workbook and report off of those two different sheets, so the majority of the time it’s easier to simply copy the same data down a single sheet, it’s trivially easy to do using Access or any other relational database.
Typically what you’d want to do is create several normalized tables and define a relationship between them. Then, when querying the view, you can easily join between the tables to get the data that you need.
So, working off of the assumption that you’re building this for simple reporting and not to create a property management system (if you are looking at that – I’d recommend that you look at some of the players in the industry, like Micros or Agilysys), based on my experience working in the industry, I’d recommend the following table layout:
Reservations – this holds the reservation information (guest name,
arrival date, departure date, check-in date, check-out date, rate if
you use a blended rate, etc.)
Rooms – this holds information on your rack (number, wing code, max
guests, # beds, smoking/non, view, type, etc.)
Room Status – Only if you need to track if a room is on
reserve/hold/OOO/OTM (Status type, date start, date end)
Room Status Types – Types of room status holds and how it affects
inventory (type, out of inventory flag)
Rates (if you don’t use a blended rate) – one entry per reservation
per night (guest, rate)
Personally, I’m a huge fan of using surrogate keys for the unique identifiers, because all too often I've been burned where something changes in the business process and a natural key that was previously unique all of a sudden can be duplicated. In that vein, each table would have a surrogate key and the joins would be as follows:
Reservations – Rooms (many to one)
Rooms – Room Status (one to many)
Room Status – Room Status Types (many to one)
Reservations – Rates (one to many)
If you define the relationships properly in Access (i.e. foreign key relationships in other DBMS), it should automatically use them to build your joins when creating your queries (called Views in just about every other DBMS) or reports.
For learning about databases I’d recommend that you review:
Wikipedia on Join types
Wikipedia on Slowly Changing Dimension (you could use some of
these techniques to record changes in room information over time)
Wikipedia on Relational Databases
Office documentation on Access
Kimball Group Design Tips (great for data warehouse/datamart
design)
if you need to use your existing table then the following is not applicable. If the data can be migrated to a new schema then this will readily address the challenge. TRE is an approach which uses the current view paradigm for development but fully supports the time dimensions of data (which are system time=when the data goes into the db and valid time=the business time which applies to the data). By working in the current view approach of TRE this sort of problem is straightforward. Take a look at:- http://youtu.be/V1EcsuJxUno
I am on SQL Server 2008 R2 and I am currently developing a database structure which contains seasonal values for some products.
By seasonal I mean that those values won't be useful after a particular date in terms of customer use. But, those values will be used for statistical results by internal stuff.
On the sales web site, we will add a feature for product search and one of my aim is to make this search as optimized as possible. But, more row inside the database table, less fast this search will become. So, I consider archiving the unused values.
I can handle auto archiving with SQL Server Jobs automatically. No problem there. But I am not sure how I should archive those values.
Best way I can come up with is that I create another table inside the same database with same columns and put them there.
Example :
My main table name is ProductPrices and there a primary key has been
defined for this database. Then, I have created another table named
ProdutcPrices_archive. I created a primary key field for this table
as well and the same columns as ProductPrices table except for
ProdutPrices primary key value. I don't think it is useful to
archive that value (do I think correct?).
For the internal use, I consider putting two table values together
with UNION (Is that the correct way?).
This database is meant to use for long time and it should be designed with best structure. I am not sure if I miss something here for the long run.
Any advice would be appreciated.
I'd consider one of two options initially
Use partitioning to separate the single table into current working set and archive data.
No need to use an archive table
Add validForm, ValidTo columns to implement a type 2 SCD
Then add an indexed view for ValidTo IS NULL to get the current set of data
I wouldn't have 2 separate tables if all data has to be "on-line" in one database.
This leads to a 3rd option: an entirely separate database with all data. Only "current" data stays in live. (as #Mike_Walsh's answer explains)
The indexed view option is easiest and works with standard edition (with NOEXPAND hint)
gbn brings up some good approaches. I think the "right" longer term answer for you is the t3rd option, though.
It sounds like you have two business use cases of your data -
1.) Real time Online Transaction Processing (OLTP). This is the POS transactions, inventory management, quick "how did receipts look today, how is inventory, are we having any operational problems?" kind of questions and keeps the business running day to day. Here you want the data necessary to conduct operations and you want a database optimized for updates/inserts/etc.
2.) Analytical type questions/Reporting. This is looking at month over month numbers, year over year numbers, running averages. These are the questions that you ask as that are strategic and look at a complete picture of your history - You'll want to see how last years Christmas seasonal items did against this years, maybe even compare those numbers with the seasonal items from that same period 5 years ago. Here you want a database that contains a lot more data than your OLTP. You want to throw away as little history as possible and you want a database largely optimized for reporting and answering questions. Probably more denormalized. You want the ability to see things as they were at a certain time, so the Type 2 SCDs mentioned by gbn would be useful here.
It sounds to me like you need to create a reporting database. You can call it a data warehouse, but that term scares people these days. Doesn't need to be scary, if you plan it properly it doesn't have to take you 6 years and 6 million dollars to make ;-)
This is definitely a longer term answer but in a couple years you'll be happy you spent the time creating one. A good book to understand the concept of dimensional modeling and thinking about data warehouses and their terminology is The Data Warehouse Toolkit.