What to do with missing values if missing has clear meaning? - missing-data

I am trying to model the daily demand of a retailer's products, based on the prices of competitors offering the same products (among other variables). I have 20 columns with the prices of various competitors for this specific product on the same date. The retailer however has over 80 000 products in its product range during the observed period and of course not all competitors offer all those products as well. In fact, in most of the cases, only 2 or 3 competitors offer the same product. This leads to a lot of missing values, indicating that this competitor does not offer this product at that moment.
I do not want to simply impute these missing values, given the large proportion of missing values, as well as the fact that they contain valuable information on their own: does a competitor provide this product or not.
Is there a supervised learning algorithm that handles missing values as specific cases?

I also agree that imputation is not a good idea in this situation as the information of missingness gets lost. However, given the idea of creating additional features which indicate if a value is known or missing, you could try to fill missing values by the mean and than use LogisticRegression on your data.

Related

Data Warehouse - Storing unique data over time

Basically we are building a reporting dashboard for our software. We are giving the Clients the ability to view basic reporting information.
Example: (I've removed 99% of the complexity of our actual system out of this example, as this should still get across what I'm trying to do)
One example metric would be...the number of unique products viewed over a certain time period. AKA, if 5 products were each viewed by customers 100 times over the course of a month. If you run the report for that month, it should just say 5 for number of products viewed.
Are there any recommendations on how to go about storing data in such a way where it can be queried for any time range, and return a unique count of products viewed. For the sake of this example...lets say there is a rule that the application cannot query the source tables directly, and we have to store summary data in a different database and query it from there.
As a side note, we have tons of other metrics we are storing, which we store aggregated by day. But this particular metric is different because of the uniqueness issue.
I personally don't think it's possible. And our current solution is that we offer 4 pre-computed time ranges where metrics affected by uniqueness are available. If you use a custom time range, then that metric is no longer available because we don't have the data pre-computed.
Your problem is that you're trying to change the grain of the fact table. This can't be done.
Your best option is what I think you are doing now - define aggregate fact tables at the grain of day, week and month to support your performance constraint.
You can address the custom time range simply by advising your users that this will be slower than the standard aggregations. For example, a user wanting to know the counts of unique products sold on Tuesdays can write a query like this, at the expense of some performance loss:
select distinct dim_prod.pcode
,count(*)
from fact_sale
join dim_prod on dim_prod.pkey = fact_sale.pkey
join dim_date on dim_date.dkey = fact_sale.dkey
where dim_date.day_name = 'Tuesday'
group by dim_prod.pcode
The query could also be written against a daily aggregate rather than a transactional fact, and as it would be scanning less data it would run faster, maybe even meeting your need
From the information that you have provided, I think you are trying to measure ' number of unique products viewed over a month (for example)'.
Not sure if you are using Kimball methodologies to design your fact tables. I believe in Kimball methodology, an Accumulating Snapshot Fact table will be recommended to meet such a requirement.
I might be preaching to the converted(apologies in that case), but if not then I would let you go through the following link where the experts have explained the concept in detail:
http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
I have also included another link from Kimball, which explains different types of fact tables in detail:
http://www.kimballgroup.com/2014/06/design-tip-167-complementary-fact-table-types/
Hope that explains the concepts in detail. More than happy to answer any questions(to the best of my ability)
Cheers
Nithin

Quickly compute millions of values for a search

Let's say I have a database of millions of widgets with a price attribute. Widgets belong to suppliers, and I sell widgets to customers by first buying them from suppliers and then selling them to the customer. With this basic setup, if a customer asks me for every widget less than $50, it's trivial to list them.
However, I mark up the price of widgets from individual suppliers differently. So I may mark up widgets from Supplier A by 10%, and I may mark up widgets from Supplier B by a flat rate of $5. In a database, these markups would be stored in a join table with my ID, the supplier ID, a markup type (flat, percentage), and a markup rate. On top of this, suppliers may add their own markups when they sell to me (these markups would be in the same join table with the supplier's ID, my ID, and a markup type/rate).
So if I want to sell a $45 widget from Supplier A, it might get marked up by the supplier's 10% markup (to $49.50), and then my own $10 flat markup (to $59.50). This widget would not show up in the client's search for widgets costing less than $50. However, it's possible that an $80 widget could get marked down to $45 by the time it reaches the client, and should be returned in results. These markups are subject to change, and let's assume I'm one of hundreds of people in this system selling widgets to customers through suppliers, all with their own markup relationships in that markup table.
Is there any precedent for performing calculations like this quickly across millions of objects? I realize this is a huge, non-trivial problem, but I'm curious how one would start addressing a problem like this.
Add columns to your database and store the computed results, updating them with the related records change. You cannot calculate these values on the fly for millions of records.
Is there any precedent for performing calculations like this quickly across millions of
objects?
Standard. Seriously. Data warehouse, risk projections. Stuff like that - your problem is small. Precaulcuate all combinations, store them in a proper higher level database server, finished.
it is not huge - seriously. It is only huge for a small server, but once you get a calculation grid going... it is quite trivial. Millions of objects? Calculate 100.000 objects in a minute per machine, 10 million are 100 minute objects. And you dont have THAT many changes.

database design - when to split tables?

Sometimes creating a separate table would produce much more work, should I split it anyway?
for example: In my project I have a table of customers, each customer has his own special price for each product (there are only 5 products & more products are not planned in the future), each customer also have unique days of the week when the company delivers to him the products.
Many operations like changing days/price for a customer, or displaying days & prices of all customers would be much easier when the days & product prices are columns in the customers table and not separate tables, so is it refuted to create only one big customers table in such case? What are the drawbacks?
UPDATE: They just informed me that after a year or so there's a chance that they add more products, they say their business won't exceed 20-30 products in any event.
I still can't understand why in such case when product's prices has no relation (each customer has his own special price) adding rows to Products table is better then adding Columns to Customers table?
The only benefit I could think of is that customer that has only 5 products won't have to 'carry' 20 nullable products (saves space on server)? I don't have much experience so maybe I'm missing the obvious?
Clearly, just saying that one should always normalize is not pragmatic. No advice is always true.
If you can say with certainty that 5 "items" will be enough for a long time I think it is perfectly fine to just store them as columns if it saves you work.
If your prediction fails and a 6th items needs to be stored you can add a new column. As long as the number of columns doesn't get out of hand with very high probability, this should not be a problem.
Just be careful with such tactics as the ability of many programmers to predict the future turns out to be very limited.
In the end only one thing counts: Delivering the requested solution at the lowest cost. Purity of code is not a goal.
Normalization is all about data integrity (consistency), nothing else; not about hard, easy, fast, slow, efficient and other murky attributes. The current design almost certainly allows for data anomalies. If not right now, the moment you try to track price changes, invoices, orders, etc, it is a dead end.

Which table design is better for a Balance that is broken up into multiple parts?

I have a database where the Balance and Payments need to be broken down into different "money buckets" to show how they are allocated. For example, there is principal, interest, late fees, bounced check fees, Misc, etc. There are up to 10 different money buckets.
Which of these two methods is the better way of designing a database for this, and why?
Option A
PAYMENTS
AccountId
// Other payment-related columns
TotalPaid
PrincipalPaid
InterestPaid
MiscPaid
BadCheckChargesPaid
...
Option B
PAYMENTS
AccountId
// Other payment-related columns
TotalPaid
PAYMENT_DETAILS
PaymentId
PaymentTypeId
AmountPaid
In most cases only 1-3 of the different balance types are used
Option B is the better normalized, more flexible option (easy to add a new bucket later) and would get my vote.
While the normalization fairy can often tempt you in the direction of the latter (as it does me), the former is probably the more sensible. You're only talking about 10 columns (not 500), and there's no normalization rules that are really being broken. Unless there's the strong possibility that this list of payment allocation buckets will grow, I would stay away from the EAV structure just because of the headaches (and innumerable joins in some queries) that it can produce.
Option B seems better to me. A clincher would be whether your application is designed to show the details like this:
Item Amount
-------------- ---------------
Principal $10.00
Interest $1.11
If so, the normalized version is not only "righter" but actually stores the data in a format closer to what your application requires.
To me, the big question is whether you store the payment total in the payment record or derive it from the details.

Is this a textbook design pattern, or did I invent something new?

I'm fresh out of designing a set of tables, in which I came up with an architecture that I was very pleased with! I've never seen it anywhere else before, so I'd love to know if I've just reinvented the wheel (most probable), or if this is a genuine innovation.
Here's the problem statement: I have Employees who can each sign a different contract with the company. Each employee can perform different Activities, and each activity may have a different pay rate, sometimes a fixed amount for completing one activity, sometimes an hourly rate, and sometimes at a tiered rate. There may also be a specific customer who likes the employee particularly, so when he works with that specific customer, he gets a higher rate. And if no rate is defined, he gets the company default rate.
Don't fuss about the details: the main point is that there are a lot of pay rates that can be defined, each in a fairly complicated way. And the pay rates all have the following in common:
Service Type
Pay Scale Type (Enum: Fixed Amount/Hourly Rate/Tiered Rate)
Fixed Amount (if PayScaleType = FA)
Hourly Rate (if PayScaleType = HR) - yes, could be merged into one field, but for reasons I won't go into here, I've kept them separate
Tiers (1->n relationship, with all the tiers and the amount to pay once you have gone over the tier threshold)
These pay rates apply to:
Default company rate
Employee rate
Employee override rate (defined per customer)
If I had to follow the simple brute force approach, I would have to create a PayRate and PayRateTier clone table for each of the 3 above tables, plus their corresponding Linq classes, plus logic to calculate the rates in 3 separate places, somehow refactoring to reuse the calculation logic. Ugh. That's like using copy and paste, just on the database.
So instead, what did I do? I created a intermediary table, which I called PayRatePackage, consisting only of an ID field. I have only one PayRate table with a mandatory FK to PayRatePackage, and a PayRateTier table with a mandatory FK to PayRate. Then, DefaultCompanyPayRate has a mandatory FK to PayRatePackage, as do EmployeeRate and EmployeeOverrideRate.
So simple - and it works!
(Pardon me for not attaching diagrams; that would be a lot of effort to go to for a SO question where I've already solved the main problem. If a lot of people want to see a diagram, please say so in the comments, and I'll throw something together.)
Now, I'm pretty sure that something this simple and effective must be in a formal design pattern somewhere, and I'd love to know what it is. Or did I just invent something new? :)
I'm pretty sure this is the Strategy Pattern
"Define a family of algorithms, encapsulate each one, and make them interchangeable. Strategy lets the algorithm vary independently from clients that use it."
Sounds like relational database design to me. You broke out specific logic into specific entities, and keyed them back to the original tables... Standard normalization...