Does a column of only zeros provide any information in a data analysis?. What if it has missing values? - data-science

I have a tricky question someone asked me:
I got a bunch of columns with data to predict some future sales.
There are a bunch of these columns that use a lot of memory and only got zeros. The question is: Can I just remove these columns from my analysis?
Second part. What if the columns that only have zeros also have missing values. What do you do?

Analyzing datasets with large numbers of zeros can indeed be a waste of computational resources (see sparse matrices computation), especially since they may not be contributing any meaningful information. In fact, they might even add noise to your dataset, obscuring any relationships you might find otherwise.
But there are cases were zeros can be incredibly meaningful. For example, if you were trying to predict future sales of products based on units sold (count data), with each column representing a month of sales, you might want to keep the zeros as they provide insight into sales of your product during those specific months.
Removing missing values is definitely tricky, and they can often be a sign that you might need to re-examine the data collection process for explanation; is there a reason why the data is missing, or does the missing value mean something (e.g. sometimes zeros can be coded as NAs)?
How you deal with zeros/missing values varies greatly based on the domain you're in, the specific question you're trying to answer, and what exactly the meaning of your data/columns are. So it's kind of hard to answer without knowing much about the data itself.

Related

How do you deal with missing data when it's missing like 60%?

My data has a lot of missing values and I have to predict those values. One way is to take the average of those values. But I want to hear an other perspective on it. How experienced data scientist solve such kind of issue?
Are your missing values categorical or continuous?
One way is to remove the samples entirely, however this may lead to a sampling bias, since the missing values could have been the result of some causal effect, that is the missing values are not missing completely at random.
If your data has enough dimensionality, you can treat your missing values as the output and try to apply a predicting model and hope that it can faithfully estimate the missing values, given the explanatory variables you already have.
Picking the most frequent value, the median, or averaging as you point out could also be an option, however be careful with outliers when averaging as these can have a tremendous effect on the mean.
It depends on nature of variables, it may be some statistics like mean or median. Another practice is assign to missing variables some value different from others for example 0, -1 or something like this.
The hardest approach is to impute the dataset and not deviate too far from the truth. A test to validate how well you have done this is the following. If the other parameters provide enough evidenced insight to impute with a level of precision for missing data....it should be able to do it with existing data.
So if 60 percent of the column is missing, take the row observations where this column is PRESENT.
Next, randomly choose to remove 60% of this subsetted data. Now run imputation methods of your choosing.
Compare the imputed dataset to the real data set for similarity. Decide if they are close enough for you to then run this against the full data set. At least this approach will give you a leg to stand on if you need to defend yourself.
Fight the Good Fight.

Oracle String Conversion - Alpha String to Numeric Score, Fuzzy Match

I'm working with a lot of name data where the following events are happening:
In one stream the data is submitted as "Sung" and in the other stream "Snug" my initial thought to this was to convert Sung and Snug to where each character equals a number then the sums would be the same, so even if they transverse a character, I'd be able to bucket these appropriately.
The other is where in one stream it comes in as "Lillly" as opposed to "Lilly" in the other stream. I'd like to figure out how to fuzzy match these such that I can identify them. I'm not sure if this is possible in Oracle.
I'm working with many millions of data points and trying to figure out how to write these classification buckets such that I can stop having so much noise in my primary task of finding where people are truly different people as opposed to a clerical error.
Any thoughts would be very appreciated.
A common measure for such distance is called Levenshtein distance (Wikipedia here). This measures the "edit" distance between two strings -- number of edit operations needed to convert one into the other.
That's the good news. More good news is that Oracle even has an implementation in the UTL_MATCH library.
The bad news is that it is really, really expensive on millions of data points. Unfortunately, I cannot help you there so much. One idea is to determine which names are "close enough" because they already share a certain minimum number of characters.
Another method is to convert the strings to what they sound like. That is called soundex. You may be able to use the two together -- assuming your names are predominantly English (the soundex algorithm was invented by the US Census Bureau, so it would work best on names in America).

Neural Network Input and Output Data formatting

and thanks for reading my thread.
I have read some of the previous posts on formatting/normalising input data for a Neural Network, but cannot find something that addresses my queries specifically. I apologise for the long post.
I am attempting to build a radial basis function network for analysing horse racing data. I realise that this has been done before, but the data that I have is "special" and I have a keen interest in racing/sportsbetting/programming so would like to give it a shot!
Whilst I think I understand the principles for the RBFN itself, I am having some trouble understanding the normalisation/formatting/scaling of the input data so that it is presented in a "sensible manner" for the network, and I am not sure how I should formulate the output target values.
For example, in my data I look at the "Class change", which compares the class of race that the horse is running in now compared to the race before, and can have a value between -5 and +5. I expect that I need to rescale these to between -1 and +1 (right?!), but I have noticed that many more runners have a class change of 1, 0 or -1 than any other value, so I am worried about "over-representation". It is not possible to gather more data for the higher/lower class changes because thats just 'the way the data comes'. Would it be best to use the data as-is after scaling, or should I trim extreme values, or something else?
Similarly, there are "continuous" inputs - like the "Days Since Last Run". It can have a value between 1 and about 1000, but values in the range of 10-40 vastly dominate. I was going to scale these values to be between 0 and 1, but even if I trim the most extreme values before scaling, I am still going to have a huge representation of a certain range - is this going to cause me an issue? How are problems like this usually dealt with?
Finally, I am having trouble understanding how to present the "target" values for training to the network. My existing results data has the "win/lose" (0 or 1?) and the odds at which the runner won or lost. If I just use the "win/lose", it treats all wins and loses the same when really they're not - I would be quite happy with a network that ignored all the small winners but was highly profitable from picking 10-1 shots. Similarly, a network could be forgiven for "losing" on a 20-1 shot but losing a bet at 2/5 would be a bad loss. I considered making the results (+1 * odds) for a winner and (-1 / odds) for a loser to capture the issue above, but this will mean that my results are not a continuous function as there will be a "discontinuity" between short price winners and short price losers.
Should I have two outputs to cover this - one for bet/no bet, and another for "stake"?
I am sorry for the flood of questions and the long post, but this would really help me set off on the right track.
Thank you for any help anyone can offer me!
Kind regards,
Paul
The documentation that came with your RBFN is a good starting point to answer some of these questions.
Trimming data aka "clamping" or "winsorizing" is something I use for similar data. For example "days since last run" for a horse could be anything from just one day to several years but tends to centre in the region of 20 to 30 days. Some experts use a figure of say 63 days to indicate a "spell" so you could have an indicator variable like "> 63 =1 else 0" for example. One clue is to look at outliers say the upper or lower 5% of any variable and clamp these.
If you use odds/dividends anywhere make sure you use the probabilities ie 1/(odds+1) and a useful idea is to normalize these to 100%.
The odds or parimutual prices tend to swamp other predictors so one technique is to develop separate models, one for the market variables (the market model) and another for the non-market variables (often called the "fundamental" model).

Assigning values to missing data for use in binary logistic regression in SAS

Many of the variables in the data I use on a daily basis have blank fields, some of which, have meaning (ex. A blank response for a variable dealing with the ratio of satisfactory accounts to toal accounts, thus the individual does not have any accounts if they do not have a response in this column, whereas a response of 0 means the individual has no satisfactory accounts).
Currently, these records do not get included into logistic regression analyses as they have missing values for one or more fields. Is there a way to include these records into a logistic regression model?
I am aware that I can assign these blank fields with a value that is not in the range of the data (ex. if we go back to the above ratio variable, we could use 9999 or -1 as these values are not included in the range of a ratio variable (0 to 1)). I am just curious to know if there is a more appropriate way of going about this. Any help is greatly appreciated! Thanks!
You can impute values for the missing fields, subject to logical restrictions on your experimental design and the fact that it will weaken the power of your experiment some relative to having the same experiment with no missing values.
SAS offers a few ways to do this. The simplest is to use PROC MI and PROC MIANALYZE, but even those are certainly not a simple matter of plugging a few numbers in. See this page for more information. Ultimately this is probably a better question for Cross-Validated at least until you have figured out the experimental design issues.

precision gains where data move from one table to another in sql server

There are three tables in our sql server 2008
transact_orders
transact_shipments
transact_child_orders.
Three of them have a common column carrying_cost. Data type is same in all the three tables.It is float with NUMERIC_PRECISION 53 and NUMERIC_PRECISION_RADIX 2.
In table 1 - transact_orders this column has value 5.1 for three rows. convert(decimal(20,15), carrying_cost) returns 5.100000..... here.
Table 2 - transact_shipments three rows are fetching carrying_cost from those three rows in transact_orders.
convert(decimal(20,15), carrying_cost) returns 5.100000..... here also.
Table 3 - transact_child_orders is summing up those three carrying costs from transact_shipments. And the value shown there is 15.3 when I run a normal select.
But convert(decimal(20,15), carrying_cost) returns 15.299999999999999 in this stable. And its showing that precision gained value in ui also. Though ui is only fetching the value, not doing any conversion. In the java code the variable which is fetching the value from the db is defined as double.
The code in step 3, to sum up the three carrying_costs is simple ::
...sum(isnull(transact_shipments.carrying_costs,0)) sum_carrying_costs,...
Any idea why this change occurs in the third step ? Any help will be appreciated. Please let me know if any more information is needed.
Rather than post a bunch of comments, I'll write an answer.
Floats are not suitable for precise values where you can't accept rounding errors - For example, finance.
Floats can scale from very small numbers, to very high numbers. But they don't do that without losing a degree of accuracy. You can look the details up on line, there is a host of good work out there for you to read.
But, simplistically, it's because they're true binary numbers - some decimal numbers just can't be represented as a binary value with 100% accuracy. (Just like 1/3 can't be represented with 100% accuracy in decimal.)
I'm not sure what is causing your performance issue with the DECIMAL data type, often it's because there is some implicit conversion going on. (You've got a float somewhere, or decimals with different definitions, etc.)
But regardless of the cause; nothing is faster than integer arithmetic. So, store your values are integers? £1.10 could be stored as 110p. Or, if you know you'll get some fractions of a pence for some reason, 11000dp (deci-pennies).
You do then need to consider the biggest value you will ever reach, and whether INT or BIGINT is more appropriate.
Also, when working with integers, be careful of divisions. If you divide £10 between 3 people, where does the last 1p need to go? £3.33 for two people and £3.34 for one person? £0.01 eaten by the bank? But, invariably, it should not get lost to the digital elves.
And, obviously, when presenting the number to a user, you then need to manipulate it back to £ rather than dp; but you need to do that often anyway, to get £10k or £10M, etc.
Whatever you do, and if you don't want rounding errors due to floating point values, don't use FLOAT.
(There is ALOT written on line about how to use floats, and more importantly, how not to. It's a big topic; just don't fall into the trap of "it's so accurate, it's amazing, it can do anything" - I can't count the number of time people have screwed up data using that unfortunately common but naive assumption.)