I have looked everywhere I can, but I couldn't find answer to my question regarding rpart package.
I have built a regression tree using rpart, I have around 700 variables. I want to get the variables actually used to build the tree including the surrogates. I can find the actual variables used using tree$variable.importance, but I also have to get the surrogates because I need them to predict on the test set data I have. I do not want to keep all the 700 variables in the test set as I have a very big data (20mil observations) and I am running out of memory.
The list variable.importance in an rpart object does show the surrogate variables, but it only shows the top variables limited by a minimum importance value.
The matrix splits in an rpart object lists all of the split variables and their surrogate variables along with some other data like index, the value on which it splits (for continuous variable) or the categories that are split (for categorical variable), count how many observations are that split applies to. It doesn't give a hierarchy of which surrogates apply to which split, but it does list every variable. To get the hierarchy, you have to do summary(rpart_object).
Related
Somebody told me it is a good idea to convert identifying columns (e.g. person numbers) from strings to categorical. This would speed up some operations like searching, filtering and grouping.
I understand that a 40 chars strings costs much more RAM and time to compare instead of a simple integer.
But I would have some overhead because of a str-to-int-table for translating between two types and to know which integer number belongs to which string "number".
Maybe .astype('categorical') can help me here? Isn't this an integer internally? Does this speed up some operations?
The user guide has the following about categorical data use cases:
The categorical data type is useful in the following cases:
A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
See also the API docs on categoricals.
The book, Python for Data Analysis by Wes McKinney, has the following on this topic:
The categorical representation can yield significant performance
improvements when you are doing analytics. You can also perform
transformations on the categories while leaving the codes unmodified.
Some example transformations that can be made at relatively low cost are:
Renaming categories
Appending a new category without changing the order or position of the existing categories
GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings.
Series containing categorical data have several special methods similar to the Series.str specialized string methods. This also provides convenient access to the categories and codes.
In large datasets, categoricals are often used as a convenient tool for memory savings and better performance.
My data frame has 3.8 million rows and 20 or so features, many of which are categorical. After paring down the number of features, I can "dummy up" one critical column with 20 or so categories and my COLAB with (allegedly) TPU running won't crash.
But there's another column with about 53,000 unique values. Trying to "dummy up" this feature crashes my session. I can't ditch this column.
I've looked up target encoding, but the data set is very imbalanced and I'm concerned about target leakage. Is there a way around this?
EDIT: My target variable is a simple binary one.
Without knowing more details of the problem/feature, there's no obvious way to do this. This is the part of Data Science/Machine Learning that is an art, not a science. A couple ideas:
One hot encode everything, then use a dimensionality reduction algorithm to remove some of the columns (PCA, SVD, etc).
Only one hot encode some values (say limit it to 10 or 100 categories, rather than 53,000), then for the rest, use an "other" category.
If it's possible to construct an embedding for these variables (Not always possible), you can explore this.
Group/bin the values in the columns by some underlying feature. I.e. if the feature is something like days_since_X, bin it by 100 or something. Or if it's names of animals, group it by type instead (mammal, reptile, etc.)
It seems I am struggling to understand the difference between a set and a tuple in MDX. I've read very fancy definitions comparing the two, but the only difference to me seems that 'A set has the same-type members' and 'A tuple has non-same-type members'. Other than that, any definition I read or come across (talking about dimensional space or what-not) seems to make no sense. The 'one-item' I get:
# Tuple
[Team].[Hierarchy].[Code].[DET]
And then multiple items with that same type (dimensionality) is a set, ok:
{[Team].[Hierarchy].[Code].[DET], [Team].[Hierarchy].[Code].[DAL]}
But here are a few examples that don't make sense to me:
# How is this a set? It just has two exact same items!
{[Team].[Hierarchy].[Code].[DET], [Team].[Hierarchy].[Code].[DET]}
And another example:
# Tuple (again, same thing -- now adding a duplicate attribute
(
{[Team].[Hierarchy].[Code].[DET],[Team].[Hierarchy].[Code].[DET]},
[Team].[Name].[Name].[Detroit Lions]
)
Now since both of these are almost doing the same thing (and neither references a measure, so neither would be self-sufficient to pull a 'value'), what is the actual difference between a tuple and a set? These seem to be so loosely defined in the language (for example, above I can have duplicate members in a set, which is usually not allowed in a set).
A related question (some of the answers cover the basics of a one-level set/tuple difference but don't go into too much detail on nesting): Difference between tuple and set in mdx. Also, most of the links on that page are broken.
MDX sets are an ordered collection of 0 or more tuples (note that a member is considered to be a tuple containing a single element) with the same dimensionality. Unlike a mathematical set, an MDX set may contain duplicates, it is more of a list of elements. More details here.
And perhaps as a refresh for MDX concepts here is a gentle introduction of MDX.
Can anyone explain me excatly what is meant by Dummy Variable Trap?And why we want to remove one column to avoid that trap?Please provide me some links or explain this.I am not clear about this process.
In regression analysis there's often talk about the issue of multicolinearity, which you might be familiar with already. The dummy variable trap is simply perfect colinearity between two or more variables. This can arise if, for one binary variable, two dummies are included; Imagine that you have a variable x which is equal to 1 when something is True. If you would include x, along with another variable z, which would be the opposite of x (i.e. 1 when that same thing is False), in your regression model, you would have two perfectly negatively correlated variables.
Here's a simple demonstration. Let's say your x is one column with True/False values in a pandas dataframe. See what happens when you use pd.get_dummies(df.x) below. The two dummies that are created are mirroring each other, so one of them is redundant. In simpler terms, you only need one of them since you can always guess the value of the other based on the one that you have.
import pandas as pd
df = pd.DataFrame({'x': [True, False]})
pd.get_dummies(df.x)
False True
0 0 1
1 1 0
The same applies if you have a categorical variable that can take on more than two values. Whether binary or not, there is always a "base scenario" that can be defined by the variation in the other case(s). This "base scenario" is therefore redundant and will only introduce perfect colinearity in the model if included.
So what's the issue with multicolinearity/linear dependence? The short answer is that if there is imperfect multicolinearity among your explanatory variables, your estimated coefficients can be distorted/biased. If there is perfect multicolinearity (which is the case with the dummy variable trap) you can't estimate your model at all; think of it like this, if you have a variable that can be perfectly explained by another variable, it means that your sample data only includes valuable information about one, not two, truly unique variables. So it would be impossible to obtain two separate coefficient estimates for the same variable.
Further Reading
Multicolinearity
Dummy Variable Trap
I was able to find a few, but I was wondering, is there more algorithms that based on data encoding/modification instead of complete encryption of it. Examples that I found:
Steganography. The method is based on hiding a message within a message;
Tokenization. Data is mapped in the tokenization server to a random token that represents the real data outside of the server;
Data perturbation. As far as I know it works mostly with databases. Adds noise to the sensitive records yet allows to read general and public fields, like sum of the records on a specific day.
Are there any other methods like this?
If your purpose is to publish this data there are other methods similars to data perturbation, its called Data Anonymization [source]:
Data masking—hiding data with altered values. You can create a mirror
version of a database and apply modification techniques such as
character shuffling, encryption, and word or character substitution.
For example, you can replace a value character with a symbol such as
“*” or “x”. Data masking makes reverse engineering or detection
impossible.
Pseudonymization—a data management and de-identification method that
replaces private identifiers with fake identifiers or pseudonyms, for
example replacing the identifier “John Smith” with “Mark Spencer”.
Pseudonymization preserves statistical accuracy and data integrity,
allowing the modified data to be used for training, development,
testing, and analytics while protecting data privacy.
Generalization—deliberately removes some of the data to make it less
identifiable. Data can be modified into a set of ranges or a broad
area with appropriate boundaries. You can remove the house number in
an address, but make sure you don’t remove the road name. The purpose
is to eliminate some of the identifiers while retaining a measure of
data accuracy.
Data swapping—also known as shuffling and permutation, a technique
used to rearrange the dataset attribute values so they don’t
correspond with the original records. Swapping attributes (columns)
that contain identifiers values such as date of birth, for example,
may have more impact on anonymization than membership type values.
Data perturbation—modifies the original dataset slightly by applying techniques that round numbers and add random noise. The range
of values needs to be in proportion to the perturbation. A small base
may lead to weak anonymization while a large base can reduce the
utility of the dataset. For example, you can use a base of 5 for
rounding values like age or house number because it’s proportional to
the original value. You can multiply a house number by 15 and the
value may retain its credence. However, using higher bases like 15 can
make the age values seem fake.
Synthetic data—algorithmically manufactured information that has no
connection to real events. Synthetic data is used to create artificial
datasets instead of altering the original dataset or using it as is
and risking privacy and security. The process involves creating
statistical models based on patterns found in the original dataset.
You can use standard deviations, medians, linear regression or other
statistical techniques to generate the synthetic data.
Is this what are you looking for?
EDIT: added link to the source and quotation.