I was able to find a few, but I was wondering, is there more algorithms that based on data encoding/modification instead of complete encryption of it. Examples that I found:
Steganography. The method is based on hiding a message within a message;
Tokenization. Data is mapped in the tokenization server to a random token that represents the real data outside of the server;
Data perturbation. As far as I know it works mostly with databases. Adds noise to the sensitive records yet allows to read general and public fields, like sum of the records on a specific day.
Are there any other methods like this?
If your purpose is to publish this data there are other methods similars to data perturbation, its called Data Anonymization [source]:
Data masking—hiding data with altered values. You can create a mirror
version of a database and apply modification techniques such as
character shuffling, encryption, and word or character substitution.
For example, you can replace a value character with a symbol such as
“*” or “x”. Data masking makes reverse engineering or detection
impossible.
Pseudonymization—a data management and de-identification method that
replaces private identifiers with fake identifiers or pseudonyms, for
example replacing the identifier “John Smith” with “Mark Spencer”.
Pseudonymization preserves statistical accuracy and data integrity,
allowing the modified data to be used for training, development,
testing, and analytics while protecting data privacy.
Generalization—deliberately removes some of the data to make it less
identifiable. Data can be modified into a set of ranges or a broad
area with appropriate boundaries. You can remove the house number in
an address, but make sure you don’t remove the road name. The purpose
is to eliminate some of the identifiers while retaining a measure of
data accuracy.
Data swapping—also known as shuffling and permutation, a technique
used to rearrange the dataset attribute values so they don’t
correspond with the original records. Swapping attributes (columns)
that contain identifiers values such as date of birth, for example,
may have more impact on anonymization than membership type values.
Data perturbation—modifies the original dataset slightly by applying techniques that round numbers and add random noise. The range
of values needs to be in proportion to the perturbation. A small base
may lead to weak anonymization while a large base can reduce the
utility of the dataset. For example, you can use a base of 5 for
rounding values like age or house number because it’s proportional to
the original value. You can multiply a house number by 15 and the
value may retain its credence. However, using higher bases like 15 can
make the age values seem fake.
Synthetic data—algorithmically manufactured information that has no
connection to real events. Synthetic data is used to create artificial
datasets instead of altering the original dataset or using it as is
and risking privacy and security. The process involves creating
statistical models based on patterns found in the original dataset.
You can use standard deviations, medians, linear regression or other
statistical techniques to generate the synthetic data.
Is this what are you looking for?
EDIT: added link to the source and quotation.
Related
Somebody told me it is a good idea to convert identifying columns (e.g. person numbers) from strings to categorical. This would speed up some operations like searching, filtering and grouping.
I understand that a 40 chars strings costs much more RAM and time to compare instead of a simple integer.
But I would have some overhead because of a str-to-int-table for translating between two types and to know which integer number belongs to which string "number".
Maybe .astype('categorical') can help me here? Isn't this an integer internally? Does this speed up some operations?
The user guide has the following about categorical data use cases:
The categorical data type is useful in the following cases:
A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
See also the API docs on categoricals.
The book, Python for Data Analysis by Wes McKinney, has the following on this topic:
The categorical representation can yield significant performance
improvements when you are doing analytics. You can also perform
transformations on the categories while leaving the codes unmodified.
Some example transformations that can be made at relatively low cost are:
Renaming categories
Appending a new category without changing the order or position of the existing categories
GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings.
Series containing categorical data have several special methods similar to the Series.str specialized string methods. This also provides convenient access to the categories and codes.
In large datasets, categoricals are often used as a convenient tool for memory savings and better performance.
My data frame has 3.8 million rows and 20 or so features, many of which are categorical. After paring down the number of features, I can "dummy up" one critical column with 20 or so categories and my COLAB with (allegedly) TPU running won't crash.
But there's another column with about 53,000 unique values. Trying to "dummy up" this feature crashes my session. I can't ditch this column.
I've looked up target encoding, but the data set is very imbalanced and I'm concerned about target leakage. Is there a way around this?
EDIT: My target variable is a simple binary one.
Without knowing more details of the problem/feature, there's no obvious way to do this. This is the part of Data Science/Machine Learning that is an art, not a science. A couple ideas:
One hot encode everything, then use a dimensionality reduction algorithm to remove some of the columns (PCA, SVD, etc).
Only one hot encode some values (say limit it to 10 or 100 categories, rather than 53,000), then for the rest, use an "other" category.
If it's possible to construct an embedding for these variables (Not always possible), you can explore this.
Group/bin the values in the columns by some underlying feature. I.e. if the feature is something like days_since_X, bin it by 100 or something. Or if it's names of animals, group it by type instead (mammal, reptile, etc.)
I'm using a data set that consists of mostly nominal values from SFDC (e.g. EE Names, Title, Role, Lead Source, Account Name, etc.) and am trying to correlate the features to a boolean class of whether a Sales Lead was converted to a Sales Contact.
I wanted to run this data through some basic feature selection algorithms, but most require numerical values only. I could map each of the unique classifications to a new field(feature) with a boolean mapping scheme, but then i'll generate an extremely large number of new features and I'm not sure if that will give a meaningful output. Admittedly the best solution might be to run the data through a decision tree, but wanted to see if there were any other strategies that others have come up with in the community for handling data sets of mostly nominal data that have been successfully used on real world applications.
I'm using python with scipy/numpy/pandas/scikit-learn to do my analysis.
I would first try to use sklearn.feature_extraction.DictVectorizer and then try Chi2 univariate feature selection that can work with sparse data representations. For instance there is an application of chi2 feature selection on sparse text data here in scikit-learn: http://scikit-learn.org/dev/auto_examples/document_classification_20newsgroups.html
Unfortunately, scikit-learn's decision trees and ensemble do not work on sparse representations yet.
Problem:
A relational database (Postgres) storing timeseries data of various measurement values. Each measurement value can have a specific "measurement type" (e.g. temperature, dissolved oxygen, etc) and can have specific "measurement units" (e.g. Fahrenheit/Celsius/Kelvin, percent/milligrams per liter, etc).
Question:
Has anyone built a similar database such that dimensional integrity is conserved? Have any suggestions?
I'm considering building a measurement_type and a measurement_unit table, both of these would have text two columns, ID and text. Then I would create foreign keys to these tables in the measured_value table. Text worries me somewhat because there's the possibility for non-unique duplicates (e.g. 'ug/l' vs 'µg/l' for micrograms per liter).
The purpose of this would be so that I can both convert and verify units on queries, or via programming externally. Ideally, I would have the ability later to include strict dimensional analysis (e.g. linking µg/l to the value 'M/V' (mass divided by volume)).
Is there a more elegant way to accomplish this?
I produced a database sub-schema for handling units an aeon ago (okay, I exaggerate slightly; it was about 20 years ago, though). Fortunately, it only had to deal with simple mass, length, time dimensions - not temperature, or electric current, or luminosity, etc. Rather less simple was the currency side of the game - there were a myriad different ways of converting between one currency and another depending on date, currency, and period over which conversion rate was valid. That was handled separately from the physical units.
Fundamentally, I created a table 'measures' with an 'id' column, a name for the unit, an abbreviation, and a set of dimension exponents - one each for mass, length, time. This gets populated with names such as 'volume' (length = 3, mass = 0, time = 0), 'density' (length = 3, mass = -1, time = 0) - and the like.
There was a second table of units, which identified a measure and then the actual units used by a particular measurement. For example, there were barrels, and cubic metres, and all sorts of other units of relevance.
There was a third table that defined conversion factors between specific units. This consisted of two units and the multiplicative conversion factor that converted unit 1 to unit 2. The biggest problem here was the dynamic range of the conversion factors. If the conversion from U1 to U2 is 1.234E+10, then the inverse is a rather small number (8.103727714749e-11).
The comment from S.Lott about temperatures is interesting - we didn't have to deal with those. A stored procedure would have addressed that - though integrating one stored procedure into the system might have been tricky.
The scheme I described allowed most conversions to be described once (including hypothetical units such as furlongs per fortnight, or less hypothetical but equally obscure ones - outside the USA - like acre-feet), and the conversions could be validated (for example, both units in the conversion factor table had to have the same measure). It could be extended to handle most of the other units - though the dimensionless units such as angles (or solid angles) present some interesting problems. There was supporting code that would handle arbitrary conversions - or generate an error when the conversion could not be supported. One reason for this system was that the various international affiliate companies would report their data in their locally convenient units, but the HQ system had to accept the original data and yet present the resulting aggregated data in units that suited the managers - where different managers each had their own idea (based on their national background and length of duty in the HQ) about the best units for their reports.
"Text worries me somewhat because there's the possibility for non-unique duplicates"
Right. So don't use text as a key. Use the ID as a key.
"Is there a more elegant way to accomplish this?"
Not really. It's hard. Temperature is it's own problem because temperature is itself an average, and doesn't sum like distance does; plus F to C conversion is not a multiply (as it is with every other unit conversion.)
A note about conversions: a lot of units are linearly related, and can be converted using a formula like "y = A + Bx", where A and B are constants which could be stored in the database for each pair of units that you need to convert between. For example, for Celsius to Farenheit the constants are A=32, B=1.8.
However, there are also rare exceptions. Converting between logarithmic and non-logarithmic units, for example. Or converting between mass-per-volume and molar-mass-per-volume (in which case you would need to know the molar mass of the compound being measured).
Of course, if you are sure that all the conversions required by the system are linear, then there's no need for over-engineering, just store the two constants. You can then extract standardized results from the database using straight SQL joins with calculated fields.
What are your thoughts on this? I'm working on integrating some new data that's in a tab-delimited text format, and all of the decimal columns are kept as single integers; in order to determine the decimal amount you need to multiply the number by .01. It does this for things like percentages, weight and pricing information. For example, an item's price is expressed as 3259 in the data files, and when I want to display it I would need to multiply it in order to get the "real" amount of 32.59.
Do you think this is a good or bad idea? Should I be keeping my data structure identical to the one provided by the vendor, or should I make the database columns true decimals and use SSIS or some sort of ETL process to automatically multiply the integer columns into their decimal equivalent? At this point I haven't decided if I am going to use an ORM or Stored Procedures or what to retrieve the data, so I'm trying to think long term and decide which approach to use. I could also easily just handle this in code from a DTO or similar, something along the lines of:
public class Product
{
// ...
private int _price;
public decimal Price
{
get
{
return (this._price * .01);
}
set
{
this._price = (value / .01);
}
}
}
But that seems like extra and unnecessary work on the part of a class. How would you approach this, keeping in mind that the data is provided in the integer format by a vendor that you will regularly need to get updates from.
"Do you think this is a good or bad idea?"
Bad.
"Should I be keeping my data structure identical to the one provided by the vendor?"
No.
"Should I make the database columns true decimals?"
Yes.
It's so much simpler to do what's right. Currently, the data is transmitted with no "." to separate the whole numbers from the decimals; that doesn't any real significance.
The data is decimal. Decimal math works. Use the decimal math provided by your language and database. Don't invent your own version of Decimal arithmetic.
Personally I would much prefer to have the data stored correctly in my database and just do a simple conversion every time an update comes in.
Pedantically: they aren't kept as ints either. They are strings that require parsing.
Philisophically: you have information in the file and you should write data into the database. That means transforming the information in any ways necessary to make it meaningful/useful. If you don't do this transform up front, then you'll be doomed to repeat the transform across all consumers of the database.
There are some scenarios where you aren't allowed to transform the data, such as being able to answer the question: "What was in the file?". Those scenarios would require the data to be written as string - if the parse failed, you wouldn't have an accurate representation of the file.
In my mind the most important facet of using Decimal over Int in this scenario is maintainability.
Data stored in the tables should be clearly meaningful without need for arbitrary manipulation. If manipulation is required is should be clearly evident that it is (such as from the field name).
I recently dealt with data where days of the week were stored as values 2-8. You can not imagine the fall out this caused (testing didn't show the problem for a variety of reasons, but live use did cause political explosions).
If you do ever run in to such a situation, I would be absolutely certain to ensure data can not be written to or read from the table without use of stored procedures or views. This enables you to ensure the necessary manipulation is both enforced and documented. If you don't have both of these, some poor sod who follows you in the future will curse your very name.