How to distinguish in master data and calculated interpolated data? - sql

I'm getting a bunch of vectors with datapoints for a fixed set of values, in the example below you see an example of a vector with a value per time point
1D:2
2D:
7D:5
1M:6
6M:6.5
But alas not for all the timepoints is a value available. All vectors are stored in a database and with a trigger we calcuate the missing values by interpolation, or possibly a more advanced algorithm. Somehow I want to be able to tell which data points have been calculated and which have been original delivered to us. Of course I can add a flag column to the table with values indicating whether the value was a master value or is calculated, but I'm wondering whether there is a more sophisticated way. We probably don't need to determine on a regular basis, so cpu cycles are not an issue for determining or insertion.
The example above shows some nice looking numbers but in reality it would look more somethin like 3.1415966533.
The database for storage is called oracle 10.
cheers.

Could you deactivate the trigger temporarily?

Related

Non-cryptography algorithms to protect the data

I was able to find a few, but I was wondering, is there more algorithms that based on data encoding/modification instead of complete encryption of it. Examples that I found:
Steganography. The method is based on hiding a message within a message;
Tokenization. Data is mapped in the tokenization server to a random token that represents the real data outside of the server;
Data perturbation. As far as I know it works mostly with databases. Adds noise to the sensitive records yet allows to read general and public fields, like sum of the records on a specific day.
Are there any other methods like this?
If your purpose is to publish this data there are other methods similars to data perturbation, its called Data Anonymization [source]:
Data masking—hiding data with altered values. You can create a mirror
version of a database and apply modification techniques such as
character shuffling, encryption, and word or character substitution.
For example, you can replace a value character with a symbol such as
“*” or “x”. Data masking makes reverse engineering or detection
impossible.
Pseudonymization—a data management and de-identification method that
replaces private identifiers with fake identifiers or pseudonyms, for
example replacing the identifier “John Smith” with “Mark Spencer”.
Pseudonymization preserves statistical accuracy and data integrity,
allowing the modified data to be used for training, development,
testing, and analytics while protecting data privacy.
Generalization—deliberately removes some of the data to make it less
identifiable. Data can be modified into a set of ranges or a broad
area with appropriate boundaries. You can remove the house number in
an address, but make sure you don’t remove the road name. The purpose
is to eliminate some of the identifiers while retaining a measure of
data accuracy.
Data swapping—also known as shuffling and permutation, a technique
used to rearrange the dataset attribute values so they don’t
correspond with the original records. Swapping attributes (columns)
that contain identifiers values such as date of birth, for example,
may have more impact on anonymization than membership type values.
Data perturbation—modifies the original dataset slightly by applying techniques that round numbers and add random noise. The range
of values needs to be in proportion to the perturbation. A small base
may lead to weak anonymization while a large base can reduce the
utility of the dataset. For example, you can use a base of 5 for
rounding values like age or house number because it’s proportional to
the original value. You can multiply a house number by 15 and the
value may retain its credence. However, using higher bases like 15 can
make the age values seem fake.
Synthetic data—algorithmically manufactured information that has no
connection to real events. Synthetic data is used to create artificial
datasets instead of altering the original dataset or using it as is
and risking privacy and security. The process involves creating
statistical models based on patterns found in the original dataset.
You can use standard deviations, medians, linear regression or other
statistical techniques to generate the synthetic data.
Is this what are you looking for?
EDIT: added link to the source and quotation.

How do you deal with missing data when it's missing like 60%?

My data has a lot of missing values and I have to predict those values. One way is to take the average of those values. But I want to hear an other perspective on it. How experienced data scientist solve such kind of issue?
Are your missing values categorical or continuous?
One way is to remove the samples entirely, however this may lead to a sampling bias, since the missing values could have been the result of some causal effect, that is the missing values are not missing completely at random.
If your data has enough dimensionality, you can treat your missing values as the output and try to apply a predicting model and hope that it can faithfully estimate the missing values, given the explanatory variables you already have.
Picking the most frequent value, the median, or averaging as you point out could also be an option, however be careful with outliers when averaging as these can have a tremendous effect on the mean.
It depends on nature of variables, it may be some statistics like mean or median. Another practice is assign to missing variables some value different from others for example 0, -1 or something like this.
The hardest approach is to impute the dataset and not deviate too far from the truth. A test to validate how well you have done this is the following. If the other parameters provide enough evidenced insight to impute with a level of precision for missing data....it should be able to do it with existing data.
So if 60 percent of the column is missing, take the row observations where this column is PRESENT.
Next, randomly choose to remove 60% of this subsetted data. Now run imputation methods of your choosing.
Compare the imputed dataset to the real data set for similarity. Decide if they are close enough for you to then run this against the full data set. At least this approach will give you a leg to stand on if you need to defend yourself.
Fight the Good Fight.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

Assigning values to missing data for use in binary logistic regression in SAS

Many of the variables in the data I use on a daily basis have blank fields, some of which, have meaning (ex. A blank response for a variable dealing with the ratio of satisfactory accounts to toal accounts, thus the individual does not have any accounts if they do not have a response in this column, whereas a response of 0 means the individual has no satisfactory accounts).
Currently, these records do not get included into logistic regression analyses as they have missing values for one or more fields. Is there a way to include these records into a logistic regression model?
I am aware that I can assign these blank fields with a value that is not in the range of the data (ex. if we go back to the above ratio variable, we could use 9999 or -1 as these values are not included in the range of a ratio variable (0 to 1)). I am just curious to know if there is a more appropriate way of going about this. Any help is greatly appreciated! Thanks!
You can impute values for the missing fields, subject to logical restrictions on your experimental design and the fact that it will weaken the power of your experiment some relative to having the same experiment with no missing values.
SAS offers a few ways to do this. The simplest is to use PROC MI and PROC MIANALYZE, but even those are certainly not a simple matter of plugging a few numbers in. See this page for more information. Ultimately this is probably a better question for Cross-Validated at least until you have figured out the experimental design issues.

how to store an approximate number? (number is too small to be measured)

I have a table representing standards of alloys. The standard is partly based on the chemical composition of the alloys. The composition is presented in percentages. The percentage is determined by a chemical composition test. Sample data.
But sometimes, the lab cannot measure below a certain percentage. So they indicate that the element is present, but the percentage is less than they can measure.
I was confused how to accurately store such a number in an SQL database. I thought to store the number with a negative sign. No element can have a negative composition of course, but i can interpret this as less than the specified value. Or option is to add another column for each element!! The latter option i really don't like.
Any other ideas? It's a small issue if you think about it, but i think a crowd is always wiser. Somebody might have a neater solution.
Question updated:
Thanks for all the replies.
The test results come from different labs, so there is no common lower bound.
The when the percentage of Titanium is less than <0.0004 for example, the number is still important, only the formula will differ slightly in this case.
Hence the value cannot be stored as NULL, and i don't know the lower bound for all values.
Tricky one.
Another possibility i thought of is to store it as a string. Any other ideas?
What you're talking about is a sentinel value. It's a common technique. Strings in most languages after all use 0 as a sentinel end-of-string value. You can do that. You just need to find a number that makes sense and isn't used for anything else. Many string functions will return -1 to indicate what you're looking for isn't there.
0 might work because if the element isn't there there shouldn't even be a record. You also face the problem that it might be mistaken for actually meaning 0. -1 is another option. It doesn't have that same problem obviously.
Another column to indicate if the amount is measurable or not is also a viable option. The case for this one becomes stronger if you need to store different categories of trace elements (eg <1%, <0.1%, <0.01%, etc). Storing the negative of those numbers seems a bit hacky to me.
You could just store it as NULL, meaning that the value exists but is undefined.
Any arithmetic operation with a NULL yields a NULL.
Division by NULL is safe.
NULL's are ignored by the aggregation functions, so queries like these:
SELECT SUM(metal_percent), COUNT(metal_percent)
FROM alloys
GROUP BY
metal
will give you the sum and the count of the actual, defined values, not taking the unfilled values into account.
I would use a threshold value which is at least one significant digit smaller than your smallest expected value. This way you can logically say that any value less than say 0.01, can be presented to you application as a "trace" amount. This remains easy to understand and gives you flexibility in determining where your threshold should lie.
Since the constraints of the values are well defined (cannot have negative composition), I would go for the "negative value to indicate less-than" approach. As long as this use of such sentinel values are sufficiently documented, it should be reasonably easy to implement and maintain.
An alternative but similar method would be to add 100 to the values, assuming that you can't get more than 100%. So <0.001 becomes 100.001.
I would have a table modeling the certificate, in a one to many relation with another table, storing the values for elements. Then, I would still have the elements table containing the value in one column and a flag (less than) as a separate column.
Draft:
create table CERTIFICATES
(
PK_ID integer,
NAME varchar(128)
)
create table ELEMENTS
(
ELEMENT_ID varchar(2),
CERTIFICATE_ID integer,
CONCENTRATION number,
MEASURABLE integer
)
Depending on the database engine you're using, the types of the columns may vary.
Why not add another column to store whether or not its a trace amount
This will allow you to to save the amount that the trace is less than too
Since there is no common lowest threshold value and NULL is not acceptable, the cleanest solution now is to have a marker column which indicates whether there is a quantifiable amount or a trace amount present. A value of "Trace" would indicate to anybody reading the raw data that only a trace amount was present. A value of "Quantity" would indicate that you should check an amount column to find the actual quantity present.
I would have to warn against storing numerical values as strings. It will inevitably add additional pain, since you now lose the assertions a strong type definition gives you. When your application consumes the values in that column, it has to read the string to determine whether it's a sentinel value, a numeric value or simply some other string it can't interpret. Trying to handle data conversion errors at this point in your application is something I'm sure you don't want to be doing.
Another field seems like the way to go; call it 'MinMeasurablePercent'.