what is the best data model to represent mathematical range (in database,xml,json...)? - sql

mathematical range,for example:
greater or equal to 50 and smaller than 100 (>=50 && < 100)
smaller than 10 or greater than 40 (<10 || >40)
I have been thinking about how to represent mathematical range in a file and database, the range may be input by non programmer and I need to keep the input simple,but at another side, it also need to keep the input easy to convert to data and easy to check error input e.g.:"<10 || >100" seems the most simple but it is harder for me to parse the string to get the data,also need to consider input format error
I have been considering some input methods,using >=50 && < 100 as example,which is in key value form:
1.using 1 string to represent whole range:
<rangeInString>=50 && < 100</rangeInString>
2.separate 2 strings,one represent lower bound and another one represent upper bound,then parse each string in program:
<lowerBound> >=50 </lowerBound>
<upperBound> <100 </upperBound>
3.separate lower and upper bound,also separate the sign from number:
<lowerBound>
<sign> >= </sign>
<data>50</data>
</lowerBound>
<upperBound>
<sign> < </sign>
<data>100</data>
</upperBound>
4.separate lower bound and upper bound,also separate sign, and also separate the case that if includes the equal condition:
<lowerBound>
<sign> > </sign>
<isIncludeEqual>true</isIncludeEqual>
<data>50</data>
</lowerBound>
<upperBound>
<sign> < </sign>
<isIncludeEqual>false</isIncludeEqual>
<data>100</data>
</upperBound>
5.auto detect using "&&" or "||",e.g.:>= A with < B,if A < B,must be "&&" e.g.(>= 50 && <100),otherwise it is "||" e.g.(>= 100 || <50):
<A>
<sign> > </sign>
<isIncludeEqual>true</isIncludeEqual>
<data>50</data>
</A>
<B>
<sign> < </sign>
<isIncludeEqual>false</isIncludeEqual>
<data>100</data>
</B>
6.use a field "isAnd" to separate >=50 && < 100 (true) and <=50 || > 100 (false)instead of using field sign "<" and ">" :
<lowerBound>
<isIncludeEqual>true</isIncludeEqual>
<data>50</data>
</lowerBound>
<upperBound>
<isIncludeEqual>false</isIncludeEqual>
<data>100</data>
</upperBound>
<isAnd>true</isAnd>
7.other data model...
I need to consider somethings:
1.easy for non programmer to input
2.easy to convert or parse to data into program
3.easy to check error ,for example,parse string increase the complexity of converting data and checking incorrect format,also there may have other incorrect format,e.g.:<=50 && >100 should not be valid, I may allow auto detect using "&&" or "||" by the sign of input,but it may increase the complexity of the code
can anyone have idea?

Why "encode" it? There's no benefit or need and some hassle to use it.
Just store the exclusive range end values
low_end int,
high_end int,
You can then convert these raw values to useable expressions either in SQL or application code. You don't need to consider inclusive values because "n exclusive" === "n inclusive - 1" for low end and "n exclusive" === "n inclusive + 1" for high end.
Here's an SQL implementation:
where (low_end is null or col > low_end)
and (high_end is null or col < high_end)
If the range end values need to be floating point numbers, you'll need a little more:
low_end int,
low_inclusive boolean,
high_end int,
high_inclusive boolean,
And more code:
where (low_end is null or col > low_end + case when low_inclusive then 0 else 1 end)
and (high_end is null or col < high_end - case when high_inclusive then 0 else 1 end)

This is a good question, what about a combination of interval notation as suggested by Gordon and a given character for infinity. This combined with separate fields (or a parsing algorithm) could accomplish the task of defining any range.
For example, the range (3<x<20) could be written as (3,20). The range (x<=10 || x>30) could be written as the combination of
(-_,10],(30,_).
Where _ represents infinity. Or use the actual Infinity symbol character, ∞, Unicode U+221E.
This way would be pretty clear for those with a mathematics background, I believe, and would provide infinite flexibility.
I hope you find this helpful.

PostgreSQL does ranges natively.
The representation looks like this:
[low, high)
[ or ] = inclusive
( or ) = exclusive
Unbounded looks like this: [low-value, infinity]
http://www.postgresql.org/docs/9.4/static/rangetypes.html

Specifically addressing your options:
Why represent it in a format that you have to parse? A case could be made that you store it in a format that your code can parse, but what if you need to access it with a different programming language?
Same as 1.
Getting close, but you would need to subsume the bounds within a range object that includes && or ||. Also, no need for element, which is implied by "lower" and "upper" and could be replaced by an inclusive flag like you have in 4.
No need for
Unnecessary abstraction...it's just a range
That could work
Other data model:
The data is structured, so could work in json, xml, relational, or even as a set of semantic triples.

Related

get prefix out a size range with different size formats

I have column in a df with a size range with different sizeformats.
artikelkleurnummer size
6725 0161810ZWA B080
6726 0161810ZWA B085
6727 0161810ZWA B090
6728 0161810ZWA B095
6729 0161810ZWA B100
in the sizerange are also these other size formats like XS - XXL, 36-50 , 36/38 - 52/54, ONE, XS/S - XL/XXL, 363-545
I have tried to get the prefix '0' out of all sizes with start with a letter in range (A:K). For exemple: Want to change B080 into B80. B100 stays B100.
steps:
1 look for items in column ['size'] with first letter of string in range (A:K),
2 if True change second position in string into ''
for range I use:
from string import ascii_letters
def range_alpha(start_letter, end_letter):
return ascii_letters[ascii_letters.index(start_letter):ascii_letters.index(end_letter) + 1]
then I've tried a for loop
for items in df['size']:
if df.loc[df['size'].str[0] in range_alpha('A','K'):
df.loc[df['size'].str[1] == ''
message
SyntaxError: unexpected EOF while parsing
what's wrong?
You can do it with regex and the pd.Series.str.replace -
df = pd.DataFrame([['0161810ZWA']*5, ['B080', 'B085', 'B090', 'B095', 'B100']]).T
df.columns = "artikelkleurnummer size".split()
replacement = lambda mpat: ''.join(g for g in mpat.groups() if mpat.groups().index(g) != 1)
df['size_cleaned'] = df['size'].str.replace(r'([a-kA-K])(0*)(\d+)', replacement)
Output
artikelkleurnummer size size_cleaned
0 0161810ZWA B080 B80
1 0161810ZWA B085 B85
2 0161810ZWA B090 B90
3 0161810ZWA B095 B95
4 0161810ZWA B100 B100
TL;DR
Find a pattern "LetterZeroDigits" and change it to "LetterDigits" using a regular expression.
Slightly longer explanation
Regexes are very handy but also hard. In the solution above, we are trying to find the pattern of interest and then replace it. In our case, the pattern of interest is made of 3 parts -
A letter in from A-K
Zero or more 0's
Some more digits
In regex terms - this can be written as r'([a-kA-K])(0*)(\d+)'. Note that the 3 brackets make up the 3 parts - they are called groups. It might make a little or no sense depending on how exposed you have been to regexes in the past - but you can get it from any introduction to regexes online.
Once we have the parts, what we want to do is retain everything else except part-2, which is the 0s.
The pd.Series.str.replace documentation has the details on the replacement portion. In essence replacement is a function that takes all the matching groups as the input and produces an output.
In the first part - where we identified three groups or parts. These groups are accessed with the mpat.groups() function - which returns a tuple containing the match for each group. We want to reconstruct a string with the middle part excluded, which is what the replacement function does
sizes = [{"size": "B080"},{"size": "B085"},{"size": "B090"},{"size": "B095"},{"size": "B100"}]
def range_char(start, stop):
return (chr(n) for n in range(ord(start), ord(stop) + 1))
for s in sizes:
if s['size'][0].upper() in range_char("A", "K"):
s['size'] = s['size'][0]+s['size'][1:].lstrip('0')
print(sizes)
Using a List/Dict here for example.

How to check if weight is between two values

What's the best way to check if weight are between the range using the If condition?
Ex:
If textbox.text (between) value X - value Z then
You can use standard equal operators like this:
If (Val(TextBox.Text) >= ValueX) And (Val(TextBox.Text) <= ValueZ) Then
' etc...
Val function extracts numbers from string.

How to find correct min / max values of a list in Perl 6

New to Perl6, trying to figure out what I'm doing wrong here. The problem is a simple checksum that takes the difference of the max and min values for each row in the csv
The max and min values it returns are completely wrong. For the first row in the csv, it returns the max as 71, and the min as 104, which is incorrect.
Here's the link to the repo for reference, and the link to the corresponding problem.
#!/usr/bin/env perl6
use Text::CSV;
sub checksum {
my $sum = 0;
my #data = csv(in => "input.csv");
for #data -> #value {
$sum += (max #value) - (min #value);
}
say $sum;
}
checksum
I assume your input contains numbers, but since CSV is a text format, the input is being read in as strings. min and max are operating based on string sorting, so max("4", "10") is 4 but max("04", "10") is 10. To solve this, you can either cast each element to Numeric (int, floating point, etc.) before you get the min/max:
#input.map(*.Numeric).max
or pass a conversion function to min and max so each element is parsed as a number as it's compared:
#input.max(*.Numeric)
The first solution is better for your case, since the second solution is an ephemeral conversion, converting internally but still returning a string. Final note: in normal code I would write +* or { +$_ } to mean "treat X as a number", but in this case I prefer being explicit: .Numeric.

SQL - Create Unique AlphaNumeric based on a 10-digit integer stored as VARCHAR

I'm trying to emulate a function in SQL that a client has produced in Excel. In effect, they have a unique, 10-digit numeric value (VARCHAR) as the primary key in one of their enterprise database systems. Within another database, they require a unique, 5-digit alphanumeric identifier. They want that 5-digit alphanumeric value to be a representation of the 10-digit number. So what they did in excel was to split the 10-digit number into pairs, then convert each of those pairs into a hexadecimal value, then stitch them back together.
The EXCEL equation is:
=IF(VALUE(MID(A2,1,4))>0,DEC2HEX(VALUE(MID(A2,3,2)))&DEC2HEX(VALUE(MID(A2,5,2)))&DEC2HEX(VALUE(MID(A2,7,2)))&DEC2HEX(VALUE(MID(A2,9,2))),DEC2HEX(VALUE(MID(A2,5,2)))&DEC2HEX(VALUE(MID(A2,7,2)))&DEC2HEX((VALUE(MID(A2,9,2)))))
I need the SQL equivalent of this. Of course, should someone out there know a better way to accomplish their goal of "a 5-digit alphanumeric identifier" based off the 10-digit number, I'm all ears.
ADDED 8/2/2011
First of all, thank you to everyone for the replies. Nice to see folks willing to help and even enjoying it! Based on all the responses, I'm apt to tell my client they're intent is sound, only their method is off kilter. I'd also like to recommend a solution. So the challenge remains, just modified slightly:
CHALLENGE: Within SQL, take a 10 digit, unique NUMERIC string and represent it ALPHANUMERICALLY in as few characters as possible. The resulting string must also be unique.
Note that the first 3-4 characters in the 10-digit string are likely to be zeros, and that they could be stripped to shorten the resulting alphanumeric string. Not required, but perhaps helpful.
This problem is inherently impossible. You have a 10 digit numeric value that you want to convert to a 5 digit alphanumeric value. Since there are 10 numeric characters, this means that there are 10^10 = 10 000 000 000 unique values for your 10 digit number. Since there are 36 alphanumeric characters (26 letters + 10 numbers), there are 36^5 = 60 466 176 unique values for your 5 digit number. You cannot map a set of 10 billion elements into a set with around 60 million.
Now, lets take a closer look at what your client's code is doing:
So what they did in excel was to split the 10-digit number into pairs, then convert each of those pairs into a hexadecimal value, then stitch them back together.
This isn't 100% accurate. The excel code never uses the first 2 digits, but performs this operation on the remaining 8. There are two main problems with this algorithm which may not be intuitively obvious:
Two 10 digit numbers can map to the same 5 digit number. Consider the numbers 1000000117 and 1000001701. The last four digits of 1000000117 get mapped to 1 11, where the last four digits of 1000001701 get mapped to 11 1. This causes both to map to 00111.
The 5 digit number may not even end up being 5 digits! For example, 1000001616 gets mapped to 001010.
So, what is a possible solution? Well, if you don't care if that 5 digit number is unique or not, in MySQL you can use something like:
hex(<NUMERIC VALUE> % 0xFFFFF)
The log of 10^10 base 2 is 33.219280948874
> return math.log(10 ^ 10) / math.log(2)
33.219280948874
> = 2 ^ 33.21928
9999993422.9114
So, it takes 34 bits to represent this number. In hex this will take 34/4 = 8.5 characters, much more than 5.
> return math.log(10 ^ 10) / math.log(16)
8.3048202372184
The Excel macro is ignoring the first 4 (or 6) characters of the 10 character string.
You could try encoding in base 36 instead of 16. This will get you to 7 characters or less.
> return math.log(10 ^ 10) / math.log(36)
6.4254860446923
The popular base 64 encoding will get you to 6 characters
> return math.log(10 ^ 10) / math.log(64)
5.5365468248123
Even Ascii85 encoding won't get you down to 5.
> return math.log(10 ^ 10) / math.log(85)
5.1829075929158
You need base 100 to get to 5 characters
> return math.log(10 ^ 10) / math.log(100)
5
There aren't 100 printable ASCII characters, so this is not going to work, as zkhr explained as well, unless you're willing to go beyond ASCII.
I found your question interesting (although I don't claim to know the answer) - I googled a bit for you out of interest and found this which may help you http://dpatrickcaldwell.blogspot.com/2009/05/converting-decimal-to-hexadecimal-with.html

what is the differences between +n and (n) in bit operations?

I've found two parameters defined like these:
&TM_PERIOD+4&/&TM_PERIOD(4)&
It's to pass data from a database to a form.
If the format of the data would be DDMMYYYY what are differences between those two parameters?
if TM_PRIOD is in form of DDMMYYYY then
TM_PERIOD(4) equals DDMM
TM_PERIOD+4 equals YYYY
the (4) means 4 characters
the +4 means after the 4th character
TM_PERIOD+1(2) = DM
(2 characters after the first)
These are not bit operations. +n specifies a string offset and (n) specifies the length.
They can be used independently of each other as well, so you can use just +n or just (n).
So:
data: lv_text(20) type c.
lv_text = "Hello".
write: / lv_text+2(3).
would output 'llo', for example.