Use of "precision_step" in elasticsearch mappings - lucene

In elastic search mapping there is an optional field called precision_step. What it means .
I searched in google . i cant find any solid info about.
Can anyone please explain what is a precision_step and use of it..!
thanks in advance..!

It is part of the mapping for dates and numbers.
It is a Lucene concept, you can read more here: Lucene doco
From the Lucene Doco:
Good values for precisionStep are depending on usage and data type:
The default for all data types is 4, which is used, when no precisionStep is given.
Ideal value in most cases for 64 bit data types (long, double) is 6 or 8.
Ideal value in most cases for 32 bit data types (int, float) is 4.
For low cardinality fields larger precision steps are good. If the cardinality is < 100, it is fair to use Integer.MAX_VALUE (see below).

Suitable values are between 1 and 8. A good starting point to test is 4, which is the default value for all Numeric* classes.
Actual usage is, Lower values consume more disk space but speed up searching. Lower step values mean more precision and so more terms in index (and index gets larger).

Related

Looking for up to date documentation on Oracle SQL Number DATA to answer scale question

I have been assuming based on old posts I found here and elsewhere that a NUMBER data type with no scale specified defaults to zero. However, when I look at our data on our server, I see rational numbers (read non-integers) even with no scale specified. Is this expected behavior?
To give some context, I am a technical writer and I am documenting various things associated with the data we house. I have been assuming that all NUMBERS with no scale specified have been integers. If this is incorrect, I'll need to update my documentation.
I have basically used the following link as a citation for my reasoning concerning scale, but now I believe that it is outdated. Link
I have also read the oracle documentation here and here. The last link also states that if no scale is specified, then the default is set to zero.
I think the Oracle documentation is misleading. If no precision and no scale is specified, then the number is stored as-is; the default to 0 scale only occurs when there is a specified precision.
So, the following code stores the first few digits of pi:
create table t (n number);
insert into t(n) values (3.14159265358979);
But this does not:
create table t2 (n number(5));
insert into t2(n) values (3.14159265358979);
Here is a db<>fiddle illustrating this.
This is rather indelibly marked in my memory, from a fun time porting an Oracle database to BigQuery (which did not even have numeric at the time). The Oracle number data type was one of the most difficult parts of the transition. In the end, we needed to move that into strings.

Precision Step in Apache Lucene [Request for comment to improve solution]

What exactly is Precision step.
I am new to lucene and had some difficulty in understanding the concept of precision step,used in Numeric Field and NumericRangeQuery. After going through lucene docs and different stackoverflow questions I got the concept. Now I am sharing here my understanding and explanation.I hope it will help others in fast and easy understanding of Precision Step. This is open for discussion and correction. Please add your precious knowledge here and help this to improve.
Precision Step
Since Lucene deals with only Strings datatype more exactly. All Datatypes are converted to string and then processed further.For numeric fields and queries lucene has developed a string manipulation of numbers. The string of numbers are processed and queried accordingly.Precision step/ value is used here for indexing terms and query optimization.
Precision step is a count that after how many bits of indexed value a new term starts.
For example.
In case of an int of 32 bits.
Precision step of 26 will give two terms.
32 bit itself and 32-26=4 bit
Similarly
Precision step of 8 will create 4 terms in total
32 its self
32-8=24
24-8=16
16-8=8
Thus if we have lower value of precision step, there will be more precisions and more terms in index. And maximum number of terms to match will increase resulting in improved results.
Shortly Lower Precision Step value => More Precision => More terms => Increased Terms for matching => Improved Results.

precision gains where data move from one table to another in sql server

There are three tables in our sql server 2008
transact_orders
transact_shipments
transact_child_orders.
Three of them have a common column carrying_cost. Data type is same in all the three tables.It is float with NUMERIC_PRECISION 53 and NUMERIC_PRECISION_RADIX 2.
In table 1 - transact_orders this column has value 5.1 for three rows. convert(decimal(20,15), carrying_cost) returns 5.100000..... here.
Table 2 - transact_shipments three rows are fetching carrying_cost from those three rows in transact_orders.
convert(decimal(20,15), carrying_cost) returns 5.100000..... here also.
Table 3 - transact_child_orders is summing up those three carrying costs from transact_shipments. And the value shown there is 15.3 when I run a normal select.
But convert(decimal(20,15), carrying_cost) returns 15.299999999999999 in this stable. And its showing that precision gained value in ui also. Though ui is only fetching the value, not doing any conversion. In the java code the variable which is fetching the value from the db is defined as double.
The code in step 3, to sum up the three carrying_costs is simple ::
...sum(isnull(transact_shipments.carrying_costs,0)) sum_carrying_costs,...
Any idea why this change occurs in the third step ? Any help will be appreciated. Please let me know if any more information is needed.
Rather than post a bunch of comments, I'll write an answer.
Floats are not suitable for precise values where you can't accept rounding errors - For example, finance.
Floats can scale from very small numbers, to very high numbers. But they don't do that without losing a degree of accuracy. You can look the details up on line, there is a host of good work out there for you to read.
But, simplistically, it's because they're true binary numbers - some decimal numbers just can't be represented as a binary value with 100% accuracy. (Just like 1/3 can't be represented with 100% accuracy in decimal.)
I'm not sure what is causing your performance issue with the DECIMAL data type, often it's because there is some implicit conversion going on. (You've got a float somewhere, or decimals with different definitions, etc.)
But regardless of the cause; nothing is faster than integer arithmetic. So, store your values are integers? £1.10 could be stored as 110p. Or, if you know you'll get some fractions of a pence for some reason, 11000dp (deci-pennies).
You do then need to consider the biggest value you will ever reach, and whether INT or BIGINT is more appropriate.
Also, when working with integers, be careful of divisions. If you divide £10 between 3 people, where does the last 1p need to go? £3.33 for two people and £3.34 for one person? £0.01 eaten by the bank? But, invariably, it should not get lost to the digital elves.
And, obviously, when presenting the number to a user, you then need to manipulate it back to £ rather than dp; but you need to do that often anyway, to get £10k or £10M, etc.
Whatever you do, and if you don't want rounding errors due to floating point values, don't use FLOAT.
(There is ALOT written on line about how to use floats, and more importantly, how not to. It's a big topic; just don't fall into the trap of "it's so accurate, it's amazing, it can do anything" - I can't count the number of time people have screwed up data using that unfortunately common but naive assumption.)

Is it possible to affect a Lucene rank based on a numeric value?

I have content with various numeric values, and a higher value indicates (theoretically) more valuable content, which I want to rank higher.
For instance:
Average rating (0 - 5)
Number of comments (0 - whatever)
Number of inbound link references from other pages (0 - whatever)
Some arbitrary number I apply to indicate how important I feel the content is (1 - whatever)
These can be indexed by Lucene as a numeric value, but how can I tell Lucene to use this value in its ranking algorithm?
you can set this value using "Field.SetBoost" while indexing.
Depending how exactly you want to proceed, you can set boost while indexing as suggested by #L.B, or if you want to make it dynamic, i.e. at search time rather than indexing time, you can use ValueSourceQuery and CustomScoreQuery.
You can see example in the question I asked some time ago:
Lucene custom scoring for numeric fields (the example was tested with Lucene 3.0).

Is there any reason for numeric rather than int in T-SQL?

Why would someone use numeric(12, 0) datatype for a simple integer ID column? If you have a reason why this is better than int or bigint I would like to hear it.
We are not doing any math on this column, it is simply an ID used for foreign key linking.
I am compiling a list of programming errors and performance issues about a product, and I want to be sure they didn't do this for some logical reason. If you follow this link:
http://msdn.microsoft.com/en-us/library/ms187746.aspx
... you can see that the numeric(12, 0) uses 9 bytes of storage and being limited to 12 digits, theres a total of 2 trillion numbers if you include negatives. WHY would a person use this when they could use a bigint and get 10 million times as many numbers with one byte less storage. Furthermore, since this is being used as a product ID, the 4 billion numbers of a standard int would have been more than enough.
So before I grab the torches and pitch forks - tell me what they are going to say in their defense?
And no, I'm not making a huge deal out of nothing, there are hundreds of issues like this in the software, and it's all causing a huge performance problem and using too much space in the database. And we paid over a million bucks for this crap... so I take it kinda seriously.
Perhaps they're used to working with Oracle?
All numeric types including ints are normalized to a standard single representation among all platforms.
There are many reasons to use numeric - for example - financial data and other stuffs which need to be accurate to certain decimal places. However for the example you cited above, a simple int would have done.
Perhaps sloppy programmers working who didn't know how to to design a database ?
Before you take things too seriously, what is the data storage requirement for each row or set of rows for this item?
Your observation is correct, but you probably don't want to present it too strongly if you're reducing storage from 5000 bytes to 4090 bytes, for example.
You don't want to blow your credibility by bringing this up and having them point out that any measurable savings are negligible. ("Of course, many of our lesser-experienced staff also make the same mistake.")
Can you fill in these blanks?
with the data type change, we use
____ bytes of disk space instead of ____
____ ms per query instead of ____
____ network bandwidth instead of ____
____ network latency instead of ____
That's the kind of thing which will give you credibility.
How old is this application that you are looking into?
Previous to SQL Server 2000 there was no bigint. Maybe its just something that has made it from release to release for many years without being changed or the database schema was copied from an application that was this old?!?
In your example I can't think of any logical reason why you wouldn't use INT. I know there are probably reasons for other uses of numeric, but not in this instance.
According to: http://doc.ddart.net/mssql/sql70/da-db_1.htm
decimal
Fixed precision and scale numeric data from -10^38 -1 through 10^38 -1.
numeric
A synonym for decimal.
int
Integer (whole number) data from -2^31 (-2,147,483,648) through 2^31 - 1 (2,147,483,647).
It is impossible to know if there is a reason for them using decimal, since we have no code to look at though.
In some databases, using a decimal(10,0) creates a packed field which takes up less space. I know there are many tables around my work that use that. They probably had the same kind of thought here, but you have gone to the documentation and proven that to be incorrect. More than likely, I would say it will boil down to a case of "that's the way we have always done it, because someone one time said it was better".
It is possible they spend a LOT of time in MS Access and see 'Number' often and just figured, its a number, why not use numeric?
Based on your findings, it doesn't sound like they are the optimization experts, and just didn't know. I'm wondering if they used schema generation tools and just relied on them too much.
I wonder how efficient an index on a decimal value (even if 0 scale is set) for a primary key compares to a pure integer value.
Like Mark H. said, other than the indexing factor, this particular scenario likely isn't growing the database THAT much, but if you're looking for ammo, I think you did find some to belittle them with.
In your citation, the decimal shows precision of 1-9 as using 5 bytes. Your column apparently has 12,0 - using 4 bytes of storage - same as integer.
Moreover, INT, datatype can go to a power of 31:
-2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647)
While decimal is much larger to 38:
- 10^38 +1 through 10^38 - 1
So the software creator was actually providing more while using the same amount of storage space.
Now, with the basics out of the way, the software creator actually limited themselves to just 12 numbers or 123,456,789,012 (just an example for place holders not a maximum number). If they used INT they could not scale this column - it would go up to the full 31 digits. Perhaps there is a business reason to limit this column and associated columns to 12 digits.
An INT is an INT, while a DECIMAL is scalar.
Hope this helps.
PS:
The whole number argument is:
A) Whole numbers are 0..infinity
B) Counting (Natural) numbers are 1..infinity
C) Integers are infinity (negative) .. infinity (positive)
D) I would not cite WikiANYTHING for anything. Come on, use a real source! May as well be http://MyPersonalMathCite.com