How to make a query that matches values within a specified range of non standard types? - lucene

For standard ones I find it pretty straightforward
NumericRangeQuery.NewIntRange(item.Name, item.MinValue, item.MaxValue, true, true))
It works great with most common numeric types.
But what I would like to do is to make a range query with such datatypes as Date and decimal.
How could I achieve this?

For dates, store them as ints. So 2016 July 23 = 20160723
If you want to the hour or minute or second, just add those digits to the right. You may need to switch to long (Int64) for the longer versions.
If you want finer grain then store Ticks.
After all that just use the appropriate NumericRange query.
In Lucene.net 3.0.3 the best float accuracy is with Double

Related

Cloudsearch range failing for negative integers

I have records in Amazon's cloudsearch that are timestamped with an int representing milliseconds since the epoch. I call the field time. This can be negative for dates before 1970. When I perform a structured query using time:[0,}, it's returning negative as well as positive timestamps, which is wrong. The docs say that ints are 64-bit signed, so I don't see why this wouldn't be valid. My query syntax works fine with other fields that are only positive-valued. Are range searches actually restricted to positive numbers?
(aside: I know I could use a date string format, but I want to use an integer for consistency with other parts of my system. Also I want to be able to represent BCE dates and I'm not sure whether YYYY:MM:DD formats behave safely when YYYY is negative.)
It turns out Cloudsearch queries work fine with negative numbers, as you'd expect. My problem was that I'd previously defined this field as a text field (and a text comparison won't numerically order strings like '123', '-555', '-1', etc.) I'd changed the field to an int, but I'd forgotten to re-index, so Cloudsearch was still secretly treating it as text.
To re-index after changing field types you can use:
aws cloudsearch index-documents --domain-name mycloudsearch
or you can do it from the web interface.

What use are SQL dates without date functions?

I've worked with various ORMs and database abstractions designed to make it easy to work with multiple databases, both relational and not. The more comprehensive solutions will usually give you access to some date functions that boil down to actual SQL (or whatever, in the case of non-SQL dbs). On the other hand, many of these abstractions don't provide direct access to SQL functions and you lose the ability to deal with dates directly. Instead, you're expected to use the upper-level language (PHP, Python, whatever) to do your date-wrangling, and finally only insert, select, what-have-you the formatted date.
So my question is this: if the SQL server never gets to do anything with the date itself, am I better off just using an int and putting epoch timestamps in it, or is there additional value to the database server "knowing" it's a date?
If you are using dates, store them as dates.
Not only does this make it easier to translate between the database and application, but when you need to do anything based on the dates (and you will, otherwise why have dates stored at all?).
That is, when you need to sort or query using the dates, you will not need to go trough special effort to re-convert to dates.
Other than what #Oded said, if you never ever use any date related functions, Still there are some issues;
At the moment, you cannot store epoch timestamp in milliseconds into an INT field (overflows).
Timestamp without milliseconds will overflow INT on Tue Jan 19 2038 # 03:14:08 GMT+0000 (GMT) as it will be greater than 2147483647.
BUT, Integer takes 4 bytes and Datetime takes 8 bytes. You are better off 4 bytes if you are within above two limitations.

precision gains where data move from one table to another in sql server

There are three tables in our sql server 2008
transact_orders
transact_shipments
transact_child_orders.
Three of them have a common column carrying_cost. Data type is same in all the three tables.It is float with NUMERIC_PRECISION 53 and NUMERIC_PRECISION_RADIX 2.
In table 1 - transact_orders this column has value 5.1 for three rows. convert(decimal(20,15), carrying_cost) returns 5.100000..... here.
Table 2 - transact_shipments three rows are fetching carrying_cost from those three rows in transact_orders.
convert(decimal(20,15), carrying_cost) returns 5.100000..... here also.
Table 3 - transact_child_orders is summing up those three carrying costs from transact_shipments. And the value shown there is 15.3 when I run a normal select.
But convert(decimal(20,15), carrying_cost) returns 15.299999999999999 in this stable. And its showing that precision gained value in ui also. Though ui is only fetching the value, not doing any conversion. In the java code the variable which is fetching the value from the db is defined as double.
The code in step 3, to sum up the three carrying_costs is simple ::
...sum(isnull(transact_shipments.carrying_costs,0)) sum_carrying_costs,...
Any idea why this change occurs in the third step ? Any help will be appreciated. Please let me know if any more information is needed.
Rather than post a bunch of comments, I'll write an answer.
Floats are not suitable for precise values where you can't accept rounding errors - For example, finance.
Floats can scale from very small numbers, to very high numbers. But they don't do that without losing a degree of accuracy. You can look the details up on line, there is a host of good work out there for you to read.
But, simplistically, it's because they're true binary numbers - some decimal numbers just can't be represented as a binary value with 100% accuracy. (Just like 1/3 can't be represented with 100% accuracy in decimal.)
I'm not sure what is causing your performance issue with the DECIMAL data type, often it's because there is some implicit conversion going on. (You've got a float somewhere, or decimals with different definitions, etc.)
But regardless of the cause; nothing is faster than integer arithmetic. So, store your values are integers? £1.10 could be stored as 110p. Or, if you know you'll get some fractions of a pence for some reason, 11000dp (deci-pennies).
You do then need to consider the biggest value you will ever reach, and whether INT or BIGINT is more appropriate.
Also, when working with integers, be careful of divisions. If you divide £10 between 3 people, where does the last 1p need to go? £3.33 for two people and £3.34 for one person? £0.01 eaten by the bank? But, invariably, it should not get lost to the digital elves.
And, obviously, when presenting the number to a user, you then need to manipulate it back to £ rather than dp; but you need to do that often anyway, to get £10k or £10M, etc.
Whatever you do, and if you don't want rounding errors due to floating point values, don't use FLOAT.
(There is ALOT written on line about how to use floats, and more importantly, how not to. It's a big topic; just don't fall into the trap of "it's so accurate, it's amazing, it can do anything" - I can't count the number of time people have screwed up data using that unfortunately common but naive assumption.)

When to use VARCHAR and DATE/DATETIME

We had this programming discussion on Freenode and this question came up when I was trying to use a VARCHAR(255) to store a Date Variable in this format: D/MM/YYYY. So the question is why is it so bad to use a VARCHAR to store a date. Here are the advantages:
Its faster to code. Previously I used DATE, but date formatting was a real pain.
Its more power hungry to use string than Date? Who cares, we live in the Ghz era.
Its not ethically correct (lolwut?) This is what the other user told me...
So what would you prefer to use to store a date? SQL VARCHAR or SQL DATE?
Why not put screws in with a hammer?
Because it isn't the right tool for the job.
Some of the disadvantages of the VARCHAR version:
You can't easily add / subtract days to the VARCHAR version.
It is harder to extract just month / year.
There is nothing stopping you putting non-date data in the VARCHAR column in the database.
The VARCHAR version is culture specific.
You can't easily sort the dates.
It is difficult to change the format if you want to later.
It is unconventional, which will make it harder for other developers to understand.
In many environments, using VARCHAR will use more storage space. This may not matter for small amounts of data, but in commercial environments with millions of rows of data this might well make a big difference.
Of course, in your hobby projects you can do what you want. In a professional environment I'd insist on using the right tool for the job.
When you'll have database with more than 2-3 million rows you'll know why it's better to use DATETIME than VARCHAR :)
Simple answer is that with databases - processing power isn't a problem anymore. Just the database size is because of HDD's seek time.
Basically with modern harddisks you can read about 100 records / second if they're read in random order (usually the case) so you must do everything you can to minimize DB size, because:
The HDD's heads won't have to "travel" this much
You'll fit more data in RAM
In the end it's always HDD's seek times that will kill you. Eg. some simple GROUP BY query with many rows could take a couple of hours when done on disk compared to couple of seconds when done in RAM => because of seek times.
For VARCHAR's you can't do any searches. If you hate the way how SQL deals with dates so much, just use unix timestamp in 32 bit integer field. You'll have (basically) all advantages of using SQL DATE field, you'll just have to manipulate and format dates using your choosen programming language, not SQL functions.
Two reasons:
Sorting results by the dates
Not sensitive to date formatting changes
So let's take for instance a set of records that looks like this:
5/12/1999 | Frank N Stein
1/22/2005 | Drake U. La
10/4/1962 | Goul Friend
If we were to store the data your way, but sorted on the dates in assending order SQL will respond with the resultset that looks like this:
1/22/2005 | Drake U. La
10/4/1962 | Goul Friend
5/12/1999 | Frank N. Stein
Where if we stored the dates as a DATETIME, SQL will respond correctly ordering them like this:
10/4/1962 | Goul Friend
5/12/1999 | Frank N. Stein
1/22/2005 | Drake U. La
Additionally, if somewhere down the road you needed to display dates in a different format, for example like YYYY-MM-DD, then you would need to transform all your data or deal with mixed content. When it's stored as a SQL DATE, you are forced to make the transform in code, and very likely have one spot to change the format to display all dates--for free.
Between DATE/DATETIME and VARCHAR for dates I would go with DATE/DATETIME everytime. But there is a overlooked third option. Storing it as a INTEGER unsigned!
I decided to go with INTEGER unsigned in my last project, and I am really satisfied with making that choice instead of storing it as a DATE/DATETIME. Because I was passing along dates between client and server it made the ideal type for me to use. Instead of having to store it as DATE and having to convert back every time I select, I just select it and use it however I want it. If you want to select the date as a "human-readable" date you can use the FROM_UNIXTIME() function.
Also a integer takes up 4 bytes while DATETIME takes up 8 bytes. Saving 50% storage.
The sorting problem that Berin proposes is also solved using integer as storage for dates.
I'd vote for using the date/datetime types, just for the sake of simplicity/consistency.
If you do store it as a character string, store it in ISO 8601 format:
http://www.iso.org/iso/date_and_time_format
http://xml.coverpages.org/ISO-FDIS-8601.pdf
http://www.cl.cam.ac.uk/~mgk25/iso-time.html
Among other things, ISO 8601 date/time string (A) collate properly, (B) are human readable, (C) are locale-indepedent, and (D) are readily convertable to other formats. To crib from the ISO blurb, ISO 8601 strings offer
representations for the following:
Date
Time of the day
Coordinated universal time (UTC)
Local time with offset to UTC
Date and time
Time intervals
Recurring time intervals
Representations can be in one of two formats: a basic format
that has a minimal number of characters and an extended format
that adds characters to enhance human readability. For example,
the third of January 2003 can be represented as either 20030103
or 2003-01-03.
[and]
offer the following advantages over many of the locally used
representations:
Easily readable and writeable by systems
Easily comparable and sortable
Language independent
Larger units are written in front of smaller units
For most representations the notation is short and of constant length
One last thing: If all you need to do is store a date, then storing it in the ISO 8601 short form YYYYMMDD in a char(8) column takes no more storage than a datetime value (and you don't need to worry about the 3 millisecond gap between the last tick of the one day and the first tick of the next. But that's a matter for another discussion. If you break it up into 3 columns — YYYY char(4), MM char(2), DD char(2) you'll use up the same amount of storage, and get more options for indexing. Even better, store the fields as a short for yyyy (4 bytes), and a tinyint for each of MM and DD — now you're down to 6 bytes for the date. The drawback, of course, to decomposing the date components into their constituent parts is that conversion to proper date/time data types is complicated.

List of Best Practice MySQL Data Types

Is there a list of best practice MySQL data types for common applications. For example, the list would contain the best data type and size for id, ip address, email, subject, summary, description content, url, date (timestamp and human readable), geo points, media height, media width, media duration, etc
Thank you!!!
i don't know of any, so let's start one!
numeric ID/auto_increment primary keys: use an unsigned integer. do not use 0 as a value. and keep in mind the maximum value of of the various sizes, i.e. don't use int if you don't need 4 billion values when the 16 million offered by mediumint will suffice.
dates: unless you specifically need dates/times that are outside the supported range of mysql's DATE and TIME types, use them! if you instead use unix timestamps, you have to convert them to use the built-in date and time functions. if your app needs unix timestamps, you can always convert the standard date and time data types on the way out using unix_timestamp().
ip addresses: use inet_aton() and inet_ntoa() since it easily compacts an ip address in to 4 bytes and gives you the ability to do range searches that utilize indexes.
Integer Display Width You likely define your integers something like this "INT(4)" but have been baffled by the fact that (4) has no real effect on the stored numbers. In other words, you can store numbers like 999999 just fine. The reason is that for integers, (4) is the display width, and only has an effect if used with the ZEROFILL modifier. Further, this is for display purposes only, so you could define a column as "INT(4) ZEROFILL" and store 99999. If you stored 999, the mysql REPL (console) would output 0999 when you've selected this column.
In other words, if you don't need the ZEROFILL stuff, you can leave off the display width.
Money: Use the Decimal data type. Based on real-world production scenarios I recommend (19,8).
EDIT: My original recommendation was (19,4); however, I've recently run into a production issue where the client reported that they absolutely needed decimal with a "scale" of "8"; thus "4" wasn't enough and was causing improper tax calculations. I now recommend (19,8) based on a real-world scenario. I would love to hear stories needing a more granular scale.