Cloudsearch range failing for negative integers - amazon-cloudsearch

I have records in Amazon's cloudsearch that are timestamped with an int representing milliseconds since the epoch. I call the field time. This can be negative for dates before 1970. When I perform a structured query using time:[0,}, it's returning negative as well as positive timestamps, which is wrong. The docs say that ints are 64-bit signed, so I don't see why this wouldn't be valid. My query syntax works fine with other fields that are only positive-valued. Are range searches actually restricted to positive numbers?
(aside: I know I could use a date string format, but I want to use an integer for consistency with other parts of my system. Also I want to be able to represent BCE dates and I'm not sure whether YYYY:MM:DD formats behave safely when YYYY is negative.)

It turns out Cloudsearch queries work fine with negative numbers, as you'd expect. My problem was that I'd previously defined this field as a text field (and a text comparison won't numerically order strings like '123', '-555', '-1', etc.) I'd changed the field to an int, but I'd forgotten to re-index, so Cloudsearch was still secretly treating it as text.
To re-index after changing field types you can use:
aws cloudsearch index-documents --domain-name mycloudsearch
or you can do it from the web interface.

Related

How to make a query that matches values within a specified range of non standard types?

For standard ones I find it pretty straightforward
NumericRangeQuery.NewIntRange(item.Name, item.MinValue, item.MaxValue, true, true))
It works great with most common numeric types.
But what I would like to do is to make a range query with such datatypes as Date and decimal.
How could I achieve this?
For dates, store them as ints. So 2016 July 23 = 20160723
If you want to the hour or minute or second, just add those digits to the right. You may need to switch to long (Int64) for the longer versions.
If you want finer grain then store Ticks.
After all that just use the appropriate NumericRange query.
In Lucene.net 3.0.3 the best float accuracy is with Double

SQL equals does not work for timestamps?

My table has a category 'timestamp' where the timestamps are formatted 2015-06-22 18:59:59
However, using DBVisualizer Free 9.2.8 and Vertica, when I try to pull up rows by timestamp with a
SELECT * FROM table WHERE timestamp = '2015-06-22 18:59:59';
(directly copy-pasting the stamp), nothing comes up. Why is this happening and is there a way around it?
FYI, saying "the timestamps are formatted 2015-06-22 18:59:59" is incorrect if you are indeed using a TIMESTAMP type. Such types have their own internal representation of a date-time value, almost always a count since epoch. In your case with Vertica, 8 bytes are used for such storage. The formatting of the date-time value happens when a string representation is generated. Never confuse the string representation with the date-time value. Conflating the two may well be related to your problem/confusion.
A few different thoughts about possible problems…
String Literals
Are you sure Vertica takes strings as timestamp literals? That format you used is common SQL format. But given that Vertica seems to be a specialized database, I would double-check that.
If strings are not allowed, you may need to call some kind of function to transform the string into a date-time values.
Fractional Second
As the comment by Martin Smith points out, the doc for Timestamp-related data types in Vertica 7.1 says those types can have a fractional second to resolution of microseconds. That means up to 6 decimal places of a fraction.
So if you are searching for "2015-06-22 18:59:59" but the stored value is "2015-06-22 18:59:59.012345", no match on the query.
Half-Open
The fractional seconds issue described above is often the cause of problems people have when handling a span of time. If you naïvely try to pinpoint the ending time, you are likely to have problems. Seeing the "59:59" in your example string makes me think this applies to you.
The better approach to spans of time is "Half-Open" (or Half-Closed, whatever) where the beginning is inclusive while the ending is exclusive. Common notation for this is [). In comparison logic this means: value >= start AND value < stop. Notice the lack of EQUALS SIGN in the stop comparison. In English we would say "look for an hour's worth of invoices starting at 2:00 PM and going up to, but not including, 3:00 PM".
Half-Open for a week means Monday-Monday, for a month the first of one month to the first of the next month, and for a year the January 1 of one year to January 1 of the following year.
Half-Open means not using BETWEEN in SQL. SQL's BETWEEN has often be criticized. Instead do something like the following to look for an hour's worth of invoices. Notice the Z on the end of string literal which means "UTC time zone" ("Z" for "Zulu"). (But verify, as my SQL syntax may need fixing.)
SELECT *
FROM some_table_
WHERE invoice_received_ >= '2015-06-22 18:00:00Z'
AND invoice_received_ < '2015-06-22 19:00:00Z'
;
This query will catch any values such as '2015-06-22 18:59:59.654321" which seems to be eluding you.
Reserved Word
I hope you have not really named your table 'table' and your column 'timestamp'. Such use of keywords and reserved words can cause explicit errors or more subtle weird problems.
Tip: The easy way to avoid any of the over a thousand reserved words in various databases is to append a trailing underscore. The SQL standard explicitly promises to never using a trailing underscore in its reserved words. So use "timestamp_" rather than "timestamp". Another example: "invoice_" table and "received_" column. I recommend doing that as a habit on everything your name in SQL: columns, tables, constraints, indexes, and so on.
Time Zone
You are using the TIMESTAMP which is short for TIMESTAMP WITHOUT TIME ZONE. Or so I presume; the Vertica doc is vague but that is the common usage as seen in the Postgres doc, and may even be standard SQL.
Anyways, TIMESTAMP WITHOUT TIME ZONE is usually the wrong type for most business purposes. The WITH time zone is misnamed and often misunderstood as a consequence: It means "with respect for time zone" where data inputs that include an offset or other time zone information from UTC are adjusted to UTC during the INSERT/UPDATE operations. The WITHOUT type simply ignores any such offset or time zone information.
The WITHOUT type should only be used for the concept of a date-time generally without being tied to any one locality. For example, saying "Christmas this year starts at beginning of December 25, 2015". That means in any time zone rather than a specific time zone. Obviously Christmas starts earlier in Paris, for example, than in Montréal.
If you are timestamping legal documents such as invoices, or booking appointments with people across time zones, or scheduling shipments in various localities, you should be using WITH time zone type.
So back to your possible problem: Test how Vertica or your client app or your database driver is handling your input string. It may be adjusting time zones as part of the parsing of the string using your client machine’s current default time zone. When sent to the database, that value will not match the stored value if during storage no adjustment to UTC was made.
Tip: Generally best practice is to do all your storage and business logic in UTC, adjusting to local time zones only where expected by user.

Swedish "personnummer" (personal identity number) in SQL

This is a specific instance of an old problem: How to store "numbers" (e.g. phone numbers, IP addresses, social security numbers) in SQL databases?
Background: In Sweden, Personal Identity Numbers ("personnummer") are extremely common: You use them when communicating with the government, the bank, your employer, etc. People born in Sweden are assigned them when born. My immigrant friends lament the dark couple of weeks before they got a personnummer and could finally get a debit card and start looking for jobs.
My organization needs to store personnummer of our members. We have a SQL database for this. How should I store the data?
From Wikipedia, regarding the format of a personnummer:
The personal identity number consists of 10 digits and a hyphen. The first six correspond to the person's birthday, in YYMMDD form. They are followed by a hyphen. People over the age of 100 replace the hyphen with a plus sign. The seventh through ninth are a serial number. An odd ninth number is assigned to males and an even ninth number is assigned to females. Some county authorities, such as Stockholm, and some banks, have started using 12 digit numbers to allow YYYYMMDD. This format is also used on some Swedish ID-cards[clarification needed] and on the Swedish European Health Insurance Cards but not on state-issued identity documents.
The tenth digit is a checksum which was introduced in 1967 when the system was computerized.
So, a personnummer could be "120101-3842" for a person born this year. This is also commonly formatted as "20120101-3842" because of Y2K and "replacing the hyphen with a plus sign" is not well-known.
In a database column, I imagine I can:
Store it as a VARCHAR, formatted as "120101-3842", "20120101-3842" or "201201013842" (shaving of a byte by getting of the superfluous hyphen in the YYYYMMDD-format).
Store the full YYYYMMDDXXXX as an INTEGER, which is too big for 32 bits but fits without problems in 64 bits.
There won't be any issues with leading zeroes in this case, and using a VARCHAR is almost twice the size. Unlike IP addresses, storing this number as an INTEGER does not make it harder to read for a human (i.e. "127.0.0.1" compared to 2130706433).
I appreciate the "strictness" of an INTEGER column but also feel that this might run into unseen issues.
EDIT: We have a real need to validate this input with the checksum et cetera, which requires doing math on the indivdual digits (multiplying, summing etc). Since digits aren't really ... uh... part of a quantity, but of decimal formatting, it might make sense to consider it a varchar after all.
Use VARCHAR with a fixed length because it is the most simple approach. And I don't think that your organisation will store the number of all 9.5 million inhabitants so that saving space is a real design goal? :)
So, as I understand it, the hyphen / plus signs are only required for the format with 2 digit year.
If I were you, I would on the application side convert to the 4 digit year format (And drop the hyphen). Then store the resulting value as an integer. As you have stated, this will save space, and will allow you to mathematically transform the values (Although I imagine that on personal numbers this may be irrelevant).
I think the key here is that you should choose a single format rather than trying to manage two different formats in the database. This will also help to lead to application consistency. When it comes to external applications that require one or another format, you can place a transform into the transfer code.
On a side note, it should be fairly trivial to create a trigger that would automatically assign the 2 digit year format (As long as you replace the hyphen / plus with a digit) To the 4 year format.
I would store the canonical form 201201013842 as a CHAR (rather than a VARCHAR).
The bottom line is that you do not control the semantics of the number (Swedish authorities do). If at some point they decide to add non numeric characters to the number (as the number already does in the older format), you will be better equipped to deal with the change.
We have the same problem and we currently store it as yyyyMMdd-xxxx, but if i where to redesign this today i would store the yyyyMMdd in a date field as that would handle the validation of the date, then i would store the 4 other values in a nchar(4) and add a constraint to ensure its only numbers.

precision gains where data move from one table to another in sql server

There are three tables in our sql server 2008
transact_orders
transact_shipments
transact_child_orders.
Three of them have a common column carrying_cost. Data type is same in all the three tables.It is float with NUMERIC_PRECISION 53 and NUMERIC_PRECISION_RADIX 2.
In table 1 - transact_orders this column has value 5.1 for three rows. convert(decimal(20,15), carrying_cost) returns 5.100000..... here.
Table 2 - transact_shipments three rows are fetching carrying_cost from those three rows in transact_orders.
convert(decimal(20,15), carrying_cost) returns 5.100000..... here also.
Table 3 - transact_child_orders is summing up those three carrying costs from transact_shipments. And the value shown there is 15.3 when I run a normal select.
But convert(decimal(20,15), carrying_cost) returns 15.299999999999999 in this stable. And its showing that precision gained value in ui also. Though ui is only fetching the value, not doing any conversion. In the java code the variable which is fetching the value from the db is defined as double.
The code in step 3, to sum up the three carrying_costs is simple ::
...sum(isnull(transact_shipments.carrying_costs,0)) sum_carrying_costs,...
Any idea why this change occurs in the third step ? Any help will be appreciated. Please let me know if any more information is needed.
Rather than post a bunch of comments, I'll write an answer.
Floats are not suitable for precise values where you can't accept rounding errors - For example, finance.
Floats can scale from very small numbers, to very high numbers. But they don't do that without losing a degree of accuracy. You can look the details up on line, there is a host of good work out there for you to read.
But, simplistically, it's because they're true binary numbers - some decimal numbers just can't be represented as a binary value with 100% accuracy. (Just like 1/3 can't be represented with 100% accuracy in decimal.)
I'm not sure what is causing your performance issue with the DECIMAL data type, often it's because there is some implicit conversion going on. (You've got a float somewhere, or decimals with different definitions, etc.)
But regardless of the cause; nothing is faster than integer arithmetic. So, store your values are integers? £1.10 could be stored as 110p. Or, if you know you'll get some fractions of a pence for some reason, 11000dp (deci-pennies).
You do then need to consider the biggest value you will ever reach, and whether INT or BIGINT is more appropriate.
Also, when working with integers, be careful of divisions. If you divide £10 between 3 people, where does the last 1p need to go? £3.33 for two people and £3.34 for one person? £0.01 eaten by the bank? But, invariably, it should not get lost to the digital elves.
And, obviously, when presenting the number to a user, you then need to manipulate it back to £ rather than dp; but you need to do that often anyway, to get £10k or £10M, etc.
Whatever you do, and if you don't want rounding errors due to floating point values, don't use FLOAT.
(There is ALOT written on line about how to use floats, and more importantly, how not to. It's a big topic; just don't fall into the trap of "it's so accurate, it's amazing, it can do anything" - I can't count the number of time people have screwed up data using that unfortunately common but naive assumption.)

List of Best Practice MySQL Data Types

Is there a list of best practice MySQL data types for common applications. For example, the list would contain the best data type and size for id, ip address, email, subject, summary, description content, url, date (timestamp and human readable), geo points, media height, media width, media duration, etc
Thank you!!!
i don't know of any, so let's start one!
numeric ID/auto_increment primary keys: use an unsigned integer. do not use 0 as a value. and keep in mind the maximum value of of the various sizes, i.e. don't use int if you don't need 4 billion values when the 16 million offered by mediumint will suffice.
dates: unless you specifically need dates/times that are outside the supported range of mysql's DATE and TIME types, use them! if you instead use unix timestamps, you have to convert them to use the built-in date and time functions. if your app needs unix timestamps, you can always convert the standard date and time data types on the way out using unix_timestamp().
ip addresses: use inet_aton() and inet_ntoa() since it easily compacts an ip address in to 4 bytes and gives you the ability to do range searches that utilize indexes.
Integer Display Width You likely define your integers something like this "INT(4)" but have been baffled by the fact that (4) has no real effect on the stored numbers. In other words, you can store numbers like 999999 just fine. The reason is that for integers, (4) is the display width, and only has an effect if used with the ZEROFILL modifier. Further, this is for display purposes only, so you could define a column as "INT(4) ZEROFILL" and store 99999. If you stored 999, the mysql REPL (console) would output 0999 when you've selected this column.
In other words, if you don't need the ZEROFILL stuff, you can leave off the display width.
Money: Use the Decimal data type. Based on real-world production scenarios I recommend (19,8).
EDIT: My original recommendation was (19,4); however, I've recently run into a production issue where the client reported that they absolutely needed decimal with a "scale" of "8"; thus "4" wasn't enough and was causing improper tax calculations. I now recommend (19,8) based on a real-world scenario. I would love to hear stories needing a more granular scale.