I'm contributing to a project that runs SQL queries on a variety of database types, some of which I don't have access to for testing. A SUM() operation is leading to an overflow for INT data types and the proposed fix is to cast it to BIGINT: CAST(col AS BIGINT) before the SUM.
That works for most SQL databases. But I see in BigQuery, INT and BIGINT are both aliases for INT64.
Do I need to worry about an error or performance hit with this operation? Or will that CAST succeed without a performance hit, i.e., BigQuery will ignore it since it's already that data type behind the scenes? The Druid docs indicated that's what happens in Druid.
Related
I am currently working on a transaction engine, and we are having some issues with our locking mechanism and the wait time. So in order to fix the issue I am trying to simulate the process in a separate application. So I started the SQL profiler and processed a transaction. While looking at the queries, I noticed the following:
in the database my columns are defined as follows:
EVENTNBR (int),
DEPFILENBR (int),
DEPFILESEQ (int),
USERID (varchar(20)),
STATUS (varchar(1)),
CREATION_DT (datetime),
MOD_DT (datetime),
SOURCE_TYPE (varchar(100)),
SOURCE_GROUP (varchar(100)),
SOURCE_REFID (varchar(100)),
SOURCE_DATE (datetime),
CHECKSUM (varchar(50)),
however when I look at the query, say an insert statement in the SQL profiler, somehow, somewhere, the code is sending the following;
INSERT INTO TABLE_DATA WITH (ROWLOCK) (EVENTNBR, DEPFILENBR, DEPFILESEQ, USERID, STATUS, CREATION_DT, MOD_DT, SOURCE_TYPE, SOURCE_GROUP, SOURCE_REFID, SOURCE_DATE, CHECKSUM)
#mod_dt **varchar(19),**
#source_date **varchar(8000),**
#depfilenbr **varchar(7),**
#eventnbr **varchar(1),**
#source_refid **varchar(8000),**
#creation_dt **varchar(19)**,
#source_group varchar(8000),
#source_type **varchar(7),**
#userid **varchar(7)**,
#checksum **varchar(44)**,
#status **varchar(1)**,
#depfileseq **varchar(1)**',
Now the query is successful and works fine. I am wondering how expensive is this type conversion that is happening on the SQL side? If there is a million of these inserts happening, does correcting this makes a difference?
Thanks
Data type conversion can be super expensive under some circumstances. This is particularly true in on clauses and where clauses where the conversion impedes the use of an index.
In your case, you are simply converting between scalar types, mostly different lengths of strings and an occasional date. There is some overhead in the conversion. I am guessing, though, that the overhead for the type conversion is rather smaller than what already needs to be done for the insert -- in terms of allocating data pages, structuring the data to fit on the page, maintaining indexes, and any additional checks or triggers.
Just to add to that, Microsoft includes a list of data types and their precedence in "implicit conversion" (e.g. you use an int variable in a where clause against a bigint column, and SQL "implicity converts" the int to a bigint. It's primarily in cases where your joining or filtering on these fields that you can run into performance issues (for example, you might have something like where [IntAsString] = #Int or vice versa. If SQL changes the type of the variable, you're good to go. If it has to change the type of the column (again, see the data type precedence link), it will have to do a LOT of work.
Don't quote me on this, but I would say as far as inserts go, it's probably not something to worry about. But when in doubt, look at the execution plan and look for costly operators and implicit casts.
https://learn.microsoft.com/en-us/sql/t-sql/data-types/data-type-precedence-transact-sql?view=sql-server-2017
I have to change a column datatype from smallmoney to money, because values being added to it exceed the smallmoney limit, and I can't truncate those values. Is it possible for any problems to occur from this? I know the table is accessed in multiple locations, but they are too many to go through all of them and make sure the conversion wont break anything.
Changing Datatype SmallMoney to Money wont affect Data already stored and Queries referring it, provided you have not used any function in the queries. It will surely affect code base if the ORM has been used, Datatype need to be changed everywhere in the codebase: Certainly, Refactoring can be used for that matter, but from a regression perspective the operations perform on it in code need to be revisited.
I've worked with many databases over the last 20 yrs and have only ran into this "interesting" type of implicit data conversion problem with SQL Server.
If i create a table with one small int column and insert two rows with a value 1 and 2 into it and then run the following query "Select Avg(Column) From table" i get a truncated result instead of the 1.5 that i would get from pretty much any other dB on the planet that would automatically upsize the datatype to store the entire results rather than truncating/rounding to the columns data type. Now i know i can cast my way around this for every possible scenario but not a good dynamic solution especially for data analytics with data analytic products... I.E: Cognos/Microstrategy etc...
I am in data warehousing and have fact tables with millions of rows in them... I would love to store small columns and have proper aggregation results. My current approach to work around this nuance is to define the smallest quantifiable columns as Numeric(19,5) to account for all situations even though these columns many times only store 1 or 0 for which a tinyint would be great but will not naturally aggregate well.
Is there not any directive that tells SQL server do do what every other DB (oracle/db2/informix/access etc...) does? Which is promote to a larger type and show the entire results and let me do what i want with them?
You could create views on the tables which would cast the smallint or tinyint to float and only publish these views to the users. This would keep the small memory usage. The conversion should be no overhead, compared to other database systems that must do that as well if they use a different data type for aggregation.
While it might frustrate you, a lot of programming languages also behave this way with ints, 1 / 2 will spit out 0. See:
With c++ integers, does 1 divided by 2 reliably equal 0, and 3/2 = 1, 5/2 = 2 etc.?
It's a design quirk, it'd break a lot of things if they changed it. You're asking can you change a fairly fundamental way SQL Server behaves and thus potentially break any one else's code running on the server.
Simply put, no you can't.
And you're wrong that every other DB product behaves this way, Derby also does the same thing:
http://docs.oracle.com/javadb/10.6.2.1/ref/rrefsqlj32693.html
In the Oracle docs they specifically warn you that AVG will return a float regardless of the original type. This is because every language has to make the choice, do I return the original type or the most precise answer? To stop overflows, a lot of languages chose the former to the constant frustration of programmers everywhere.
So in SQL Server, to get a float out, put a float in.
To my best knowledge, the fastest way would be to do an implicit cast: SELECT AVG(Field * 1.0). You could of course do an explicit cast the same way. As far as I know there is no way to tell SQL Server that you want integers converted to floats when you average them, and arguably that's actually correct behavior.
Situation:
varchar(20) seems to truncate silently in Teradata and not to expand or complain when encountering strings larger than 20 characters long... This is a bit of a surprise as I expected either automatic expansion of the column to fit larger strings, say 30 characters, OR for an error to be thrown if a larger string were encountered. Silent truncation seems to get me the worst of all worlds...
Complication:
For my application (prototype analytics design) I don't know in advance how large will be the data I will be ingesting over the course of a few weeks. That seems to rule out using varchar(N), except for max
Questions:
So now I have a few choices, and am looking for some guidance:
Q1. User error? Am I misunderstanding a key concept about varchar(N)?
If this is in fact how Teradata handles varchar fields, then
Q2. why would anyone specify anything less than varchar(max) especially when it is not clear in advance how many characters might need to be stored in the field.
Q3. Is there a different data type that permits flexible sizing of the string -- i.e. a true variable length character string?
If I recall, other SQL dialects implement varchar(n) as a recommended initial size for the string but allow it to expand as needed to fit the maximum length of the data strings thrown in. Is there a similar data type in Teradata?
(Note: since I'm prototyping the tables, I am less concerned about performance efficiency at this point; more concerned about quick but safe designs that allow the prototype to progress.)
I am not familiar with any dialect of SQL that implements a varchar(n) that behaves as you suggest -- a recommended initial size and then letting it grow. This would apply to Oracle, SQL Server, MySQL, and Postgres. In all these databases, varchar(n) behaves pretty much as you see it behave in Teradata in SELECT statements with explicit casts. I don't believe any cause a truncation error when a longer string is placed into a shorter string.
As Branko notes in his comment, the behavior is different in data modification steps, where an implicit cast does cause an error.
I am not familiar with all the details of Teradata. In SQL Server, there is historically a world of difference between varchar(max) and varchar(8000). The former would be allocated on a separate data page, with the latter allocated on the same page as the data. (The rules have been modified in more recent versions so varchars can spill off the data page.)
In other words, there may be other considerations when using varchar(max), involving how the data is stored on pages, how indexes are built on them, and perhaps other considerations.
My suggestion is that you pick a reasonably large size, say 1000 or so, and let the application continue from there. If you want real flexibility, then use varchar(max). You should also investigate through Teradata documentation and/or technical contacts what the issues are with declaring very large strings.
Teradata works in two modes : Teradata (BT; .. ET;) and ANSI(commit;). They have list of differences and one of them you've met during development -- Teradata mode allows truncation of display data. On contrary - ANSI forbids such truncation, so, you'll see an error.
To get the idea, just use simple example:
create table check_exec_mode (str varchar(5))
;
select * from check_exec_mode
;
insert into check_exec_mode values ('123456')
;
If you configure connections of your teradata client (e.g., Teradata Studio Express) in TMODE(transaction mode)=TERA, then you'll get as a result one truncated row in the table ('12345').
Changing transaction mode to ANSI and executing insert statement , will lead you to the error "Right truncation of string data".
Iv got a table called customers where the pk is an int.
Is there any performance problems or issues with joining this to a table where the field is a BIGINT, when i say joining i mean inner join.
I know this is bad practice to have different types but it is not my project.
Thanks
Yes. You'll get an implicit widening conversion according to datatype precedence rules.
Any index in the int column will most likely be ignored. And given this is the PK it could be very poorly performing indeed. The same applies if you explicitly CAST too.
Unfortunately, the options would be either fix the design or add a computed, indexed column so it's bigint JOIN bigint. If you can't change the table, then run it and see: if you have a few 100s or 1000s of rows, then you may be OK. If it's millions, you're what's technically known as bollixed.
Sounds like there is nothing you can do about it (except copy the data to a table with matching data types), so I'm not sure what to say. It will have an impact on performance, but probably not nearly as bad as converting from varchar or double.
If your primary key is int and this is meant to be a foreign key bigint, then the bigint should never hold anything outside of the range of int, so the option of casting the bigint down to int (instead of upcasting from int to bigint) is never going to product a problem.
Also, depending upon your query and its execution plan, the performance hit might be minimized - depending on things like direction of join, inner or outer join, cardinality/statistics, which indexes are available, etc.
My advice would be to have explicit conversion and use computed column and an index on that to resolve any performance issues. The computed column is nothing but the bigint column value converted to int.
Refer to the following link -
http://www.sqlservercentral.com/scripts/T-SQL+Aids/31906/
I have no idea as I have never built a system like that! All I could suggest is clone the tables, profile the system, change the data types to match and then re-profile.
However my gut feeling is that there will be some overhead in converting types. So perhaps you could also mock this the other way around by profiling a known system (say PK int to int) and then insert an explicit cast (say PK int to int-cast-as-bigint) in the join clause and see what happens