Background: The application I'm working on is not using any character delimiters. Fields are fixed length. Alphanumeric fields have to be left justified and space filled to the right, and numeric fields are right justified and zero-filled to the left.
I've been trying to accomplish this by using the RPAD and LPAD functions. The problem I'm running into is the error Teradata is displaying, "Response Row size or Constant Row size overflow". Each record if 4000 Bytes, and (from what I've read) the maximum size for each record in Teradata is 64KB, so I'm well under the maximum Teradata-allowed length.
Here is a small sample of the code that is generating an error;
SELECT
RPAD(t1.MemberNbr, 20, ' ') AS MemberNbr
,RPAD(t1.LastName, 35, ' ') AS LastName
,RPAD(t1.FirstName, 25, ' ') AS FirstName
,CAST(t1.B_Day AS DATE FORMAT 'YYYYMMDD') (char(8)) AS BirthDay
FROM someTable AS t1
Can anyone explain to me why this isn't working? Thanks
When you check the resulting data type (SELECT TYPE(RPAD(t1.MemberNbr, 20, ' '))) you will notice it's either a VARCHAR(32000) CHARACTER SET UNICODE or VARCHAR(64000) CHARACTER SET LATIN, you need to decrease it using a cast:
CAST(RPAD(t1.MemberNbr, 20, ' ') AS CHAR(20))
I know it's stupid, but RPAD & LPAD are no built-in functions but FastPath UDFs, thus the parser/optimizer doesn't seem to know about the actual result size (Otherwise it's ok for other UDFs, e.g. LTRIM/RTRIM)
Related
I have a Cast Procedure for a table with "raw" data. Any time a record comes from any of our locations into the raw table, my procedure "cleans" the data and loads it into a new table. The original raw table is all varchars and my procedure converts date and number fields to the proper data types. From the clean table, a Java program selects any new records on a daily basis and FTPs them off in a file to another dept. Have just learned that a few of the fields accept input from users and on a rare occasion, someone uses a pipe in what they input. A pipe symbol happens to be the delimiter that the other dept is using and whenever a pipe shows up in the middle of a field, it throws a wrench on their end.
I've never used REGEX or REGEXP_REPLACE in Oracle before. There are only three fields where the users can input data - MISTINTCOMMENT, PALETTE, COLORID. How do I use REGEX or REGEXP_REPLACE to replace any pipes with a space? Do I want to do it on each field? Or is this something I should "wrap around" the entire statement (in case there's a field I missed where someone might be able to input a pipe)?
Here is the portion of the procedure where the Values are cleaned and inserted into new table. How to best use RegEx with this?
VALUES (CASE
WHEN THECOSTCENTER IS NOT NULL
THEN THECOSTCENTER
ELSE (SUBSTR(TRIM(THESENDING_QMGR), -6))
END,
CASE
WHEN THESTORENBR = '0' AND (SUBSTR(THESENDING_QMGR, 1, 5) = 'PDPOS')
THEN TO_NUMBER(SUBSTR(THESENDING_QMGR, 8, 4))
WHEN THESTORENBR = '0' AND (SUBSTR(THESENDING_QMGR, 1, 8) = 'PROD_POS')
THEN TO_NUMBER(SUBSTR(THESENDING_QMGR, 9, 4))
ELSE TO_NUMBER(NVL(THESTORENBR,'0'))
END,
TO_NUMBER(NVL(THECONTROLNBR,'0')), TO_NUMBER(NVL(THELINENBR,'0')), THESALESNBR, TO_NUMBER(NVL(THEQTYMISTINT,'0')), THEREASONCODE, THEMISTINTCOMMENT,
THESIZECODE, THETINTERMODEL, THETINTERSERIALNBR, TO_NUMBER(NVL(THEEMPNBR,'0')), TO_DATE(THETRANDATE,'YYYY-MM-DD'), THETRANTIME, THECDSADLFLD,
THEPRODNBR, THEPALETTE, THECOLORID, TO_DATE(THEINITTRANDATE,'YYYY-MM-DD'), TO_NUMBER(NVL(THEGALLONSMISTINTED,'0'),'999999999.99'), THEUPDATEEMPNBR,
TO_DATE(THEUPDATETRANDATE,'YYYY-MM-DD'), TO_NUMBER(NVL(THEGALLONS,'0'),'999999999.99'), THEFORMSOURCE, THEUPDATETRANTIME, THESOURCEIND,
TO_DATE(THECANCELDATE,'YYYY-MM-DD'), THECOLORTYPE, TO_NUMBER(NVL(THECANCELEMPNBR,'0')), TO_BOOLEAN(THENEEDEXTRACTED), TO_BOOLEAN(THEMISTINTMQXTR),
THEDATASOURCE, THETRANGUID, TO_NUMBER(NVL(THETERMNBR,'0')), TO_NUMBER(NVL(THETRANNBR,'0')), TO_NUMBER(NVL(THETRANID,'0')), THEID, THETINTABLESALESNBR,
TO_NUMBER(NVL(THERETURNQTY,'0')), THECREATED_TS, THEXMIT_GUID, THESENDING_QMGR, THEMSG_ID, THEPUT_TS,
THEBROKER_NAME, THECHECKSUM);
If you have to use a REGEXP_REPLACE to replace pipes, escape them:
REGEXP_REPLACE(x, '\|', ' ')
This is useful to know when your more complex expressions include a pipe.
In this case, REPLACE that performs literal text search and replace will suffice:
REPLACE(x, '|', ' ')
I have the following test table in SQL Server 2005:
CREATE TABLE [dbo].[TestTable]
(
[ID] [int] NOT NULL,
[TestField] [varchar](100) NOT NULL
)
Populated with:
INSERT INTO TestTable (ID, TestField) VALUES (1, 'A value'); -- Len = 7
INSERT INTO TestTable (ID, TestField) VALUES (2, 'Another value '); -- Len = 13 + 6 spaces
When I try to find the length of TestField with the SQL Server LEN() function it does not count the trailing spaces - e.g.:
-- Note: Also results the grid view of TestField do not show trailing spaces (SQL Server 2005).
SELECT
ID,
TestField,
LEN(TestField) As LenOfTestField, -- Does not include trailing spaces
FROM
TestTable
How do I include the trailing spaces in the length result?
This is clearly documented by Microsoft in MSDN at http://msdn.microsoft.com/en-us/library/ms190329(SQL.90).aspx, which states LEN "returns the number of characters of the specified string expression, excluding trailing blanks". It is, however, an easy detail on to miss if you're not wary.
You need to instead use the DATALENGTH function - see http://msdn.microsoft.com/en-us/library/ms173486(SQL.90).aspx - which "returns the number of bytes used to represent any expression".
Example:
SELECT
ID,
TestField,
LEN(TestField) As LenOfTestField, -- Does not include trailing spaces
DATALENGTH(TestField) As DataLengthOfTestField -- Shows the true length of data, including trailing spaces.
FROM
TestTable
You can use this trick:
LEN(Str + 'x') - 1
I use this method:
LEN(REPLACE(TestField, ' ', '.'))
I prefer this over DATALENGTH because this works with different data types, and I prefer it over adding a character to the end because you don't have to worry about the edge case where your string is already at the max length.
Note: I would test the performance before using it against a very large data set; though I just tested it against 2M rows and it was no slower than LEN without the REPLACE...
"How do I include the trailing spaces in the length result?"
You get someone to file a SQL Server enhancement request/bug report because nearly all the listed workarounds to this amazingly simple issue here have some deficiency or are inefficient. This still appears to be true in SQL Server 2012. The auto trimming feature may stem from ANSI/ISO SQL-92 but there seems to be some holes (or lack of counting them).
Please vote up "Add setting so LEN counts trailing whitespace" here:
https://feedback.azure.com/forums/908035-sql-server/suggestions/34673914-add-setting-so-len-counts-trailing-whitespace
Retired Connect link:
https://connect.microsoft.com/SQLServer/feedback/details/801381
There are problems with the two top voted answers. The answer recommending DATALENGTH is prone to programmer errors. The result of DATALENGTH must be divided by the 2 for NVARCHAR types, but not for VARCHAR types. This requires knowledge of the type you're getting the length of, and if that type changes, you have to diligently change the places you used DATALENGTH.
There is also a problem with the most upvoted answer (which I admit was my preferred way to do it until this problem bit me). If the thing you are getting the length of is of type NVARCHAR(4000), and it actually contains a string of 4000 characters, SQL will ignore the appended character rather than implicitly cast the result to NVARCHAR(MAX). The end result is an incorrect length. The same thing will happen with VARCHAR(8000).
What I've found works, is nearly as fast as plain old LEN, is faster than LEN(#s + 'x') - 1 for large strings, and does not assume the underlying character width is the following:
DATALENGTH(#s) / DATALENGTH(LEFT(LEFT(#s, 1) + 'x', 1))
This gets the datalength, and then divides by the datalength of a single character from the string. The append of 'x' covers the case where the string is empty (which would give a divide by zero in that case). This works whether #s is VARCHAR or NVARCHAR. Doing the LEFT of 1 character before the append shaves some time when the string is large. The problem with this though, is that it does not work correctly with strings containing surrogate pairs.
There is another way mentioned in a comment to the accepted answer, using REPLACE(#s,' ','x'). That technique gives the correct answer, but is a couple orders of magnitude slower than the other techniques when the string is large.
Given the problems introduced by surrogate pairs on any technique that uses DATALENGTH, I think the safest method that gives correct answers that I know of is the following:
LEN(CONVERT(NVARCHAR(MAX), #s) + 'x') - 1
This is faster than the REPLACE technique, and much faster with longer strings. Basically this technique is the LEN(#s + 'x') - 1 technique, but with protection for the edge case where the string has a length of 4000 (for nvarchar) or 8000 (for varchar), so that the correct answer is given even for that. It also should handle strings with surrogate pairs correctly.
LEN cuts trailing spaces by default, so I found this worked as you move them to the front
(LEN(REVERSE(TestField))
So if you wanted to, you could say
SELECT
t.TestField,
LEN(REVERSE(t.TestField)) AS [Reverse],
LEN(t.TestField) AS [Count]
FROM TestTable t
WHERE LEN(REVERSE(t.TestField)) <> LEN(t.TestField)
Don't use this for leading spaces of course.
You need also to ensure that your data is actually saved with the trailing blanks. When ANSI PADDING is OFF (non-default):
Trailing blanks in character values
inserted into a varchar column are
trimmed.
You should define a CLR function that returns the String's Length field, if you dislike string concatination.
I use LEN('x' + #string + 'x') - 2 in my production use-cases.
If you dislike the DATALENGTH because of of n/varchar concerns, how about:
select DATALENGTH(#var)/isnull(nullif(DATALENGTH(left(#var,1)),0),1)
which is just
select DATALENGTH(#var)/DATALENGTH(left(#var,1))
wrapped with divide-by-zero protection.
By dividing by the DATALENGTH of a single char, we get the length normalised.
(Of course, still issues with surrogate-pairs if that's a concern.)
This is the best algorithm I've come up with which copes with the maximum length and variable byte count per character issues:
ISNULL(LEN(STUFF(#Input, 1, 1, '') + '.'), 0)
This is a variant of the LEN(#Input + '.') - 1 algorithm but by using STUFF to remove the first character we ensure that the modified string doesn't exceed maximum length and remove the need to subtract 1.
ISNULL(..., 0) is added to deal with the case where #Input = '' which causes STUFF to return NULL.
This does have the side effect that the result is also 0 when #Input is NULL which is inconsistent with LEN(NULL) which returns NULL, but this could be dealt with by logic outside this function if need be
Here are the results using LEN(#Input), LEN(#Input + '.') - 1, LEN(REPLACE(#Input, ' ', '.')) and the above STUFF variant, using a sample of #Input = CAST(' S' + SPACE(3998) AS NVARCHAR(4000)) over 1000 iterations
Algorithm
DataLength
ExpectedResult
Result
ms
LEN
8000
4000
2
14
+DOT-1
8000
4000
1
13
REPLACE
8000
4000
4000
514
STUFF+DOT
8000
4000
4000
0
In this case the STUFF algorithm is actually faster than LEN()!
I can only assume that internally SQL looks at the last character and if it is not a space then optimizes the calculation
But that's a good result eh?
Don't use the REPLACE option unless you know your strings are small - it's hugely inefficient
use
SELECT DATALENGTH('string ')
I want to extract a specific part of column values.
The target column and its values look like
TEMP_COL
---------------
DESCOL 10MG
TEGRAL 200MG 50S
COLOSPAS 135MG 30S
The resultant column should look like
RESULT_COL
---------------
10MG
200MG
135MG
This can be done using a regular expression:
SELECT regexp_substr(TEMP_COL, '[0-9]+MG')
FROM the_table;
Note that this is case sensitive and it always returns the first match.
I would probably approach this using REGEXP_SUBSTR() rather than base functions, because the structure of the prescription text varies from record to record.
SELECT TRIM(REGEXP_SUBSTR(TEMP_COL, '(\s)(\S*)', 1, 1))
FROM yourTable
The pattern (\s)(\S*) will match a single space followed by any number of non-space characters. This should match the second term in all cases. We use TRIM() to remove a leading space which is matched and returned.
how do you know what is the part you want to extract? how do you know where it begins and where it ends? using the white-spaces?
if so, you can use substr for cutting the data and instr for finding the white-spaces.
example:
select substr(tempcol, -- string
instr(tempcol, ' ', 1), -- location of first white-space
instr(tempcol, ' ', 1, 2) - instr(tempcol, ' ', 1)) -- length until next space
from dual
another solution is using regexp_substr (but it might be harder on performance if you have a lot of rows):
SELECT REGEXP_SUBSTR (tempcol, '(\S*)(\s*)', 1, 2)
FROM dual;
edit: fixed the regular expression to include expressions that don't have space after the parsed text. sorry about that.. ;)
I'm trying to remove a hidden characters from a varchar column, these hidden characters (i.e. period, space) was taken from a scanned bar code and it is not visible in the result set once query was executed. I have tried to use below script but it failed to remove the hidden characters(see attached screenshot for reference.)
Any help is highly appreciated.
SELECT Replace(Replace(LTrim(RTrim(mycolumn)), '.', ''), ' ', '')
FROM MyTable
WHERE serialno = '123456789'
One thing that has worked for me is to select the column with the special characters, then paste the data into notepad++ then turn on View>Show Symbol>Show All Characters. Then I could copy the special characters from Notepad++ into the second argument of the REPLACE() function in SQL.
I am using Oracle 11g. I am using the Scott account and the demo EMP table. I inserted one record with ENAME BRUCE WILLIAM. My aim is to show the first name and last name in two columns. I used this code:
select trim rpad(ename, instr(ename,' '))) "First Name",
trim(substr(ename, instr(ename,' '))) "Last Name"
from emp;
This gives a weird result. The First Name is extended to second line. I used
select trim(substr(ename, 1, instr(ename,' '))),
trim(substr(ename, instr(ename, ' ')))
from emp;
I got the expected output. My question is why the first line of query is giving extra spaces?
You are not getting extra spaces in your string, and if you were then the trim() would remove them again. SQL*Plus is just formatting the results in a way you don't expect. The documentation mentions the default formatting for column types, and can usually work it out for system functions (though the characterset can make it bigger than you expect).
It seems like SQL*Plus, and SQL Developer, can't determine a sensible default for your rpad case, but can for your substr. Well, SQL*Plus is really just getting a result set cursor from the database, and using the cursor metadata to determine the default widths to apply to the fields for display, so it isn't getting the length you expect from that metadata. But what length should it use?
The database only knows how big the rpad value can be if the padding length is a simple value - it doesn't even mind zero (it returns null, which you're relying on). If the padding length is determined by a function then there's no way to tell how big the result could be, apart from calculating it for every value in the result set before returning the metadata and and actual data, which isn't practical, and would produce inconsistent output as the data changed.
It also wouldn't be practical to try to determine a theoretical maximum, even though it looks superficially straightforward in your case. substr can't ever return something longer than the original value; but rpad could potentially produce something huge even from a short input value, so it has to allow for that possibility if it can't easily determine a limit (i.e. from a fixed value).
So it plays safe and allows for it being up to the maximum length for a varchar2, which is 4000 characters, as this dynamic SQL demonstrates:
declare
l_curid integer;
l_desctab dbms_sql.desc_tab3;
l_colcnt integer;
begin
l_curid := dbms_sql.open_cursor;
dbms_sql.parse(l_curid, 'select rpad(ename, instr(ename,'' '')), '
|| 'rpad(ename, 4), '
|| 'substr(ename, 1, instr(ename,'' '')) '
|| 'from emp where ename like ''B%''' , dbms_sql.native);
dbms_sql.describe_columns3(l_curid, l_colcnt, l_desctab);
for i in 1 .. l_colcnt loop
dbms_output.put_line('column ' || i
|| ' ' || l_desctab(i).col_name
|| ' type ' || l_desctab(i).col_type
|| ' length ' || l_desctab(i).col_max_len
);
end loop;
dbms_sql.close_cursor(l_curid);
end;
/
column 1 RPAD(ENAME,INSTR(ENAME,'')) type 1 length 4000
column 2 RPAD(ENAME,4) type 1 length 16
column 3 SUBSTR(ENAME,1,INSTR(ENAME,'')) type 1 length 40
As you can see, it knows the length for a fixed-length rpad and a substr (note the size is four times the actual string length due to the multibyte characterset), but falls back to the maximum for the rpad using a function.
What you're seeing is SQL*Plus showing a 4000-char column. If you did this in SQL Developer you would see the header for that column is indeed 4000 characters. SQL*Plus helps a bit by reducing the displayed column header to the line size, and wraps the next column onto a separate line.
lpad ('string', n [, 'string_pad')
rpad ('string', n [, 'string_pad')
string is left padded to length n with string_pad. If string_pad is ommited, a space will be used as default
rpad is similar, but pads right instead of left.
from http://www.adp-gmbh.ch/ora/sql/rpad.html
and here is good example for understanding
begin
for i in 1 .. 15 loop
dbms_output.put_line(
rpad('string', i) || '<'
);
end loop;
end;