I have a CHAR column that contains messy OCR'd scan of printed integers.
I need to do SUM() operators on that column. But I'm unable to cast properly.
;Good
sqlite> select CAST("123" as integer);
123
;No Good, should be '323999'
sqlite> select CAST("323,999" as integer);
323
I believe SQLite interprets the comma as marking the end of the "the longest possible prefix of the value that can be interpreted as an integer number"
I prefer to avoid the agony of writing python scripts to do data cleaning on this column. Is there any clever way to do it strictly with SQL?
If you are trying to ignore commas, then remove them before the conversion:
select cast(replace('323,999', ',', '') as integer)
Related
I received a csv file with a column called "Amount", which should be of a MONEY type inside my table.
First step I took was loading the csv file as is. So my table uses a string type for that Amount column, just because I know it will not be formatted as money on the source. Due to some spaces on that column, I can't convert from NVARCHAR to MONEY.
Here's the initial table structure:
CREATE TABLE #TestReplace (
Amount NVARCHAR(100)
)
Here's an example to what the client inserted as value for the column:
INSERT INTO #TestReplace VALUES('2 103.74')
Because there is a space into that string, I need to remove it so I can convert it to the MONEY type.
However, if I try the REPLACE SQL function, nothing happens. It's like the value does not change
SELECT REPLACE(Amount, ' ','') FROM #TestReplace
Amount after the replace command is still: 2 103.74
Am I missing something that does not catch the space after the number 2? Is there a better way to remove that space and convert from NVARCHAR to MONEY?
Appreciate all the help!
You have a character that is not a space but looks like one. If you are using 8-bit ASCII characters, you can determine what the value is using:
select ascii(substring(amount, 2, 1))
If this is an nvarchar() (as in your example):
select unicode(substring(amount, 2, 1))
Once you know what the character is, you can replace it.
To add to Gordon's answer, once you know the integer ASCII/Unicode value, you can use that in the replace function with the CHAR() and NCHAR() functions like so:
--For ASCII:
REPLACE(Amount, CHAR( /*int value*/ ), '')
--For Unicode:
REPLACE(Amount, NCHAR( /*int value*/ ), '')
Column xy of type 'nvarchar2(40)' in table ABC.
Column consists mainly of numerical Strings
how can I make a
select to_number(trim(xy)) from ABC
query, that ignores non-numerical strings?
In general in relational databases, the order of evaluation is not defined, so it is possible that the select functions are called before the where clause filters the data. I know this is the case in SQL Server. Here is a post that suggests that the same can happen in Oracle.
The case statement, however, does cascade, so it is evaluated in order. For that reason, I prefer:
select (case when NOT regexp_like(xy,'[^[:digit:]]') then to_number(xy)
end)
from ABC;
This will return NULL for values that are not numbers.
You could use regexp_like to find out if it is a number (with/without plus/minus sign, decimal separator followed by at least one digit, thousand separators in the correct places if any) and use it like this:
SELECT TO_NUMBER( CASE WHEN regexp_like(xy,'.....') THEN xy ELSE NULL END )
FROM ABC;
However, as the built-in function TO_NUMBER is not able to deal with all numbers (it fails at least when a number contains thousand separators), I would suggest to write a PL/SQL function TO_NUMBER_OR_DEFAULT(numberstring, defaultnumber) to do what you want.
EDIT: You may want to read my answer on using regexp_like to determine if a string contains a number here: https://stackoverflow.com/a/21235443/2270762.
You can add WHERE
SELECT TO_NUMBER(TRIM(xy)) FROM ABC WHERE REGEXP_INSTR(email, '[A-Za-z]') = 0
The WHERE is ignoring columns with letters. See the documentation
I want to be able to differentiate between a string that is alphnumeric and a string that is in hex format.
My current query is:
<columnName> LIKE '?_____=' + REPLICATE('[0-9A-Fa-f]',16)
I found this method of searching for hex ID's online and I thought it was working. However after getting a significantly larger sample size I can see a high false positive rate in my results. The problem is that this gives me all the results I do want but it also gives me a bunch of results I dont care about. For example:
I want to see:
<url>.php?mains=d7ad916d1c0396ff
but i dont want to see:
<url>.php?mblID=2007012422060265
The difference between the 2 strings is that the 16 characters at the end that i want to collect are all numeric and not a hex ID. What are some ways you guys use to limit the results to hex ID only? Thanks in advnace.
UPDATE:
Juergen brought up a good point, the second number could be a hex value to. Not all hex numbers contain [a-F]. I would like to rephrase the question to state that I am looking for an ID with both letters and numbers in it, not just numbers.
The simplest way is just to add a separate clause for that restriction:
<columnName> LIKE '?_____=' + REPLICATE('[0-9A-Fa-f]',16)
AND <columnName> NOT LIKE '?_____=' + REPLICATE('[0-9]',16)
It should be fairly simple to determine if a string contains only numbers...
Setting up a test table:
CREATE TABLE #Temp (Data char(32) not null)
INSERT #Temp
values ('<url>.php?mains=d7ad916d1c0396ff')
,('<url>.php?mblID=2007012422060265 ')
Write a query:
SELECT
right(Data, 16) StringToCheck
,isnumeric(right(Data, 16)) IsNumeric
from #Temp
Get results:
StringToCheck IsNumeric
d7ad916d1c0396ff 0
2007012422060265 1
So, if the IsNumeric function returns 0, it could be a hex string.
This makes several assumptions:
The rightmost 16 characters are what you want to check
You only ever hit 16 characters. I don't know when the string would get too long to check.
A non-numeric character means hex. Any chance of "Q" or "~" being embedded in the string?
We are trying to load a file created by FastExport into an oracle database.
However the Float column is being exported like this: 1.47654345670000000000 E010.
How do you configure SQL*Loader to import it like that.
Expecting Control Script to look like:
OPTIONS(DIRECT=TRUE, ROWS=20000, BINDSIZE=8388608, READSIZE=8388608)
UNRECOVERABLE LOAD DATA
infile 'data/SOME_FILE.csv'
append
INTO TABLE SOME_TABLE
fields terminated by ','
OPTIONALLY ENCLOSED BY '"' AND '"'
trailing nullcols (
FLOAT_VALUE CHAR(38) "???????????????????",
FILED02 CHAR(5) "TRIM(:FILED02)",
FILED03 TIMESTAMP "YYYY-MM-DD HH24:MI:SS.FF6",
FILED04 CHAR(38)
)
I tried to_number('1.47654345670000000000 E010', '9.99999999999999999999 EEEE')
Error: ORA-01481: invalid number format model error.
I tried to_number('1.47654345670000000000 E010', '9.99999999999999999999EEEE')
Error: ORA-01722: invalid number
These are the solutions I came up with in order of preference:
to_number(replace('1.47654345670000000000 E010', ' ', ''))
to_number(TRANSLATE('1.47654345670000000000 E010', '1 ', '1'))
I would like to know if there are any better performing solutions.
As far as I'm aware there is no way to have to_number ignore the space, and nothing you can do in SQL*Loader to prepare it. If you can't remove it by pre-processing the file, which you've suggested isn't an option, then you'll have to use a string function at some point. I wouldn't expect it to add a huge amount of processing, above what to_number will do anyway, but I'd always try it and see rather than assuming anything - avoiding the string functions sounds a little like premature optimisation. Anyway, the simplest is possibly replace:
select to_number(replace('1.47654345670000000000 E010',' ',''),
'9.99999999999999999999EEEE') from dual;
or just for display purposes:
column num format 99999999999
select to_number(replace('1.47654345670000000000 E010',' ',''),
'9.99999999999999999999EEEE') as num from dual
NUM
------------
14765434567
You could define your own function to simplify the control file slightly, but not sure it'd be worth it.
Two other options come to mind. (a) Load into a temporary table as a varchar, and then populate the real table using the to_number(replace()); but I doubt that will be any improvement in performance and might be substantially worse. Or (b) if you're running 11g, load into a varchar column in the real table, and make your number column a virtual column that applies the functions.
Actually, a third option... don't use SQLLoader at all, but use the CSV file as an external table, and populate your real table from that. You'll still have to do the to_number(replace()) but you might see a difference in performance over doing it in SQLLoader. The difference could be that it's worse, of course, but might be worth trying.
Change number width with "set numw"
select num from blabla >
result >> 1,0293E+15
set numw 20;
select num from blabla >
result >> 1029301200000021
Here is the solution I went with:
OPTIONS(DIRECT=TRUE, ROWS=20000, BINDSIZE=8388608, READSIZE=8388608)
UNRECOVERABLE LOAD DATA
infile 'data/SOME_FILE.csv'
append
INTO TABLE SOME_TABLE
fields terminated by ','
OPTIONALLY ENCLOSED BY '"' AND '"'
trailing nullcols (
FLOAT_VALUE CHAR(38) "REPLACE(:FLOAT_VALUE,' ','')",
FILED02 CHAR(5) "TRIM(:FILED02)",
FILED03 TIMESTAMP "YYYY-MM-DD HH24:MI:SS.FF6",
FILED04 CHAR(38)
)
In my solution the conversion to a number is implicit:
"REPLACE(:FLOAT_VALUE,' ','')"
In Oracle 11g, it's not needed to convert numbers specially.
Just use integer external in the .ctl-file:
I tried the following in my Oracle DB:
field MYNUMBER has type NUMBER.
Inside .ctl-file I used the following definition:
MYNUMBER integer external
In the datafile the value is: MYNUMBER: -1.61290E-03
As for the result: sqlldr loaded the notation correctly: MYNUMBER field: -0.00161290
I am not sure if it's a bug or a feature; but it works in Oracle 11g.
Is it possible to order result rows by a varchar column cast to integer in Postgres 8.3?
It's absolutely possible.
ORDER BY varchar_column::int
Be sure to have valid integer literals in your varchar column for each entry or you get an exception invalid input syntax for integer. Leading and trailing white space is ok - that's trimmed automatically.
If that's the case, though, then why not convert the column to integer to begin with? Smaller, faster, cleaner, simpler.
How to avoid exceptions?
To remove non-digit characters before the cast and thereby avoid possible exceptions:
ORDER BY NULLIF(regexp_replace(varchar_column, '\D', '', 'g'), '')::int
The regexp_replace() expression effectively removes all non-digits, so only digits remain or an empty string. (See below.)
\D is shorthand for the character class [^[:digit:]], meaning all non-digits ([^0-9]).
In old Postgres versions with the outdated setting standard_conforming_strings = off, you have to use Posix escape string syntax E'\\D' to escape the backslash \. This was default in Postgres 8.3, so you'll need that for your outdated version.
The 4th parameter g is for "globally", instructing to replace all occurrences, not just the first.
You may want to allow a leading dash (-) for negative numbers.
If the the string has no digits at all, the result is an empty string which is not valid for a cast to integer. Convert empty strings to NULL with NULLIF. (You might consider 0 instead.)
The result is guaranteed to be valid. This procedure is for a cast to integer as requested in the body of the question, not for numeric as the title mentions.
How to make it fast?
One way is an index on an expression.
CREATE INDEX tbl_varchar_col2int_idx ON tbl
(cast(NULLIF(regexp_replace(varchar_column, '\D', '', 'g'), '') AS integer));
Then use the same expression in the ORDER BY clause:
ORDER BY
cast(NULLIF(regexp_replace(varchar_column, '\D', '', 'g'), '') AS integer)
Test with EXPLAIN ANALYZE whether the functional index actually gets used.
Also in case you want to order by a text column that has something convertible to float, then this does it:
select *
from your_table
order by cast(your_text_column as double precision) desc;