Finding MySQL errors from LOAD DATA INFILE - sql

I am running a LOAD DATA INFILE command in MySQL and one of the files is showing errors at the mysql prompt.
How do I check the warnings and errors? Right now the only thing I have to go by is the fact that the prompt reports 65,535 warnings on import.
mysql> use dbname;
Database changed
mysql> LOAD DATA LOCAL INFILE '/dump.txt'
-> INTO TABLE table
-> (id, title, name, accuracy);
Query OK, 897306 rows affected, 65535 warnings (16.09 sec)
Records: 897306 Deleted: 0 Skipped: 0 Warnings: 0
How do I get mysql to show me what those warnings are? I looked in the error log but I couldn't find them. Running the "SHOW WARNINGS" command only returned 64 results which means that the remaining 65,000 warnings must be somewhere else.
2 |
| Warning | 1366 | Incorrect integer value: '' for column 'accuracy' at row 2038
3 |
| Warning | 1366 | Incorrect integer value: '' for column 'accuracy' at row 2038
4 |
| Warning | 1366 | Incorrect integer value: '' for column 'accuracy' at row 2038
6 |
| Warning | 1366 | Incorrect integer value: '' for column 'accuracy' at row 2038
7 |
+---------+------+--------------------------------------------------------------
--+
64 rows in set (0.00 sec)
How do I find these errors?

The MySQL SHOW WARNINGS command only shows you a subset of the warnings. You can change the limit of warning shown by modifying the parameter max_error_count.

Getting that many errors suggests that you have the wrong delimiter or extraneous quote marks that are making MySQL read the wrong columns from your input.
You can probably fix that by adding
[{FIELDS | COLUMNS}
[TERMINATED BY 'string']
[[OPTIONALLY] ENCLOSED BY 'char']
[ESCAPED BY 'char']
]
[LINES
[STARTING BY 'string']
[TERMINATED BY 'string']
]
after the tablename and before the column list.
Something like:
LOAD DATA LOCAL INFILE '/dump.txt'
INTO TABLE table
fields terminated by ' ' optionally enclosed by '"'
(id, title, name, accuracy);
By default, if you don't specify this, MySQL expects the tab character to terminate fields.

There could be a blank entry in the data file, and the target table doesn't allow null values, or doesn't have a valid default value for the field in question.
I'd check that the table has a default for accuracy - and if it doesn't, set it to zero and see if that clears up the errors.
Or you could pre-process the file with 'awk' or similar and ensure there is a valid numeric value for the accuracy field in all rows.

Related

How to find float rounding errors in SQL server

I've narrowed down a data issue on a legacy SQL Server 2008 database.
The column is a 'float'. SSMS shows four of the records as '0.04445' but when i query for all records that match the first value, only 3 of the four are returned. The last record is somehow different, i suspect it is off by 0.0000000001 or something and the SMSS GUI is rounding it for display(?). Using the '<' operator has similar results ('where my_column < 0.04445' returns three of the four) This is causing some catastrophic calculation errors in the calling app.
I tried casting it to a decimal ('SELECT CAST(my_column as DECIMAL(38,20)) FROM...') but all four records just come back 0.044450000000000000000000000000
I suspect that there are many other similar errors in this same column, as the data has been entered in various ways over the years.
Is there any way to see this column in its full value/precision/mantissa, rather than the rounded value?
I can't change the schema or the app.
Update - using the 'Edit Top 200 Rows' feature, I can see that about three quarters of them are 0.044449999999999996 and the other quarter are ecxactly 0.04445. But I can't get it to display that level of accuracy in a regular query result
You can use CONVERT(VARBINARY(8), my_column) to the number in its original form. What you get should be 0x3FA6C226809D4952 or 0x3FA6C226809D4951. And what number that really is? 3FA6C226809D4951 is binary
0 01111111010 0110110000100010011010000000100111010100100101010001
0 => number is positive
01111111010 => 1018-1023 = -5 is exponent (so we get 2^-5)
1.0110110000100010011010000000100111010100100101010001 => 6405920109971793*2^-52
so the 0x3FA6C226809D4951 is exactly 6405920109971793*2^-57, which is 0.044449999999999996458388551445750636048614978790283203125
and 0x3FA6C226809D4952 is exactly 6405920109971794*2^-57, which is 0.04445000000000000339728245535297901369631290435791015625
So, your question is really about SSMS, not about your application or SQL Server itself, right? You want to see the actual float values in SSMS without rounding, right?
By design SSMS rounds float during display. For example, see this answer.
But, you can see the actual value that is stored in the column if you convert it to a string explicitly using CONVERT function.
float and real styles
For a float or real expression, style can have
one of the values shown in the following table. Other values are
processed as 0.
0 (default) A maximum of 6 digits. Use in scientific notation, when appropriate.
1 Always 8 digits. Always use in scientific notation.
2 Always 16 digits. Always use in scientific notation.
3 Always 17 digits. Use for lossless conversion.
With this style, every distinct float or real value is guaranteed to
convert to a distinct character string.
It looks like style 3 is just what you need:
convert(varchar(30), my_column, 3)
Here is my test:
DECLARE #v1 float = 0.044449999999999996e0;
DECLARE #v2 float = 0.044445e0;
SELECT #v1, #v2, convert(varchar(30), #v1, 3), convert(varchar(30), #v2, 3)
Result that I see in SSMS:
+------------------+------------------+-------------------------+-------------------------+
| (No column name) | (No column name) | (No column name) | (No column name) |
+------------------+------------------+-------------------------+-------------------------+
| 0.04445 | 0.044445 | 4.4449999999999996e-002 | 4.4444999999999998e-002 |
+------------------+------------------+-------------------------+-------------------------+

"String or binary data would be truncated." for NVARCHAR but not VARCHAR in LIKE operation

In SQL Server, nvarchar takes twice the space of varchar, and its pre-page-pointer limit is 4000 compared to varchar's 8000.
So, why does the following like comparison give a String or binary data would be truncated. error...
select 1 where '' like cast(replicate('x', 4001) as nvarchar(max))
...while casting as a massively larger varchar does not?
select 1 where '' like cast(replicate('x', 123456) as varchar(max))
In fact, why does the top live give a truncation error at all when it's clearly declared as nvarchar(max) which has a size limit of about 2GB?
From the description of the LIKE operator:
pattern
Is the specific string of characters to search for in
match_expression, and can include the following valid wildcard
characters. pattern can be a maximum of 8,000 bytes.
This query shows a factual count of symbols:
select len(replicate('x', 123456)) as CntVarchar,
len(replicate('x', 4001)) as CntNVarchar
+------------+-------------+
| CntVarchar | CntNVarchar |
+------------+-------------+
| 8000 | 4001 |
+------------+-------------+
The first case has 8000 bytes. The second has 8002 bytes, that violates the rule "can be a maximum of 8,000 bytes".

Evaluating a variable using the IN() Function

I'm trying to resolve a datastep variable in the in() function. I have a dataset that looks like the following:
|Run|Sample Level|Samples Tested|
| 1 | 1 | 1-5 |
| 1 | 2 | 1-5 |
...etc
| 1 | 5 | 1-5 |
---------------------------------
| 2 | 1 | 1-4 |
| 2 | 2 | 1-4 |
The samples tested vary by run. Normally the only sample levels in the dataset are the ones in the range provided by "Samples Tested". However occasionally this is not the case, and it can get messy. For example the last one I worked on looked like this:
|Run|Sample Level|Samples Tested|
| 1 | 1 |2-9, 12-35, 37-40|
In this case I'd want to drop all rows with sample levels that were not included in Samples Tested, which I did by manually adding the code:
Data Want;
set Have;
if sample_level not in (2:9, 12:35, 37:40) then delete;
run;
But what I want to do is have this done automatically by looking at the samples tested column. It's easy enough to turn a "-" into a ":", but where I'm stuck is getting the IN() function to recognize or resolve a variable. I would like code that looks like this: if sample_level not in(Samples_Tested) then delete; where samples_tested has been transformed to be something that the IN() function can handle. I'm also not opposed to using proc sql; if anyone has a solution that they think will work. I know you can do things like
Proc sql; Create table want as select * from HAVE where Sample_Level in (Select Samples_Tested from Have); Quit;
But the problem is that the samples tested varies by run and there could be 16 different runs. Hopefully I've explained the challenge clearly enough. Thanks for taking the time to read this and thanks in advance for your help!
Assuming the values of SAMPLES_TESTED is constant for each value of RUN you could use it to generate the selection criteria. For example you could use a data _null_ step to write a WHERE statement to a file and then %include that code into another data step.
filename code temp;
data _null_;
file code;
if eof then put ';';
set have end=eof;
by run;
if first.run;
if _n_=1 then put 'where ' # ;
else put ' or ' # ;
samples_tested=translate(samples_tested,':','-');
put '(' run= 'and sample_level in (' samples_tested '))';
run;
data want;
set have;
%include code;
run;
Note: IN is an operator and not a function.
Good to see SAS code ;-)
That would work with one range:
select * from HAVE where level in (tested);
For multiple ranges I would use SUBSTRING_INDEX in MySQL or just combination of SUBSTRING and INDEX to find next condition.
select * from HAVE where level in (tested1) or level in (tested2) or level in (tested3);
Where you replace tested1 for example as substr(tested,1, index(tested,',')
I used the following to generate sample:
create table have
(run int,
level int,
tested varchar(20));
INSERT INTO have (run, level, tested)
VALUES (1, 1, "3-5");
INSERT INTO have (run, level, tested)
VALUES (1, 3, "3-5, 12:35");
INSERT INTO have (run, level, tested)
VALUES (1, 20, "3-5, 12-35");

Invalid digits on Redshift

I'm trying to load some data from stage to relational environment and something is happening I can't figure out.
I'm trying to run the following query:
SELECT
CAST(SPLIT_PART(some_field,'_',2) AS BIGINT) cmt_par
FROM
public.some_table;
The some_field is a column that has data with two numbers joined by an underscore like this:
some_field -> 38972691802309_48937927428392
And I'm trying to get the second part.
That said, here is the error I'm getting:
[Amazon](500310) Invalid operation: Invalid digit, Value '1', Pos 0,
Type: Long
Details:
-----------------------------------------------
error: Invalid digit, Value '1', Pos 0, Type: Long
code: 1207
context:
query: 1097254
location: :0
process: query0_99 [pid=0]
-----------------------------------------------;
Execution time: 2.61s
Statement 1 of 1 finished
1 statement failed.
It's literally saying some numbers are not valid digits. I've already tried to get the exactly data which is throwing the error and it appears to be a normal field like I was expecting. It happens even if I throw out NULL fields.
I thought it would be an encoding error, but I've not found any references to solve that.
Anyone has any idea?
Thanks everybody.
I just ran into this problem and did some digging. Seems like the error Value '1' is the misleading part, and the problem is actually that these fields are just not valid as numeric.
In my case they were empty strings. I found the solution to my problem in this blogpost, which is essentially to find any fields that aren't numeric, and fill them with null before casting.
select cast(colname as integer) from
(select
case when colname ~ '^[0-9]+$' then colname
else null
end as colname
from tablename);
Bottom line: this Redshift error is completely confusing and really needs to be fixed.
When you are using a Glue job to upsert data from any data source to Redshift:
Glue will rearrange the data then copy which can cause this issue. This happened to me even after using apply-mapping.
In my case, the datatype was not an issue at all. In the source they were typecast to exactly match the fields in Redshift.
Glue was rearranging the columns by the alphabetical order of column names then copying the data into Redshift table (which will
obviously throw an error because my first column is an ID Key, not
like the other string column).
To fix the issue, I used a SQL query within Glue to run a select command with the correct order of the columns in the table..
It's weird why Glue did that even after using apply-mapping, but the work-around I used helped.
For example: source table has fields ID|EMAIL|NAME with values 1|abcd#gmail.com|abcd and target table has fields ID|EMAIL|NAME But when Glue is upserting the data, it is rearranging the data by their column names before writing. Glue is trying to write abcd#gmail.com|1|abcd in ID|EMAIL|NAME. This is throwing an error because ID is expecting a int value, EMAIL is expecting a string. I did a SQL query transform using the query "SELECT ID, EMAIL, NAME FROM data" to rearrange the columns before writing the data.
Hmmm. I would start by investigating the problem. Are there any non-digit characters?
SELECT some_field
FROM public.some_table
WHERE SPLIT_PART(some_field, '_', 2) ~ '[^0-9]';
Is the value too long for a bigint?
SELECT some_field
FROM public.some_table
WHERE LEN(SPLIT_PART(some_field, '_', 2)) > 27
If you need more than 27 digits of precision, consider a decimal rather than bigint.
If you get error message like “Invalid digit, Value ‘O’, Pos 0, Type: Integer” try executing your copy command by eliminating the header row. Use IGNOREHEADER parameter in your copy command to ignore the first line of the data file.
So the COPY command will look like below:
COPY orders FROM 's3://sourcedatainorig/order.txt' credentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter '\t' IGNOREHEADER 1;
For my Redshift SQL, I had to wrap my columns with Cast(col As Datatype) to make this error go away.
For example, setting my columns datatype to Char with a specific length worked:
Cast(COLUMN1 As Char(xx)) = Cast(COLUMN2 As Char(xxx))

How is input to the MySQL function md5 handled?

I'm having problems understanding how input to the md5 function in MySQL 4.1.22 is handled. Basically I'm not able to recreate the md5sum of a specific value combination for comparison. I guess it has something to do with the format of the input data.
I have set up a table with two columns of type double(direction and elevation) + a third for storing a md5 sum.
With a setup script i add data to the direction and elevation fields + create a checksum using the following syntax:
insert into polygons (
direction,
elevation,
md5sum
)
values (
(select radians(20.0)),
(select radians(30.0)),
( md5( (select radians(20.0)) + (select radians(20.0)) ) )
)
which ends up as: 0.349065850398866, 0.523598775598299, '0c2cd2c2a9fe40305c4e3bd991812df5'
Later I compare the stored md5sum with a newly calculated, the newly calculated is created using md5('0.349065850398866' + '0.523598775598299') and I get the following checksum:
'8c958bcf912664d6d27c50f1034bdf34'
If I modify the last decimal in the passed "string" from a 9 to a 8 0.523598775598298 i get the same checksum as previously stored, '0c2cd2c2a9fe40305c4e3bd991812df5'. Any value of the last decimal from 8 down to 0 gives the same checksum.
Using BINARY, (md5( (select BINARY(radians(20.0))) + (select BINARY(radians(20.0))) ) in the setup script creates the same checksum as the my original "runtime calculation"
Worth mentioning is that the original method works for all other rows I have (55)
I guess I'm using the function in a somewhat strange way, but I'm not sure of a better way, so in order to find a better way I feel I need to understand why the current is failing.
The two numbers you are adding are stored in binary form, but displayed in decimal form. There is no guarantee that you will get exactly the same number back if you give the decimal form back to the machine.
In this case, this causes the addition to give a slightly different result, which gives an entirely different MD5 sum:
mysql> select radians(20.0) + radians(30.0), '0.349065850398866' + '0.523598775598299';
+-------------------------------+-------------------------------------------+
| radians(20.0) + radians(30.0) | '0.349065850398866' + '0.523598775598299' |
+-------------------------------+-------------------------------------------+
| 0.87266462599716 | 0.87266462599717 |
+-------------------------------+-------------------------------------------+
1 row in set (0.00 sec)
If you want to consistently get the same result, you need to store the results of radians(20.0) and radians(30.0) in variables somewhere, never relying on their printed representations.
The output of radians(20.0) is computed with many more digits than are shown in the printed output. When the result is passed to the md5 function, the full non-truncated value is used, whereas the printed value only will show a limited number of digits. Thus, it is not the same value being passed into the md5 function in the two cases.