Extra spaces after loading from a text file with SQL*Loader - sql

I'm using Oracle 11g, and I'm trying to load data from a text file with SQL*Loader
Here is a sample of the data (there are much more columns):
123456789876543212,100,333,432,02/05/2014,02/05/2014,02/05/2014,1.1,AA
I want to load the data into the DB first as a VARCHAR2, and then to convert them to the correct datatype in the DB, with a query. It's much more easy in my opinion.
Here is my table (MyTable):
create table MyTable
(
A varchar2(500)
B varchar2(500)
C varchar2(500)
D varchar2(500)
E varchar2(500)
F varchar2(500)
G varchar2(500)
H varchar2(500)
I varchar2(500)
)
Here is my loading script:
load data
infile 'D:\MyFile.txt'
into table MyTable
fields terminated by ','
trailing nullcols
(
A char(4000),
B char(4000),
C char(4000),
D char(4000),
E char(4000),
F char(4000),
G char(4000),
H char(4000),
I char(4000)
)
Here is how the data looks like after being loaded into the DB.
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2,1 0 0,3 3 3,4 3 2,0 2 / 0 5 / 2 0 1 4,0 2 / 0 5 / 2 0 1 4,0 2 / 0 5 / 2 0 1 4, 1 . 1,A A
Why does my data look like this? What are these spaces? I don't have a lot of experience with data loading.
I'm guessing that the problem is the data types of the table in the DB and in the loading file. What is the right way to defined such as data? I want to load the data as is into the DB. I'll make the conversation in the DB with a query. Please note that the first column has 18 digits.

The normal reason for "spaces" being inserted between every character after loading is because there is a nul (ASCII 0) after every character in your original text file. If you look at your file in a text editor in Hexadecimal you should be able to see this (it'll be represented as 00). You can also look at your table using the DUMP() function.
Without extra parameters, DUMP() is a useful function that returns the data-type code of the data you pass it, the length of the data in bytes and the internal representation of ''expr''. There are a few other options which are explained in the documentation.
From the below you'll see that the data-type code is 96, which represents a CHAR., the length is 1 i.e. the string is 1 byte long and the internal representation is 97, which is the ASCII code for a.
SQL> select dump('a')
2 from dual;
DUMP('A')
----------------
Typ=96 Len=1: 97
In your case you're expecting a code of 0 for nuls.
I'd go back to your supplier and tell them to remove the characters, after you've double checked, as you won't be able to tell whether they're actual nul characters or part of a multi-byte character. I've previously written about the strategies for removing nuls from the database should you be unable to get the file fixed.

Related

How to get char from regular expression matching list

I want to get all the characters from regular expression matching list with SQL ORACLE
Example: regular expression matching list is : '[a-c2-5]'
Result is table include rows:
a
b
c
2
3
4
5
This sounds very much like an XY problem (google the phrase to learn what it means). What is the real problem you are trying to solve this way?
In any case - here is one way to solve this. I assume you are only interested in ASCII characters (with ASCII codes between 1 and 255); you may generalize this if needed.
select chr(level) as matched_character
from dual
where regexp_like(chr(level), '[a-c2-5]')
connect by level <= 255
;
MATCHED_CHARACTER
-----------------
2
3
4
5
a
b
c

display non-printable ascii characters in SQL as :ascii: or :print: does not work

I am trying to fetch all non-printable ASCII characters from DESCRIPTION field in a table using SQL in TOAD however the below query is not working .
select
regexp_instr(a.description,'[^[:ascii:]]') as description from
poline a where a.ponum='XXX' and a.siteid='YYY' and
regexp_instr(a.description,'[^[:ascii:]]') > 0
the above query bought error ORA-127729: invalid character class in regular expression. I tried :print: instead of :ascii: however it didn't bring any result. Below is the description for this record which has non-printable characters.
Sherlock 16 x 6.5” Wide Wheelbarrow wheel .M100P.10R – Effluent care bacteria and enzyme formulation
:ascii: is not a valid character class, and even if it were, it doesn't appear to be what you are trying to get here (ascii does contain non-printable characters). Valid classes can be found here.
Actually if you replace :ascii: with :print: in your original query, it will indeed return the first position in each POLINE.DESCRIPTION that is a non-printable character. (If it returns nothing for you, it may be because your DESCRIPTION data is actually all printable.)
But as you stated you want to identify Every non-printable char in each DESCRIPTION in POLINE, some changes would be needed. I'll include an example that gets every match as a starting place.
In this example, each DESCRIPTION will be decomposed to its individual constituent characters, and each char will be checked for printability. The location within the DESCRIPTION string along with the ASCII number of the non-printable character will be returned.
This example assumes there is a unique identifier for each row in POLINE, here called POLINE_ID.
First, create the test table:
CREATE TABLE POLINE(
POLINE_ID NUMBER PRIMARY KEY,
PONUM VARCHAR2(32),
SITEID VARCHAR2(32),
DESCRIPTION VARCHAR2(256)
);
And load some data. I inserted a couple non-printing chars in the example Sherlock string you provided, #23 and #17. An example string composed of only the first 64 ASCII chars (of which the first 31 are not in :print:) is also included, and some fillers to fall through the PONUM and SITEID predicates.
INSERT INTO POLINE VALUES (1,'XXX','YYY','Sherlock'||CHR(23)||' 16 x 6.5” Wide Wheelbarrow wheel .M100P.10R –'||CHR(17)||' Effluent care bacteria and enzyme formulation');
DECLARE
V_STRING VARCHAR2(64) := CHR(1);
BEGIN
FOR POINTER IN 2..64 LOOP
V_STRING := V_STRING||CHR(POINTER);
END LOOP;
INSERT INTO POLINE VALUES (2, 'XXX','YYY',V_STRING);
INSERT INTO POLINE VALUES (3, 'AAA','BBB',V_STRING);
END;
/
INSERT INTO POLINE VALUES(4,'XXX','YYY','VOLTRON');
Now we have 4 rows total. Three of them contain (multiple) non-printable characters, but only two of them should match all the restrictions.
Then run a query. There are two example queries below--the first uses REGEXP_INSTR with as in your initial example query (substituting :cntrl: for :print:). But for an alternative, a 2nd, variant is also included that just checks whether each char is in the first 31 ascii chars.
Both example queries, will index every char of each DESCRIPTION, and check whether it is printable, and collect the ascii number and location of each non-printable character in each candidate DESCRIPTION. The example table here has DESCRIPTIONs that are 256 chars long, so this is used as the max index in the cartesian join.
Please note, these are not efficient, and are designed to get EVERY match. If you end up only needing the first match afterall, your original query replaced with :print: will perform much better. Also, this could also be tuned by dropping into PL/SOL or perhaps going recursive (if PL/SQL is allowed in your use case, or you are 11gR2+, etc.). Also some predicates here such as REGEXP_LIKE do not impact the end result and serve only to allow preliminary filtration. These could be superfluous (or worse) for you, depending on your data set.
First example, using regex and :print:
SELECT
POLINE_ID,
STRING_INDEX AS NON_PRINTABLE_LOCATION,
ASCII(REGEXP_SUBSTR(SUBSTR(DESCRIPTION, STRING_INDEX, 1), '[[:cntrl:]]', 1, 1)) AS NON_PRINTABLE_ASCII_NUMBER
FROM POLINE
CROSS JOIN (SELECT LEVEL AS STRING_INDEX
FROM DUAL
CONNECT BY LEVEL < 257) CANDIDATE_LOCATION
WHERE PONUM = 'XXX'
AND SITEID = 'YYY'
AND REGEXP_LIKE(DESCRIPTION, '[[:cntrl:]]')
AND REGEXP_INSTR(SUBSTR(DESCRIPTION, STRING_INDEX, 1), '[[:cntrl:]]', 1, 1, 0) > 0
AND STRING_INDEX <= LENGTH(DESCRIPTION)
ORDER BY 1 ASC, 2 ASC;
Second example, using ASCII numbers:
SELECT
POLINE_ID,
STRING_INDEX AS NON_PRINTABLE_LOCATION,
ASCII(SUBSTR(DESCRIPTION, STRING_INDEX, 1)) AS NON_PRINTABLE_ASCII_NUMBER
FROM POLINE
CROSS JOIN (SELECT LEVEL AS STRING_INDEX
FROM DUAL
CONNECT BY LEVEL < 257) CANDIDATE_LOCATION
WHERE PONUM = 'XXX'
AND SITEID = 'YYY'
AND REGEXP_LIKE(DESCRIPTION, '[[:cntrl:]]')
AND ASCII(SUBSTR(DESCRIPTION, STRING_INDEX, 1)) BETWEEN 1 AND 31
AND STRING_INDEX <= LENGTH(DESCRIPTION)
ORDER BY 1 ASC, 2 ASC;
In our test data, these queries will produce equivalent output. We should expect this to have two hits (for chrs 17 and 23) in the Sherlock DESCRIPTION, and 31 hits for the first-64-ascii DESCRIPTION.
Result:
POLINE_ID NON_PRINTABLE_LOCATION NON_PRINTABLE_ASCII_NUMBER
1 9 23
1 56 17
2 1 1
2 2 2
2 3 3
2 4 4
2 5 5
2 6 6
2 7 7
2 8 8
2 9 9
2 10 10
2 11 11
2 12 12
2 13 13
2 14 14
2 15 15
2 16 16
2 17 17
2 18 18
2 19 19
2 20 20
2 21 21
2 22 22
2 23 23
2 24 24
2 25 25
2 26 26
2 27 27
2 28 28
2 29 29
2 30 30
2 31 31
33 rows selected.
EDIT In response to comments, here is some elaboration on what we can expect from [[:cntrl:]] and [^[:cntrl:]] with regexp_instr.
[[:cntrl:]] will match any of the first 31 ascii characters, while [^[:cntrl:]] is the logical negation of [[:cntrl:]], so it will match anything except the first 31 ascii characters.
To compare these, we can start with the simplest case of only one character, ascii #31. Since there's only one character, the result can only be either match or miss. One will expect the following to return 1 for the match:
SELECT REGEXP_INSTR(CHR(31),'[[:cntrl:]]',1,1,0) AS MATCH_INDEX FROM DUAL;
MATCH_INDEX
1
But 0 for the miss with negating [^[:cntrl:]] :
SELECT REGEXP_INSTR(CHR(31),'[^[:cntrl:]]',1,1,0) AS MATCH_INDEX FROM DUAL;
MATCH_INDEX
0
Now if we include two (or more) characters that are a mix of printable and non-printnable, there are more possible outcomes. Both [[:cntrl:]] and [^[:cntrl:]] can match, but they can only match different things. If we move from only ascii #31 to ascii #64#31, we will still expect [[:cntrl:]] to match (since there is a non-printable character in the second position) but it should now return 2, since the non-printable is in the second position.
SELECT REGEXP_INSTR(CHR(64)||CHR(31),'[[:cntrl:]]',1,1,0) AS MATCH_INDEX FROM DUAL;
MATCH_INDEX
2
And now [^[:cntrl:]] also has the opportunity to match (at the first position):
SELECT REGEXP_INSTR(CHR(64)||CHR(31),'[^[:cntrl:]]',1,1,0) AS MATCH_INDEX FROM DUAL;
MATCH_INDEX
1
When there are a mix of printable and control characters, both [[:cntrl:]] and [^[:cntrl:]] can match, but they will match at different indices.

Loading null values to Hive table

I have a .txt file that has the following rows:
Steve,1 1 1 1 1 5 10 20 10 10 10 10
when i created an external table, loaded the data and select *, i got null values. Please help how to show the number values instead of null. I very much appreciate the help!
create external table Teller(Name string, Bill array<int>)
row format delimited
fields terminated by ','
collection items terminated by '\t'
stored as textfile
location '/user/training/hive/Teller';
load data local inpath'/home/training/hive/input/*.txt' overwrite into table Teller;
output:
Steve [null]
It seems the integers are separated by spaces and not tabs
bash
hdfs dfs -mkdir -p /user/training/hive/Teller
echo Steve,1 1 1 1 1 5 10 20 10 10 10 10 | hdfs dfs -put - /user/training/hive/Teller/data.txt
hive
hive> create external table Teller(Name string, Bill array<int>)
> row format delimited
> fields terminated by ','
> collection items terminated by ' '
> stored as textfile
> location '/user/training/hive/Teller';
OK
Time taken: 0.417 seconds
hive> select * from teller;
OK
Steve [1,1,1,1,1,5,10,20,10,10,10,10]

Print limited number of elements in collect_set array using printf function

I want to printf() just the first 3 patients in collect_set() of patient numbers.
A. I have created "patient_list" using collect_set
collect_set(distinct patient_seq) AS patient_list
which yields arrays of patients numbers of varying length (4, 5 or 6 digits)
Example:
["16189","26599","406622","419117","5551"]
["223587","224663","232072","326504","433430","436673","54540","58188","74118"]
B. I then stripped out the commas and quotes and separated by '*' (in order to grab just the first 3 patients, in the next step):
concat_ws('*', patient_list) AS pat_list
This produces:
16189*26599*406622*419117*5551
223587*224663*232072*326504*433430*436673*54540*58188*74118
C. I tried to use SUBSTRING_INDEX() to create a new variable (pat_list_short) containing just the first 3 patients, but this function is not supported in hive 1.1.0 (not supported until 1.3.0).
substring_index(pat_list, '*', 3) AS pat_list_short
What other option do I have?
I want to feed the pat_list_short into the PRINTF using %s in order to print out just the first three patient numbers for review team. Since the patient num varies in length I can't just limit the print to a certain length
Thanks
Using the data you provided
--------------
key | pat_id
--------------
1 16189
1 26599
1 406622
1 419117
1 5551
2 223587
2 224663
2 232072
2 326504
2 433430
2 436673
2 54540
2 58188
2 74118
you can use this UDF here to truncate an array to a desired length. There are instructions on the main page how to build and use the jar.
Query:
add jar /path/to/jar/brickhouse-0.7.1.jar;
create temporary function trunc_array as 'brickhouse.udf.collect.TruncateArrayUDF';
select key
, concat(' ', trunc_array(collect_set( pat_id ), 3)) pat_list_short
from db.tbl
group by key
Output:
----------------------
key | pat_list_short
----------------------
1 5551 26599 16189
2 232072 58188 223587
I must admit I'm a bit unclear has to how printf() plays a part in this problem as the query returns a result and prints it. It is also with noting that in your query in A, the distinct in collect_set(distinct) is redundant, as collect_set's purpose is to collect distinct elements.

SQL Replace Commas in the a row of a table

I have table test( ID Numeric(11,0), report varchar(255) )
and data looks below
1 ,Age,,,,,,family_status,,,,,,
2 ,,,,,,,,retaliation,hostile environment,,,,
3 ,,,,,,,,,,,,,
4 ,,,,,,,,retaliation,,,,,
5 ,,,,,,,,,hostile environment,,,,
6 ,Age,,,,,,,,,,,,
7 ,,,,national_origin,,,,,,,,,
8 Sex,,,,,,,,,,,,,
9 ,,,,national_origin,,disability,,retaliation,,,,,
10 Sex,,,,,,,,retaliation,,,,,
11 ,,,,,,,,
and i would like to update this table by replacing or using any other data scribing to remove extra commas so that data looks
1 Age,family_status
2 retaliation,hostile environment
3
4 retaliation
5 hostile environment
6 Age
7 national_origin,
8 Sex
9 national_origin,disability,retaliation
10 Sex,retaliation
11
i try to use the below statement but not sure how to loop through so that it will check and remove all the commas
UPDATE table test SET report = replace(report , ',,', ',')
If you are just doing this as a one off task (rather than a scripted process you expect to use repeatedly) you could always just run this query repeatedly until you get 0 rows updated
UPDATE table test SET report = replace(report , ',,', ',')
WHERE report like '%,,%'
If you need to do this over and over, or put it in a program I recommend using your procedural (non SQL code) to do the replace where you have better text manipulation commands.
If you aren't thrilled with that, check out this blog article I wrote on a similar problem of replacing repeating spaces from a string.