Solved: The integer column was set on auto increment, just left it empty on the import...
we have a lot of data, that is currently in excel. I made a VBA skript, that builds me a CSV to import into our Database / PostgreSQL Table. I'm trying to import with the import/export feature of PgAdmin.
The table has columns of type ([PK] Integer, string, string, JSON). When I try to import it throws me an error, right at the beginning, saying that »2« is no valid integer.
The file is UTF-8 encoded.
This is the command PgAdmin generates:
--command " "\copy public.stocknew (stockid, stockname, stockbarcode, stockjson) FROM '//DESKTOP-G86U473/temp/Test.csv' DELIMITER ',' CSV ENCODING 'UTF8' QUOTE '"' ESCAPE '''';""
Not a regular question asker, so please comment, if anything needs clarification.
Here is the first entry of the CSV file.
2,"W12345","35","{
'"Manufacturer'":'"ExampleValue'",
'"Supplier'":'"ExampleValue'",
'"SupplierName'":'"ExampleValue'",
'"Category'":'"ExampleValue'",
'"SubCategory'":'"ExampleValue'",
'"Partvalue'":'"868MHz - 928MHz, 2.400MHz - 2.500MHz'",
'"Tolerance'":'"ExampleValue'",
'"Dimension'":'"10,4 x 49,6mm'",
'"Temperature'":'"-34°C + 76°C'",
...*This keeps going for a while*...
'"Example'":2,
'"Example'":3,
'"Example'":4
}"
The following data successfully loads
"2","W12345","35","{\"Manufacturer\":\"ExampleValue\",\"Supplier\":\"ExampleValue\",\"SupplierName\":\"ExampleValue\",\"Category\":\"ExampleValue\",\"SubCategory\":\"ExampleValue\",\"Partvalue\":\"868MHz - 928MHz, 2.400MHz - 2.500MHz\",\"Tolerance\":\"ExampleValue\",\"Dimension\":\"10,4 x 49,6mm\",\"Temperature\":\"-34°C + 76°C\"}"
postgres=# create table pew (i1 int, s1 varchar(30), s2 varchar(30), j1 jsonb);
postgres=# copy pew from '/tmp/somedata.csv' with (format CSV, quote '"', escape '\');
COPY 1
postgres=# select * from pew;
i1 | s1 | s2 |
j1
----+--------+----+-----------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------
2 | W12345 | 35 | {"Category": "ExampleValue", "Supplier": "ExampleValue", "Dimension": "10,4 x 49,6mm", "Partvalue": "868MHz - 92
8MHz, 2.400MHz - 2.500MHz", "Tolerance": "ExampleValue", "SubCategory": "ExampleValue", "Temperature": "-34°C + 76°C", "Manufacturer
": "ExampleValue", "SupplierName": "ExampleValue"}
(1 row)
postgres=# select version();
version
------------------------------------------------------------------------------------------------------------------
PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
(1 row)
Looks like extraneous ' and missing escaping ".
Pardon my ugly SQL
Related
I have a hive table with the following structure and data:
Table structure:
CREATE EXTERNAL TABLE IF NOT EXISTS db_crprcdtl.shcar_dtls
ID string,
CSK string,
BRND string,
MKTCP string,
AMTCMP string,
AMTSP string,
RLBRND string,
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/on/hadoop/dir/'
-------------------------------------------------------------------------------
ID | CSK | BRND | MKTCP | AMTCMP
-------------------------------------------------------------------------------
782 flatn,grpl,mrtn hnd,mrc,nsn 34555,56566,66455 38900,59484,71450
1231 jikl,bngr su,mrc,frd 56566,32333,45000 59872,35673,48933
123 unsrvl tyt,frd,vlv 25000,34789,33443 29892,38922,36781
Trying to push this data into the SQL Server. But while doing so, getting the following error message:
SQL Error [107090] [S0001]: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Not enough columns in this line.
What I tried:
There's an online article where the author has documented similar kind of issues. I tried to implement one of them Looked in Excel and found two columns that had carriage returns but this also doesn't come handy.
Any suggestion/help would be really appreciated. Thanks
If I'm able to understand your issue, then it seems that your , separated data is getting divided into various columns rather one column on the SQL-SERVER, something like:
------------------------------
ID |CSK |BRND |MKTCP |AMTCMP
------------------------------
782 flatn grpl mrtn hnd mrc nsn 345 56566 66455 38900 59484 71450
1231 jikl bngr su mrc frd 56566 32333 45000 59872 35673 48933
123 unsrvl tyt frd vlv 25000 34789 33443 29892 38922 36781
So, if you look on Hive there are only 5 columns. While on SQL-SERVER the same. This I presume as you haven't shared the schema. But if that's the case, then you see that there are more than 5 values are being passed. While the schema definition is only of 5 columns.
So the error is populating.
Refer this Document by MS and try to create a FILE_FORMAT with FIELD_TERMINATOR ='\t',
like:
CREATE EXTERNAL FILE FORMAT <name>
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR ='\t',
| STRING_DELIMITER = string_delimiter
| First_Row = integer -- ONLY AVAILABLE SQL DW
| DATE_FORMAT = datetime_format
| USE_TYPE_DEFAULT = { TRUE | FALSE }
| Encoding = {'UTF8' | 'UTF16'} )
);
Hope that helps to resolve to your issue :)
I have a basic Athena query like this:
SELECT *
FROM my.dataset LIMIT 10
When I try to run it I get an error message like this:
Your query has the following error(s):
HIVE_BAD_DATA: Error parsing field value for field 2: For input string: "32700.000000000004"
How do I identify the S3 document that has the invalid field?
My documents are JSON.
My table looks like this:
CREATE EXTERNAL TABLE my.data (
`id` string,
`timestamp` string,
`profile` struct<
`name`: string,
`score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');
Inconsistent schema
Inconsistent schema is when values in some rows are of different data type. Let's assume that we have two json files
// inside s3://path/to/bad.json
{"name":"1Patrick", "age":35}
{"name":"1Carlos", "age":"eleven"}
{"name":"1Fabiana", "age":22}
// inside s3://path/to/good.json
{"name":"2Patrick", "age":35}
{"name":"2Carlos", "age":11}
{"name":"2Fabiana", "age":22}
Then a simple query SELECT * FROM some_table will fail with
HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven"
However, we can exclude that file within WHERE clause
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
WHERE "$PATH" != 's3://path/to/bad.json'
Result:
source_s3_file | name | age
---------------------------------------
s3://path/to/good.json | 1Patrick | 35
s3://path/to/good.json | 1Carlos | 11
s3://path/to/good.json | 1Fabiana | 22
Of course, this is the best case scenario when we know which files are bad. However, you can employ this approach to somewhat manually infer which files are good. You can also use LIKE or regexp_like to walk through multiple files at a time.
SELECT
COUNT(*)
FROM some_table
WHERE regexp_like("$PATH", 's3://path/to/go[a-z]*.json')
-- If this query doesn't fail, that those files are good.
The obvious drawback of such approach is cost to execute query and time spent, especially if it is done file by file.
Malformed records
In the eyes of AWS Athena, good records are those which are formatted as a single JSON per line:
{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }
AWS Athena supports OpenX JSON SerDe library which can be set to evaluate malformed records as NULL by specifying
-- When you create table
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
when you create table. Thus, the following query will reveal files with malformed records:
SELECT
DISTINCT("$PATH")
FROM "some_database"."some_table"
WHERE(
col_1 IS NULL AND
col_2 IS NULL AND
col_3 IS NULL
-- etc
)
Note: you can use only a single col_1 IS NULL if you are 100% sure that it doesn't contain empty fields other then in corrupted rows.
In general, malformed records are not that big of a deal provided that 'ignore.malformed.json' = 'true'. For example the following query will still succeed
For example if a file contains:
{"name": "2Patrick","age": 35,"address": "North Street"}
{
"name": "2Carlos",
"age": 11,
"address": "Flowers Street"
}
{"name": "2Fabiana","age": 22,"address": "Main Street"}
the following query will still succeed
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
Result:
source_s3_file | name | age | address
-----------------------------|----------|-----|-------------
1 s3://path/to/malformed.json| 2Patrick | 35 | North Street
2 s3://path/to/malformed.json| | |
3 s3://path/to/malformed.json| | |
4 s3://path/to/malformed.json| | |
5 s3://path/to/malformed.json| | |
6 s3://path/to/malformed.json| | |
7 s3://path/to/malformed.json| 2Fabiana | 22 | Main Street
While with 'ignore.malformed.json' = 'false' (which is the default behaviour) exactly the same query will throw an error
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]
I'm trying to export data From postgresql to csv.
First i created the query and tried exporting From pgadmin with the File -> Export to CSV. The CSV is wrong, as it contains for example :
The header : Field1;Field2;Field3;Field4
Now, the rows begin well, except for the last field that it puts it on another line:
Example :
Data1;Data2;Data3;
Data4;
The problem is i get error when trying to import the data to another server.
The data is From a view i created.
I also tried
COPY view(field1,field2...) TO 'C:\test.csv' DELIMITER ',' CSV HEADER;
It exports the same file.
I just want to export the data to another server.
Edit:
When trying to import the csv i get the error :
ERROR : Extra data after the last expected column. Context Copy
actions, line 3: <<"Data1, data2 etc.">>
So the first line is the header, the second line is the first row with data minus the last field, which is on the 3rd line, alone.
In order for you to export the file in another server you have two options:
Creating a shared folder between the two servers, so that the
database also has access to this directory.
COPY (SELECT field1,field2 FROM your_table) TO '[shared directory]' DELIMITER ',' CSV HEADER;
Triggering the export from the target server using the STDOUT of
COPY. Using psql you can achieve this running the following
command:
psql yourdb -c "COPY (SELECT * FROM your_table) TO STDOUT" > output.csv
EDIT: Addressing the issue of fields containing line feeds (\n)
In case you wanna get rid of the line feeds, use the REPLACE function.
Example:
SELECT E'foo\nbar';
?column?
----------
foo +
bar
(1 Zeile)
Removing the line feed:
SELECT REPLACE(E'foo\nbaar',E'\n','');
replace
---------
foobaar
(1 Zeile)
So your COPY should look like this:
COPY (SELECT field1,REPLACE(field2,E'\n','') AS field2 FROM your_table) TO '[shared directory]' DELIMITER ',' CSV HEADER;
the described above export procedure is OK, e.g:
t=# create table so(i int, t text);
CREATE TABLE
t=# insert into so select 1,chr(10)||'aaa';
INSERT 0 1
t=# copy so to stdout csv header;
i,t
1,"
aaa"
t=# create table so1(i int, t text);
CREATE TABLE
t=# copy so1 from stdout csv header;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> i,t
1,"
aaa"
>> >> >> \.
COPY 1
t=# select * from so1;
i | t
---+-----
1 | +
| aaa
(1 row)
I have gz files in a folder. I need only 3 columns from these files, but each line has over 100 of them. At the moment I create a view this way.
drop table MAK_CHARGE_RCR;
create external table MAK_CHARGE_RCR
(LINE string)
STORED as SEQUENCEFILE
LOCATION '/apps/hive/warehouse/mydb.db/file_rcr';
drop view VW_MAK_CHARGE_RCR;
create view VW_MAK_CHARGE_RCR as
Select LINE[57] as CREATE_DATE, LINE[64] as SUBS_KEY, LINE[63] as RC_TERM_NAME
from
(Select split(LINE, '\\|') as LINE
from MAK_CHARGE_RCR) a;
The view has the fields I need. Now I have to do the same, but without CTAS and I am not sure how to go about it. What can I do?
I was told the table must look like this
create external table MAK_CHARGE_RCR
(CREATE_DATE string, SUBS_KEY string, RC_TERM_NAME etc)
I could split the line like this
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\\|'
but I'll need to list every column. I have another group of files with over 1000 columns. All of them I'll need to list. This just seems a bit excessive, so I wondered if it is possible to do
create external table arstel.MAK_CHARGE_RCR
(split(LINE, '\\|')[57] string,
split(LINE, '\\|')[64] string
etc)
This doesn't work obviously, but maybe there are work arounds?
RegexSerDe
For educational purposes
P.s.
I intend to create an enhanced version of the CSV SerDe that excepts an additional parameter with the positions of the requested columns.
Demo
bash
echo {a..c}{1..100} | xargs -n 100 | tr ' ' '|' | \
hdfs dfs -put - /user/hive/warehouse/mytable/data.txt
hive
create external table mytable
(
col58 string
,col64 string
,col65 string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ("input.regex" = "^(?:([^|]*)\\|){58}(?:([^|]*)\\|){6}([^|]*)\\|.*$")
stored as textfile
location '/user/hive/warehouse/mytable'
;
select * from mytable
;
+---------------+---------------+---------------+
| mytable.col58 | mytable.col64 | mytable.col65 |
+---------------+---------------+---------------+
| a58 | a64 | a65 |
| b58 | b64 | b65 |
| c58 | c64 | c65 |
+---------------+---------------+---------------+
I am learning PyParsing in last few weeks. I plan to use it to get table names from SQL statements.
I have looked at http://pyparsing.wikispaces.com/file/view/simpleSQL.py. But I intend to keep the grammar simple because I am not trying to get every part of select statement parsed rather I am looking for just table names. Also it is quite involved to define the complete grammar for any commercially available modern day database like Teradata.
#!/usr/bin/env python
from pyparsing import *
import sys
semicolon = Combine(Literal(';') + lineEnd)
comma = Literal(',')
lparen = Literal('(')
rparen = Literal(')')
# Keyword definition
update_kw, volatile_kw, create_kw, table_kw, as_kw, from_kw, \
where_kw, join_kw, left_kw, right_kw, cross_kw, outer_kw, \
on_kw , insert_kw , into_kw= \
map(lambda x: Keyword(x, caseless=True), \
['UPDATE', 'VOLATILE', 'CREATE', 'TABLE', 'AS', 'FROM',
'WHERE', 'JOIN' , 'LEFT', 'RIGHT' , \
'CROSS', 'OUTER', 'ON', 'INSERT', 'INTO'])
# Teradata SQL allows SELECT and well as SEL keyword
select_kw = Keyword('SELECT', caseless=True) | Keyword('SEL' , caseless=True)
# list of reserved keywords
reserved_words = (update_kw | volatile_kw | create_kw | table_kw | as_kw |
select_kw | from_kw | where_kw | join_kw |
left_kw | right_kw | cross_kw | on_kw | insert_kw |
into_kw)
# Identifier can be used as table or column names. They can't be reserved words
ident = ~reserved_words + Word(alphas, alphanums + '_')
# Recursive definition for table
table = Forward()
# simple table name can be identifer or qualified identifier e.g. schema.table
simple_table = Combine(Optional(ident + Literal('.')) + ident)
# table name can also a complete select statement used as table
nested_table = lparen.suppress() + select_kw.suppress() + SkipTo(from_kw).suppress() + \
from_kw.suppress() + table + rparen.suppress()
# table can be simple table or nested table
table << (nested_table | simple_table)
# comma delimited list of tables
table_list = delimitedList(table)
# Building from clause only because table name(s) will always appears after that
from_clause = from_kw.suppress() + table_list
txt = """
SELECT p, (SELECT * FROM foo),e FROM a, d, (SELECT * FROM z), b
"""
for token, start, end in from_clause.scanString(txt):
print token
A thing worth mentioning here. I use "SkipTo(from_kw)" to jump over column list in SQL statement. This is primarily to avoid defining grammar for column list which can be comma delimited list of identifiers, many function names, DW analytical functions and what not. With this grammar I am able to parse above statement as well as any level of nesting in SELECT column list or table list.
['foo']
['a', 'd', 'z', 'b']
I am facing problem when SELECT has where clause:
nested_table = lparen.suppress() + select_kw.suppress() + SkipTo(from_kw).suppress() + \
from_kw.suppress() + table + rparen.suppress()
When WHERE clause is there then the same statement may look like:
SELECT ... FROM a,d , (SELECT * FROM z WHERE (c1 = 1) and (c2 = 3)), p
I thought of changing "nested_table" definition to:
nested_table = lparen.suppress() + select_kw.suppress() + SkipTo(from_kw).suppress() + \
from_kw.suppress() + table + Optional(where_kw + SkipTo(rparen)) + rparen
But this is not working since it matches to the right parenthesis following "c = 1". What I would like to know is how to skip to the right parenthesis that matches left parenthesis right before "SELECT * FROM z..." I don't know how to do it using PyParsing
Also on a different note I seek some advice the best way to get table names from complex nested SQLs. Any help is really appreciated.
Thanks
Abhijit
Considering that you are also trying to parse out nested SELECT's, I don't think you'll be able to avoid writing a fairly complete SQL parser. Fortunately, there is a more complete example on the Pyparsing wiki Examples page, select_parser.py. I hope that gets you further along.