Postgres Function to Insert Arrays - sql

I am trying to INSERT data via a postgres function, and I can't quite get it working. I am getting an error stating
ERROR: function unnest(integer) does not exist
SQL state: 42883
Hint: No function matches the given name and argument types. You might need to add explicit type casts.
I am using Postgres 9.5, and my function is as follows:
CREATE FUNCTION insert_multiple_arrays(
some_infoid INTEGER[],
other_infoid INTEGER[],
some_user_info VARCHAR,
OUT new_user_id INTEGER
)
RETURNS INTERGER AS $$
BEGIN
INSERT INTO user_table (user_info) VALUES ($3) RETURNING user_id INTO new_user_id;
INSERT INTO some_info_mapper (user_id, some_info_id) SELECT new_user_id, unnest($1);
INSERT INTO other_info_mapper (user_id, other_info_id) SELECT new_user_id,unnest($2);
END;
$$ LANGUAGE plpgsql;
I will be calling the stored procedure from my backend via a SELECT statement. An example is like so:
createUser(user, callback){
let client = this.getDb();
client.query("SELECT insert_multiple_arrays($1, $2, $3)",
[user.some_info_ids, user.other_info_ids, user.info], function(err, results){
if(err){
callback (err);
}
callback(null, results);
});
};
The output that I am expecting would be as follows:
user_table
user_id | user_info |
----------------------+-----------------+
1 | someInfo |
some_info_mapper
user_id | some_info_id |
----------------------+-----------------+
1 | 33 |
1 | 5 |
other_info_mapper
user_id | other_info_id |
----------------------+-----------------+
1 | 8 |
1 | 9 |
1 | 22 |
1 | 66 |
1 | 99 |
How do I handle this error? Do I need to do some sort of processing to my data to put it into a format that postgres accepts?

You're calling insert_multiple_arrays with three parameters, but show the definition with four. Perhaps you have an old 3-parameter version still lurking there, buggy, and trying to find the bug in the 4-parameter version that is not actually in use?

After exploring #cachiques comments, it appears that the data was not being sent correctly after all. As it turns out, that the data being passed to the back end was an array objects that needed to be parsed further than I realized. Once parsed, the sql worked fine. Here is the code I used to parse from the server side, which would be sent to the sql query:
user.other_info_ids = req.body.other_info.map( function(obj) { return obj.info_id; } );

Related

Change null to empty array in databricks SQL?

I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs (json_exists() checks the schema to see if the key is possible to use, json_get() gets a key from the column or returns a default) and want to do the following:
SELECT
ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE
FROM
JSON_TABLE
WHERE
JSON_COL IS NOT NULL AND
json_exists(JSON_COL, 'ARRAY')==1
When the data has at least one instance of JSON_COL containing ARRAY, the schema is such that this has no problems. If, however, the data has all null values in JSON_COL.ARRAY, an error occurs because the column has been inferred as a string type (error received: input to function explode should be array or map type, not string). Unfortunately, while the json_exists() function returns the expected values, the error still occurs even when the returned dataset would be empty.
Can I get around this error via casting or replacement of nulls? If not, what is an alternative that still allows inferring the schema of the JSON?
Note: This is a simplified example. I am writing code to generate SQL code for hundreds of similar data structures, so while I am open to workarounds, a direct solution would be ideal. Please ask if anything is unclear.
Example table that causes error:
| ID | JSON_COL |
| 1 | {"_corrupt_record": null, "otherInfo": [{"test": 1, "from": 3}]} |
| 2 | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]} |
Example table that does not cause error:
| ID | JSON_COL |
| 1 | {"_corrupt_record": null, "array": [{"test": 1, "from": 3}]} |
| 2 | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]} |
This question seems like it might hold the answer, but I was not able to get anything working from it.
You can filter the table before calling json_get and explode, so that you only explode when json_get returns a non-null value:
SELECT
ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE
FROM (
SELECT *
FROM JSON_TABLE
WHERE
JSON_COL IS NOT NULL AND
json_exists(JSON_COL, 'ARRAY')==1
)

HIVE-SQL_SERVER: HadoopExecutionException: Not enough columns in this line

I have a hive table with the following structure and data:
Table structure:
CREATE EXTERNAL TABLE IF NOT EXISTS db_crprcdtl.shcar_dtls
ID string,
CSK string,
BRND string,
MKTCP string,
AMTCMP string,
AMTSP string,
RLBRND string,
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/on/hadoop/dir/'
-------------------------------------------------------------------------------
ID | CSK | BRND | MKTCP | AMTCMP
-------------------------------------------------------------------------------
782 flatn,grpl,mrtn hnd,mrc,nsn 34555,56566,66455 38900,59484,71450
1231 jikl,bngr su,mrc,frd 56566,32333,45000 59872,35673,48933
123 unsrvl tyt,frd,vlv 25000,34789,33443 29892,38922,36781
Trying to push this data into the SQL Server. But while doing so, getting the following error message:
SQL Error [107090] [S0001]: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Not enough columns in this line.
What I tried:
There's an online article where the author has documented similar kind of issues. I tried to implement one of them Looked in Excel and found two columns that had carriage returns but this also doesn't come handy.
Any suggestion/help would be really appreciated. Thanks
If I'm able to understand your issue, then it seems that your , separated data is getting divided into various columns rather one column on the SQL-SERVER, something like:
------------------------------
ID |CSK |BRND |MKTCP |AMTCMP
------------------------------
782 flatn grpl mrtn hnd mrc nsn 345 56566 66455 38900 59484 71450
1231 jikl bngr su mrc frd 56566 32333 45000 59872 35673 48933
123 unsrvl tyt frd vlv 25000 34789 33443 29892 38922 36781
So, if you look on Hive there are only 5 columns. While on SQL-SERVER the same. This I presume as you haven't shared the schema. But if that's the case, then you see that there are more than 5 values are being passed. While the schema definition is only of 5 columns.
So the error is populating.
Refer this Document by MS and try to create a FILE_FORMAT with FIELD_TERMINATOR ='\t',
like:
CREATE EXTERNAL FILE FORMAT <name>
WITH (   
FORMAT_TYPE = DELIMITEDTEXT,   
FORMAT_OPTIONS (        
FIELD_TERMINATOR ='\t',
| STRING_DELIMITER = string_delimiter
| First_Row = integer -- ONLY AVAILABLE SQL DW
| DATE_FORMAT = datetime_format
| USE_TYPE_DEFAULT = { TRUE | FALSE }
| Encoding = {'UTF8' | 'UTF16'} )
);
Hope that helps to resolve to your issue :)

How do I identify problematic documents in S3 when querying data in Athena?

I have a basic Athena query like this:
SELECT *
FROM my.dataset LIMIT 10
When I try to run it I get an error message like this:
Your query has the following error(s):
HIVE_BAD_DATA: Error parsing field value for field 2: For input string: "32700.000000000004"
How do I identify the S3 document that has the invalid field?
My documents are JSON.
My table looks like this:
CREATE EXTERNAL TABLE my.data (
`id` string,
`timestamp` string,
`profile` struct<
`name`: string,
`score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');
Inconsistent schema
Inconsistent schema is when values in some rows are of different data type. Let's assume that we have two json files
// inside s3://path/to/bad.json
{"name":"1Patrick", "age":35}
{"name":"1Carlos", "age":"eleven"}
{"name":"1Fabiana", "age":22}
// inside s3://path/to/good.json
{"name":"2Patrick", "age":35}
{"name":"2Carlos", "age":11}
{"name":"2Fabiana", "age":22}
Then a simple query SELECT * FROM some_table will fail with
HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven"
However, we can exclude that file within WHERE clause
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
WHERE "$PATH" != 's3://path/to/bad.json'
Result:
source_s3_file | name | age
---------------------------------------
s3://path/to/good.json | 1Patrick | 35
s3://path/to/good.json | 1Carlos | 11
s3://path/to/good.json | 1Fabiana | 22
Of course, this is the best case scenario when we know which files are bad. However, you can employ this approach to somewhat manually infer which files are good. You can also use LIKE or regexp_like to walk through multiple files at a time.
SELECT
COUNT(*)
FROM some_table
WHERE regexp_like("$PATH", 's3://path/to/go[a-z]*.json')
-- If this query doesn't fail, that those files are good.
The obvious drawback of such approach is cost to execute query and time spent, especially if it is done file by file.
Malformed records
In the eyes of AWS Athena, good records are those which are formatted as a single JSON per line:
{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }
AWS Athena supports OpenX JSON SerDe library which can be set to evaluate malformed records as NULL by specifying
-- When you create table
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
when you create table. Thus, the following query will reveal files with malformed records:
SELECT
DISTINCT("$PATH")
FROM "some_database"."some_table"
WHERE(
col_1 IS NULL AND
col_2 IS NULL AND
col_3 IS NULL
-- etc
)
Note: you can use only a single col_1 IS NULL if you are 100% sure that it doesn't contain empty fields other then in corrupted rows.
In general, malformed records are not that big of a deal provided that 'ignore.malformed.json' = 'true'. For example the following query will still succeed
For example if a file contains:
{"name": "2Patrick","age": 35,"address": "North Street"}
{
"name": "2Carlos",
"age": 11,
"address": "Flowers Street"
}
{"name": "2Fabiana","age": 22,"address": "Main Street"}
the following query will still succeed
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
Result:
source_s3_file | name | age | address
-----------------------------|----------|-----|-------------
1 s3://path/to/malformed.json| 2Patrick | 35 | North Street
2 s3://path/to/malformed.json| | |
3 s3://path/to/malformed.json| | |
4 s3://path/to/malformed.json| | |
5 s3://path/to/malformed.json| | |
6 s3://path/to/malformed.json| | |
7 s3://path/to/malformed.json| 2Fabiana | 22 | Main Street
While with 'ignore.malformed.json' = 'false' (which is the default behaviour) exactly the same query will throw an error
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

DECLARING GLOBAL TEMPORARY TABLE

DB2/400 SQL : I'm working in a sql function witch is using a global temporary table. I've juste a problem to declare this table : SQL send me an error, but i don't see where is the problem ? Can someone telle me what 's about this error ?
Function with a declaring global temporary table
I don't speak or read french, but it appears that the error is telling you that your function definition expects a return value, but the function body does not return anything.
Now when creating a SQL scalar function, the function body starts out with declarations. These declarations are for variables and cursors. It is somewhat unfortunate that you create a global temporary table using a statement that starts with DECLARE. It doesn't belong in the declarations but in the procedure body.
.-NOT ATOMIC-.
>>-+--------+--BEGIN--+------------+---------------------------->
'-label:-' '-ATOMIC-----'
>--+---------------------------------------+-------------------->
| .-----------------------------------. |
| V | |
'---+-SQL-variable-declaration--+-- ; +-'
+-SQL-condition-declaration-+
+-return-codes-declaration--+
'-INCLUDE-statement---------'
>--+--------------------------------------+--------------------->
| .----------------------------------. |
| V | |
'---+-DECLARE CURSOR-statement-+-- ; +-'
'-INCLUDE-statement--------'
>--+---------------------------------+-------------------------->
| .-----------------------------. |
| V | |
'---+-handler-declaration-+-- ; +-'
'-INCLUDE-statement---'
.---------------------------------.
V |
>----+-----------------------------+-+--END--+-------+---------><
'-SQL-procedure-statement-- ; ' '-label-'
It is part of the SQL procedure statement!
The declaration section must come before the creation of any tables.
While it seems counter intuitive, DECLARE GLOBAL TEMPORARY TABLE is not a normal variable declaration.
Try moving the other variable declarations above the table declaration.

PostgreSQL: Sub-select inside insert

I have a table called map_tags:
map_id | map_license | map_desc
And another table (widgets) whose records contains a foreign key reference (1 to 1) to a map_tags record:
widget_id | map_id | widget_name
Given the constraint that all map_licenses are unique (however are not set up as keys on map_tags), then if I have a map_license and a widget_name, I'd like to perform an insert on widgets all inside of the same SQL statement:
INSERT INTO
widgets w
(
map_id,
widget_name
)
VALUES (
(
SELECT
mt.map_id
FROM
map_tags mt
WHERE
// This should work and return a single record because map_license is unique
mt.map_license = '12345'
),
'Bupo'
)
I believe I'm on the right track but know right off the bat that this is incorrect SQL for Postgres. Does anybody know the proper way to achieve such a single query?
Use the INSERT INTO SELECT variant, including whatever constants right into the SELECT statement.
The PostgreSQL INSERT syntax is:
INSERT INTO table [ ( column [, ...] ) ]
{ DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }
[ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]
Take note of the query option at the end of the second line above.
Here is an example for you.
INSERT INTO
widgets
(
map_id,
widget_name
)
SELECT
mt.map_id,
'Bupo'
FROM
map_tags mt
WHERE
mt.map_license = '12345'
INSERT INTO widgets
(
map_id,
widget_name
)
SELECT
mt.map_id, 'Bupo'
FROM
map_tags mt
WHERE
mt.map_license = '12345'
Quick Answer:
You don't have "a single record" you have a "set with 1 record"
If this were javascript: You have an "array with 1 value" not "1 value".
In your example, one record may be returned in the sub-query,
but you are still trying to unpack an "array" of records into separate
actual parameters into a place that takes only 1 parameter.
It took me a few hours to wrap my head around the "why not".
As I was trying to do something very similiar:
Here are my notes:
tb_table01: (no records)
+---+---+---+
| a | b | c | << column names
+---+---+---+
tb_table02:
+---+---+---+
| a | b | c | << column names
+---+---+---+
|'d'|'d'|'d'| << record #1
+---+---+---+
|'e'|'e'|'e'| << record #2
+---+---+---+
|'f'|'f'|'f'| << record #3
+---+---+---+
--This statement will fail:
INSERT into tb_table01
( a, b, c )
VALUES
( 'record_1.a', 'record_1.b', 'record_1.c' ),
( 'record_2.a', 'record_2.b', 'record_2.c' ),
-- This sub query has multiple
-- rows returned. And they are NOT
-- automatically unpacked like in
-- javascript were you can send an
-- array to a variadic function.
(
SELECT a,b,c from tb_table02
)
;
Basically, don't think of "VALUES" as a variadic
function that can unpack an array of records. There is
no argument unpacking here like you would have in a javascript
function. Such as:
function takeValues( ...values ){
values.forEach((v)=>{ console.log( v ) });
};
var records = [ [1,2,3],[4,5,6],[7,8,9] ];
takeValues( records );
//:RESULT:
//: console.log #1 : [1,2,3]
//: console.log #2 : [4,5,7]
//: console.log #3 : [7,8,9]
Back to your SQL question:
The reality of this functionality not existing does not change
just because your sub-selection contains only one result. It is
a "set with one record" not "a single record".