Check for null keys in map? Presto - sql

I have a query that was running fine for a while now that was doing the following:
MAP_AGG(key, value(which is a map))
AS k_v1,
MAP_CONCAT(
k_v(some map),
MAP_UNION_SUM(
MAP(ARRAY[K], ARRAY[V])
) as k_v2
With some data from source that looks like this:
key
value
k_v
K
V
id_2
{"KEY2": "20"}
{"KEY4": "100"}
KEY8
100
id_1
{"KEY1": "96.25"}
{"KEY5": "150"}
KEY8
150
In which it provides a table as such:
k_v1
k_v2
{"id_2":{{"KEY2": "20"}, "id_1":{"KEY1": "96.25"}}
{{"KEY4": "100"}, {"KEY5": "150"}, {"KEY8": "250"}}
But now as a new job was running, I get an error stating that
"Failure": "map key cannot be null"
I'm trying to understand how to catch such a case with Presto, as it seems pretty verbose of a process to have to unnest these kinds of situations to check for null keys. Is there a more easier or built in solution to do this kind of check and remove that from the mapping?
Edit: I have hundred's of thousands of records that needs to be processed. The sample data above is to illustrate the schema.

Not sure were and how you want to apply unnest but my guess would be that source of issue is MAP(ARRAY[K], ARRAY[V]) with some of K being null (MAP_AGG should ignore null keys and other methods are working with existing maps). For this case you can try using conditional expression to ignore such rows (by creating empty maps) - if(K is null, MAP(), MAP(ARRAY[K], ARRAY[V])):
MAP_CONCAT(
k_v(some map),
MAP_UNION_SUM(
if(K is null, MAP(), MAP(ARRAY[K], ARRAY[V]))
) as k_v2
or substitute value with some default with coalesce(K, 'KEYDEFAULT'):
MAP_CONCAT(
k_v(some map),
MAP_UNION_SUM(
MAP(ARRAY[coalesce(K, 'KEYDEFAULT')], ARRAY[V])
) as k

Related

JSON stored in SUPER type fails to select camelcase element. Too long to be serialized. How can I select?

Summary:
I am working with a large JSON that is stored in a redshift SUPER type.
Context
This issue is near identical to the question posted here for TSQL. My schema:
chainId BIGINT
properties SUPER
Sample data:
{
"chainId": 5,
"$browser": "Chrome",
"token": "123x5"
}
I have this as a column in my table called properties.
Desired behavior
I want to be able to retrieve the value 5 from the chainId key and store it in a BIGINT column.
What I've tried
I have referenced the following aws docs:
https://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html
https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html
https://docs.aws.amazon.com/redshift/latest/dg/super-overview.html
I have tried the following which haven't worked for me:
SELECT
properties.chainId::varchar as test1
, properties.chainId as test2
, properties.chainid as test3
, properties."chainId" as test4
, properties."chainid" as test5
, json_extract_path_text(json_serialize(properties), 'chainId') serial_then_extract
, properties[0].chainId as testval1
, properties[0]."chainId" as testval2
, properties[0].chainid as testval3
, properties[0]."chainid" as testval4
, properties[1].chainId as testval5
, properties[1]."chainId" as testval6
FROM clean
Of these, the attempt, serial_then_extract returned a not null, correct value, but not all of the values in my properties field are short enough to serialize, so this only works on some of the rows.
All others return null.
Referencing the following docs: https://docs.aws.amazon.com/redshift/latest/dg/query-super.html#unnest I have also attempted to iterate over the super type using partisql:
SELECT ps.*
, p.chainId
from clean ps, ps.properties p
where 1=1
But this returns no rows.
I also tried the following:
select
properties
, properties.token
, properties."$os"
from base
And this returned rows with values. I know that there is a chainId value as I've checked the corresponding key and am working with sample data.
What am I missing? What else should I be trying?
Does anyone know if this has to do with the way that the JSON key is formatted? [camelcase]
You need to enable case sensitive identifiers. By default Redshift maps everything to lower case for table and column names. If you have mixed case identifiers like in your super field you need to enable case sensitivity with
SET enable_case_sensitive_identifier TO true;
See: https://docs.aws.amazon.com/redshift/latest/dg/r_enable_case_sensitive_identifier.html

Handling null or missing attributes in JSON using PostgreSQL

I'm learning how to handle JSON in PostgreSQL.
I have a table with some columns. One column is a JSON field. The data in that column has at least these three variations:
Case 1: {"createDate": 1448067864151, "name": "world"}
Case 2: {"createDate": "", "name": "hello"}
Case 3: {"name": "sky"}
Later on, I want to select the createDate.
TO_TIMESTAMP((attributes->>'createDate')::bigint * 0.001)
That works fine for Case 1 when the data is present and it is convertible to a bigint. But what about when it isn't? How do I handle this?
I read this article. It explains that we can add check constraints to perform some rudimentary validation. Alternatively, I could do a schema validation before the data is inserts (on the client side). There are pros and cons with both ideas.
Using a Check Constraint
CONSTRAINT validate_createDate CHECK ((attributes->>'createDate')::bigint >= 1)
This forces a non-nullable field (Case 3 fails). But I want the attribute to be optional. Furthermore, if the attribute doesn't convert to a bigint because it is blank (Case 2), this errors out.
Using JSON schema validation on the client side before insert
This works, in part, because the schema validation makes sure that what data comes in conforms to the schema. In my case, I can control which clients access this table, so this is OK. But it doesn't matter for the SQL later on since my validator will let pass all three cases.
Basically, you need to check if createDate attribute is empty:
WITH data(attributes) AS ( VALUES
('{"createDate": 1448067864151, "name": "world"}'::JSON),
('{"createDate": "", "name": "hello"}'::JSON),
('{"name": "sky"}'::JSON)
)
SELECT to_timestamp((attributes->>'createDate')::bigint * 0.001) FROM data
WHERE
(attributes->>'createDate') IS NOT NULL
AND
(attributes->>'createDate') != '';
Output:
to_timestamp
----------------------------
2015-11-20 17:04:24.151-08
(1 row)
Building on Dmitry's answer, you can also check the json type with the json_typeof function. Note the json operator: -> to get json instead of the ->> operator which always casts the value to string.
By doing the check in the SELECT with a CASE conditional instead of in the WHERE clause, we also keep the rows not having a createdDate. Depending on your usecase, this might be better.
WITH data(attributes) AS ( VALUES
('{"createDate": 1448067864151, "name": "world"}'::JSON),
('{"createDate": "", "name": "hello"}'::JSON),
('{"name": "sky"}'::JSON)
)
SELECT
CASE WHEN (json_typeof(attributes->'createDate') = 'number')
THEN to_timestamp((attributes->>'createDate')::bigint * 0.001)
END AS created_date
FROM data
;
Output:
created_date
----------------------------
"2015-11-21 02:04:24.151+01"
""
""
(3 rows)

Handle null values within SQL IN clause

I have following sql query in my hbm file. The SCHEMA, A and B are schema and two tables.
select
*
from SCHEMA.A os
inner join SCHEMA.B o
on o.ORGANIZATION_ID = os.ORGANIZATION_ID
where
case
when (:pass = 'N' and os.ORG_ID in (:orgIdList)) then 1
when (:pass = 'Y') then 1
end = 1
and (os.ORG_SYNONYM like :orgSynonym or :orgSynonym is null)
This is a pretty simple query. I had to use the case - when to handle the null value of "orgIdList" parameter(when null is passed to sql IN it gives error). Below is the relevant java code which sets the parameter.
if (_orgSynonym.getOrgIdList().isEmpty()) {
query.setString("orgIdList", "pass");
query.setString("pass", "Y");
} else {
query.setString("pass", "N");
query.setParameterList("orgIdList", _orgSynonym.getOrgIdList());
}
This works and give me the expected output. But I would like to know if there is a better way to handle this situation(orgIdList sometimes become null).
There must be at least one element in the comma separated list that defines the set of values for the IN expression.
In other words, regardless of Hibernate's ability to parse the query and to pass an IN(), regardless of the support of this syntax by particular databases (PosgreSQL doesn't according to the Jira issue), Best practice is use a dynamic query here if you want your code to be portable (and I usually prefer to use the Criteria API for dynamic queries).
If not need some other work around like what you have done.
or wrap the list from custom list et.

Group by expression in pig

Consider I have a dataset with tuples (f1, f2). I want to get my data in two bags: one where fi is null and the other where f1 values are not null. I try:
raw = LOAD 'somedata' USING PigStorage() AS (f1:chararray, f2:chararray);
raw_group = GROUP raw BY f1 is null;
raw_count = FOREACH raw_group GENERATE group, COUNT_STAR(raw);
I expect to get two groups with keys true and false. When I run it in grunt I get the following:
2013-12-26 14:56:10,958 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: <line 1046, column 25> Syntax error, unexpected symbol at or near 'f1'
I can do a workaround:
raw_group = GROUP raw BY (f1 is null)?0:1;
, but I really like to understand what's going on here, as I just started to learn Pig. According to Pig documentation I can use expressions as a grouping key. Do I miss something here or nulls are treated differently in Pig?
The boolean datatype was introduced in Pig 0.10. The expression f1 is null is a boolean, so it can't appear as a field in a relation, which it would do if it were the value of group. Prior to Pig 0.10, booleans could only be used in FILTER statements or in the ternary operator, as you showed in your workaround.
While I haven't tried this out, presumably if you were to attempt the same thing in Pig 0.10 or later, your original attempt would succeed.

What is the best way to run N independent column updates in PostgreSQL? What is the best way to do it in the SQL spec?

I'm looking for a more efficient way to run many columns updates on the same table like this:
UPDATE TABLE table
SET col = regexp_replace( col, 'foo', 'bar' )
WHERE regexp_match( col, 'foo' );
Such that foo, and bar, will be a combination of 40 different regex-replaces. I doubt even 25% of the dataset needs to be updated at all, but what I'm wanting to know is it is possible to cleanly achieve the following in SQL.
A single pass update
A single match of the regex, triggers a single replace
Not running all possible regexp_replaces if only one matches
Not updating all columns if only one needs the update
Not updating a row if no column has changed
I'm also curious, I know in MySQL (bear with me)
UPDATE foo SET bar = 'baz'
Has an implicit WHERE bar != 'baz' clause
However, in PostgreSQL I know this doesn't exist: I think I could at least answer one of my questions if I knew how to skip a single row's update if the target columns weren't updated.
Something like
UPDATE TABLE table
SET col = *temp_var* = regexp_replace( col, 'foo', 'bar' )
WHERE col != *temp_var*
Do it in code. Open up a cursor, then: grab a row, run it through the 40 regular expressions, and if it changed, save it back. Repeat until the cursor doesn't give you any more rows.
Whether you do it that way or come up with the magical SQL expression, it's still going to be a row scan of the entire table, but the code will be much simpler.
Experimental Results
In response to criticism, I ran an experiment. I inserted 10,000 lines from a documentation file into a table with a serial primary key and a varchar column. Then I tested two ways to do the update. Method 1:
in a transaction:
opened up a cursor (select for update)
while reading 100 rows from the cursor returns any rows:
for each row:
for each regular expression:
do the gsub on the text column
update the row
This takes 1.16 seconds with a locally connected database.
Then the "big replace," a single mega-regex update:
update foo set t =
regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(t,
E'\bcommit\b', E'COMMIT'),
E'\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b',
E'9ACF10762B5F3D3B1B33EA07792A936A25E45010'),
E'\bAuthor:\b', E'AUTHOR:'),
E'\bCarl\b', E'CARL'), E'\bWorth\b',
E'WORTH'), E'\b\b',
E''), E'\bDate:\b',
E'DATE:'), E'\bMon\b', E'MON'),
E'\bOct\b', E'OCT'), E'\b26\b',
E'26'), E'\b04:53:13\b', E'04:53:13'),
E'\b2009\b', E'2009'), E'\b-0700\b',
E'-0700'), E'\bUpdate\b', E'UPDATE'),
E'\bversion\b', E'VERSION'),
E'\bto\b', E'TO'), E'\b2.9.1\b',
E'2.9.1'), E'\bcommit\b', E'COMMIT'),
E'\b61c89e56f361fa860f18985137d6bf53f48c16ac\b',
E'61C89E56F361FA860F18985137D6BF53F48C16AC'),
E'\bAuthor:\b', E'AUTHOR:'),
E'\bCarl\b', E'CARL'), E'\bWorth\b',
E'WORTH'), E'\b\b',
E''), E'\bDate:\b',
E'DATE:'), E'\bMon\b', E'MON'),
E'\bOct\b', E'OCT'), E'\b26\b',
E'26'), E'\b04:51:58\b', E'04:51:58'),
E'\b2009\b', E'2009'), E'\b-0700\b',
E'-0700'), E'\bNEWS:\b', E'NEWS:'),
E'\bAdd\b', E'ADD'), E'\bnotes\b',
E'NOTES'), E'\bfor\b', E'FOR'),
E'\bthe\b', E'THE'), E'\b2.9.1\b',
E'2.9.1'), E'\brelease.\b',
E'RELEASE.'), E'\bThanks\b',
E'THANKS'), E'\bto\b', E'TO'),
E'\beveryone\b', E'EVERYONE'),
E'\bfor\b', E'FOR')
The mega-regex update takes 0.94 seconds to update.
At 0.94 seconds compared to 1.16, it's true that the mega-regex update is faster, running in 81% of the time of doing it in code. It is not, however a lot faster. And ye Gods, look at that update statement. Do you want to write that, or try to figure out what went wrong when Postgres complains that you dropped a parenthesis somewhere?
Code
The code used was:
def stupid_regex_replace
sql = Select.new
sql.select('id')
sql.select('t')
sql.for_update
sql.from(TABLE_NAME)
Cursor.new('foo', sql, {}, #db) do |cursor|
until (rows = cursor.fetch(100)).empty?
for row in rows
for regex, replacement in regexes
row['t'] = row['t'].gsub(regex, replacement)
end
end
sql = Update.new(TABLE_NAME, #db)
sql.set('t', row['t'])
sql.where(['id = %s', row['id']])
sql.exec
end
end
end
I generated the regular expressions dynamically by taking words from the file; for each word "foo", its regular expression was "\bfoo\b" and its replacement string was "FOO" (the word uppercased). I used words from the file to make sure that replacements did happen. I made the test program spit out the regex's so you can see them. Each pair is a regex and the corresponding replacement string:
[[/\bcommit\b/, "COMMIT"],
[/\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b/,
"9ACF10762B5F3D3B1B33EA07792A936A25E45010"],
[/\bAuthor:\b/, "AUTHOR:"],
[/\bCarl\b/, "CARL"],
[/\bWorth\b/, "WORTH"],
[/\b<cworth#cworth.org>\b/, "<CWORTH#CWORTH.ORG>"],
[/\bDate:\b/, "DATE:"],
[/\bMon\b/, "MON"],
[/\bOct\b/, "OCT"],
[/\b26\b/, "26"],
[/\b04:53:13\b/, "04:53:13"],
[/\b2009\b/, "2009"],
[/\b-0700\b/, "-0700"],
[/\bUpdate\b/, "UPDATE"],
[/\bversion\b/, "VERSION"],
[/\bto\b/, "TO"],
[/\b2.9.1\b/, "2.9.1"],
[/\bcommit\b/, "COMMIT"],
[/\b61c89e56f361fa860f18985137d6bf53f48c16ac\b/,
"61C89E56F361FA860F18985137D6BF53F48C16AC"],
[/\bAuthor:\b/, "AUTHOR:"],
[/\bCarl\b/, "CARL"],
[/\bWorth\b/, "WORTH"],
[/\b<cworth#cworth.org>\b/, "<CWORTH#CWORTH.ORG>"],
[/\bDate:\b/, "DATE:"],
[/\bMon\b/, "MON"],
[/\bOct\b/, "OCT"],
[/\b26\b/, "26"],
[/\b04:51:58\b/, "04:51:58"],
[/\b2009\b/, "2009"],
[/\b-0700\b/, "-0700"],
[/\bNEWS:\b/, "NEWS:"],
[/\bAdd\b/, "ADD"],
[/\bnotes\b/, "NOTES"],
[/\bfor\b/, "FOR"],
[/\bthe\b/, "THE"],
[/\b2.9.1\b/, "2.9.1"],
[/\brelease.\b/, "RELEASE."],
[/\bThanks\b/, "THANKS"],
[/\bto\b/, "TO"],
[/\beveryone\b/, "EVERYONE"],
[/\bfor\b/, "FOR"]]
If this were a hand-generated list of regex's, and not automatically generated, my question is still appropriate: Which would you rather have to create or maintain?
For the skip update, look at suppress_redundant_updates - see http://www.postgresql.org/docs/8.4/static/functions-trigger.html.
This is not necessarily a win - but it might well be in your case.
Or perhaps you can just add that implicit check as an explicit one?