Summary:
I am working with a large JSON that is stored in a redshift SUPER type.
Context
This issue is near identical to the question posted here for TSQL. My schema:
chainId BIGINT
properties SUPER
Sample data:
{
"chainId": 5,
"$browser": "Chrome",
"token": "123x5"
}
I have this as a column in my table called properties.
Desired behavior
I want to be able to retrieve the value 5 from the chainId key and store it in a BIGINT column.
What I've tried
I have referenced the following aws docs:
https://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html
https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html
https://docs.aws.amazon.com/redshift/latest/dg/super-overview.html
I have tried the following which haven't worked for me:
SELECT
properties.chainId::varchar as test1
, properties.chainId as test2
, properties.chainid as test3
, properties."chainId" as test4
, properties."chainid" as test5
, json_extract_path_text(json_serialize(properties), 'chainId') serial_then_extract
, properties[0].chainId as testval1
, properties[0]."chainId" as testval2
, properties[0].chainid as testval3
, properties[0]."chainid" as testval4
, properties[1].chainId as testval5
, properties[1]."chainId" as testval6
FROM clean
Of these, the attempt, serial_then_extract returned a not null, correct value, but not all of the values in my properties field are short enough to serialize, so this only works on some of the rows.
All others return null.
Referencing the following docs: https://docs.aws.amazon.com/redshift/latest/dg/query-super.html#unnest I have also attempted to iterate over the super type using partisql:
SELECT ps.*
, p.chainId
from clean ps, ps.properties p
where 1=1
But this returns no rows.
I also tried the following:
select
properties
, properties.token
, properties."$os"
from base
And this returned rows with values. I know that there is a chainId value as I've checked the corresponding key and am working with sample data.
What am I missing? What else should I be trying?
Does anyone know if this has to do with the way that the JSON key is formatted? [camelcase]
You need to enable case sensitive identifiers. By default Redshift maps everything to lower case for table and column names. If you have mixed case identifiers like in your super field you need to enable case sensitivity with
SET enable_case_sensitive_identifier TO true;
See: https://docs.aws.amazon.com/redshift/latest/dg/r_enable_case_sensitive_identifier.html
I'm learning how to handle JSON in PostgreSQL.
I have a table with some columns. One column is a JSON field. The data in that column has at least these three variations:
Case 1: {"createDate": 1448067864151, "name": "world"}
Case 2: {"createDate": "", "name": "hello"}
Case 3: {"name": "sky"}
Later on, I want to select the createDate.
TO_TIMESTAMP((attributes->>'createDate')::bigint * 0.001)
That works fine for Case 1 when the data is present and it is convertible to a bigint. But what about when it isn't? How do I handle this?
I read this article. It explains that we can add check constraints to perform some rudimentary validation. Alternatively, I could do a schema validation before the data is inserts (on the client side). There are pros and cons with both ideas.
Using a Check Constraint
CONSTRAINT validate_createDate CHECK ((attributes->>'createDate')::bigint >= 1)
This forces a non-nullable field (Case 3 fails). But I want the attribute to be optional. Furthermore, if the attribute doesn't convert to a bigint because it is blank (Case 2), this errors out.
Using JSON schema validation on the client side before insert
This works, in part, because the schema validation makes sure that what data comes in conforms to the schema. In my case, I can control which clients access this table, so this is OK. But it doesn't matter for the SQL later on since my validator will let pass all three cases.
Basically, you need to check if createDate attribute is empty:
WITH data(attributes) AS ( VALUES
('{"createDate": 1448067864151, "name": "world"}'::JSON),
('{"createDate": "", "name": "hello"}'::JSON),
('{"name": "sky"}'::JSON)
)
SELECT to_timestamp((attributes->>'createDate')::bigint * 0.001) FROM data
WHERE
(attributes->>'createDate') IS NOT NULL
AND
(attributes->>'createDate') != '';
Output:
to_timestamp
----------------------------
2015-11-20 17:04:24.151-08
(1 row)
Building on Dmitry's answer, you can also check the json type with the json_typeof function. Note the json operator: -> to get json instead of the ->> operator which always casts the value to string.
By doing the check in the SELECT with a CASE conditional instead of in the WHERE clause, we also keep the rows not having a createdDate. Depending on your usecase, this might be better.
WITH data(attributes) AS ( VALUES
('{"createDate": 1448067864151, "name": "world"}'::JSON),
('{"createDate": "", "name": "hello"}'::JSON),
('{"name": "sky"}'::JSON)
)
SELECT
CASE WHEN (json_typeof(attributes->'createDate') = 'number')
THEN to_timestamp((attributes->>'createDate')::bigint * 0.001)
END AS created_date
FROM data
;
Output:
created_date
----------------------------
"2015-11-21 02:04:24.151+01"
""
""
(3 rows)
Consider I have a dataset with tuples (f1, f2). I want to get my data in two bags: one where fi is null and the other where f1 values are not null. I try:
raw = LOAD 'somedata' USING PigStorage() AS (f1:chararray, f2:chararray);
raw_group = GROUP raw BY f1 is null;
raw_count = FOREACH raw_group GENERATE group, COUNT_STAR(raw);
I expect to get two groups with keys true and false. When I run it in grunt I get the following:
2013-12-26 14:56:10,958 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: <line 1046, column 25> Syntax error, unexpected symbol at or near 'f1'
I can do a workaround:
raw_group = GROUP raw BY (f1 is null)?0:1;
, but I really like to understand what's going on here, as I just started to learn Pig. According to Pig documentation I can use expressions as a grouping key. Do I miss something here or nulls are treated differently in Pig?
The boolean datatype was introduced in Pig 0.10. The expression f1 is null is a boolean, so it can't appear as a field in a relation, which it would do if it were the value of group. Prior to Pig 0.10, booleans could only be used in FILTER statements or in the ternary operator, as you showed in your workaround.
While I haven't tried this out, presumably if you were to attempt the same thing in Pig 0.10 or later, your original attempt would succeed.
I'm looking for a more efficient way to run many columns updates on the same table like this:
UPDATE TABLE table
SET col = regexp_replace( col, 'foo', 'bar' )
WHERE regexp_match( col, 'foo' );
Such that foo, and bar, will be a combination of 40 different regex-replaces. I doubt even 25% of the dataset needs to be updated at all, but what I'm wanting to know is it is possible to cleanly achieve the following in SQL.
A single pass update
A single match of the regex, triggers a single replace
Not running all possible regexp_replaces if only one matches
Not updating all columns if only one needs the update
Not updating a row if no column has changed
I'm also curious, I know in MySQL (bear with me)
UPDATE foo SET bar = 'baz'
Has an implicit WHERE bar != 'baz' clause
However, in PostgreSQL I know this doesn't exist: I think I could at least answer one of my questions if I knew how to skip a single row's update if the target columns weren't updated.
Something like
UPDATE TABLE table
SET col = *temp_var* = regexp_replace( col, 'foo', 'bar' )
WHERE col != *temp_var*
Do it in code. Open up a cursor, then: grab a row, run it through the 40 regular expressions, and if it changed, save it back. Repeat until the cursor doesn't give you any more rows.
Whether you do it that way or come up with the magical SQL expression, it's still going to be a row scan of the entire table, but the code will be much simpler.
Experimental Results
In response to criticism, I ran an experiment. I inserted 10,000 lines from a documentation file into a table with a serial primary key and a varchar column. Then I tested two ways to do the update. Method 1:
in a transaction:
opened up a cursor (select for update)
while reading 100 rows from the cursor returns any rows:
for each row:
for each regular expression:
do the gsub on the text column
update the row
This takes 1.16 seconds with a locally connected database.
Then the "big replace," a single mega-regex update:
update foo set t =
regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(t,
E'\bcommit\b', E'COMMIT'),
E'\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b',
E'9ACF10762B5F3D3B1B33EA07792A936A25E45010'),
E'\bAuthor:\b', E'AUTHOR:'),
E'\bCarl\b', E'CARL'), E'\bWorth\b',
E'WORTH'), E'\b\b',
E''), E'\bDate:\b',
E'DATE:'), E'\bMon\b', E'MON'),
E'\bOct\b', E'OCT'), E'\b26\b',
E'26'), E'\b04:53:13\b', E'04:53:13'),
E'\b2009\b', E'2009'), E'\b-0700\b',
E'-0700'), E'\bUpdate\b', E'UPDATE'),
E'\bversion\b', E'VERSION'),
E'\bto\b', E'TO'), E'\b2.9.1\b',
E'2.9.1'), E'\bcommit\b', E'COMMIT'),
E'\b61c89e56f361fa860f18985137d6bf53f48c16ac\b',
E'61C89E56F361FA860F18985137D6BF53F48C16AC'),
E'\bAuthor:\b', E'AUTHOR:'),
E'\bCarl\b', E'CARL'), E'\bWorth\b',
E'WORTH'), E'\b\b',
E''), E'\bDate:\b',
E'DATE:'), E'\bMon\b', E'MON'),
E'\bOct\b', E'OCT'), E'\b26\b',
E'26'), E'\b04:51:58\b', E'04:51:58'),
E'\b2009\b', E'2009'), E'\b-0700\b',
E'-0700'), E'\bNEWS:\b', E'NEWS:'),
E'\bAdd\b', E'ADD'), E'\bnotes\b',
E'NOTES'), E'\bfor\b', E'FOR'),
E'\bthe\b', E'THE'), E'\b2.9.1\b',
E'2.9.1'), E'\brelease.\b',
E'RELEASE.'), E'\bThanks\b',
E'THANKS'), E'\bto\b', E'TO'),
E'\beveryone\b', E'EVERYONE'),
E'\bfor\b', E'FOR')
The mega-regex update takes 0.94 seconds to update.
At 0.94 seconds compared to 1.16, it's true that the mega-regex update is faster, running in 81% of the time of doing it in code. It is not, however a lot faster. And ye Gods, look at that update statement. Do you want to write that, or try to figure out what went wrong when Postgres complains that you dropped a parenthesis somewhere?
Code
The code used was:
def stupid_regex_replace
sql = Select.new
sql.select('id')
sql.select('t')
sql.for_update
sql.from(TABLE_NAME)
Cursor.new('foo', sql, {}, #db) do |cursor|
until (rows = cursor.fetch(100)).empty?
for row in rows
for regex, replacement in regexes
row['t'] = row['t'].gsub(regex, replacement)
end
end
sql = Update.new(TABLE_NAME, #db)
sql.set('t', row['t'])
sql.where(['id = %s', row['id']])
sql.exec
end
end
end
I generated the regular expressions dynamically by taking words from the file; for each word "foo", its regular expression was "\bfoo\b" and its replacement string was "FOO" (the word uppercased). I used words from the file to make sure that replacements did happen. I made the test program spit out the regex's so you can see them. Each pair is a regex and the corresponding replacement string:
[[/\bcommit\b/, "COMMIT"],
[/\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b/,
"9ACF10762B5F3D3B1B33EA07792A936A25E45010"],
[/\bAuthor:\b/, "AUTHOR:"],
[/\bCarl\b/, "CARL"],
[/\bWorth\b/, "WORTH"],
[/\b<cworth#cworth.org>\b/, "<CWORTH#CWORTH.ORG>"],
[/\bDate:\b/, "DATE:"],
[/\bMon\b/, "MON"],
[/\bOct\b/, "OCT"],
[/\b26\b/, "26"],
[/\b04:53:13\b/, "04:53:13"],
[/\b2009\b/, "2009"],
[/\b-0700\b/, "-0700"],
[/\bUpdate\b/, "UPDATE"],
[/\bversion\b/, "VERSION"],
[/\bto\b/, "TO"],
[/\b2.9.1\b/, "2.9.1"],
[/\bcommit\b/, "COMMIT"],
[/\b61c89e56f361fa860f18985137d6bf53f48c16ac\b/,
"61C89E56F361FA860F18985137D6BF53F48C16AC"],
[/\bAuthor:\b/, "AUTHOR:"],
[/\bCarl\b/, "CARL"],
[/\bWorth\b/, "WORTH"],
[/\b<cworth#cworth.org>\b/, "<CWORTH#CWORTH.ORG>"],
[/\bDate:\b/, "DATE:"],
[/\bMon\b/, "MON"],
[/\bOct\b/, "OCT"],
[/\b26\b/, "26"],
[/\b04:51:58\b/, "04:51:58"],
[/\b2009\b/, "2009"],
[/\b-0700\b/, "-0700"],
[/\bNEWS:\b/, "NEWS:"],
[/\bAdd\b/, "ADD"],
[/\bnotes\b/, "NOTES"],
[/\bfor\b/, "FOR"],
[/\bthe\b/, "THE"],
[/\b2.9.1\b/, "2.9.1"],
[/\brelease.\b/, "RELEASE."],
[/\bThanks\b/, "THANKS"],
[/\bto\b/, "TO"],
[/\beveryone\b/, "EVERYONE"],
[/\bfor\b/, "FOR"]]
If this were a hand-generated list of regex's, and not automatically generated, my question is still appropriate: Which would you rather have to create or maintain?
For the skip update, look at suppress_redundant_updates - see http://www.postgresql.org/docs/8.4/static/functions-trigger.html.
This is not necessarily a win - but it might well be in your case.
Or perhaps you can just add that implicit check as an explicit one?