GREL: qq about Templating Export and Nulls (Open Refine) - openrefine

There is some expression more elegant than this to avoid printing NULLs and words "NULL"?
By reading the Open Refine's docs https://github.com/OpenRefine/OpenRefine/wiki/General-Refine-Expression-Language I have done this bellow, but feels odd and complex.
Cheers.
{
"rows" : [
{{ if(isNull(cells["supplierID"].value),
"",
" \"supplierID\" : " + jsonize(cells["supplierID"].value)
)}},
{{ if(or(isNull(cells["homePage"].value), cells["homePage"].value == "NULL"),
"",
" \"homePage\" : " + jsonize(cells["homePage"].value)
)}}
]
}

Usually I'd aim to make the column consistent (i.e. either contain a null value in all cases or the string "NULL" in all cases, but not a mix of these) before doing an export. So you could run a transform on the column first like:
if(value=="NULL",null, value)
If you were to do this first, then in the export template you can use:
forNonBlank(cells["homepage"].value,v," \"homePage\" : " + jsonize(v),"")
However, if you don't want to make this change in the data for some reason, you can achieve a slightly more elegant option than the one you have by using the 'coalesce' function (available from OpenRefine 3.0 onwards) which chooses the first non-null value from a list:
{{ if(coalesce(cells["homepage"].value,"NULL")=="NULL",""," \"homePage\" : " + jsonize(cells["homePage"].value)) }}
The coalesce will use the value of cells["homepage"].value if it is non-null, or fall back to using the string "NULL" if cells["homepage"].value is null. So this means that if you feed a cell with content string "NULL" or a cell that has null - the coalesce function outputs the string "NULL" in both cases - which means you then only have to check for the single condition.
coalesce is documented at https://github.com/OpenRefine/OpenRefine/wiki/GREL%20Other%20Functions

Related

Pattern match using regexp_extract_all

I am trying to build a array from this string and need help with pattern on regexp_extract_all.
Here is my input string contains keyword value pairs
BEGIN
DECLARE p_JSON STRING DEFAULT """
{
"instances": [{
"LT_20MN_SalesContrctCnt": 388.0,
"Pyramid_Index": '',
"MARKET": "'Growth Markets','Europe'",
"SERVICE_DIM": "'S&C','F&M'",
"SG_MD": "'All Service Group'"
}]}
""";
SELECT split(x,":")[OFFSET(0)] as keyword, split(x,":")[OFFSET(1)] keyword_value
FROM unnest(split(REGEXP_REPLACE(JSON_EXTRACT(p_JSON, '$.instances'),r'([\'\"\[\]{}])', ''))) as x
END;
The above SQL is failing at SPLIT due to , with in the data.
All I am trying to do here is build a two columns Keyword and value.
The idea here is if I can extract each row using REGEXP_EXTRACT_ALL with out the last "," then I should be able to split into keyword and keyword_value columns. Btw the names or number of keywords/values are not fixed.
Intended output from REGEXP_EXTRACT_ALL:
"LT_20MN_SalesContrctCnt": 388.0
"Pyramid_Index": ''
"MARKET": "'Growth Markets','Europe'"
"SERVICE_DIM": "'S&C','F&M'"
"SG_MD": "'All Service Group'"
Appreciate if you can suggest a better way to handle this.
Thanks in advance.
Using your sample data, I just added an extra REGEXP_REPLACE to replace ," to #" so we can avoid splitting using ,. See approach below:
SELECT
SPLIT(arr,":")[OFFSET(0)] as keyword,
SPLIT(arr,":")[OFFSET(1)] as keyword_value,
FROM sample_data,
UNNEST(SPLIT(REGEXP_REPLACE(REGEXP_REPLACE(JSON_EXTRACT(p_JSON, '$.instances'),r'[\[\]{}]',''),r',"','#"'),'#')) arr
Output:

PostgreSQL ON CONFLICT error inserting column EXCLUDED.column_name does not exist with psycopg2.sql Placeholders

I'm using psycopg2 with psycopg2.sql.
import psycopg2
from psycopg2 import sql
I re-wrote some static sql code to be more dynamic by using sql.Placeholder and sql.Identifier.
However, even when there is no conflict, I get an error:
Error inserting into table: column "EXCLUDED.domain_name" does not exist
My sql query looks like this:
query = sql.SQL("insert into dns ({}) values ({}) "
"ON CONFLICT ({}) "
"DO UPDATE SET ({}) = ({}) "
"WHERE {} >= ({});").format(
sql.SQL(', ').join(map(sql.Identifier, dns_cols)),
sql.SQL(', ').join(sql.Placeholder() * len(dns_vals)),
sql.SQL(', ').join(map(sql.Identifier, conflict_cols)),
sql.SQL(', ').join(map(sql.Identifier, dns_cols)),
sql.SQL(', ').join(map(sql.Identifier, excluded_names)),
sql.Identifier('EXCLUDED.updated_date_time'),
sql.Identifier('dns.updated_date_time')
)
mogrify prints out the following:
b'insert into dns ("domain_name", "tld", "subdomain", "https", "cf_url", "updated_date_time") values (\'example\', \'org\', \'\', false, \'http://example.org\', \'2018-12-06 23:12:00\') ON CONFLICT ("domain_name", "tld", "subdomain") DO UPDATE SET ("domain_name", "tld", "subdomain", "https", "cf_url", "updated_date_time") = ("EXCLUDED.domain_name", "EXCLUDED.tld", "EXCLUDED.subdomain", "EXCLUDED.https", "EXCLUDED.cf_url", "EXCLUDED.updated_date_time") WHERE "EXCLUDED.updated_date_time" >= ("dns.updated_date_time");'
dns_cols, excluded_names, and dns_vals are all lists and their values appear to be showing up just fine in the mogrify print out.
I have never needed to create EXCLUDED columns, they are always accessible when "ON CONFLICT" is triggered.
How do I reference EXCLUDED columns when using psycopg2.sql Placeholders?
You are referencing an unqualified column whose name contains a literal period. If you want to quote this, you would have to quote the two parts separately, not together. Also, EXCLUDED has to be in lower case if you do quote it (I am kind of surprised that it works at all when quoted).
"excluded"."domain_name"
The best way to do this is probably to hard code the EXCLUDED into the query string and remove it from the sql.Identifier call:
"DO UPDATE SET ({}) = (EXCLUDED.{}) "
...
sql.Identifier('updated_date_time'),

Can we Use "Case" in a ColdFusion Query-of-Query

I am applying case in ColdFusion query of query but it's throwing an error.
Query:
<cfquery name="qEmployees1" dbtype="query">
select (
case
when ISNUMERIC(u.userdefined)=1
then right('00000000'+u.userdefined,8)
else userdefined
end
) as hello
from all_employees
order by hello ASC
</cfquery>
Error message:
Encountered "when" at line 3, column 22. Was expecting one of:
"AND" ... "BETWEEN" ... "IN" ... "IS" ... "LIKE" ... "NOT" ...
"OR" ... ")" ... "=" ... "." ... "!=" ... "#" ... "<>" ...
">" ... ">=" ... "<" ... "<=" ... "+" ... "-" ... "*" ...
"||" ... "/" ... "**" ... "(" ...
Update:
The original suggestion isn't going to work due to it only looking at a single row. Really you need to loop through your all_employees recordset and apply it to each individual row.
You might be able to achieve this without QoQ if you are just outputting the results to the page. Like this:
<cfoutput>
<cfloop query="all_employees">
<cfif isNumeric(all_employees.userdefined)>
#Right('00000000'&all_employees.userdefined,8)#
<cfelse>
#all_employees.userdefined#
<cfif>
</cfloop>
</cfoutput>
Original Answer:
How about something like this?:
<cfquery name="qEmployees1" dbtype="query">
SELECT
<cfif isNumeric([all_employees].[u.userdefined])>
right('00000000'+u.userdefined,8)
<cfelse>
u.userdefined
</cfif> AS hello
FROM all_employees
ORDER by hello
</cfquery>
I have not tested this but I don't think having dot notation in the SQL column name will work correctly in this case. I enclosed it in square brackets anyway.
In case anyone else decides to try the QoQ below, one very important thing to note is that even if it executes without error, it's NOT doing the same thing as CASE. A CASE statement applies logic to the values within each row of a table - individually. In the QoQ version, the CFIF expression does not operate on all values within the query. It only examines the value in the 1st row and then applies the decision for that one value to ALL rows in the query.
Notice how the QoQ below (incorrectly) reports that all of the values are numeric? While the database query (correctly) reports a mix of "Numeric" and "Non-numeric" values. So the QoQ code is not equivalent to CASE.
TestTable Data:
id userDefined
1 22
2 AA
3 BB
4 CC
Database Query:
SELECT CASE
WHEN ISNUMERIC(userDefined)=1 THEN 'Number: '+ userDefined
ELSE 'Not a number: ' + userDefined
END AS TheColumnAlias
FROM TestTable
ORDER BY ID ASC
Database Query Result:
QoQ
<cfquery name="qQueryOfQuery" dbtype="query">
SELECT
<cfif isNumeric(qDatabaseQuery2.userDefined)>
'Number: '+ userDefined
<cfelse>
'Not a number: ' + userDefined
</cfif>
AS TheColumnAlias
FROM qDatabaseQuery2
ORDER by ID
</cfquery>
QoQ Result
EDIT:
I thought about this one and decided to change it to an actual answer. Since you're using CF2016+, you have access to some of the more modern features that CF offers. First, Query of Query is a great tool, but it can be very slow. Especially for lower record counts. And then if there are a lot of records in your base query, it can eat up your server's memory, since it's an in-memory operation. We can accomplish our goal without the need of a QoQ.
One way we can sort of duplicate the functionality that you're looking for is with some of the newer CF functions. filter, each and sort all work on a query object. These are the member function versions of these, but I think they look cleaner. Plus, I've used cfscript-syntax.
I mostly reused my original CFSCript query (all_employees), that creates the query object, but I added an f column to it, which holds the text to be filtered on.
all_employees = QueryNew("userdefined,hello,f", "varchar,varchar,varchar",
[
["test","pure text","takeMe"],
["2","number as varchar","takeMe"],
["03","leading zero","takeMe"],
[" 4 ","leading and trailing spaces","takeMe"],
["5 ","extra trailing spaces","takeMe"],
[" 6","extra leading spaces","takeMe"],
["aasdfadsf","adsfasdfasd","dontTakeMe"],
["165e73","scientific notation","takeMe"],
["1.5","decimal","takeMe"],
["1,5","comma-delimited (or non-US decimal)","takeMe"],
["1.0","valid decimal","takeMe"],
["1.","invalid decimal","takeMe"],
["1,000","number with comma","takeMe"]
]
) ;
The original base query didn't have a WHERE clause, so no additional filtering was being done on the initial results. But if we needed to, we could duplicate that with QueryFilter or .filter.
filt = all_employees.filter( function(whereclause){ return ( whereclause.f == "takeMe"); } ) ;
This takes the all_employees query and applies a function that will only return rows that match our function requirements. So any row of the query where f == "takeMe". That's like WHERE f = 'takeMe' in a query. That sets the new filtered results into a new query object filt.
Then we can use QueryEach or .each to go through every row of our new filtered query to modify what we need to. In this case, we're building a new array for the values we want. A for/in loop would probably be faster; I haven't tested.
filt.each(
function(r) {
retval.append(
ISNUMERIC(r.userDefined) ? right("00000000"&ltrim(rtrim((r.userdefined))),8) : r.userDefined
) ;
}
) ;
Now that we have a new array with the results we want, the original QoQ wanted to order those results. We can do this with ArraySort or .sort.
retval.sort("textnocase") ;
In my test, CF2016 seemed to pass retval.sort() as a boolean and didn't return the sorted array, but CF2018 did. This was expected behavior, since the return type was changed in CF2018. Regardless, both will sort the retval array, so that when we dump the retval array, it's in the chosen order.
And as I always suggest, load test on your system with your data. Like I said, this is only one way to go about what you're trying to do. There are others that may be faster.
https://cffiddle.org/app/file?filepath=dedd219b-6b27-451d-972a-7af75c25d897/54e5559a-b42e-4bf6-b19b-075bfd17bde2/67c0856d-bdb3-4c92-82ea-840e6b8b0214.cfm
(CF2018) > https://trycf.com/gist/2a3762dabf10ad695a925d2bc8e55b09/acf2018?theme=monokai
https://helpx.adobe.com/coldfusion/cfml-reference/coldfusion-functions/functions-m-r/queryfilter.html
https://helpx.adobe.com/coldfusion/cfml-reference/coldfusion-functions/functions-m-r/queryeach.html
https://helpx.adobe.com/coldfusion/cfml-reference/coldfusion-functions/functions-a-b/arraysort.html
ORIGINAL:
This is more of a comment than an answer, but it's much too long for a comment.
I wanted to mention a couple of things to watch out for.
First, ColdFusion's isNumeric() can sometimes have unexpected results. It doesn't really check to see if a value is a number. It checks if a string can be converted to number. So there are all sorts of values that isNumeric() will see as numeric. EX: 1e3 is scientific notation for 1000. isNumeric("1e3") will return true.
My second suggestion is how to deal with leading and trailing space in a "numeric" value, EX: " 4 ". isNumeric() will return true for this one, but when you append and trim for your final value, it will come out as "000000 4". My suggestion to deal with these is to use val() or ltrim(rtrim()) around your column. val() will reduce it to a basic number (" 1.0 " >> "1") but ltrim(rtrim()) will retain the number but get rid of the space (" 1.0 " >> "1.0") and also retain the "scientific notation" value (" 1e3 " >> "1e3"). Both still miss 1,000, so if that's a concern you'll need to handle that. But the method you use totally depends on the values your data contains. Number verification isn't always as easy as it seems it should be.
I've always been a firm believer in GIGO -- Garbage In, Garbage Out. I see basic data cleansing as part of my job. But if it's extreme or regular, I'll tell the source to fix it or their stuff won't work right. When it comes to data, it's impossible to account for all possibilities, but we can check for common expectations. It's always easier to whitelist than it is to blacklist.
<cfscript>
all_employees = QueryNew("userdefined,hello", "varchar,varchar",
[
["test","pure text"],
["2","number as varchar"],
["03","leading zero"],
[" 4 ","leading and trailing spaces"],
["5 ","extra trailing spaces"],
[" 6","extra leading spaces"],
["165e73","scientific notation"],
["1.5","decimal"],
["1,5","comma-delimited (or non-US decimal)"],
["1.0","valid decimal"],
["1.","invalid decimal"],
["1,000","number with comma"]
]
)
//writedump(all_employees) ;
retval = [] ;
for (r in all_employees) {
retval.append(
{
"1 - RowInput" : r.userdefined.replace(" ","*","all") , // Replace space with * for output visibility.
"2 - IsNumeric?" : ISNUMERIC(r.userdefined) ,
"3 - FirstOutput": ( ISNUMERIC(r.userDefined) ? right("00000000"&r.userdefined,8) : r.userDefined ) ,
"4 - ValOutput" : ( ISNUMERIC(r.userDefined) ? right("00000000"&val(r.userdefined),8) : r.userDefined ) ,
"5 - TrimOutput" : ( ISNUMERIC(r.userDefined) ? right("00000000"&ltrim(rtrim((r.userdefined))),8) : r.userDefined )
}
) ;
}
writeDump(retval) ;
</cfscript>
https://trycf.com/gist/03164081321977462f8e9e4916476ed3/acf2018?theme=monokai
What are you trying to do exactly? Please share some context of the goal for your post.
To me it looks like your query may not be formatted properly. It would evalusate to something like:
select ( 0000000099
) as hello
from all_employees
order by hello ASC
Try doing this. Put a <cfabort> right here... And then let me know what query was produced on the screen when you run it.
<cfquery name="qEmployees1" dbtype="query">
select (
case
when ISNUMERIC(u.userdefined)=1
then right('00000000'+u.userdefined,8)
else userdefined
end
) as hello
from all_employees
order by hello ASC
<cfabort>
</cfquery>
<cfquery name="qEmployees1" dbtype="query">
SELECT
(
<cfif isNumeric(all_employees.userdefined)>
right('00000000'+all_employees.userdefined,8)
<cfelse>
all_employees.userdefined
</cfif>
) AS hello
FROM all_employees
ORDER by hello
</cfquery>
it is the syntax free answer thanks to #volumeone

How to read each row in a groovy-sql statement?

I am trying to read a table having five rows and columns. I have used sql.eachRow function to read eachRow and assign the value to a String. I am getting an error "Groovy:[Static type checking] - No such property: MachineName for class: java.lang.Object"
My code:
sql.eachRow('select * from [MACHINES] WHERE UpdateTime > :lastTimeRead, [lastTimeRead: Long.parseLong(lastTimeRead)])
{ row ->
def read = row.MachineName
}
but MachineName is my column name. How can i overcome this error.
Using dynamic Properties with static type checking is not possible*.
However, eachRow will pass a GroovyResultSet as first parameter to the Closure. This means that row has the type GroovyResultSet and so you can access the value using getAt
row.getAt('MachineName')
should work. In groovy you can also use the []-operator:
row['MachineName']
which is equivalent to the first solution.
*) without a type checking extension.
If you Know the Column name you can just use the Below.
"$row.MachineName"
But if you don't Know column name or having some issue, still it can be accessed using an array of Select.
sql.eachRow('select * from [MACHINES] WHERE UpdateTime > :lastTimeRead, [lastTimeRead: Long.parseLong(lastTimeRead)])
{ row->
log.info "First value = ${row[0]}, next value = ${row[1]}"
}

Handling null or missing attributes in JSON using PostgreSQL

I'm learning how to handle JSON in PostgreSQL.
I have a table with some columns. One column is a JSON field. The data in that column has at least these three variations:
Case 1: {"createDate": 1448067864151, "name": "world"}
Case 2: {"createDate": "", "name": "hello"}
Case 3: {"name": "sky"}
Later on, I want to select the createDate.
TO_TIMESTAMP((attributes->>'createDate')::bigint * 0.001)
That works fine for Case 1 when the data is present and it is convertible to a bigint. But what about when it isn't? How do I handle this?
I read this article. It explains that we can add check constraints to perform some rudimentary validation. Alternatively, I could do a schema validation before the data is inserts (on the client side). There are pros and cons with both ideas.
Using a Check Constraint
CONSTRAINT validate_createDate CHECK ((attributes->>'createDate')::bigint >= 1)
This forces a non-nullable field (Case 3 fails). But I want the attribute to be optional. Furthermore, if the attribute doesn't convert to a bigint because it is blank (Case 2), this errors out.
Using JSON schema validation on the client side before insert
This works, in part, because the schema validation makes sure that what data comes in conforms to the schema. In my case, I can control which clients access this table, so this is OK. But it doesn't matter for the SQL later on since my validator will let pass all three cases.
Basically, you need to check if createDate attribute is empty:
WITH data(attributes) AS ( VALUES
('{"createDate": 1448067864151, "name": "world"}'::JSON),
('{"createDate": "", "name": "hello"}'::JSON),
('{"name": "sky"}'::JSON)
)
SELECT to_timestamp((attributes->>'createDate')::bigint * 0.001) FROM data
WHERE
(attributes->>'createDate') IS NOT NULL
AND
(attributes->>'createDate') != '';
Output:
to_timestamp
----------------------------
2015-11-20 17:04:24.151-08
(1 row)
Building on Dmitry's answer, you can also check the json type with the json_typeof function. Note the json operator: -> to get json instead of the ->> operator which always casts the value to string.
By doing the check in the SELECT with a CASE conditional instead of in the WHERE clause, we also keep the rows not having a createdDate. Depending on your usecase, this might be better.
WITH data(attributes) AS ( VALUES
('{"createDate": 1448067864151, "name": "world"}'::JSON),
('{"createDate": "", "name": "hello"}'::JSON),
('{"name": "sky"}'::JSON)
)
SELECT
CASE WHEN (json_typeof(attributes->'createDate') = 'number')
THEN to_timestamp((attributes->>'createDate')::bigint * 0.001)
END AS created_date
FROM data
;
Output:
created_date
----------------------------
"2015-11-21 02:04:24.151+01"
""
""
(3 rows)