KDB+: Formatting Values in Tables - formatting

What is the most robust way to specify rules for formatting table values? I want to apply each rule to its corresponding column in the most efficient way. I suppose using the functional form would be helpful in this case.
Here is the sample table:
tbl:flip `GARP`longWgt`shortWgt`longWgtBeta`shortWgtBeta`longWgtRisk`shortWgtRisk`netWgt`netExposure`relativeBeta`relativeRisk`adjBeta`adjRisk!(`GARP_AUTOS_CA`GARP_BANKS_CA`GARP_CHEMICALS_CA`GARP_COMMUNICATIONS_CA`GARP_CONS_DISCR_CA;0.0091686 0.0176234 0.0076484 0.0131509 0.0460397;-0.010305 -0.0470135 0n -0.0078549 -0.0563819;1.3522162 0.6234817 1.3140238 0.7327634 1.1802914;0.1440806 0.7642193 0n 0.7216727 0.6112765;0.3254744 0.1573925 0.2541326 0.2554008 0.350877;0.3079491 0.2218098 0n 0.2594863 0.2758658;-0.0011365 -0.0293902 0.0076484 0.005296 -0.0103422;0.8897173 0.374857 0n 1.67422 0.8165681;9.3851363 0.8158414 0n 1.0153681 1.9308631;1.0569097 0.7095833 0n 0.9842553 1.2719117;8.3501184 -3.269856 0n 1.6999496 1.5766812;-1.0634328 -3.7595078 0n 1.64786 1.0386025)
I want all the columns to have 2 significant figures. The longWgt, shortWgt and netWgt columns should be in percent.
I have something like this, but I'm sure there is a better way of doing this:
tbl:update longWgt:100f*longWgt, shortWgt:100f*shortWgt, netWgt:100f*netWgt from tbl;
tbl:update .Q.f[2] each longWgt, .Q.f[2] each shortWgt, .Q.f[2] each longWgtBeta, .Q.f[2] each shortWgtBeta, .Q.f[2] each longWgtRisk, .Q.f[2] each shortWgtRisk, .Q.f[2] each netWgt, .Q.f[2] each netExposure, .Q.f[2] each relativeBeta, .Q.f[2] each relativeRisk, .Q.f[2] each adjBeta, .Q.f[2] each adjRisk from tbl;
tbl:update {x,"%"} each longWgt, {x,"%"} each shortWgt, {x,"%"} each netWgt from tbl;

How about using a functional query:
First multiply the wgtCols below by 100:
wgtCols: `longWgt`shortWgt`netWgt;
![`tbl;();0b;wgtCols!{(*;100f;x)} each wgtCols];
Then format all columns allCols except the `GARP with 2 decimals:
allCols:1_cols tbl;
![`tbl;();0b;allCols!{(each;.Q.f[2];x)} each allCols];
Finally format the wgtCols to percentage format:
![`tbl;();0b;wgtCols!{(each;{x,"%"};x)} each wgtCols]
NB: To find out how to construct your functional query, you can use the parse operator on the q-query of your choice:
parse "update longWgt:100f*longWgt, shortWgt:100f*shortWgt, netWgt:100f*netWgt from tbl"
Output:
!
`tbl
()
0b
`longWgt`shortWgt`netWgt!((*;100f;`longWgt);(*;100f;`shortWgt);(*;100f;`netWgt))

You could also use the 3 argument form of the # apply operator which can be found here https://code.kx.com/wiki/Reference/AtSymbol since you are only applying functions to columns and not aggregating, filtering or renaming any columns.
It indexes into the item specified in the first argument with the 2nd argument and then applies the function in 3rd argument to the resulting elements, leaving the other elements untouched.
#[`tbl;wgtCols;100*];
#[`tbl;allCols;.Q.f[2]'];
#[`tbl;wgtCols;{x,'"%"}];
These can be nicely merged into one function using each both ' which will iterate through the list of arguments, applying them in turn. Information on each-both can be found https://code.kx.com/q/ref/adverbs/#each-both
#[`tbl;;]'[(wgtCols;allCols;wgtCols);(100*;.Q.f[2]';{x,'"%"})]]
These will all amend in place, e.g overwrite the tbl variable. If you do not want this to happen you can use this function:
#[;;]/[tbl;(wgtCols;allCols;wgtCols);(100*;.Q.f[2]';{x,'"%"})]
This utilises the / (scan) operator to apply each function in turn, with the first function operating on the initial table and the results feeding into the next function.

Related

Regex comparison in Oracle between 2 varchar columns (from different tables)

I am trying to find a way to capture relevant errors from oracle alertlog. I have one table (ORA_BLACKLIST) with column values as below (these are the values which I want to ignore from
V$DIAG_ALERT_EXT)
Below are sample data in ORA_BLACKLIST table. This table can grow based on additional error to ignore from alertlog.
ORA-07445%[kkqctdrvJPPD
ORA-07445%[kxsPurgeCursor
ORA-01013%
ORA-27037%
ORA-01110
ORA-2154
V$DIAG_ALERT_EXT contains a MESSAGE_TEXT column which contains sample text like below.
ORA-01013: user requested cancel of current operation
ORA-07445: exception encountered: core dump [kxtogboh()+22] [SIGSEGV] [ADDR:0x87] [PC:0x12292A56]
ORA-07445: exception encountered: core dump [java_util_HashMap__get()] [SIGSEGV]
ORA-00600: internal error code arguments: [qercoRopRowsets:anumrows]
I want to write a query something like below to ignore the black listed errors and only capture relevant info like below.
select
dae.instance_id,
dae.container_name,
err_count,
dae.message_level
from
ORA_BLACKLIST ob,
V$DIAG_ALERT_EXT dae
where
group by .....;
Can someone suggest a way or sample code to achieve it?
I should have provided the exact contents of blacklist table. It currently contains some regex (perl) and I want to convert it to oracle like regex and compare with v$diag_alert_ext message_text column. Below are sample perl regex in my blacklist table.
ORA-0(,|$| )
ORA-48913
ORA-00060
ORA-609(,|$| )
ORA-65011
ORA-65020 ORA-31(,|$| )
ORA-7452 ORA-959(,|$| )
ORA-3136(,|)|$| )
ORA-07445.[kkqctdrvJPPD
ORA-07445.[kxsPurgeCursor –
Your blacklist table looks like like patterns, not regular expressions.
You can write a query like this:
select dae.* -- or whatever columns you want
from V$DIAG_ALERT_EXT dae
where not exists (select 1
from ORA_BLACKLIST ob
where dae.message_text like ob.<column name>
);
This will not have particularly good performance if the tables are large.

BigQuery IF condition then append value into Array - Standard SQL

In BQ (Standard SQL) I would like to Append a value into an existing Array IF a condition is satisfied
example
IF (REGEXP_CONTAINS(prodTitle, r'(?i)ecksofa'),ARRAY_CONCAT(prodcategory, ("1102")))
is this correct and efficient?
can I use multiple IFs and ARRAY_CONCAT in the same Query?
example
IF (REGEXP_CONTAINS(prodTitle, r'(?i)ecksofa'),ARRAY_CONCAT(prodcategory, ("1102")))
IF (REGEXP_CONTAINS(prodTitle, r'(?i)blablan'),ARRAY_CONCAT(prodcategory, ("1103")))
Guess your purpose is like below for single IF (corrected your expression a little bit):
IF (REGEXP_CONTAINS(prodTitle, r'(?i)ecksofa'),
ARRAY_CONCAT(prodcategory, ["1102"]),
prodcategory)
In order to chain multiple IF and concat the output array, I would use SQL like below:
ARRAY_CONCAT(prodcategory,
IF (REGEXP_CONTAINS(prodTitle, r'(?i)ecksofa'), ["1102"], []),
IF (REGEXP_CONTAINS(prodTitle, r'(?i)blablan'), ["1103"], []),
...
)
To be more efficient, it is better to replace
REGEXP_CONTAINS(prodTitle, r'(?i)ecksofa')
=>
STRPOS(LOWER(prodTitle), 'ecksofa') != 0

How do I Process This String?

I have some results in one of my tables and the results vary, each; represents multiple entries in one column which I need to split out.
Here is my SQL and the results:
select REGEXP_COUNT(value,';') as cnt,
description
from mytable;
1 {Managed By|xBoss}{xBoss xBoss Number|X0910505569}{Time
Requested|2009-04-15 20:47:11.0}{Time Arrived|2009-04-15 21:46:11.0};
1 {Managed By|Modern Management}{xBoss Number|}{Time Requested|2009-04-
16 14:01:29.0}{Time Arrived|2009-04-16 14:44:11.0};
2 {Managed By|xBoss}{xBoss Number|X091480092}{Time Requested|2009-05-28
08:58:41.0}{Time Arrived|};{Managed By|Jims Allocation}{xBoss xBoss
Number|}{Time Requested|}{Time Arrived|};
Desired output:
R1:
Managed By: xBoss
Time Requested:2009-10-19 07:53:45.0
Time Arrived: 2009-10-19 07:54:46.0
R2:
Managed By:Own Arrangements
Number: x5876523
Time Requested: 2009-10-19 07:57:46.0
Time Arrived:
R3:
Managed By: xBoss
Time Requested:2009-10-19 08:07:27.0
select
SPLIT_PART(description, '}', 1),
SPLIT_PART(description, '}', 2),
SPLIT_PART(description, '}', 3),
SPLIT_PART(description, '}', 4),
SPLIT_PART(description, '}', 5)
as description_with_tag from mytable;
This is ok when the count is 1, but when there are multiple ; in the description it doesn't give me the results.
Is it possible to put this into an array based on the count?
First, it's worth pointing out that data in this type of format cannot take advantage of all the benefits that Redshift can offer. Amazon Redshift is a columnar database that can provide amazing performance when data is stored in appropriate columns. However, selecting specific text from a text field will always perform poorly.
Therefore, my main advice would be to pre-process the data into normal rows and columns so that Redshift can provide you the best capabilities.
However, to answer your question, I would recommend making a Scalar User-Defined Function:
CREATE FUNCTION f_extract_curly (s TEXT, key TEXT)
RETURNS TEXT
STABLE
AS $$
# List of items in {brackets}
items = s[1:-1].split('}{')
# Dictionary of Key|Value from items
entries = {i.split('|')[0]: i.split('|')[1] for i in items}
# Return desired value
return entries.get(key, None)
$$ LANGUAGE plpythonu;
I loaded sample data with:
CREATE TABLE foo (
description TEXT
);
INSERT INTO foo values('{Managed By|xBoss}{xBoss xBoss Number|X0910505569}{Time Requested|2009-04-15 20:47:11.0}{Time Arrived|2009-04-15 21:46:11.0};');
INSERT INTO foo values('{Managed By|Modern Management}{xBoss Number|}{Time Requested|2009-04-16 14:01:29.0}{Time Arrived|2009-04-16 14:44:11.0};');
INSERT INTO foo values('{Managed By|xBoss}{xBoss Number|X091480092}{Time Requested|2009-05-28 08:58:41.0}{Time Arrived|};{Managed By|Jims Allocation}{xBoss xBoss Number|}{Time Requested|}{Time Arrived|};');
Then I tested it with:
SELECT
f_extract_curly(description, 'Managed By'),
f_extract_curly(description, 'Time Requested')
FROM foo
and got the result:
xBoss 2009-04-15 20:47:11.0
Modern Management 2009-04-16 14:01:29.0
xBoss
It doesn't know how to handle lines that have the same field specified twice (with semi-colons between). You did not provide enough sample input and output lines for me to figure out what you wanted in such situations, but feel free to tweak the code for your requirements.
There is no array data type in Redshift. There are 2 options:
1) First split_part by ';', then union results separately for every index of the first split_part output, then split_part results by '}' and finally get what you need.
2) Create a Python UDF and process these strings with Python. I guess this is the best solution for your use case.
3) Transform your data outside Redshift. From your data structure it seems like it's much better to process it before copying to Redshift, unnesting the arrays into rows and extracting keys from your objects into columns.

How to custom sort this data in SQL Server 2012?

I have some hard time figuring it out how to custom sort data below the way I want to. Meaning it should be in order like this:
201-1-1
201-1-2
201-1-3
.......
201-2-1
and so on if you know what I mean.
Instead I'm getting this sort executing below code:
select *
from test.dbo.accounts
order by account_name asc
Output:
201-10-1
201-10-2
201-1-1
201-11-1
201-11-2
201-11-3
201-11-4
201-11-6
201-1-2
201-12-1
201-12-2
201-12-3
201-12-4
201-12-6
201-1-3
201-13-1
201-13-2
201-13-3
201-13-4
201-13-6
201-1-4
201-14-1
201-14-2
201-14-4
201-14-6
201-15-1
201-15-2
201-15-3
201-15-4
201-15-6
201-1-6
201-16-1
201-16-2
201-16-3
201-16-4
201-16-6
201-16-7
201-1-7
201-17-1
201-17-2
201-17-4
201-17-6
201-18-1
201-18-2
201-18-3
201-18-4
201-18-6
201-19-1
Thanks
For your sample data, this following trick will work:
order by len(account_name), account_name
This only works because the only variable length component is the second component and because the hyphen is "smaller" than digits.
You should normalize the accounts names so all the components are the same length, by left padding the numbers with zeros.
Ugh. String manipulation in SQL can be extremely cumbersome. There might be a better way to do this, but this does seem to work.
select accoutn_name
from test.dbo.accounts
order by left(account_name,charindex('-',account_name,1)-1)
,replace(right(left(account_name,CHARINDEX('-',account_name,1)+2),2),'-', '')
,REPLACE(right(account_name,2),'-','')
BTW, this is a very expensive process to run. If it's productionalized, you'll want to come up with a better solution.

What is the best way to run N independent column updates in PostgreSQL? What is the best way to do it in the SQL spec?

I'm looking for a more efficient way to run many columns updates on the same table like this:
UPDATE TABLE table
SET col = regexp_replace( col, 'foo', 'bar' )
WHERE regexp_match( col, 'foo' );
Such that foo, and bar, will be a combination of 40 different regex-replaces. I doubt even 25% of the dataset needs to be updated at all, but what I'm wanting to know is it is possible to cleanly achieve the following in SQL.
A single pass update
A single match of the regex, triggers a single replace
Not running all possible regexp_replaces if only one matches
Not updating all columns if only one needs the update
Not updating a row if no column has changed
I'm also curious, I know in MySQL (bear with me)
UPDATE foo SET bar = 'baz'
Has an implicit WHERE bar != 'baz' clause
However, in PostgreSQL I know this doesn't exist: I think I could at least answer one of my questions if I knew how to skip a single row's update if the target columns weren't updated.
Something like
UPDATE TABLE table
SET col = *temp_var* = regexp_replace( col, 'foo', 'bar' )
WHERE col != *temp_var*
Do it in code. Open up a cursor, then: grab a row, run it through the 40 regular expressions, and if it changed, save it back. Repeat until the cursor doesn't give you any more rows.
Whether you do it that way or come up with the magical SQL expression, it's still going to be a row scan of the entire table, but the code will be much simpler.
Experimental Results
In response to criticism, I ran an experiment. I inserted 10,000 lines from a documentation file into a table with a serial primary key and a varchar column. Then I tested two ways to do the update. Method 1:
in a transaction:
opened up a cursor (select for update)
while reading 100 rows from the cursor returns any rows:
for each row:
for each regular expression:
do the gsub on the text column
update the row
This takes 1.16 seconds with a locally connected database.
Then the "big replace," a single mega-regex update:
update foo set t =
regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(t,
E'\bcommit\b', E'COMMIT'),
E'\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b',
E'9ACF10762B5F3D3B1B33EA07792A936A25E45010'),
E'\bAuthor:\b', E'AUTHOR:'),
E'\bCarl\b', E'CARL'), E'\bWorth\b',
E'WORTH'), E'\b\b',
E''), E'\bDate:\b',
E'DATE:'), E'\bMon\b', E'MON'),
E'\bOct\b', E'OCT'), E'\b26\b',
E'26'), E'\b04:53:13\b', E'04:53:13'),
E'\b2009\b', E'2009'), E'\b-0700\b',
E'-0700'), E'\bUpdate\b', E'UPDATE'),
E'\bversion\b', E'VERSION'),
E'\bto\b', E'TO'), E'\b2.9.1\b',
E'2.9.1'), E'\bcommit\b', E'COMMIT'),
E'\b61c89e56f361fa860f18985137d6bf53f48c16ac\b',
E'61C89E56F361FA860F18985137D6BF53F48C16AC'),
E'\bAuthor:\b', E'AUTHOR:'),
E'\bCarl\b', E'CARL'), E'\bWorth\b',
E'WORTH'), E'\b\b',
E''), E'\bDate:\b',
E'DATE:'), E'\bMon\b', E'MON'),
E'\bOct\b', E'OCT'), E'\b26\b',
E'26'), E'\b04:51:58\b', E'04:51:58'),
E'\b2009\b', E'2009'), E'\b-0700\b',
E'-0700'), E'\bNEWS:\b', E'NEWS:'),
E'\bAdd\b', E'ADD'), E'\bnotes\b',
E'NOTES'), E'\bfor\b', E'FOR'),
E'\bthe\b', E'THE'), E'\b2.9.1\b',
E'2.9.1'), E'\brelease.\b',
E'RELEASE.'), E'\bThanks\b',
E'THANKS'), E'\bto\b', E'TO'),
E'\beveryone\b', E'EVERYONE'),
E'\bfor\b', E'FOR')
The mega-regex update takes 0.94 seconds to update.
At 0.94 seconds compared to 1.16, it's true that the mega-regex update is faster, running in 81% of the time of doing it in code. It is not, however a lot faster. And ye Gods, look at that update statement. Do you want to write that, or try to figure out what went wrong when Postgres complains that you dropped a parenthesis somewhere?
Code
The code used was:
def stupid_regex_replace
sql = Select.new
sql.select('id')
sql.select('t')
sql.for_update
sql.from(TABLE_NAME)
Cursor.new('foo', sql, {}, #db) do |cursor|
until (rows = cursor.fetch(100)).empty?
for row in rows
for regex, replacement in regexes
row['t'] = row['t'].gsub(regex, replacement)
end
end
sql = Update.new(TABLE_NAME, #db)
sql.set('t', row['t'])
sql.where(['id = %s', row['id']])
sql.exec
end
end
end
I generated the regular expressions dynamically by taking words from the file; for each word "foo", its regular expression was "\bfoo\b" and its replacement string was "FOO" (the word uppercased). I used words from the file to make sure that replacements did happen. I made the test program spit out the regex's so you can see them. Each pair is a regex and the corresponding replacement string:
[[/\bcommit\b/, "COMMIT"],
[/\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b/,
"9ACF10762B5F3D3B1B33EA07792A936A25E45010"],
[/\bAuthor:\b/, "AUTHOR:"],
[/\bCarl\b/, "CARL"],
[/\bWorth\b/, "WORTH"],
[/\b<cworth#cworth.org>\b/, "<CWORTH#CWORTH.ORG>"],
[/\bDate:\b/, "DATE:"],
[/\bMon\b/, "MON"],
[/\bOct\b/, "OCT"],
[/\b26\b/, "26"],
[/\b04:53:13\b/, "04:53:13"],
[/\b2009\b/, "2009"],
[/\b-0700\b/, "-0700"],
[/\bUpdate\b/, "UPDATE"],
[/\bversion\b/, "VERSION"],
[/\bto\b/, "TO"],
[/\b2.9.1\b/, "2.9.1"],
[/\bcommit\b/, "COMMIT"],
[/\b61c89e56f361fa860f18985137d6bf53f48c16ac\b/,
"61C89E56F361FA860F18985137D6BF53F48C16AC"],
[/\bAuthor:\b/, "AUTHOR:"],
[/\bCarl\b/, "CARL"],
[/\bWorth\b/, "WORTH"],
[/\b<cworth#cworth.org>\b/, "<CWORTH#CWORTH.ORG>"],
[/\bDate:\b/, "DATE:"],
[/\bMon\b/, "MON"],
[/\bOct\b/, "OCT"],
[/\b26\b/, "26"],
[/\b04:51:58\b/, "04:51:58"],
[/\b2009\b/, "2009"],
[/\b-0700\b/, "-0700"],
[/\bNEWS:\b/, "NEWS:"],
[/\bAdd\b/, "ADD"],
[/\bnotes\b/, "NOTES"],
[/\bfor\b/, "FOR"],
[/\bthe\b/, "THE"],
[/\b2.9.1\b/, "2.9.1"],
[/\brelease.\b/, "RELEASE."],
[/\bThanks\b/, "THANKS"],
[/\bto\b/, "TO"],
[/\beveryone\b/, "EVERYONE"],
[/\bfor\b/, "FOR"]]
If this were a hand-generated list of regex's, and not automatically generated, my question is still appropriate: Which would you rather have to create or maintain?
For the skip update, look at suppress_redundant_updates - see http://www.postgresql.org/docs/8.4/static/functions-trigger.html.
This is not necessarily a win - but it might well be in your case.
Or perhaps you can just add that implicit check as an explicit one?