I have some arrays stored in Redshift table "transactions" in the following format:
id, total, breakdown
1, 100, [50,50]
2, 200, [150,50]
3, 125, [15, 110]
...
n, 10000, [100,900]
Since this format is useless to me, I need to do some processing on this to get the values out. I've tried using regex to extract it.
SELECT regexp_substr(breakdown, '\[([0-9]+),([0-9]+)\]')
FROM transactions
but I get an error returned that says
Unmatched ( or \(
Detail:
-----------------------------------------------
error: Unmatched ( or \(
code: 8002
context: T_regexp_init
query: 8946413
location: funcs_expr.cpp:130
process: query3_40 [pid=17533]
--------------------------------------------
Ideally I would like to get x and y as their own columns so I can do the appropriate math. I know I can do this fairly easy in python or PHP or the like, but I'm interested in a pure SQL solution - partially because I'm using an online SQL editor (Mode Analytics) to plot it easily as a dashboard.
Thanks for your help!
If breakdown really is an array you can do this:
select id, total, breakdown[1] as x, breakdown[2] as y
from transactions;
If breakdown is not an array but e.g. a varchar column, you can cast it into an array if you replace the square brackets with curly braces:
select id, total,
(translate(breakdown, '[]', '{}')::integer[])[1] as x,
(translate(breakdown, '[]', '{}')::integer[])[2] as y
from transactions;
You can try this :
SELECT REPLACE(SPLIT_PART(breakdown,',',1),'[','') as x,REPLACE(SPLIT_PART(breakdown,',',2),']','') as y FROM transactions;
I tried this with redshift db and this worked for me.
Detailed Explanation:
SPLIT_PART(breakdown,',',1) will give you [50.
SPLIT_PART(breakdown,',',2) will give you 50].
REPLACE(SPLIT_PART(breakdown,',',1),'[','') will replace the [ and will give just 50.
REPLACE(SPLIT_PART(breakdown,',',2),']','') will replace the ] and will give just 50.
Know its an old post.But if someone needs a much easier way
select json_extract_array_element_text('[100,101,102]', 2);
output : 102
Related
I have some results in one of my tables and the results vary, each; represents multiple entries in one column which I need to split out.
Here is my SQL and the results:
select REGEXP_COUNT(value,';') as cnt,
description
from mytable;
1 {Managed By|xBoss}{xBoss xBoss Number|X0910505569}{Time
Requested|2009-04-15 20:47:11.0}{Time Arrived|2009-04-15 21:46:11.0};
1 {Managed By|Modern Management}{xBoss Number|}{Time Requested|2009-04-
16 14:01:29.0}{Time Arrived|2009-04-16 14:44:11.0};
2 {Managed By|xBoss}{xBoss Number|X091480092}{Time Requested|2009-05-28
08:58:41.0}{Time Arrived|};{Managed By|Jims Allocation}{xBoss xBoss
Number|}{Time Requested|}{Time Arrived|};
Desired output:
R1:
Managed By: xBoss
Time Requested:2009-10-19 07:53:45.0
Time Arrived: 2009-10-19 07:54:46.0
R2:
Managed By:Own Arrangements
Number: x5876523
Time Requested: 2009-10-19 07:57:46.0
Time Arrived:
R3:
Managed By: xBoss
Time Requested:2009-10-19 08:07:27.0
select
SPLIT_PART(description, '}', 1),
SPLIT_PART(description, '}', 2),
SPLIT_PART(description, '}', 3),
SPLIT_PART(description, '}', 4),
SPLIT_PART(description, '}', 5)
as description_with_tag from mytable;
This is ok when the count is 1, but when there are multiple ; in the description it doesn't give me the results.
Is it possible to put this into an array based on the count?
First, it's worth pointing out that data in this type of format cannot take advantage of all the benefits that Redshift can offer. Amazon Redshift is a columnar database that can provide amazing performance when data is stored in appropriate columns. However, selecting specific text from a text field will always perform poorly.
Therefore, my main advice would be to pre-process the data into normal rows and columns so that Redshift can provide you the best capabilities.
However, to answer your question, I would recommend making a Scalar User-Defined Function:
CREATE FUNCTION f_extract_curly (s TEXT, key TEXT)
RETURNS TEXT
STABLE
AS $$
# List of items in {brackets}
items = s[1:-1].split('}{')
# Dictionary of Key|Value from items
entries = {i.split('|')[0]: i.split('|')[1] for i in items}
# Return desired value
return entries.get(key, None)
$$ LANGUAGE plpythonu;
I loaded sample data with:
CREATE TABLE foo (
description TEXT
);
INSERT INTO foo values('{Managed By|xBoss}{xBoss xBoss Number|X0910505569}{Time Requested|2009-04-15 20:47:11.0}{Time Arrived|2009-04-15 21:46:11.0};');
INSERT INTO foo values('{Managed By|Modern Management}{xBoss Number|}{Time Requested|2009-04-16 14:01:29.0}{Time Arrived|2009-04-16 14:44:11.0};');
INSERT INTO foo values('{Managed By|xBoss}{xBoss Number|X091480092}{Time Requested|2009-05-28 08:58:41.0}{Time Arrived|};{Managed By|Jims Allocation}{xBoss xBoss Number|}{Time Requested|}{Time Arrived|};');
Then I tested it with:
SELECT
f_extract_curly(description, 'Managed By'),
f_extract_curly(description, 'Time Requested')
FROM foo
and got the result:
xBoss 2009-04-15 20:47:11.0
Modern Management 2009-04-16 14:01:29.0
xBoss
It doesn't know how to handle lines that have the same field specified twice (with semi-colons between). You did not provide enough sample input and output lines for me to figure out what you wanted in such situations, but feel free to tweak the code for your requirements.
There is no array data type in Redshift. There are 2 options:
1) First split_part by ';', then union results separately for every index of the first split_part output, then split_part results by '}' and finally get what you need.
2) Create a Python UDF and process these strings with Python. I guess this is the best solution for your use case.
3) Transform your data outside Redshift. From your data structure it seems like it's much better to process it before copying to Redshift, unnesting the arrays into rows and extracting keys from your objects into columns.
I have a following table:
EstimatedCurrentRevenue -- Revenue column value of yesterday
EstimatedPreviousRevenue --- Revenue column value of current day
crmId
OwnerId
PercentageChange.
I am querying two snapshots of the similarly structured data in Azure data lake and trying to query the percentage change in Revenue.
Following is my query i am trying to join on OpportunityId to get the difference between the revenue values:
#opportunityRevenueData = SELECT (((opty.EstimatedCurrentRevenue - optyPrevious.EstimatedPreviousRevenue)*100)/opty.EstimatedCurrentRevenue) AS PercentageRevenueChange, optyPrevious.EstimatedPreviousRevenue,
opty.EstimatedCurrentRevenue, opty.crmId, opty.OwnerId From #opportunityCurrentData AS opty JOIN #opportunityPreviousData AS optyPrevious on opty.OpportunityId == optyPrevious.OpportunityId;
But i get the following error:
E_CSC_USER_SYNTAXERROR: syntax error. Expected one of: AS EXCEPT FROM
GROUP HAVING INTERSECT OPTION ORDER OUTER UNION UNION WHERE ';' ')'
','
at token 'From', line 40
near the ###:
This expression is having the problem i know but not sure how to fix it.
(((opty.EstimatedCurrentRevenue - optyPrevious.EstimatedPreviousRevenue)*100)/opty.EstimatedCurrentRevenue)
Please help, i am completely new to U-sql
U-SQL is case-sensitive (as per here) with all SQL reserved words in UPPER CASE. So you should capitalise the FROM and ON keywords in your statement, like this:
#opportunityRevenueData =
SELECT (((opty.EstimatedCurrentRevenue - optyPrevious.EstimatedPreviousRevenue) * 100) / opty.EstimatedCurrentRevenue) AS PercentageRevenueChange,
optyPrevious.EstimatedPreviousRevenue,
opty.EstimatedCurrentRevenue,
opty.crmId,
opty.OwnerId
FROM #opportunityCurrentData AS opty
JOIN
#opportunityPreviousData AS optyPrevious
ON opty.OpportunityId == optyPrevious.OpportunityId;
Also, if you are completely new to U-SQL, you should consider working through some tutorials to establish the basics of the language, including case-sensitivity. Start at http://usql.io/.
This same crazy sounding error message can occur for (almost?) any USQL syntax error. The answer above was clearly correct for the provided code.
However since many folks will probably get to this page from a search for 'AS EXCEPT FROM GROUP HAVING INTERSECT OPTION ORDER OUTER UNION UNION WHERE', I'd say the best advice to handle these is look closely at the snippet of your code that the error message has marked with '###'.
For example I got to this page upon getting a syntax error for a long query and it turned out I didn't have a casing issue, but just a malformed query with parens around the wrong thing. Once I looked more closely at where in the snippet the ### symbol was, the error became clear.
Ive tried many queries to find... just one word and I can´t even make that.
Its a DB2 database Im using com.ibm.db2.jcc.DB2Driver
This brings me info:
select *
from JL_ENR
where id_ws = '002'
and dc_dy_bsn = '2014-08-25'
and ai_trn = 2331
the JL_TPE column is the CLOB column where I want to find two strings in that search result ( and dc_dy_bsn = '2014-08-25'
and ai_trn = 2331 ).
So first I tried with one:
select
dbms_lob.substr(clob_column,dbms_lob_instr(JL_TPE,'CEMENTO'),1)
from
JL_ENR
where
dbms_lob.instr(JL_TPE,'CEMENTO')>0;
didnt work
SELECT * FROM JL_ENR WHERE dbms_lob.instr(JL_TPE,'CEMENTO')>0
and ai_trn = 2331
and dc_dy_bsn = '2014-08-25'
didnt work
Select *
From JL_ENR
Where NOT
DBMS_LOB.INSTR(JL_TPE, 'CEMENTO', 1, 1) = 0;
didn´t work
Could someone explain me how to find two strings please?
Or a tutorial link where it is explained how to make it work...
Thanks.
Can you provide some sample data and the version you are using? Your example should work (tested on v10.5.0.1):
db2 "create table test ( x int, y clob(1M) )"
db2 "insert into test (x,y) values (1,cast('The string to find is CEMENTO, how do we do that?')"
db2 "insert into test (x,y) values (2,cast('The string to find is CEMENT, how do we do that?' as clob))"
db2 "select x, DBMS_LOB.INSTR(y, 'CEMENTO', 1) from test where DBMS_LOB.INSTR(y, 'CEMENTO', 1) > 0"
X 2
----------- -----------
1 23
1 record(s) selected.
I had to search for a specific value in the where clause. I used TEXTBLOB LIKE '%Search value%' and it worked! This was for db2 in a CLOB(536870912) column.
I have loaded the following test data:
name, age,gender
"John", 33,m
"Sam", 33,m
"Julie",33,f
"Jimbo",, m
with schema: name:STRING,age:INTEGER,gender:STRING and I have confirmed that the Jimbo row shows a null for column "age" in the BigQuery Browser Tool > mydataset > Details > Preview section.
When I run this query :
SELECT AVG(age) FROM [peterprivatedata.testpeople]
I get 24.75 which is incorrect. I expected 33 because the documentation for AVG says "Rows with a NULL value are not included in the calculation."
Am I doing something wrong or is this a known bug? (I don't know if there's a public issues list to check). What's the simplest workaround to this?
This is a known bug where we coerce null numeric values to to 0 on import. We're currently working on a fix. These values do however, show up as not not defined (which for various reasons is different from null), so you can check for IS_EXPLICITLY_DEFINED. For example:
SELECT sum(if(is_explicitly_defined(numeric_field), numeric_field, 0)) /
sum(if(is_explicitly_defined(numeric_field), 1, 0))
AS my_avg FROM your_table
Alternately, you could use another column to represent is_null. Then the query would look like:
SELECT sum(if(numeric_field_is_null, 0, numeric_field)) /
sum(if(numeric_field_is_null, 0, 1))
AS my_avg FROM your_table
SQLDF newbie here.
I have a data frame which has about 15,000 rows and 1 column.
The data looks like:
cars
autocar
carsinfo
whatisthat
donnadrive
car
telephone
...
I wanted to use the package sqldf to loop through the column and
pick all values which contain "car" anywhere in their value.
However, the following code generates an error.
> sqldf("SELECT Keyword FROM dat WHERE Keyword="car")
Error: unexpected symbol in "sqldf("SELECT Keyword FROM dat WHERE Keyword="car"
There is no unexpected symbol, so I'm not sure whats wrong.
so first, I want to know all the values which contain 'car'.
then I want to know only those values which contain just 'car' by itself.
Can anyone help.
EDIT:
allright, there was an unexpected symbol, but it only gives me just car and not every
row which contains 'car'.
> sqldf("SELECT Keyword FROM dat WHERE Keyword='car'")
Keyword
1 car
Using = will only return exact matches.
You should probably use the like operator combined with the wildcards % or _. The % wildcard will match multiple characters, while _ matches a single character.
Something like the following will find all instances of car, e.g. "cars", "motorcar", etc:
sqldf("SELECT Keyword FROM dat WHERE Keyword like '%car%'")
And the following will match "car" or "cars":
sqldf("SELECT Keyword FROM dat WHERE Keyword like 'car_'")
This has nothing to do with sqldf; your SQL statement is the problem. You need:
dat <- data.frame(Keyword=c("cars","autocar","carsinfo",
"whatisthat","donnadrive","car","telephone"))
sqldf("SELECT Keyword FROM dat WHERE Keyword like '%car%'")
# Keyword
# 1 cars
# 2 autocar
# 3 carsinfo
# 4 car
You can also use regular expressions to do this sort of filtering. grepl returns a logical vector (TRUE / FALSE) stating whether or not there was a match or not. You can get very sophisticated to match specific items, but a basic query will work in this case:
#Using #Joshua's dat data.frame
subset(dat, grepl("car", Keyword, ignore.case = TRUE))
Keyword
1 cars
2 autocar
3 carsinfo
6 car
Very similar to the solution provided by #Chase. Because we do not use subset we do not need a logical vector and can use both grep or grepl:
df <- data.frame(keyword = c("cars", "autocar", "carsinfo", "whatisthat", "donnadrive", "car", "telephone"))
df[grep("car", df$keyword), , drop = FALSE] # or
df[grepl("car", df$keyword), , drop = FALSE]
keyword
1 cars
2 autocar
3 carsinfo
6 car
I took the idea from Selecting rows where a column has a string like 'hsa..' (partial string match)