I have a data like below
Input data
Key data
a [5,2,6,null,null]
b [5,7,9,4,null,null,null]
I want output to be like below.
Output:
Key data
a [6,2,5,null,null]
b [4,9,7,5,null,null,null]
Basically elements in the array needs to be reversed by keeping nulls at the end as it is.
Can someone please help me with spark SQL query?
My approach - transform NULLs for sorting then transform back to NULL
select transform(sort_array(transform(data, x -> coalesce(x, 0)), False), x -> case when x=0 then null else x end) from table1
[EDIT]
Just noticed the transformations are not required if the NULLs are to be sorted at the end based on reverse order. sort_array() will work by itself
sort_array(data, False)
[EDIT 2]
Having it pointed out to me I misunderstood the question, I believe this will work... it's a little convoluted however:
select
concat(
reverse(array_except(array(5,7,9,4,null,null,null), array(null)))
, array_repeat(null
, aggregate(
transform(array(5,7,9,4,null,null,null), (x, i) -> (case when x is null then 1 else 0 end))
, 0, (acc, x) -> acc + x
)
)
)
The approach counts the number of nulls, removes the nulls, reverses the array and adds back the nulls at the end of the array
Untested:
reverse(filter(array(0, null, 2, 3, null), x -> x IS NOT NULL))
then append:
filter(array(0, null, 2, 3, null), x -> x IS NULL)
See Filter
Related
I'm trying to group BigQuery columns using an array like so:
with test as (
select 1 as A, 2 as B
union all
select 3, null
)
select *,
[A,B] as grouped_columns
from test
However, this won't work, since there is a null value in column B row 2.
In fact this won't work either:
select [1, null] as test_array
When reading the documentation on BigQuery though, it says Nulls should be allowed.
In BigQuery, an array is an ordered list consisting of zero or more
values of the same data type. You can construct arrays of simple data
types, such as INT64, and complex data types, such as STRUCTs. The
current exception to this is the ARRAY data type: arrays of arrays are
not supported. Arrays can include NULL values.
There doesn't seem to be any attributes or safe prefix to be used with ARRAY() to handle nulls.
So what is the best approach for this?
Per documentation - for Array type
Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs:
BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query.
BigQuery translates NULL ARRAY into empty ARRAY in the query result, although inside the query NULL and empty ARRAYs are two distinct values.
So, as of your example - you can use below "trick"
with test as (
select 1 as A, 2 as B union all
select 3, null
)
select *,
array(select cast(el as int64) el
from unnest(split(translate(format('%t', t), '()', ''), ', ')) el
where el != 'NULL'
) as grouped_columns
from test t
above gives below output
Note: above approach does not require explicit referencing to all involved columns!
My current solution---and I'm not a fan of it---is to use a combo of IFNULL(), UNNEST() and ARRAY() like so:
select
*,
array(
select *
from unnest(
[
ifnull(A, ''),
ifnull(B, '')
]
) as grouping
where grouping <> ''
) as grouped_columns
from test
An alternative way, you can replace NULL value to some NON-NULL figures using function IFNULL(null, 0) as given below:-
with test as (
select 1 as A, 2 as B
union all
select 3, IFNULL(null, 0)
)
select *,
[A,B] as grouped_columns
from test
I have a table with a currency field that is currently formatted as a string, e.g. "£1.5m". How can I convert the column to the equivalent numeric value, i.e. 1,500,000?
The data is in postgres so I could either cast it in the table, or convert using pandas. I'm currently trying in pandas. I'd ideally like to understand how to it either way.
I've tried using pandas to_numeric, but it is unable to parse the value.
import pandas as pd
d = {'id': [1, 2, 3, 4],
'name': ["A", "B", "C", "D"],
'assets': ["£472.96k", "£142.6m", "£500", "-£3.38m"]}
df = pd.DataFrame(data=d)
df['assets'] = pd.to_numeric(['assets'])
EDIT - the code below works for pandas.
Would be interested in the postgres approach though
def convert_column(Col):
Col = Col.str.replace('£', '')
Col = (Col.replace(r'[km]+$', '', regex=True).astype(float) * \
Col.str.extract(r'[\d\.]+([km]+)', expand=False)
.fillna(1)
.replace(['k','m'], [10**3, 10**6]).astype(int))
return Col
for col in ['assets']:
df[col] = convert_column(df[col])
A Postgres solution is completely doable, it requires 1 SQL statement. The following implements such a solution. The query assumes an array of strings for input. Then beginning it shows each step (through sub select) to derive the asset value.
Separate (Unnest) the array into individual elements.
Discard the currency symbol (£).
Split out via regular expression the numeric value and the magnitude
(k,m) codes.
Apply the magnitude code to the numeric value for final value.
Along the way, keep the original value and at the last if is not a
valid value to begin with output a null value.
with test(assets) as
( values (array ['£472.96k', '£142.6m', '£500', '-£3.38m' , 'xxx'] ) )
, exp(re) as
( values ('^(\+|-)?([0-9]*\.?[0-9]*)(m|k)?$'))
select orig_asset
, case when assets ~ re
then case when asset_mag = 'k'
then asset_val * 1000::float
when asset_mag = 'm'
then asset_val * 1000000::float
else asset_val
end
else null
end asset_value
from (select orig_asset,assets, re
, regexp_replace (assets, re,'\1\2')::float asset_val
, regexp_replace (assets, re,'\3') asset_mag
from exp cross join
( select assets orig_asset
, replace(assets,'£','') assets
from ( select unnest(assets) assets from test) a
) b
) c;
Finally, you can wrap the whole query into a SQL function that returns a table. The result of which can be used like any table is a query. See fiddle here for an example of each
Although I saw update statements to update field based on existing values, I could not find anything similar to this scenario:
Suppose you have a table with only one column of number(4) type. The value in the first record is 1010.
create table stab(
nmbr number(4)
);
insert into stab values(1010);
For each digit
When the digit is 1 -- add 3 to the digit
When the digit is 0 -- add four to the digit
end
This operations needs to be completed in a single statement without using pl/sql.
I think substr function need to be used but don't know how to go about completing this.
Thanks in advance.
SELECT DECODE(SUBSTR(nmbr,1,1), '1', 1 + 3, '0', 0 + 4) AS Decoded_Nmbr
FROM stab
ORDER BY Decoded_Nmbr
Is that what you are after?
So, it seems you need to convert every 0 and 1 to a 4, and leave all the other digits alone. This seems like a string operation (and the reference to "digits" itself suggests the same thing). So, convert the number to a string, use the Oracle TRANSLATE function (see the documentation), and convert back to number.
update stab
set nmbr = to_number(translate(to_char(nmbr, '9999'), '01', '44'))
;
assuming its always a 4 digit #; you could use substring like below
-- postgres SQL example
SELECT CASE
WHEN a = 0 THEN a + 4
ELSE a + 3
end AS a,
CASE
WHEN b = 0 THEN b + 4
ELSE b + 3
end AS b,
CASE
WHEN c = 0 THEN c + 4
ELSE c + 3
end AS c,
CASE
WHEN d = 0 THEN d + 4
ELSE c + 3
end AS d
FROM ( SELECT Substr( '1010', 1, 1 ) :: INT AS a,
Substr( '1010', 2, 1 ) :: INT b,
Substr( '1010', 3, 1 ) :: INT c,
Substr( '1010', 4, 1 ) :: INT d )a
--- Other option may be (tried in postgreSQL :) ) to split the number using regexp_split_to_table into rows;then add individual each digit based on the case statement and then concat the digits back into a string
SELECT array_to_string ( array
(
select
case
WHEN val = 0 THEN val +4
ELSE val +3
END
FROM (
SELECT regexp_split_to_table ( '101010','' ) ::INT val
) a
) ,'' )
My answer to the interview question would have been that the DB design violates the rules of normalization (i.e. a bad design) and would not have this kind of "update anomaly" if it were properly designed. Having said that, it can easily be done with an expression using various combinations of single row functions combined with the required arithmetic operations.
I need to examine ACCT_NUMS values om TABLE_1. If the ACCT_NUM is prefixed by "GF0", then I need to disregard the "GF0" prefix and take the rightmost 7 characters of the remaining string. If this resulting value is not found in account_x_master or CW_CLIENT_STAGE, then, the record is to be flagged as an error.
The following seems to do the trick, but I have a concern...
UPDATE
table_1
SET
Error_Ind = 'GW001'
WHERE
LEFT(ACCT_NUM, 3) = 'GF0'
AND RIGHT(SUBSTRING(ACCT_NUM, 4, LEN(ACCT_NUM) - 3), 7) NOT IN
(
SELECT
acct_num
FROM
account_x_master
)
AND RIGHT(SUBSTRING(ACCT_NUM, 4, LEN(ACCT_NUM) - 3), 7) NOT IN
(
SELECT
CW_CLIENT_STAGE.AGS_NUM
FROM
dbo.CW_CLIENT_STAGE
)
My concern is that SQL Server may attempt to perform a SUBSTRING operation
SUBSTRING(ACCT_NUM, 4, LEN(ACCT_NUM) - 3)
that results in a computed negative value and causing the SQL to fail. Of course, this wouldn't fail is the SUBSTRING operation were only applied to those records that we at least 3 characters long, which would always be the case if the
LEFT(ACCT_NUM, 3) = 'GF0'
were applied first. If possible, I'd like to avoid adding new columns to the table. Bonus points for simplicity and less overhead :-)
How can I rewrite this UPDATE SQL to protect against this?
As other people said, your concern is valid.
I'd make two changes to your query.
1) To avoid having negative value in the SUBSTRING parameter we can rewrite it using STUFF:
SUBSTRING(ACCT_NUM, 4, LEN(ACCT_NUM) - 3)
is equivalent to:
STUFF(ACCT_NUM, 1, 3, '')
Instead of extracting a tail of a string we replace first three characters with empty string. If the string is shorter than 3 characters, result is empty string.
By the way, if your ACCT_NUM may end with space(s), they will be trimmed by the SUBSTRING version, because LEN doesn't count trailing spaces.
2) Instead of
LEFT(ACCT_NUM, 3) = 'GF0'
use:
ACCT_NUM LIKE 'GF0%'
If you have an index on ACCT_NUM and only relatively small number of rows start with GF0, then index will be used. If you use a function, such as LEFT, index can't be used.
So, the final query becomes:
UPDATE
table_1
SET
Error_Ind = 'GW001'
WHERE
ACCT_NUM LIKE 'GF0%'
AND RIGHT(STUFF(ACCT_NUM, 1, 3, ''), 7) NOT IN
(
SELECT
acct_num
FROM
account_x_master
)
AND RIGHT(STUFF(ACCT_NUM, 1, 3, ''), 7) NOT IN
(
SELECT
CW_CLIENT_STAGE.AGS_NUM
FROM
dbo.CW_CLIENT_STAGE
)
You have a very valid concern, because SQL Server will rearrange the order of evaluation of expressions in the WHERE.
The only way to guarantee the order of operations in a SQL statement is to use case. I don't think there is a way to catch failing calls to substring() . . . there is no try_substring() analogous to try_convert().
So:
WHERE
LEFT(ACCT_NUM, 3) = 'GF0' AND
(CASE WHEN LEN(ACCT_NUM) > 3 THEN RIGHT(SUBSTRING(ACCT_NUM, 4, LEN(ACCT_NUM) - 3), 7) END) NOT IN (SELECT acct_num
FROM account_x_master
) AND
(CASE WHEN LEN(ACCT_NUM) > 3 THEN RIGHT(SUBSTRING(ACCT_NUM, 4, LEN(ACCT_NUM) - 3), 7) END) NOT IN (SELECT CW_CLIENT_STAGE.AGS_NUM
FROM dbo.CW_CLIENT_STAGE
)
This is uglier. And, there may be ways around it, say by using LIKE with wildcards rather than string manipulation. But, the case will guarantee that the SUBSTRING() is only run on strings long enough so no error is generated.
Please try the below query.
Since there is no short circuit and or in SQL WHERE clause, only way to achieve is via CASE syntax.
I noticed that you had two NOT IN comparisons in different parts of WHERE which I combined into one.
Note that CASE condition is >=3 and not >3, as RIGHT('',x) is allowed.
Also note the proper use of CASE with NOT IN
UPDATE table_1
SET
Error_Ind = 'GW001'
select * from table_1
WHERE
LEFT(ACCT_NUM, 3) = 'GF0'
AND CASE
WHEN LEN(ACCT_NUM)>=3
THEN RIGHT(SUBSTRING(ACCT_NUM, 4, LEN(ACCT_NUM) - 3), 7)
ELSE NULL END NOT IN
(
SELECT acct_num as num
FROM account_x_master
UNION
SELECT CW_CLIENT_STAGE.AGS_NUM as num
FROM dbo.CW_CLIENT_STAGE
)
I have a list of strings:
HEAWAMFWSP
TLHHHAFWSP
AWAMFWHHAW
AUAWAMHHHA
Each of these strings represent 5 pairs of 2 character combinations (i.e. HE AW AM FW SP)
What I am looking to do in SQL is to display all strings that have duplication in the pairs.
Take string number 3 from above; AW AM FW HH AW. I need to display this record because it has a duplicate pair (AW).
Is this possible?
Thanks!
Given current requirements, yes this is dooable. Here's a version which uses a recursive CTE (text may need to be adjusted for vendor idiosyncracies), written and tested on DB2. Please note that this will return multiple rows if there is more than 2 instances of a pair in a string, or more than 1 set of duplicates.
WITH RECURSIVE Pair (rowid, start, pair, text) as (
SELECT id, 1, SUBSTR(text, 1, 2), text
FROM SourceTable
UNION ALL
SELECT rowid, start + 2, SUBSTR(text, start + 2, 2), text
FROM Pair
WHERE start < LENGTH(text) - 1)
SELECT Pair.rowid, Pair.pair, Pair.start, Duplicate.start, Pair.text
FROM Pair
JOIN Pair as Duplicate
ON Duplicate.rowid = Pair.rowid
AND Duplicate.pair = Pair.pair
AND Duplicate.start > Pair.start
Here's a not very elegant solution, but it works and only returns the row once no matter how many duplicate matches. The substring function is for SQLServer, not sure what it is for Oracle.
select ID, Value
from MyTable
where (substring(Value,1,2) = substring(Value,3,4)
or substring(Value,1,2) = substring(Value,5,6)
or substring(Value,1,2) = substring(Value,7,8)
or substring(Value,1,2) = substring(Value,9,10)
or substring(Value,3,4) = substring(Value,5,6)
or substring(Value,3,4) = substring(Value,7,8)
or substring(Value,3,4) = substring(Value,9,10)
or substring(Value,5,6) = substring(Value,7,8)
or substring(Value,5,6) = substring(Value,9,10)
or substring(Value,7,8) = substring(Value,9,10))