pull specific data from broken json in database, using redshift - sql

so I have table source that 1 important column is broken json. below is the sample of the data
event_properties
"{\"source\":\"barcode\",\"voucher_id\":684883298,\"voucher_name\":\"voucher 1\"}"
"{\"entryPoint\":\"voucher_selection-popup\",\"entry_point\":\"voucher_selection-popup\",\"source\":\"mobile\",\"voucher_id\":712001960,\"voucher_name\":\"voucher 2\"}"
"{\"source\":\"barcode\",\"voucher_id\":638584138,\"voucher_name\":\"voucher 1\"}"
"{\"source\":\"QR Static\",\"voucher_id\":642124374,\"voucher_name\":\"voucher 3\"}"
each line represent the 1 record. is there a way to extract the voucher id and voucher_name information since there are more than 1 variation in the data.
so the goal is to extract the voucher id and voucher name like this
voucher_id voucher_name
684883298 voucher 1
712001960 voucher 2
638584138 voucher 1
642124374 voucher 3
im using redshift

You can try this:
first verify if the JSON is valid with is_valid_json().
if not, inspect what is required to make it valid, in this case by removing leading and trailing ".
in this scenario, use trim() to remove the redundant " chars.
use json_extract_path_text() to obtain values.
SQL:
with json_data as (
select '"{\"source\":\"barcode\",\"voucher_id\":684883298,\"voucher_name\":\"voucher 1\"}"'::text as j union
select '"{\"entryPoint\":\"voucher_selection-popup\",\"entry_point\":\"voucher_selection-popup\",\"source\":\"mobile\",\"voucher_id\":712001960,\"voucher_name\":\"voucher 2\"}"'::text union
select '"{\"source\":\"barcode\",\"voucher_id\":638584138,\"voucher_name\":\"voucher 1\"}"'::text union
select '"{\"source\":\"QR Static\",\"voucher_id\":642124374,\"voucher_name\":\"voucher 3\"}"'::text)
select j,
is_valid_json(j),
trim('"' from j) as j_trimmed,
is_valid_json(j_trimmed),
json_extract_path_text(j_trimmed, 'voucher_id') as voucher_id
from json_data;
Yields the voucher_id values. Then use the same method to get the other keys' values.

Related

Multiple Between Dates from table column

There is yearly data in the source. I need to exclude the data -which is in another table and raw count is not static- from it.
Source data:
Dates to be excluded:
There can be 2 raws or 5 raws of data to be excluded, so it need to be dynamically and 2 tables can be bound by the DISPLAY_NAME column.
I am trying to do it with query, don't want to use sp. Is there any way or sp is only choise to do this.
Maybe multiple case when for each raw 1 / 0 and only get if all new case when columns are 1 but issue is don't know how many case when i will use since exclude table data raw count is not static.
Are you looking for not exists?
select s.*
from source s
where not exists (select 1
from excluded e
where e.display_name = s.display_name and
s.start_datetime >= e.start_date and
s.end_datetime < e.end_date
);
Note: Your question does not explain how the end_date should be handled. This assumes that the data on that date should be included in the result set. You can tweak the logic to exclude data from that date as well.

Find a match between 2 columns with only first parts of a multi digit integer

I'm doing a search funtion from python3 that looks into a sqllite3 database.
I have a table called 'numberseries' with various columns.
Some of the columns are "customername" "number_from", number_to, "total".
The number_from and number_to fields contains the starting point and end point of a series of numbers.
For instance:
"Customer XYX" has numbers in the range from 80201110 to 80201129.
Those 2 numbers are entered into "numbers_from" and "numbers_to" respectively.
The "total" column will contain the numerical difference between the entries. In this case 20.
I've made a simple query that returns the row based on:
"Select * from numberseries where :value BETWEEN numbers_from and numbers_to;"
This works fine with a full 8 digit number.
However:
The users want a function that returns the same data, but also if they only enter the first 5 digits of the number.
For instance if they enter 80201, it should return all rows that includes 80201 as the first 5 digits in the range between "number_from" and "number_to".
it's not enough to just look at the "number_from" column only with a like statement, as the "total" might be 1000+ numbers
I hope this makes sense.
I'm by no means an sql expert and I'm having a hard time with this problem.
UPDATE:
I got a request for clarification.
Sample data is a table with a number for rows. for simplicity it only has the following columns:
id
customer name
numbers_from
numbers_to
total
a sample customer could be:
1, "the best customer", 80201110, 80201149, 40
input could be: 8020113 (last digit missing) and i would still need to match the example row as 8020113x is in the range between 80201110 and 80201149.
Based on your last comment to the other answer you got, what you actually want is this:
select * from numberseries
where :value in (
substr(numbers_from, 1, length(:value)),
substr(numbers_to, 1, length(:value))
)
or:
select * from numberseries
where :value BETWEEN substr(numbers_from, 1, length(:value)) and numbers_to || ''
Assuming that the numbers in the data are always 8 characters, then you can treat it like a string, add a bunch of zeros and convert back to a number:
where cast(substr(value || '00000000', 1, 8) as int) BETWEEN numbers_from and numbers_to;

Is there a way to check if any items in a string array are in a string in Snowflake/Redshift?

I am looking for a way to check if a string contains any words in another field which is a single string that holds a list of items. Something like this...
id items (STRING)
1 burger;hotdog
I have a second dataset that might look like...
transaction_id description amount
10 cheeseburger 10
Now I need to grab the amount if the description matches any items in the first table, in this case it does match with the string burger, however, i can't seem to get the SQL right since if I were to use LIKE ANY in Snowflake, i'd need to pass in **('%burger%",'%hotdog%') which are two separate strings - in this case I can't make explicit calls as each id/item permutation may be different in the first table. While in Redshift when I try to use
CASE WHEN lower(t.description) SIMILAR TO '%(' || replace(items,';','|') || ')%' then amount END
I get the following error: Specified types or functions (one per INFO message) not supported on Redshift tables.
Thanks in advance!
If your wanting a snowflake answer:
WITH keys AS (
SELECT * FROM VALUES (1,'burger;hotdog') a(id,items)
), data AS (
SELECT * FROM VALUES (10,'cheeseburger',10) b(transaction_id, description, amount)
), seq_keys AS (
SELECT s.seq_id, f.value as key
FROM (
SELECT seq8() as seq_id, k.*
FROM keys AS k
) AS s
,lateral flatten(input=>split(s.items,';')) F
)
SELECT d.*, sk.*
FORM data d
JOIN seq_keys sk ON d.description ILIKE '%'||sk.key||'%'
gives:
TRANSACTION_ID DESCRIPTION AMOUNT SEQ_ID KEY
10 cheeseburger 10 0 "burger"
which is you distinct on the SEQ_ID then you can de-dupe if there are multi keys that match.. I would be inclined to also add an ID to the "data table".

SQL command(s) to transform data

For the SQL language gurus...a challenge. Hopefully not too hard. If I have data that contains an asset identifier, followed by 200 data elements for that asset...what SQL snippet would transform that to a vertical format?
Current:
Column names:
Asset ID, Column Header 1, Column Header 2, ... Column Header "n"
Data Row:
abc123, 1234, 2345, 3456, ...
Desired:
Asset ID, Column Header 1, 1234
Asset ID, Column Header 2, 2345
Asset ID, Column Header 3, 3456
...
Asset ID, Column Header n, 9876
The SQL implementation that I am using (DashDB based on DB2 in Bluemix) does not support a "pivot" command. And I would like the code snippet to work unchanged if column headers are changed, or additional columns are added to the "current" data format. I.e. I would prefer not to hard code to a fixed list of columns.
What do you think? Can it be done with an SQL code snippet?
Thanks!
You can do this by composing a pivoted table for each row and performing a cartesian product between the source table and the composed table:
SELECT assetId, colname, colvalue
FROM yourtable T,
TABLE(VALUES ('ColumnHeader1', T.ColumnHeader1),
('ColumnHeader2', T.ColumnHeader2),
('ColumnHeader3', T.ColumnHeader3),
...
('ColumnHeaderN', T.ColumnHeaderN)
) as pivot(colname, colvalue);
This will only require a single scan of yourtable, so it quite efficient.
The canonical way is union all:
select assetId, 'ColumnHeader1' as colname, ColumnHeader1 as value from t union all
select assetId, 'ColumnHeader2' as colname, ColumnHeader2 as value from t union all
. . .
There are other methods but this is usually the simplest to code. It will require reading the table once for each column, which could be an issue.
Note: You can construct such a query using a spreadsheet and formulas. Or, even construct it using another SQL query.

How to find the next sequence number in oracle string field

I have a database table with document names stored as a VARCHAR and I need a way to figure out what the lowest available sequence number is. There are many gaps.
name partial seq
A-B-C-0001 A-B-C- 0001
A-B-C-0017 A-B-C- 0017
In the above example, it would be 0002.
The distinct name values total 227,705. The number of "partial" combinations is quite large A=150, B=218, C=52 so 1,700,400 potential combinations.
I found a way to iterate through from min to max per distinct value and list all the "missing" (aka available) values, but this seems inefficient given we are not using anywhere close to the max potential partial combinations (10,536 out of 1,700,400).
I'd rather have a table based on existing data with a partial value, it's next available sequence value, and a non-existent partial means 0001.
Thanks
Hmmmm, you can try this:
select coalesce(min(to_number(seq)), 0) + 1
from t
where partial = 'A-B-C-' and
not exists (select 1
from t t2
where t2.partial = t.partial and
to_number(T2.seq) = to_number(t.seq) + 1
);
EDIT:
For all partials you need a group by:
You can use to_char() to convert it back to a character, if necessary.
select partial, coalesce(min(to_number(seq)), 0) + 1
from t
where not exists (select 1
from t t2
where t2.partial = t.partial and
to_number(T2.seq) = to_number(t.seq) + 1
)
group by partial;