Extracting string and converting columns to rows in SQL (Redshift) - sql

I've a column called "Description" in a table called "Food" which includes multiple food item names delimited by , such as chicken, soup, bread, coke
How can I extract each item from the column and create multiple rows.
e.g. Currently it's like
{FoodID, FoodName, Description} ==> {123, Meal, "chicken, soup, bread, coke"}
and what I need is
{FoodID, FoodName, Description} ==> {123, Meal, chicken},
{123, Meal, soup},
{123, Meal, bread} etc.
In Redshift, I first did a split of "description" column as
select FoodID, FoodName, Description,
SPLIT_PART(Description, ',',1) AS Item1,
SPLIT_PART(Description, ',',1) AS Item2,
SPLIT_PART(Description, ',',1) AS Item3,.....till Item10
FROM Food
consider that max of 10 items can be there and hence Item10.
What's the best method to convert these columns Item1 to Item10 to store as rows? I tried UNION ALL but it's taking a longer time considering huge load of data.

Your question is answered here in detail specifically for Redshift.
You just need to map your queries to example queries provided over there.
Your query will be something like below.
select (row_number() over (order by true))::int as n into numbers from food limit 100;
This will create numbers table.
Your query would become:
select foodId, foodName, split_part(Description,',',n) as descriptions from food cross join numbers where split_part(Description,',',n) is not null and split_part(Description,',',n) != '';
Now, coming to back to your original question about performance.
it's taking a longer time considering huge load of data.
Considering typical data warehouse use case of high read and seldom write, you should be keeping your typical food data that you have mentioned in stagging table, say stg_food.
You should use following kind of query to do one time insert into actual food table, something like below.
insert into food select foodId, foodName, split_part(Description,',',n) as descriptions from stg_food cross join numbers where split_part(Description,',',n) is not null and split_part(Description,',',n) != '';
This will write one time and make your select queries faster.

Related

Creating a view that contains all records from one table, that match the comma separated field content in another table

I have two tables au_postcodes and groups.
Table groups contains a field called PostCodeFootPrint
that contains the postcode set making up the footprint.
Table au_postcodes contains a field called poa_code that
contains a single postcode.
The records in groups.PostCodeFootPrint look like:
PostCodeFootPrint
2529,2530,2533,2534,2535,2536,2537,2538,2539,2540,2541,2575,2576,2577,2580
2640
3844
2063, 2064, 2065, 2066, 2067, 2068, 2069, 2070, 2071, 2072, 2073, 2074, 2075, 2076, 2077, 2079, 2080, 2081, 2082, 2083, 2119, 2120, 2126, 2158, 2159
2848, 2849, 2850, 2852
Some records have only one postcode, some have multiple separated by a "," or ", " (comma and space).
The records in au_postcode.poa_code look like:
poa_code
2090
2092
2093
829
830
836
2080
2081
Single postcode (always).
The objective is to:
Get all records from au_postcode, where the poa_code appears in groups.*PostCodeFootPrint into a view.
I tried:
SELECT
au_postcodes.poa_code,
groups."NameOfGroup"
FROM
groups,
au_postcodes
WHERE
groups."PostcodeFootprint" LIKE '%au_postcodes.poa_code%'
But no luck
You can use regex for this. Take a look at this fiddle:
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=739592ef262231722d783670b46bd7fa
Where I form a regex from the poa_code and the word boundary (to avoid partial matches) and compare that to the PostCodeFootPrint.
select p.poa_code, g.PostCodeFootPrint
from groups g
join au_postcode p
on g.PostCodeFootPrint ~ concat('\y', p.poa_code, '\y')
Depending on your data, this may be performant enough. I also believe that in postGres you have access to the array data type, and so it might be better to store the post code lists as arrays.
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=ae24683952cb2b0f3832113375fbb55b
Here I stored the post code lists as arrays, then used ANY to join with.
select p.poa_code, g.PostCodeFootPrint
from groups g
join au_postcode p
on p.poa_code = any(g.PostCodeFootPrint);
In these two fiddles I use explain to show the cost of the queries, and while the array solution is more expensive, I imagine it might be easier to maintain.
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=7f16676825e10625b90eb62e8018d78e
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=e96e0fc463f46a7c467421b47683f42f
I changed the underlying data type to integer in this fiddle, expecting it to reduce the cost, but it didn't, which seems strange to me.
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=521d6a7d0eb4c45471263214186e537e
It is possible to reduce the query cost with the # operator (see the last query here: https://dbfiddle.uk/?rdbms=postgres_14&fiddle=edc9b07e9b22ee72f856e9234dbec4ba):
select p.poa_code, g.PostCodeFootPrint
from groups g
join au_postcode p
on (g.PostCodeFootPrint # p.poa_code) > 0;
but it is still more expensive than the regex. However, I think you can probably rearrange the way the tables are set up and radically change performance. See the first and second queries in the fiddle, where I take each post code in the footprint and insert it as a row in a table, along with an identifier for the group it was in:
select p.poa_code, g.which
from groups2 g
join au_postcode p
on g.footprint = p.poa_code;
The explain plan for this indicates that query cost drops significantly (from 60752.50 to 517.20, or two orders of magnitude) and the execution times go from 0.487 to 0.070. So it might be worth looking into changing the table structure.
Since the values of PostCodeFootPrint are separated by a common character, you can easily create an array out of it. From there use unnest to convert the array elements to records, and then join then with au_postcode:
SELECT * FROM au_postcode au
JOIN (SELECT trim(unnest(string_to_array(PostCodeFootPrint,',')))
FROM groups) fp (PostCodeFootPrint) ON fp.PostCodeFootPrint = au.poa_code;
Demo: db<>fiddle

How to query multiple STRUCTs in BigQuery in wildcard-like way

I struggle to query multiple STRUCTs which share same record fields each other.
Let me show you how the table looks like.
Tables with multiple STRUCTs with same record fields
Each mango, melon, apple, banana STRUCT(RECORD) share same fields-qty, price.
Now I want to query them at once like "Show me the qty > 5."
Is ther any wildcard-like way to do this? Maybe something like SELECT %.qty >5. Of course It is an invalid way(just for an example).
I know that the best way is to change the schema like fruit, fruit.qty, fruit.price and put the mango and others to fruit filed data, not remain them as a field itself.
However for some reason, I want to keep that schema and query multiple RECORDs at once. Could It be possible?
Thank you.
Consider below approach
with temp as (
select
trim(fruit, '"') as fruit,
cast(json_extract(info, '$.qty') as int64) as qty,
cast(json_extract(info, '$.price') as float64) as price
from your_table t,
unnest(split(trim(to_json_string(t), '{}'), '},')) record,
unnest([struct(
split(record, ':{')[offset(0)] as fruit,
'{' || split(record, ':{')[offset(1)] || '}' as info)
])
)
select *
from temp
where qty > 5
if applied to sample data like in your question - output is

single CASE for multiple columns data

Item with quantity and price are queried from SQL server for Excel and Crystal Report. The quantity and price are bulk quantity (pounds). I need to convert it to bag quantity and bag price. Pounds per bag data is not in SQL server and it is different on each item. My item is around 20 only. I cannot create permanent or temporary table to store pounds per bag data in SQL Server. I can use CASE in SQL to calculate bag quantity and price. But, it needs two CASEs. How can I use one CASE or other method which can simplify the SQL and keep it simple to maintain? My current SQL.
SELECT Item, Quantity, Price,
CASE item
WHEN ‘Item1’ THEN Quantity/32
WHEN ‘Item2’ THEN Quantity/33
…
ELSE Quantity
END AS QtyPerBag,
CASE item
WHEN ‘Item1’ THEN Price*32
WHEN ‘Item2’ THEN Quantity*33
…
ELSE Price
END AS PricePerBag
FROM MasterTable
DhruvJoshi's approach is a good approach. Using the VALUES() constructor, it is even simpler:
SELECT mt.Item, mt.Quantity, mt.Price,
mt.Quantity/factors.Factor AS QtyPerBag,
Price * Factors.factor AS PricePerBag
FROM MasterTable mt LEFT JOIN
(VALUES ('Item1', 32), ('Item2', 33)
) factors(item, factor)
ON factors.item = mt.item;
Note: If you quantity is stored as an integer, then you should use decimal points for the factors (unless you want integer division).
As already pointed out in comments you should consider using table/set approach.
One way if you want to do that inside of a query is like below:
SELECT Item, Quantity, Price, Quantity/Factor AS QtyPerBag,
Price * Factor AS PricePerBag
FROM MasterTable LEFT JOIN
(
SELECT 'Item1' as Item, '32' as Factor
UNION
SELECT 'Item2' as Item, '33' as Factor
-- and so on ...
) T
ON T.item=masterTable.item

Splitting pairs from Column Data in SQL Server

I have a solution in Java for a simply defined problem, but I want to improve the time needed to execute the data handling. The problem is to take a series of words held in a column on a relational database and split the words into pairs which are then insert into a dictionary of pairs. The pairs themselves relate to a product identified by partid.
Hence the Part table has
PartID (int), PartDesc (nvarchar)
and the dictionary has
DictID (int), WordPair (nvarchar).
The logic is therefore:
insert into DictPair (wordpair, partid)
select wordpairs, partid from Part
A wordpair is defined as two adjacent words and hence words will be repeated, eg
red car with 4 wheel drive
will pair to
{red, car},{car, with}, {with,4}, {4, wheel}, {wheel, drive}
Hence the final dictionary for say partid 45 will have (partid, dictionarypair):
45, red car
45, car with
45, with 4
45, 4 wheel
45, wheel drive
This is used in product classification and hence word order matters (but pair order does not matter).
Has anyone any ideas on how to solve this? I was thinking in terms of stored procedures, and using some kind of parsing. I want the entire solution to be implemented in SQL for efficiency reasons.
Basically, find a split() function on the web that returns the position of a word in a string.
Then do:
select s.word, lead(s.word) over (partition by p.partId order by s.pos) as nextword
from parts p outer apply
dbo.split(p.partDesc, ' ') as s(word, pos);
This will put NULL for the last pair, which you don't seem to want. So:
insert into DictPair (wordpair, partid)
select word + ' ' nextword, partid,
from (select p.*, s.word, lead(s.word) over (partition by p.partId order by s.pos) as nextword
from parts p outer apply
dbo.split(p.partDesc, ' ') as s(word, pos)
)
where nextword is not null;
Here are some split functions, provided by Googling "SQL Server split". And another. And from StackOverflow. And there are many more

looping through a numeric range for secondary record ID

So, I figure I could probably come up with some wacky solution, but i figure i might as well ask up front.
each user can have many orders.
each desk can have many orders.
each order has maximum 3 items in it.
trying to set things up so a user can create an order and the order auto generates a reference number and each item has a reference letter. reference number is 0-99 and loops back around to 0 once it hits 99, so orders throughout the day are easy to reference for the desks.
So, user places an order for desk #2 of 3 items:
78A: red stapler
78B: pencils
78C: a kangaroo foot
not sure if this would be done in the program logic or done at the SQL level somehow.
was thinking something like neworder = order.last + 1 and somehow tying that into a range on order create. pretty fuzzy on specifics.
Without knowing the answer to my comment above, I will assume you want to have the full audit stored, rather than wiping historic records; as such the 78A 78B 78C type orders are just a display format.
If you have a single Order table (containing your OrderId, UserId, DeskId, times and any other top-level stuff) and an OrderItem table (containing your OrderItemId, OrderId, LineItemId -- showing 1,2 or 3 for your first and optional second and third line items in the order, and ProductId) and a Product table (ProductId, Name, Description)
then this is quite simple (thankfully) using the modulo operator, which gives the remainder of a division, allowing you in this case to count in groups of 3 and 100 (or any other number you wish).
Just do something like the following:
(you will want to join the items into a single column, I have just kept them distinct so that you can see how they work)
Obviously join/query/filter on user, desk and product tables as appropriate
select
o.OrderId,
o.UserId,
o.DeskId
o.OrderId%100 + 1 as OrderNumber,
case when LineItem%3 = 1 then 'A'
when LineItem%3 = 2 then 'B'
when LineItem%3 = 0 then 'C'
end as ItemLetter,
oi.ProductId
from tb_Order o inner join tb_OrderItem oi on o.OrderId=oi.OrderId
Alternatively, you can add the itemLetter (A,B,C) and/or the OrderNumber (1-100) as computed (and persisted) columns on the tables themselves, so that they are calculated once when inserted, rather than recalculating/formatting when they are selected.
This sort-of breaks some best practice that you store the raw data in the DB and you format on retrieval; but if you are not going to update the data and you are going to select the data for more than you are going to write the data; then I would break this rule and calculate your formatting at insert time