I'm curious about something in a SQL Server database. My current query pulls data about my employer's items for sale. It finds information for just under 105,000 items, which is correct. However, it returns over 155,000 rows, because each item has other things related to it. Right now, I run that data through a loop in Python, manually flattening it out by checking if the item the loop is working on is the same one it just worked on. If it is, I start filling in that item's extra information. Ideally, the SQL would return all this data already put into one row.
Here is an overview of the setup. I'm leaving out a few details for simplicity's sake, since I'm curious about the general theory, not looking for something I can copy and paste.
Item: contains the item ID, SKU, description, vendor ID, weight, and dimensions.
AttributeName: contains attr_id and attr_text. For instance, "color", "size", or "style".
AttributeValue: contains attr_value_id and attr_text. For instance, "blue" or "small".
AttributeAssign: contains item_id and attr_id. This ties attribute names to items.
attributeValueAssign: contains item_id and attr_value_id, tying attribute values to items.
A series of attachments is set up in a similar way, but with attachment and attachmentAssignment. Attachments can have only values, no names, so there is no need for the extra complexity of a third table as there is with attributes.
Vendor is simple: the ID is used in the item table. That is:
select item_id, vendorName
from item
join vendor on vendor_id = item.vendorNumber
gets you the name of an item's vendor.
Now, the fun part: items may or may not have vendors, attributes, or attachments. If they have either of the latter two, there's no way to know how many they have. I've seen items with 0 attributes and items with 5. Attachments are simpler, as there can only be 0 or 1 per item, but the possibility of 0 still demands an outer left join so I am guaranteed to get all the items.
That's how I get multiple rows per item. If an item has three attrigbutes, I get either four or seven rows for just that item--I'm not sure if it's a row per name/value or a row per name AND a row per value. Either way, this is the kind of thing I'd like to stop. I want each row in my result set to contain all attributes, with a cap at seven and null for any missing attribute. That is, something like:
item_id; item_title; item_sku; ... attribute1_name; attribute1_value; attribute2_name; attribute2_value; ... attribute7_value
1; some random item; 123-45; ... color; blue; size; medium; ... null
Right now, I'd get multiple rows for that, such as (only ID and attributes):
ID; attribute 1 name; attribute 1 value; attribute 2 name; attribute 2 value
1; color; blue; null; null
1; color; blue; size; medium
I'm after the second row only--all the information put together into one row per unique item ID. Currently, though, I get multiple rows, and Python has to put everything together. I'm outputting this to a spreadsheet, so information about an item has to be on that item's row.
I can just keep using Python if this is too much bother. But I wondered if there was a way to do it that would be relatively easy. My script works fine, and execution time isn't a concern. This is more for my own curiosity than a need to get anything working. Any thoughts on how--or if--this is possible?
Here is #WCWedin's answer modified to use a CTE.
WITH attrib_rn as
(
select
*, row_number() over(partition by item_id order by name, attribute_id) as row_number
from attributes
)
select
i.item_id,
attr1.name as attribute1_name, attr1.value as attribute1_value,
...
attr7.name as attribute7_name, attr7.value as attribute7_value
from items i
left join attrib_rn as attr1 ON attr1.item_id = i.item_id AND attr1.row_number = 1
left join attrib_rn as attr2 ON attr2.item_id = i.item_id AND attr2.row_number = 2
left join attrib_rn as attr3 ON attr3.item_id = i.item_id AND attr3.row_number = 3
left join attrib_rn as attr4 ON attr4.item_id = i.item_id AND attr4.row_number = 4
left join attrib_rn as attr5 ON attr5.item_id = i.item_id AND attr5.row_number = 5
left join attrib_rn as attr6 ON attr6.item_id = i.item_id AND attr6.row_number = 6
left join attrib_rn as attr7 ON attr7.item_id = i.item_id AND attr7.row_number = 7
Since you only want the first 7 attributes and you want to keep all of the logic in the SQL query, you're probably looking at using row_number. Subqueries will do the job directly with multiple joins, and the performance will probably be pretty good since you're only joining so many times.
select
i.item_id,
attr1.name as attribute1_name, attr1.value as attribute1_value,
...
attr7.name as attribute7_name, attr7.value as attribute7_value
from
items i
left join (
select
*, row_number() over(partition by item_id order by name, attribute_id) as row_number
from
attributes
) AS attr1 ON
attr1.item_id = i.item_id
AND attr1.row_number = 1
...
left join (
select
*, row_number() over(partition by item_id order by name, attribute_id) as row_number
from
attributes
) AS attr7 ON
attr7.item_id = i.item_id
AND attr7.row_number = 7
In SQL Server, you can tackle this with a subquery containing 'ROW_NUMBER() OVER', and a few CASE statements to map the top 7 into columns.
A little tricky, but post your full query that returns the big list and I'll demonstrate how to transpose it.
Related
Using Oracle SQL Developer, I have three tables with some common data that I need to join.
Appreciate any help on this!
Please refer to https://i.stack.imgur.com/f37Jh.png for the input and desired output (table formatting doesn't work on all tables).
These tables are made up in order to anonymize them, and in reality contain other data with millions of entries, but you could think of them as representing:
Product = Main product categories in a grocery store.
Subproduct = Subcategory products to the above. Each time the table is updated, the main product category may loses or get some new suproducts assigned to it. E.g. you can see that from May to June the Pulled pork entered while the Fishsoup was thrown out.
Issues = Status of the products, for example an apple is bad if it has brown spots on it..
What I need to find is: for each P_NAME, find the latest updated set of subproducts (SP_ID and SP_NAME), and append that information with the latest updated issue status (STATUS_FLAG).
Please note that each main product category gets its set of subproducts updated at individual occasions i.e. 1234 and 5678 might be "latest updated" on different dates.
I have tried multiple queries but failed each time. I am using combos of SELECT, LEFT OUTER JOIN, JOIN, MAX and GROUP BY.
Latest attempt, which gives me the combo of the first two tables, but missing the third:
SELECT
PRODUCT.P_NAME,
SUBPRODUCT.SP_PRODUCT_ID, SUBPRODUCT.SP_NAME, SUBPRODUCT.SP_ID, SUPPRODUCT.SP_VALUE_DATE
FROM SUBPRODUCT
LEFT OUTER JOIN PRODUCT ON PRODUCT.P_ID = SUBPRODUCT.SP_PRODUCT_ID
JOIN(SELECT SP_PRODUCT_ID, MAX(SP_VALUE_DATE) AS latestdate FROM SUBPRODUCT GROUP BY SP_PRODUCT_ID) sub ON
sub.SP_PRODUCT_ID = SUBPRODUCT.SP_PRODUCT_ID AND sub.latestDate = SUBPRODUCT.SP_VALUE_DATE;
Trying to find a row with a max value is a common SQL pattern - you can do it with a join, like your example, but it's usually more clear to use a subquery or a window function.
Correlated subquery example
select
PRODUCT.P_NAME,
SUBPRODUCT.SP_PRODUCT_ID, SUBPRODUCT.SP_NAME, SUBPRODUCT.SP_ID, SUPPRODUCT.SP_VALUE_DATE,
ISSUES.STATUS_FLAG, ISSUES.STATUS_LAST_UPDATED
from PRODUCT
join SUBPRODUCT
on PRODUCT.P_ID = SUBPRODUCT.SP_PRODUCT_ID
and SUBPRODUCT.SP_VALUE_DATE = (select max(S2.SP_VALUE_DATE) as latestDate
from SUBPRODUCT S2
where S2.SP_PRODUCT_ID = SUBPRODUCT.SP_PRODUCT_ID)
join ISSUES
on ISSUES.ISSUE_ID = SUBPRODUCT.SP_ID
and ISSUES.STATUS_LAST_UPDATED = (select max(I2.STATUS_LAST_UPDATED) as latestDate
from ISSUES I2
where I2.ISSUE_ID = ISSUES.ISSUE_ID)
Window function / inline view example
select
PRODUCT.P_NAME,
S.SP_PRODUCT_ID, S.SP_NAME, S.SP_ID, S.SP_VALUE_DATE,
I.STATUS_FLAG, I.STATUS_LAST_UPDATED
from PRODUCT
join (select SUBPRODUCT.*,
max(SP_VALUE_DATE) over (partition by SP_PRODUCT_ID) as latestDate
from SUBPRODUCT) S
on PRODUCT.P_ID = S.SP_PRODUCT_ID
and S.SP_VALUE_DATE = S.latestDate
join (select ISSUES.*,
max(STATUS_LAST_UPDATED) over (partition by ISSUE_ID) as latestDate
from ISSUES) I
on I.ISSUE_ID = S.SP_ID
and I.STATUS_LAST_UPDATED = I.latestDate
This often performs a bit better, but window functions can be tricky to understand.
I have this query using PostgreSQL 9.1 (9.2 as soon as our hosting platform upgrades):
SELECT
media_files.album,
media_files.artist,
ARRAY_AGG (media_files. ID) AS media_file_ids
FROM
media_files
INNER JOIN playlist_media_files ON media_files.id = playlist_media_files.media_file_id
WHERE
playlist_media_files.playlist_id = 1
GROUP BY
media_files.album,
media_files.artist
ORDER BY
media_files.album ASC
and it's working fine, the goal was to extract album/artist combinations and in the result set have an array of media files ids for that particular combo.
The problem is that I have another column in media files, which is artwork.
artwork is unique for each media file (even in the same album) but in the result set I need to return just the first of the set.
So, for an album that has 10 media files, I also have 10 corresponding artworks, but I would like just to return the first (or a random picked one for that collection).
Is that possible to do with only SQL/Window Functions (first_value over..)?
Yes, it's possible. First, let's tweak your query by adding alias and explicit column qualifiers so it's clear what comes from where - assuming I've guessed correctly, since I can't be sure without table definitions:
SELECT
mf.album,
mf.artist,
ARRAY_AGG (mf.id) AS media_file_ids
FROM
"media_files" mf
INNER JOIN "playlist_media_files" pmf ON mf.id = pmf.media_file_id
WHERE
pmf.playlist_id = 1
GROUP BY
mf.album,
mf.artist
ORDER BY
mf.album ASC
Now you can either use a subquery in the SELECT list or maybe use DISTINCT ON, though it looks like any solution based on DISTINCT ON will be so convoluted as not to be worth it.
What you really want is something like an pick_arbitrary_value_agg aggregate that just picks the first value it sees and throws the rest away. There is no such aggregate and it isn't really worth implementing it for the job. You could use min(artwork) or max(artwork) and you may find that this actually performs better than the later solutions.
To use a subquery, leave the ORDER BY as it is and add the following as an extra column in your SELECT list:
(SELECT mf2.artwork
FROM media_files mf2
WHERE mf2.artist = mf.artist
AND mf2.album = mf.album
LIMIT 1) AS picked_artwork
You can at a performance cost randomize the selected artwork by adding ORDER BY random() before the LIMIT 1 above.
Alternately, here's a quick and dirty way to implement selection of a random row in-line:
(array_agg(artwork))[width_bucket(random(),0,1,count(artwork)::integer)]
Since there's no sample data I can't test these modifications. Let me know if there's an issue.
"First" pick
Wouldn't it be simpler / cheaper to just use min():
SELECT m.album
,m.artist
,array_agg(m.id) AS media_file_ids
,min(m.artwork) AS artwork
FROM playlist_media_files p
JOIN media_files m ON m.id = p.media_file_id
WHERE p.playlist_id = 1
GROUP BY m.album, m.artist
ORDER BY m.album, m.artist;
Abitrary / random pick
If you are looking for a random selection, #Craig already provided a solution with truly random picks.
You could also use a CTE to avoid additional scans on the (possibly big) base table and then run two separate (cheap) subqueries on the small result set.
For arbitrary selection - not truly random, the result will depend on the physical order of rows in the table and implementation-specifics:
WITH x AS (
SELECT m.album, m.artist, m.id, m.artwork
FROM playlist_media_files p
JOIN media_files m ON m.id = p.media_file_id
)
SELECT a.album, a.artist, a.media_file_ids, b.artwork
FROM (
SELECT album, artist, array_agg(id) AS media_file_ids
FROM x
) a
JOIN (
SELECT DISTINCT ON (1,2) album, artist, artwork
FROM x
) b USING (album, artist);
For truly random results, you can add an ORDER BY .. random() like this to subquery b:
JOIN (
SELECT DISTINCT ON (1, 2) album, artist, artwork
FROM x
ORDER BY 1, 2, random()
) b USING (album, artist);
Hi everybody of the stackoverflow community! I've been visiting this site for years and here comes my first post
Lets say I have a database with three tables:
groups (GroupID,GroupType,max1,size)
candies (candyID,name,selected)
members (groupID,nameID)
Example: The candy factory.
In the candy factory 10 types of candy bags are produced out of 80 different candies.
So: There are 10 unique group types(bags) with 3 different sizes: (4,5,6); a group is combination out of 80 unique candies.
Out of this I make a database, (with some rules about which candy combinations gets into a group).
At this point I have a database with 40791 unique candy bags.
Now I want to compare a collection of candies with all the candy bags in the DB, as a result I want the bags out of the DB which are missing 3 or less candies with the compare collection.
-- restore candy status
update candies set selected = 0, blacklisted = 0;
-- set status for candies to be selected
update candies set selected = 1 where name in ('candy01','candy02','candy03','candy04');
select groupId, GroupType, max, count(*) as remainingNum, group_concat(name,', ') as remaining
from groups natural join members natural join candies
where not selected
group by groupid having count(*) <= 3
UNION -- Union with groups which dont have any remaining candies and have a 100% match
select groupid, GroupType, max, 0 as remainingNum, "" as remaining
from groups natural join members natural join candies
where selected
group by groupid having count(*) =groups.size;
The above query does this. But the thing I am trying to accomplish is to do this without the union, because speed is of the essence. And also I am new to sql and are very eager to learn/see new methods.
Greetings, Rutger
I'm not 100% sure about what you are accomplishing through these queries, so I haven't looked at a fundamentally different approach. If you can include example data to demonstrate your logic, I can have a look at that. But, in terms of simply combining your two queries, I can do that. There is a note of caution first, however...
SQL is compiled in to query plans. If the query plan for each query is significantly different from the other, combining them into a single query may be a bad idea. What you may end up with is a single plan that works for both cases, but is not very efficient for either. One poor plan can be a lot worse than two good plans => Shorter, more compact, code does not always give faster code.
You can put selected in to your GROUP BY instead of your WHERE clause; the fact that you have two UNIONed queries shows that you are treating them as two separate groups already.
Then, the only difference between your queries is the filter on count(*), which you can accommodate with a CASE WHEN statement...
SELECT
groups.groupID,
groups.GroupType,
groups.max,
CASE WHEN Candies.Selected = 0 THEN count(*) ELSE 0 END as remainingNum,
CASE WHEN Candies.Selected = 0 THEN group_concat(candies.name,', ') ELSE '' END as remaining
FROM
groups
INNER JOIN
members
ON members.GroupID = groups.GroupID
INNER JOIN
candies
ON Candies.CandyID = members.CandyID
GROUP BY
Groups.GroupID,
Groups.GroupType,
Groups.max,
Candies.Selected
HAVING
CASE
WHEN Candies.Selected = 0 AND COUNT(*) <= 3 THEN 1
WHEN Candies.Selected = 1 AND COUNT(*) = Groups.Size THEN 1
ELSE 0
END
=
1
The layout changes are simply because I disagree with using NATURAL JOIN for maintenance reasons. They are a short-cut in initial build and a potential disaster in later development. But that's a different issue, you can read about it on line if you feel you want to.
Don't update the database when you're doing a select, your first update update candies set selected = 0, blacklisted = 0; will apply to the entire table, and rewrite every record. You should try without using selected and also changing your union to UNION ALL. Further to this, you try inner join instead of natural join (but I don't know your schema for candy to members)
select groupId, GroupType, max, count(*) as remainingNum, group_concat(name,', ') as remaining
from groups
inner join members on members.groupid = groups.groupid
inner join candies on candies.candyid = member.candyid
where name NOT in ('candy01','candy02','candy03','candy04')
group by groups.groupid
having count(*) <= 3
UNION ALL -- Union with groups which dont have any remaining candies and have a 100% match
select groupid, GroupType, max, 0 as remainingNum, "" as remaining
from groups
inner join members on members.groupid = groups.groupid
inner join candies on candies.candyid = member.candyid
where name in ('candy01','candy02','candy03','candy04')
group by groupid
having count(*) = groups.size;
This should at least perform better than updating all records in the table before querying it.
I am putting together a nice little database for adding values to options, all these are setup through a map (Has and Belongs to Many) table, because many options are pointing to a single value.
So I am trying to specify 3 option.ids and a single id in a value table - four integers to point to a single value. Three tables. And I am running into a problem with the WHERE part of the statement, because if multiple values share an option there are many results. And I need just a single result.
SELECT value.id, value.name FROM value
LEFT JOIN (option_map_value, option_table)
ON (value.id = option_map_value.value_id AND option_map_value.option_table_id = option_table.id)
WHERE option_table.id IN (5, 2, 3) AND value.y_axis_id = 16;
The problem with the statement seems to be the IN on the WHERE clause. If one of the numbers are different in the IN() part, then there are multiple results - which is not good.
I have tried DISTINCT, which again works if there is one result, but returns many if there is many. The closest we have gotten to is adding a count - to return to value with the most options at the top.
So is there a way to do the WHERE to be more specific. I cannot break it out into option_table.id = 5 AND option_table.id = 2 - because that one fails. But can the WHERE clause be more specifc?
Maybe it is me being pedantic, but I would like to be able to return just the single result, instead of a count of results... Any ideas?
The problem with the statement seems to be the IN on the WHERE clause. If one of the numbers are different in the IN() part, then there are multiple results - which is not good. I have tried DISTINCT, which again works if there is one result, but returns many if there is many. The closest we have gotten to is adding a count - to return to value with the most options at the top.
You were very close, considering the DISTINCT:
SELECT v.id,
v.name
FROM VALUE v
LEFT JOIN OPTION_MAP_VALUE omv ON omv.value_id = v.id
LEFT JOIN OPTION_TABLE ot ON ot.id = omv.option_table_id
WHERE ot.id IN (5, 2, 3)
AND v.y_axis_id = 16
GROUP BY v.id, v.name
HAVING COUNT(*) = 3
You were on the right track, but needed to use GROUP BY instead in order to be able to use the HAVING clause to count the DISTINCT list of values.
Caveat emptor:
The GROUP BY/HAVING COUNT version of the query is dependent on your data model having a composite key, unique or primary, defined for the two columns involved (value_id and option_table_id). If this is not in place, the database will not stop duplicates being added. If duplicate rows are possible in the data, this version can return false positives because a value_id could have 3 associations to the option_table_id 5 - which would satisfy the HAVING COUNT(*) = 3.
Using JOINs:
A safer, though more involved, approach is to join onto the table that can have multiple options, as often as you have criteria:
SELECT v.id,
v.name
FROM VALUE v
JOIN OPTION_MAP_VALUE omv ON omv.value_id = v.id
JOIN OPTION_TABLE ot5 ON ot5.id = omv.option_table_id
AND ot5.id = 5
JOIN OPTION_TABLE ot2 ON ot2.id = omv.option_table_id
AND ot2.id = 2
JOIN OPTION_TABLE ot3 ON ot3.id = omv.option_table_id
AND ot3.id = 3
WHERE v.y_axis_id = 16
GROUP BY v.id, v.name
Need help on a query using sql server 2005
I am having two tables
code
chargecode
chargeid
orgid
entry
chargeid
itemNo
rate
I need to list all the chargeids in entry table if it contains multiple entries having different chargeids
which got listed in code table having the same charge code.
data :
code
100,1,100
100,2,100
100,3,100
101,11,100
101,12,100
entry
1,x1,1
1,x2,2
2,x3,2
11,x4,1
11,x5,1
using the above data , it query should list chargeids 1 and 2 and not 11.
I got the way to know how many rows in entry satisfies the criteria, but m failing to get the chargeids
select count (distinct chargeId)
from entry where chargeid in (select chargeid from code where chargecode = (SELECT A.chargecode
from code as A join code as B
ON A.chargecode = B.chargeCode and A.chargetype = B.chargetype and A.orgId = B.orgId AND A.CHARGEID = b.CHARGEid
group by A.chargecode,A.orgid
having count(A.chargecode) > 1)
)
First off: I apologise for my completely inaccurate original answer.
The solution to your problem is a self-join. Self-joins are used when you want to select more than one row from the same table. In our case we want to select two charge IDs that have the same charge code:
SELECT DISTINCT c1.chargeid, c2.chargeid FROM code c1
JOIN code c2 ON c1.chargeid != c2.chargeid AND c1.chargecode = c2.chargecode
JOIN entry e1 ON e1.chargeid = c1.chargeid
JOIN entry e2 ON e2.chargeid = c2.chargeid
WHERE c1.chargeid < c2.chargeid
Explanation of this:
First we pick any two charge IDs from 'code'. The DISTINCT avoids duplicates. We make sure they're two different IDs and that they map to the same chargecode.
Then we join on 'entry' (twice) to make sure they both appear in the entry table.
This approach gives (for your example) the pairs (1,2) and (2,1). So we also insist on an ordering; this cuts to result set down to just (1,2), as you described.