How can I group objects retrieved from database tables that have the same properties? - sql

I am working on an application that allows users to build a "book" from a number of "pages" and then place them in any order that they'd like. It's possible that multiple people can build the same book (the same pages in the same order). The books are built by the user prior to them being processed and printed, so I need to group books together that have the same exact layout (the same pages in the same order). I've written a million queries in my life, but for some reason I can't grasp how to do this.
I could simply write a big SELECT query, and then loop through the results and build arrays of objects that have the same pages in the same sequence, but I'm trying to figure out how to do this with one query.
Here is my data layout:
dbo.Books
BookId
Quantity
dbo.BookPages
BookId
PageId
Sequence
dbo.Pages
PageId
DocName

So, I need some clarification on a few things:
Once a user orders the pages the way they want, are they saved back down to a database?
If yes, then is the question to run a query to group book orders that have the same page-numbering, so that they are sent to the printers in an optimal way?
OR, does the user layout the pages, then send the order directly to the printer? And if so, it seems more complicated/less efficient to capture requested print jobs, and order them on-the-fly on the way out to the printers ...
What language/technology are you using to create this solution? .NET? Java?
With the answers to these questions, I can better gauge what you need.
With the answers to my questions, I also assume that:
You are using some type of many-to-many table to store customer page ordering. If so, then you'll need to write a query to select distinct page-orderings, and group by those page orderings. This is possible with a single SQL query.
However, if you feel you want more control over how this data is joined, then doing this programmatically may be the way to go, although you will lose performance by reading in all the data, and then outputting that data in a way that is consumable by your printers.

The books are identical only if the page count = match count.
It was tagged TSQL when I started. This may not be the same syntax on SQL.
;WITH BookPageCount
AS
(
select b1.bookID, COUNT(*) as [individualCount]
from book b1 with (nolock)
group by b1.bookID
),
BookCombinedCount
AS
(
select b1.bookID as [book1ID], b2.bookID as [book2ID], COUNT(*) as [combindCount]
from book b1 with (nolock)
join book b2 with (nolock)
on b1.bookID < b2.bookID
and b1.squence = b2.squence
and b1.page = b2.page
group by b1.bookID, b2.bookID
)
select BookCombinedCount.book1ID, BookCombinedCount.book2ID
from BookCombinedCount
join BookPageCount as book1 on book1.bookID = BookCombinedCount.book1ID
join BookPageCount as book2 on book2.bookID = BookCombinedCount.book2ID
where BookCombinedCount.combindCount = book1.individualCount
and BookCombinedCount.combindCount = book2.individualCount.PageCount

Related

SQL question: how do I find the count of IDs that are always mapped to a 'true' field in another table

I have a database that collects a list of document packages in one table and each individual page in another table
Each page has a PackageID connecting the two tables.
I'm trying to find the count of all packages where ALL pages connected to it have a boolean field (stored on the page table) of true. Even if 1/20 of the pages connected to the packageID is false, I don't want that packageID counted
Right now all I have is:
SELECT COUNT(DISTINCT pages.package_id)
FROM pages
WHERE boolean_field = true
But I'm not sure how to add that if one page w/ that package_id has the boolean_field != true than I don't want it counted. I also want to know the count of those packages that have any that are false.
I'm not sure if I need a subquery, if statement, having clause, or what.
Any direction even if it's what operators I should study on would be super helpful. Thanks :).
select count(*)
from
(
select package_id
from pages
group by package_id
having min(boolean_field) = 1
) tmp
Another way to express this is:
select count(*)
from packages p
where not exists (select 1
from pages pp
where pp.package_id = p.package_id and
not pp.boolean_field
);
The advantage of this approach is that it avoids aggregation, which can be a big win performance wise. It can also take advantage of an index on pages(package_id, boolean_field).

bigquery best practice for cross query

I am new with google Big query, and trying to understand what is the best practices here.
I have a (.net) component that implement some articles reader behavior.
I have two tables.
one is articles and the other is user action.
Articles is a general table containing thousands of possible articles to read.
User actions simply register when a user reads an article.
I have about 200,000 users in my system.
On a certain time, I need to prepare each user with a bucket of possible articles by taking 1000 articles from the articles table and omitting the ones he already read.
As I have over 100,000 users to build a bucket I am seeking for the best possible solution to perform this:
Possible solution:
a. query for all articles,
b. query for all users actions.
c. creating the user bucket in code- long action to omit the ones he did.
that means I perform about (users count) + 1 queries in bigquery but i have to perfrom a large search in my code.
Any smart join I can do here, but I am unsure how this can go down ??
leaving the searching work to big query, and also using less queries calls than the number of users.
any help on 2 will be appreciated
Thanks you.
I would do something like this to populate a single table for all readers in one call:
Select User,Article
from
(
Select User,Article,
Row_Number() Over (Partition by User) as NBR -- to extract only 1000 per users
From
(
((Select User From
UserActions
Group Each by User) -- Unique Users table
Cross Join
Articles) as A -- A contains a list of users with all available articles
Left Join Each
(Select User,Article
From UserAction
where activity="read"
Group Each By User,Article
) as B --Using left join to add all available articles and..
On A.User=B.User
and A.Article=B.Article
where B.User Is Null --..filter out already read
)
)
where NBR<=1000 -- filter top 1000 per user
If you want to generate a query per user and you can add the user to the query, i'd go for something simpler such as:
Select top 1000 Article
from Articles
where Article not in
(Select Article from UserAction where User = "your user here" )
Hope this helps

What is the best way to reduce sql queries in my situation

Here is the situation,each page will show 30 topics,so I had execute 1 sql statements at least,besides,I also want to show how many relpies with each topic and who the author is,thus
I have to use 30 statements to count the number of replpies and use other 30 statements to find the author.Finally,I got 61 statements,I really worry about the efficiency.
My tables looks like this:
Topic Reply User
------- ---------- ------------
id id id
title topic_id username
... ...
author_id
You should look into joining tables during a query.
Joins in SQLServer http://msdn.microsoft.com/en-us/library/ms191517.aspx
Joins in MySQL http://dev.mysql.com/doc/refman/5.0/en/join.html
As an example, I could do the following:
SELECT reply.id, reply.authorid, reply.text, reply.topicid,
topic.title,
user.username
FROM reply
LEFT JOIN topic ON (topic.id = reply.topicid)
LEFT JOIN user ON (user.id = reply.authorid)
WHERE (reply.isactive = 1)
ORDER BY reply.postdate DESC
LIMIT 10
If I read your requirements correctly, you want the result of the following query:
SELECT Topic.title, User.username, COUNT(Reply.topic_id) Replies
FROM Topic, User, Reply
WHERE Topic.id = Reply.topic_id
AND Topic.author_id = User.id
GROUP BY Topic.title, User.username
When I was first starting out with database driven web applications I had similar problems. I then spent several years working in a database rich environment where I actually learned SQL. If you intend to continue developing web applications (which I find are very fun to create) it would be worth your time to pick up a book or checking out some how-to's on basic and advance SQL.
One thing to add, on top of JOINS
It may be that your groups of data do not match or relate, so JOINs won't work. Another way: you may have 2 main chunks of data that is awkward to join.
Stored procedures can return multiple result sets.
For example, for a summary page you could return one aggregate result set and another "last 20" result set in one SQL call. To JOIN the 2 is awkward because it doesn't "fit" together.
You certainly can use some "left joins" on this one, however since the output only changes if someone updates/adds to your tables you could try to cache it in a xml/text file. Another way could be to build in some redundancy by adding another row to the topic table that keeps the reply count, username etc... and update them only if changes occur...

What is the best way to implement this SQL query?

I have a PRODUCTS table, and each product can have multiple attributes so I have an ATTRIBUTES table, and another table called ATTRIBPRODUCTS which sits in the middle. The attributes are grouped into classes (type, brand, material, colour, etc), so people might want a product of a particular type, from a certain brand.
PRODUCTS
product_id
product_name
ATTRIBUTES
attribute_id
attribute_name
attribute_class
ATTRIBPRODUCTS
attribute_id
product_id
When someone is looking for a product they can select one or many of the attributes. The problem I'm having is returning a single product that has multiple attributes. This should be really simple I know but SQL really isn't my thing and past a certain point I get a bit lost in the logic. The problem is I'm trying to check each attribute class separately so I want to end up with something like:
SELECT DISTINCT products.product_id
FROM attribproducts
INNER JOIN products ON attribproducts.product_id = products.product_id
WHERE (attribproducts.attribute_id IN (9,10,11)
AND attribproducts.attribute_id IN (60,61))
I've used IN to separate the blocks of attributes of different classes, so I end up with the products which are of certain types, but also of certain brands. From the results I've had it seems to be that AND between the IN statements that's causing the problem.
Can anyone help a little? I don't have the luxury of completely refactoring the database unfortunately, there is a lot more to it than this bit, so any suggestions how to work with what I have will be gratefully received.
Take a look at the answers to the question SQL: Many-To-Many table AND query. It's the exact same problem. Cletus gave there 2 possible solutions, none of which very trivial (but then again, there simply is no trivial solution).
SELECT DISTINCT products.product_id
FROM products p
INNER JOIN attribproducts ptype on p.product_id = ptype.product_id
INNER JOIN attribproducts pbrand on p.product_id = pbrand.product_id
WHERE ptype.attribute_id IN (9,10,11)
AND pbrand.attribute_id IN (60,61)
Try this:
select * from products p, attribproducts a1, attribproducts a2
where p.product_id = a1.product_id
and p.product_id = a2.product_id
and a1.attribute_id in (9,10,11)
and a2.attribute_id in (60,61);
This will return no rows because you're only counting rows that have a number that's (either 9, 10, 11) AND (either 60, 61).
Because those sets don't intersect, you'll get no rows.
If you use OR instead, it'll give products with attributes that are in the set 9, 10, 11, 60, 61, which isn't what you want either, although you'll then get multiple rows for each product.
You could use that select as an subquery in a GROUP BY statement, grouping by the quantity of products, and order that grouping by the number of shared attributes. That will give you the highest matches first.
Alternatively (as another answer shows), you could join with a new copy of the table for each attribute set, giving you only those products that match all attribute sets.
It sounds like you have a data schema that is GREAT for storage but terrible for selecting/reporting. When you have a data structure of OBJECT, ATTRIBUTE, OBJECT-ATTRIBUTE and OBJECT-ATTRIBUTE-VALUE you can store many objects with many different attributes per object. This is sometime referred to as "Vertical Storage".
However, when you want to retrieve a list of objects with all of their attributes values, it is an variable number of joins you have to make. It is much easier to retrieve data when it is stored horizonatally (Defined columns of data)
I have run into this scenario several times. Since you cannot change the existing data structure. My suggest would be to write a "layer" of tables on top. Dynamically create a table for each object/product you have. Then dynamically create static columns in those new tables for each attribute. Pretty much you need to "flatten" your vertically stored attribute/values into static columns. Convert from a vertical architecture into a horizontal ones.
Use the "flattened" tables for reporting, and use the vertical tables for storage.
If you need sample code or more details, just ask me.
I hope this is clear. I have not had much coffee yet :)
Thanks,
- Mark
You can use multiple inner joins -- I think this would work:
select distinct product_id
from products p
inner join attribproducts a1 on a1.product_id=p.product_id
inner join attribproducts a2 on a1.product_id=p.product_id
where a1.attribute_id in (9,10,11)
and a2.attribute_id in (60,61)

SQL database problems with addressbook table design

I am writing a addressbook module for my software right now. I have the database set up so far that it supports a very flexible address-book configuration.
I can create n-entries for every type I want. Type means here data like 'email', 'address', 'telephone' etc.
I have a table named 'contact_profiles'.
This only has two columns:
id Primary key
date_created DATETIME
And then there is a table called contact_attributes. This one is a little more complex:
id PK
#profile (Foreign key to contact_profiles.id)
type VARCHAR describing the type of the entry (name, email, phone, fax, website, ...) I should probably change this to a SET later.
value Text (containing the value for the attribute).
I can now link to these profiles, for example from my user's table. But from here I run into problems.
At the moment I would have to create a JOIN for each value that I want to retrieve.
Is there a possibility to somehow create a View, that gives me a result with the type's as columns?
So right now I would get something like
#profile type value
1 email name#domain.tld
1 name Sebastian Hoitz
1 website domain.tld
But it would be nice to get a result like this:
#profile email name website
1 name#domain.tld Sebastian Hoitz domain.tld
The reason I do not want to create the table layout like this initially is, that there might always be things to add and I want to be able to have multiple attributes of the same type.
So do you know if there is any possibility to convert this dynamically?
If you need a better description please let me know.
You have reinvented a database design called Entity-Attribute-Value. This design has a lot of weaknesses, including the weakness you've discovered: it's very hard to reproduce a query result in a conventional format, with one column per attribute.
Here's an example of what you must do:
SELECT c.id, c.date_created,
c1.value AS name,
c2.value AS email,
c3.value AS phone,
c4.value AS fax,
c5.value AS website
FROM contact_profiles c
LEFT OUTER JOIN contact_attributes c1
ON (c.id = c1.profile AND c1.type = 'name')
LEFT OUTER JOIN contact_attributes c1
ON (c.id = c1.profile AND c1.type = 'email')
LEFT OUTER JOIN contact_attributes c1
ON (c.id = c1.profile AND c1.type = 'phone')
LEFT OUTER JOIN contact_attributes c1
ON (c.id = c1.profile AND c1.type = 'fax')
LEFT OUTER JOIN contact_attributes c1
ON (c.id = c1.profile AND c1.type = 'website');
You must add another LEFT OUTER JOIN for every attribute. You must know the attributes at the time you write the query. You must use LEFT OUTER JOIN and not INNER JOIN because there's no way to make an attribute mandatory (the equivalent of simply declaring a column NOT NULL).
It's far more efficient to retrieve the attributes as they are stored, and then write application code to loop through the result set, building an object or associative array with an entry for each attribute. You don't need to know all the attributes this way, and you don't have to execute an n-way join.
SELECT * FROM contact_profiles c
LEFT OUTER JOIN contact_attributes ca ON (c.id = ca.profile);
You asked in a comment what to do if you need this level of flexibility, if not use the EAV design? SQL is not the correct solution if you truly need unlimited metadata flexibility. Here are some alternatives:
Store a TEXT BLOB, containing all the attributes structured in XML or YAML format.
Use a semantic data modeling solution like Sesame, in which any entity can have dynamic attributes.
Abandon databases and use flat files.
EAV and any of these alternative solutions is a lot of work. You should consider very carefully if you truly need this degree of flexibility in your data model, because it's hugely more simple if you can treat the metadata structure as relatively unchanging.
If you are limiting yourself to displaying a single email, name, website, etc. for each person in this query, I'd use subqueries:
SELECT cp.ID profile
,cp.Name
,(SELECT value FROM contact_attributes WHERE type = 'email' and profile = cp.id) email
,(SELECT value FROM contact_attributes WHERE type = 'website' and profile = cp.id) website
,(SELECT value FROM contact_attributes WHERE type = 'phone' and profile = cp.id) phone
FROM contact_profiles cp
If you're using SQL Server, you could also look at PIVOT.
If you want to show multiple emails, phones, etc., then consider that each profile must have the same number of them or you'll have blanks.
I'd also factor out the type column. Create a table called contact_attribute_types which would hold "email", "website", etc. Then you'd store the contact_attribute_types.id integer value in the contact_attributes table.
You will need to generate a query like:
select #profile,
max(case when type='email' then value end) as email,
max(case when type='name' then value end) as name,
max(case when type='website' then value end) as website
from mytable
group by #profile
However, that will only show one value for each type per #profile. Your DBMS may have a function you can use instead of MAX to concatenate all the values as a comma-separated string, or you may be able to write one.
This kind of data model is generally best avoided for the reasons you have already mentioned!
You create a view for each contact type
When you want all the information you pull from the entire table, when you want a subset of a specific contact type, you pull from the view.
I'd create a stored procedure that takes the intent {all, phone, email, address} as one of the parameters and then derive the data. All my app code would call this stored procedure to get the data. Also, when a new type is added (which should be very infrequently, you create another view and modify only this sproc).
I've implemented a similar design for multiple small/med size systems and have had no issues.
Am I missing something? This seems trivial?
EDIT:
I see what I was missing... You are trying to be normalized and denormalized at the same time. I'm not sure what the rest of your business rules are for pulling records. You could have profiles with multiple or null values for phone/email/addresses etc. I would keep your data format the same and again use a sproc to create the specific view you want. As your business needs change, you leave your data alone and just create another sproc to access it.
There is no one right answer for this question, as one would need to know, for your specific organization or application, how many of those contact methods the business wants to collect, how current they want the information to be, and how much flexibility they are willing to invest in.
Of course, many of here could make some good guesses as to what the average business would want to do, but the real answer is to find out what your project, what your users, are interested in.
BTW, all architecture questions about "best"-ness require this sort of cost, benefit, and risk analysis.
Now that the approach of document-oriented databases is getting more and more popular, one could use one of them to store all this information in one entry - and therefor deleting all those extra joins and queries.