SQL Server: How do I maintain data integrity using aggregate functions with group by? - sql

Here's my question: how do I maintain record integrity using aggregate functions with a group by?
To explain further, here's an example.
I have a table with the following columns: (Think of it as an "order" table)
Customer_Summary (first 10 char of name + first 10 char of address)
Customer_Name
Customer_Address
Customer_Postal Code
Order_weekday
There is one row per "order", so many rows with the same customer name, address, and summary.
What I want to do is show the customer's name, address, and postal code, as well as the number of orders they've placed on each weekday, grouped by the customer's summary.
So the data should look like:
Summary | Name | Address | PCode | Monday | Tuesday | Wednesday | Thursday | Friday
test custntest addre|test custname|test address|123456 | 1 | 1 | 1 | 1 | 1
I only want to group records of similar customer summary together, but obviously I want one name, address, and postal code to show. I'm using min() at the moment, so my query looks like:
SELECT Customer_Summary, min(customer_name), min(customer_address), min(customer_postal_code)
FROM Order
Group by customer_summary
I've omitted my weekday logic as I didn't think it was necessary.
My issue is this - some of these customers with the same customer summary have different addresses and postal codes.
So I might have two customers, looking like:
test custntest addre|test custname |test address |323456|
test custntest addre|test custname2|test address2|123456|
Using the group by, my query will return the following:
test custntest addre|test custname |test address |123456|
Since I'm using min, it's going to give me the minimum value for all of the fields, but not necessarily from the same record. So I've lost my record integrity here - the address and name returned by the query do not correctly match the postal code.
So how do I maintain data integrity on non-grouped fields when using a group by clause?
Hopefully I explained it clearly enough, and thanks in advance for the help.
EDIT: Solved. Thanks everyone!

You can always use ROW_NUMBER instead of GROUP BY
WITH A AS (
SELECT Customer_Summary, customer_name, customer_address, customer_postal_code,
ROW_NUMBER() OVER (PARTITION BY Customer_Summary ORDER BY customer_name, customer_address) AS rn
FROM Order
)
SELECT Customer_Summary, customer_name, customer_address, customer_postal_code
FROM A
WHERE rn = 1
Then you are free to order which customer to use in the ORDER BY clause. Currently I am order them by name and then address.
Edit:
My solution does what you asked for. But I surely agree with the others: If you are allowed to change the database structure, this would be a good idea... which you are not (saw your comment). Well, then ROW_NUMBER() is a good way.

I think you need to re-think your structure.
Ideally you would have a Customer table with an unique ID. Then you would use that unique ID in the Order table. Then you don't need the strange "first 10 characters" method that you are using. Instead, you just group by the unique ID from the Customer table.
You could even then also have a separate table for addresses, relating each address to the customer, with multiple rows (with fields marking them as home address, delivery address, billing address, etc).
This way you separate the Customer information from the Address information and from the Order information. Such that if the customer changes name (marriage) or address (moving home) you don't break your data - Everything is related by the IDs, not the data itself.
[This relationship is known as a Foreign Key.]

Related

Does join combine the same column with same name and data?

Im reading this article by Miguel Grinberg, and on the 'The Join' part, I'm kinda confused with the result.
To sum up the part I'm concerned, he joined a query and a subquery belonging to the SAME table on the condition where its customer_id's are the same
Query selected: id, customer_id, order_date
Subquery selected: customer_id, max(order_date) AS last_order_date
When he joined it I was expecting something like:
id | customer_id | order_data | customer_id | last_order_date
--------------------------------------------------------------
But his result was:
id | customer_id | order_data | last_order_date
-----------------------------------------------
Where is the other customer_id selected from the subquery?
With that I would like to confirm if my understanding is correct, a JOIN also combines two COLUMNS if it has the same NAME and VALUE.
The fact that the article uses select * when it should be using select orders.*, last_orders.last_order_date already makes me suspicious of anything else in the article.
Most databases would run the query and return two columns with customer_id -- as you suggest should happen. However, there is then a problem in accessing both those columns in an application. They have the same name. So, the columns might be elided in some way.
All that said, this is a rather poor example, because the query is much better written using window functions:
select o.*, max(order_date) over (partition by customer_id)
from orders o;

Sorting Twice with an SQL Query

So, say I have a table of entries which have a product name, a user, and the product's pricing.
My problem is that I want to obtain a result set that groups the products bought a single user together, and then sorts those products lexicographically.
So, something like where every product bought a user whose name starts with an A is grouped in their own little block, with each product also appearing in alphabetical order (Candy before Cat food, for example), with a user whose name starts with P afterward.
Can someone explain how I might begin to do this?
An SQL query returns a table of rows and columns. You can have one column for the client and another for the product and sort by client and inside by product (ORDER BY client, product). You don't get different "blocks" of data.
If you want this more beautiful, you need some software to create a report (i.e. data with a layout) based on the query.
What you can do with SQL, though, is suppress data, such as:
select
case when client = lag(client) over (order by client, product) then null else client end
as client,
product
from bought
order by client, product;
Sample result:
client | product
--------+--------
Max | cup
| saucer
| plate
Elsa | mug
| plate

Simple database for product order

I want to make a simple database to order products, like chips/drinks (as full product, without any specific info about product just name and price for unit)
I designed this but I'm not sure if it's good:
**Person:**
id
username
name
password
phone
email
street
zip
**order:**
id
person_id
product_id
date
quantity (neccessary?)
status (done or not)
**product:**
id
name
price
relations:
[person] 1 --- 1 [order] 1 --- many [product]
(I'm not sure about relations and fields)
It seems that in your way you are going to end up in orders containing a single product (even if you use the quantity)
I would modify the Order table:
**order:**
id
person_id
date
status (done or not)
And I would add a new table:
**OrderDetails**
id
order_id
product_id
quantity
You may check out for db normalization. You should add columns to a table that are directly related to the table. For instance date in the order is valid, because it refers to the order it was made. On the other hand it wouldn't be valid in the person table (unless it was referring to the person join date). So, similarly the quantity refers to the product in the order (thus in OrderDetails) not in the Order or the Product.
You will probably need an intermediate table between order and product, so you can add many times same order to different products

find the number of rows based on column mapping in sql

i am completely new to sql.I am trying to learn things in sql. Juts stuck upon something. Say i have a table with two colmumns customername and customer address. multiple customers can be mapped to the same address. How can retrieve the address with maximum customers ?
This can be done using grouping (to get the counts), ordering (descending) and limiting (to get the top row). In MySQL for instance, it might look like this:
SELECT customer_address, COUNT(DISTINCT customer_id) AS number_of_customers
FROM your_table
GROUP BY customer_address
ORDER BY number_of_customers DESC
LIMIT 1;
This will yield something like:
+------------------+---------------------+
| customer_address | number_of_customers |
+------------------+---------------------+
| foo | 42 |
+------------------+---------------------+

Return all Fields and Distinct Rows

Whats the best way to do this, when looking for distinct rows?
SELECT DISTINCT name, address
FROM table;
I still want to return all fields, ie address1, city etc but not include them in the DISTINCT row check.
Then you have to decide what to do when there are multiple rows with the same value for the column you want the distinct check to check against, but with different val;ues in the other columns. In this case how does the query processor know which of the multiple values in the other columns to output, if you don't care, then just write a group by on the distinct column, with Min(), or Max() on all the other ones..
EDIT: I agree with comments from others that as long as you have multiple dependant columns in the same table (e.g., Address1, Address2, City, State ) That this approach is going to give you mixed (and therefore inconsistent ) results. If each column attribute in the table is independant ( if addresses are all in an Address Table and only an AddressId is in this table) then it's not as significant an issue... cause at least all the columns from a join to the Address table will generate datea for the same address, but you are still getting a more or less random selection of one of the set of multiple addresses...
This will not mix and match your city, state, etc. and should give you the last one added even:
select b.*
from (
select max(id) id, Name, Address
from table a
group by Name, Address) as a
inner join table b
on a.id = b.id
When you have a mixed set of fields, some of which you want to be DISTINCT and others that you just want to appear, you require an aggregate query rather than DISTINCT. DISTINCT is only for returning single copies of identical fieldsets. Something like this might work:
SELECT name,
GROUP_CONCAT(DISTINCT address) AS addresses,
GROUP_CONCAT(DISTINCT city) AS cities
FROM the_table
GROUP BY name;
The above will get one row for each name. addresses contains a comma delimted string of all the addresses for that name once. cities does the sames for all the cities.
However, I don't see how the results of this query are going to be useful. It will be impossible to tell which address belongs to which city.
If, as is often the case, you are trying to create a query that will output rows in the format you require for presentation, you're much better off accepting multiple rows and then processing the query results in your application layer.
I don't think you can do this because it doesn't really make sense.
name | address | city | etc...
abc | 123 | def | ...
abc | 123 | hij | ...
if you were to include city, but not have it as part of the distinct clause, the value of city would be unpredictable unless you did something like Max(city).
You can do
SELECT DISTINCT Name, Address, Max (Address1), Max (City)
FROM table
Use #JBrooks answer below. He has a better answer.
Return all Fields and Distinct Rows
If you're using SQL Server 2005 or above you can use the RowNumber function. This will get you the row with the lowest ID for each name. If you want to 'group' by more columns, add them in the PARTITION BY section of the RowNumber.
SELECT id, Name, Address, ...
(select id, Name, Address, ...,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY id) AS RowNo
from table) sub
WHERE RowNo = 1