Does join combine the same column with same name and data? - sql

Im reading this article by Miguel Grinberg, and on the 'The Join' part, I'm kinda confused with the result.
To sum up the part I'm concerned, he joined a query and a subquery belonging to the SAME table on the condition where its customer_id's are the same
Query selected: id, customer_id, order_date
Subquery selected: customer_id, max(order_date) AS last_order_date
When he joined it I was expecting something like:
id | customer_id | order_data | customer_id | last_order_date
--------------------------------------------------------------
But his result was:
id | customer_id | order_data | last_order_date
-----------------------------------------------
Where is the other customer_id selected from the subquery?
With that I would like to confirm if my understanding is correct, a JOIN also combines two COLUMNS if it has the same NAME and VALUE.

The fact that the article uses select * when it should be using select orders.*, last_orders.last_order_date already makes me suspicious of anything else in the article.
Most databases would run the query and return two columns with customer_id -- as you suggest should happen. However, there is then a problem in accessing both those columns in an application. They have the same name. So, the columns might be elided in some way.
All that said, this is a rather poor example, because the query is much better written using window functions:
select o.*, max(order_date) over (partition by customer_id)
from orders o;

Related

"Joining" 2 different selects on the same table?

I have a table of orders with products, each product has their own shipping date. How can I retrieve the orders so it shows the fastest shipping date?
For example:
Order Product Ship date
1 phone 02/03/2019
1 charger 02/07/2019
2 printer 03/01/2019
What would be the sql query to retrieve the following?
Order Product Ship date
1 phone 02/03/2019
1 charger 02/03/2019
2 printer 03/01/2019
I.e on order 1, all ship dates are 02/03/2019 since it's the earliest.
I tried this:
SELECT order,
product,
(SELECT ship_date FROM Tracking ORDER BY ship_date ASC) tbl ON tbl.order = t.order
FROM Tracking t
But I'm getting the error:
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified.
Considering the error message I believe this is for SQL Server and thus window functions to be available.
You could use the windowed version of min() to get the minimum shipping date for an order.
SELECT [order],
[product],
min([ship date]) OVER (PARTITION BY [order]) [ship date]
FROM tracking;
To get rid of the "The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified." message, you can do that by adding "SELECT TOP 100 PERCENT" to the sub-query.
But I'd suggest looking into the RANK and DENSE_RANK functions as they're probably going to be more helpful
I am not sure that i understand what you are trying to achieve here. Well if you want just to query this table ordered by shipping date, i think something like this would work:
SELECT * FROM Tracking ORDER BY ship_date ASC;
You are making it very complicated. The query is pretty simple:
select * from my_table group by shipdate order by shipdate asc

PostgreSQL ON vs WHERE when joining tables?

I have 2 tables customer and coupons, a customer may or may not have a reward_id assigned to, so it's a nullable column. A customer can have many coupons and coupon belongs to a customer.
+-------------+------------+
| coupons | customers |
+-------------+------------+
| id | id |
| customer_id | first_name |
| code | reward_id |
+-------------+------------+
customer_id column is indexed
I would like to make a join between 2 tables.
My attempt is:
select c.*, cust.id as cust_id, cust.first_name as cust_name
from coupons c
join customer cust
on c.customer_id = cust.id and cust.reward_id is not null
However, I think there isn't an index on reward_id, so I should move cust.reward_id is not null in where clause:
select c.*, cust.id as cust_id, cust.first_name as cust_name
from coupons c
join customer cust
on c.customer_id = cust.id
where cust.reward_id is not null
I wonder if the second attempt would be more efficient than the first attempt.
It would be better if you see the execution plan on your own. Add EXPLAIN ANALYZE before your select statement and execute both to see the differences.
Here's how:
EXPLAIN ANALYZE select ...
What it does? It actually executes the select statement and gives you back the execution plan which was chosen by query optimizer. Without ANALYZE keyword it would only estimate the execution plan without actually executing the statement in the background.
Database won't use two indexes at one time, so having an index on customer(id) will make it unable to use index on customer(reward_id). This condition will actually be treated as a filter condition which is correct behaviour.
You could experiment with performance of a partial index created as such: customer(id) where reward_id is not null. This would decrease index size as it would only store these customer id's for which there is a reward_id assigned.
I generally like to split the relationship/join logic from conditions applied and I myself put them within the WHERE clause because it's more visible in there and easier to read for future if there are any more changes.
I suggest you see for yourself the possible performance gain, because it depends on how much data there is and the possible low cardinality for reward_id. For example if most rows have this column filled with a value it wouldn't make that much of a difference as the index size (normal vs partial) would be almost the same.
In a PostgreSQL inner join, whether a filter condition is placed in the ON clause or the WHERE clause does not impact a query result or performance.
Here is a guide that explores this topic in more detail: https://app.pluralsight.com/guides/using-on-versus-where-clauses-to-combine-and-filter-data-in-postgresql-joins

find the number of rows based on column mapping in sql

i am completely new to sql.I am trying to learn things in sql. Juts stuck upon something. Say i have a table with two colmumns customername and customer address. multiple customers can be mapped to the same address. How can retrieve the address with maximum customers ?
This can be done using grouping (to get the counts), ordering (descending) and limiting (to get the top row). In MySQL for instance, it might look like this:
SELECT customer_address, COUNT(DISTINCT customer_id) AS number_of_customers
FROM your_table
GROUP BY customer_address
ORDER BY number_of_customers DESC
LIMIT 1;
This will yield something like:
+------------------+---------------------+
| customer_address | number_of_customers |
+------------------+---------------------+
| foo | 42 |
+------------------+---------------------+

Returning data from a single Child record (sorted by date) with Parent data also

As a SQL noob I have a, what I am assuming, basic question about 1 to many children records.
I have an order table and an Order_Status child table.
Order table
ID Order_Number Status Order_Date ect
Order_Status table
StatusTo StatusFrom Order_ID StatusChange_Date
The child table can have many enties for the status changing for a single parent order.
How do I pull back the following information as a single record with the child tables's (os) most recent record for that parent(p)? (p.Order_Number, p.Status, p.Order_Date, os.StatusTo, os.StatusChange_Date).
I need to know because I am concerned the final os.statusto does not match the p.status.
Thanks in advance!
Steve
you can join on to a sub query which gets the most recent order status
e.g.
SELECT p.Order_Number, p.Status, p.Order_Date, os.StatusTo, os.StatusChange_Date
FROM ORDER p
LEFT JOIN (
SELECT StatusTo, Order_ID, MAX(StatusChange_Date) as StatusChange_Date
FROM Order_Status
GROUP BY StatusTo, Order_ID
) os ON os.Order_ID= p.Order_ID
I believe this should work. Assuming that you only care about those orders that have changes, and where the change is different than what is recorded (should be trivial to modify).
WITH Most_Recent_Change (order_id, statusTo, changedAt, rownum) as
(SELECT order_id, statusTo, statusChange_date,
ROWNUMBER() OVER(PARTITION BY order_id
ORDER BY statusChange_date DESC)
FROM Order_Status)
SELECT Order.order_number, Order.status, Order.order_date,
Most_Recent_Change.statusTo, Most_Recent_Change.changedAt
FROM Order
JOIN Most_Recent_Change
ON Most_Recent_Change.order_id = Order.id
AND Most_Recent_Change.rownum = 1
AND Most_Recent_Change.statusTo <> Order.status
(would have an SQLFiddle example, but it's acting weird at the moment)
Please note you should be careful of the commit level you run this at, as otherwise you may get false positives from rows being concurrently updated.
Other notes:
Don't use reserved words (like ORDER) for identifiers. It's just a hassle in general
Don't suffix columns with their datatypes, especially if those types may change in the future. I'm aware that order_date isn't being strictly named in this fashion, but it's dangerously close. It should probably be something like orderedOn (if of a strict 'solar day' type) or the better orderedAt (timestamp, in UTC or with timezone).

SQL Server: How do I maintain data integrity using aggregate functions with group by?

Here's my question: how do I maintain record integrity using aggregate functions with a group by?
To explain further, here's an example.
I have a table with the following columns: (Think of it as an "order" table)
Customer_Summary (first 10 char of name + first 10 char of address)
Customer_Name
Customer_Address
Customer_Postal Code
Order_weekday
There is one row per "order", so many rows with the same customer name, address, and summary.
What I want to do is show the customer's name, address, and postal code, as well as the number of orders they've placed on each weekday, grouped by the customer's summary.
So the data should look like:
Summary | Name | Address | PCode | Monday | Tuesday | Wednesday | Thursday | Friday
test custntest addre|test custname|test address|123456 | 1 | 1 | 1 | 1 | 1
I only want to group records of similar customer summary together, but obviously I want one name, address, and postal code to show. I'm using min() at the moment, so my query looks like:
SELECT Customer_Summary, min(customer_name), min(customer_address), min(customer_postal_code)
FROM Order
Group by customer_summary
I've omitted my weekday logic as I didn't think it was necessary.
My issue is this - some of these customers with the same customer summary have different addresses and postal codes.
So I might have two customers, looking like:
test custntest addre|test custname |test address |323456|
test custntest addre|test custname2|test address2|123456|
Using the group by, my query will return the following:
test custntest addre|test custname |test address |123456|
Since I'm using min, it's going to give me the minimum value for all of the fields, but not necessarily from the same record. So I've lost my record integrity here - the address and name returned by the query do not correctly match the postal code.
So how do I maintain data integrity on non-grouped fields when using a group by clause?
Hopefully I explained it clearly enough, and thanks in advance for the help.
EDIT: Solved. Thanks everyone!
You can always use ROW_NUMBER instead of GROUP BY
WITH A AS (
SELECT Customer_Summary, customer_name, customer_address, customer_postal_code,
ROW_NUMBER() OVER (PARTITION BY Customer_Summary ORDER BY customer_name, customer_address) AS rn
FROM Order
)
SELECT Customer_Summary, customer_name, customer_address, customer_postal_code
FROM A
WHERE rn = 1
Then you are free to order which customer to use in the ORDER BY clause. Currently I am order them by name and then address.
Edit:
My solution does what you asked for. But I surely agree with the others: If you are allowed to change the database structure, this would be a good idea... which you are not (saw your comment). Well, then ROW_NUMBER() is a good way.
I think you need to re-think your structure.
Ideally you would have a Customer table with an unique ID. Then you would use that unique ID in the Order table. Then you don't need the strange "first 10 characters" method that you are using. Instead, you just group by the unique ID from the Customer table.
You could even then also have a separate table for addresses, relating each address to the customer, with multiple rows (with fields marking them as home address, delivery address, billing address, etc).
This way you separate the Customer information from the Address information and from the Order information. Such that if the customer changes name (marriage) or address (moving home) you don't break your data - Everything is related by the IDs, not the data itself.
[This relationship is known as a Foreign Key.]