More efficient way to query shortest string value associated with each value in another column in Hive QL - sql

I have a table in Hive containing store names, order IDs, and User IDs (as well as some other columns including item ID). There is a row in the table for every item purchased (so there can be more than one row per order if the order contains multiple items). Order IDs are unique within a store, but not across stores. A single order can have more than one user ID associated with it.
I'm trying to write a query that will return a list of all stores and order IDs and the shortest user ID associated with each order.
So, for example, if the data looks like this:
STORE | ORDERID | USERID | ITEMID
------+---------+--------+-------
| a | 1 | bill | abc |
| a | 1 | susan | def |
| a | 2 | jane | abc |
| b | 1 | scott | ghi |
| b | 1 | tony | jkl |
Then the output would look like this:
STORE | ORDERID | USERID
------+---------+-------
a | 1 | bill
a | 2 | jane
b | 1 | tony
I've written a query that will do this, but I feel like there must be a more efficient way to go about it. Does anybody know a better way to produce these results?
This is what I have so far:
select
users.store, users.orderid, users.userid
from
(select
store, orderid, userid, length(userid) as len
from
sales) users
join
(select distinct
store, orderid,
min(length(userid)) over (partition by store, orderid) as len
from
sales) len on users.store = len.store
and users.orderid = len.orderid
and users.len = len.len

Check out probably this will work for you, here you can achieve your goal of single "SELECT" clause with no extra overhead on SQL.
select distinct
store, orderid,
first_value(userid) over(partition by store, orderid order by length(userid) asc) f_val
from
sales;
The result will be:
store orderid f_val
a 1 bill
a 2 jane
b 1 tony

Probably rank() is the best way:
select s.*
from (select s.*, rank() over (partition by store order by length(userid) as seqnum
from sales s
) s
where seqnum = 1;

Related

PostgreSQL return multiple rows with DISTINCT though only latest date per second column

Lets says I have the following database table (date truncated for example only, two 'id_' preix columns join with other tables)...
+-----------+---------+------+--------------------+-------+
| id_table1 | id_tab2 | date | description | price |
+-----------+---------+------+--------------------+-------+
| 1 | 11 | 2014 | man-eating-waffles | 1.46 |
+-----------+---------+------+--------------------+-------+
| 2 | 22 | 2014 | Flying Shoes | 8.99 |
+-----------+---------+------+--------------------+-------+
| 3 | 44 | 2015 | Flying Shoes | 12.99 |
+-----------+---------+------+--------------------+-------+
...and I have a query like the following...
SELECT id, date, description FROM inventory ORDER BY date ASC;
How do I SELECT all the descriptions, but only once each while simultaneously only the latest year for that description? So I need the database query to return the first and last row from the sample data above; the second it not returned because the last row has a later date.
Postgres has something called distinct on. This is usually more efficient than using window functions. So, an alternative method would be:
SELECT distinct on (description) id, date, description
FROM inventory
ORDER BY description, date desc;
The row_number window function should do the trick:
SELECT id, date, description
FROM (SELECT id, date, description,
ROW_NUMBER() OVER (PARTITION BY description
ORDER BY date DESC) AS rn
FROM inventory) t
WHERE rn = 1
ORDER BY date ASC;

Grouping by two values in same table

I have a table on the format
Ship_type | userid | Message
Neither of these columns are unique.
I want to count how many (unique) user id's that belong to each ship type, and thus find out which ship type is the most popular.
Example:
Ship_type | userid| Message
-------------- ------- ----------
Sailboat | 34241 | hello
Sailboat | 34241 | hi
Sailboat | 34241 | I'm on a boat!
Fishingvessel | 31245 | yo
Fishingvessel | 98435 | hi there
Here we see that there are two different fishingvessels and one sailboat.
If I do the following query:
select ship_type, count(ship_type) FROM db1.MessageType5 GROUP BY ship_type ORDER BY count(ship_type) ASC;
I get
Sailboat | 3
Fishingvessel | 2
which is wrong - as it counts the number of messages belonging to each ship_type.
Desired result:
Fishingvessel | 2
Sailboat | 1
You have to COUNT DISTINCT user ids (and ORDER BY ... DESC if you want the provided result):
SELECT ship_type, COUNT(DISTINCT userid) as cnt
FROM db1.MessageType5
GROUP BY ship_type
ORDER BY cnt DESC
See this fiddle.

SQL SELECT only rows where a max value is present, and the corresponding ID from another linked table

I have a simple Parts database which I'd like to use for calculating costs of assemblies, and I need to keep a cost history, so that I can update the costs for parts without the update affecting historic data.
So far I have the info stored in 2 tables:
tblPart:
PartID | PartName
1 | Foo
2 | Bar
3 | Foobar
tblPartCostHistory
PartCostHistoryID | PartID | Revision | Cost
1 | 1 | 1 | £1.00
2 | 1 | 2 | £1.20
3 | 2 | 1 | £3.00
4 | 3 | 1 | £2.20
5 | 3 | 2 | £2.05
What I want to end up with is just the PartID for each part, and the PartCostHistoryID where the revision number is highest, so this:
PartID | PartCostHistoryID
1 | 2
2 | 3
3 | 5
I've had a look at some of the other threads on here and I can't quite get it. I can manage to get the PartID along with the highest Revision number, but if I try to then do anything with the PartCostHistoryID I end up with multiple PartCostHistoryIDs per part.
I'm using MS Access 2007.
Many thanks.
Mihai's (very concise) answer will work assuming that the order of both
[PartCostHistoryID] and
[Revision] for each [PartID]
are always ascending.
A solution that does not rely on that assumption would be
SELECT
tblPartCostHistory.PartID,
tblPartCostHistory.PartCostHistoryID
FROM
tblPartCostHistory
INNER JOIN
(
SELECT
PartID,
MAX(Revision) AS MaxOfRevision
FROM tblPartCostHistory
GROUP BY PartID
) AS max
ON max.PartID = tblPartCostHistory.PartID
AND max.MaxOfRevision = tblPartCostHistory.Revision
SELECT PartID,MAX(PartCostHistoryID) FROM table GROUP BY PartID
Here is query
select PartCostHistoryId, PartId from tblCost
where PartCostHistoryId in
(select PartCostHistoryId from
(select * from tblCost as tbl order by Revision desc) as tbl1
group by PartId
)
Here is SQL Fiddle http://sqlfiddle.com/#!2/19c2d/12

How to find every customers' favourite category with a query

I have a table in MS Access which looks basically like this:
Table Name : Customer_Categories
+----------------------+------------+-------+
| Email | CategoryID | Count |
+----------------------+------------+-------+
| jim#example.com | 10 | 4 |
+----------------------+------------+-------+
| jim#example.com | 2 | 1 |
+----------------------+------------+-------+
| simon#example.com | 5 | 2 |
+----------------------+------------+-------+
| steven#example.com | 10 | 16 |
+----------------------+------------+-------+
| steven#example.com | 5 | 3 |
+----------------------+------------+-------+
In this table there are ≈ 350,000 records. The characteristics are this:
Duplicate values for Email, CategoryID and Count
Count refers to the number of times this customer has ordered from this category
What I want
I want to create a table that consists of a unique email address along with the CategoryID this customer has purchased from the most.
So the above example would be:
+----------------------+------------+
| Email | CategoryID |
+----------------------+------------+
| jim#example.com | 10 |
+----------------------+------------+
| simon#example.com | 5 |
+----------------------+------------+
| steven#example.com | 10 |
+----------------------+------------+
What I have tried
I have written a query that achieves what I want:
SELECT main.Email, (SELECT TOP 1 CategoryID
FROM Customer_Categories
WHERE main.Email = Email
GROUP BY CategoryID
ORDER BY MAX(Count) DESC, CategoryID ASC) AS Category
FROM Customer_Categories AS main
GROUP BY main.Email;
This works a treat and does exactly what I want. It returns results in around 8 seconds. However I need this data in a new table because I then want to update another table with the categoryID. When I add INTO Customer_Favourite_Categories after the sub-query to add this data to a new table rather than just return the result set and run the query it never finishes. I've left it running for about 45 minutes and it does nothing.
Is there any way around this?
If select into doesn't work, use insert into:
create table Customer_Favorite_Categories (
email <email type>,
FavoriteCategory <CategoryId type>
);
insert into Customer_Favorite_Categories
SELECT main.Email, (SELECT TOP 1 CategoryID
FROM Customer_Categories
WHERE main.Email = Email
GROUP BY CategoryID
ORDER BY MAX(Count) DESC, CategoryID ASC) AS Category
FROM Customer_Categories AS main
GROUP BY main.Email;
Try this:
SELECT Distinct(Email),Max(CategoryID )
FROM Customer_Categories group by Email
I use sub-queries for this quite frequently. Your query in "What I have tried" is close, but just a little off in syntax. Something like the following should get what you are after. Count is in square-brackets since it's a reserved word in SQL. The spacing I use in my SQL is conventional, so edit to your liking.
SELECT Email,
CategoryID
FROM MyTable AS m,
(
SELECT Email,
MAX( [Count] ) AS mc
FROM MyTable
GROUP BY Email
) AS f
WHERE m.Email = f.Email
AND m.[Count] = f.mc;

Remove redundant SQL price cost records

I have a table costhistory with fields id,invid,vendorid,cost,timestamp,chdeleted. It looks like it was populated with a trigger every time a vendor updated their list of prices.
It has redundant records - since it was populated regardless of whether price changed or not since last record.
Example:
id | invid | vendorid | cost | timestamp | chdeleted
1 | 123 | 1 | 100 | 1/1/01 | 0
2 | 123 | 1 | 100 | 1/2/01 | 0
3 | 123 | 1 | 100 | 1/3/01 | 0
4 | 123 | 1 | 500 | 1/4/01 | 0
5 | 123 | 1 | 500 | 1/5/01 | 0
6 | 123 | 1 | 100 | 1/6/01 | 0
I would want to remove records with ID 2,3,5 since they do not reflect any change since the last price update.
I'm sure it can be done, though it might take several steps.
Just to be clear, this table has swelled to 100gb and contains 600M rows. I am confident that a proper cleanup will take this table's size down by 90% - 95%.
Thanks!
The approach you take will vary depending on the database you are using. For SQL Server 2005+, the following query should give you the records you want to remove:
select id
from (
select id, Rank() over (Partition BY invid, vendorid, cost order by timestamp) as Rank
from costhistory
) tmp
where Rank > 1
You can then delete them like this:
delete from costhistory
where id in (
select id
from (
select id, Rank() over (Partition BY invid, vendorid, cost order by timestamp) as Rank
from costhistory
) tmp
)
I would suggest that you recreate the table using a group by query. Also, I assume the the "id" column is not used in any other tables. If that is the case, then you need to fix those tables as well.
Deleting such a large quantity of records is likely to take a long, long time.
The query would look like:
insert into newversionoftable(invid, vendorid, cost, timestamp, chdeleted)
select invid, vendorid, cost, timestamp, chdeleted
from table
group by invid, vendorid, cost, timestamp, chdeleted
If you do opt for a delete, I would suggestion:
(1) Fix the code first, so no duplicates are going in.
(2) Determine the duplicate ids and place them in a separate table.
(3) Delete in batches.
To find the duplicate ids, use something like:
select *
from (select id,
row_number() over (partition by invid, vendorid, cost, timestamp, chdeleted order by timestamp) as seqnum
from table
) t
where seqnum > 1
If you want to keep the most recent version instead, then use "timestamp desc" in the order by clause.