Count unique entries for ts_stat count in full text search - sql

I'm struggling with using ts_stat to get the number of unique occurrences of tags in a table and sort them by the highest count.
What I need though is to only count each entry one time so that only unique entries are counted. I tried group by and distinct but nothing is working for me.
e.g. table
user_id | tags | post_date
===================================
2 | dog cat | 1580049400
2 | dog | 1580039400
3 | dog | 1580038400
3 | dog dog cat | 1580058400
4 | dog horse | 1580028400
Here is the current query
SELECT word, ndoc, nentry
FROM ts_stat($$SELECT to_tsvector('simple', tags) FROM tags WHERE post_date > 1580018400$$)
ORDER BY ndoc DESC
LIMIT 10;
Right now this will produce
word | ndoc | nentry
====================
dog | 5 | 6
cat | 2 | 2
horse| 1 | 1
The result I would be looking for is unique counts so no 1 user can count more than once even if they have > 1 entries after a certain date as noted in the post_date condition (Which might be irrelevant). Like below.
word | total_count_per_user
===========================
dog | 3 (because there are 3 unique users with this term)
cat | 2 (because there are 2 unique users with this term)
horse| 1 (because there are 1 unique users with this term)
UPDATE: I changed the column name to reflect output. The point is no matter how many times a user enters a word. It only needs the unique count per user. e.g. if a user in that scenario creates 100 entries with dog in the text it will only count dog 1 time for that user not 100 counts of dog.

You can use COUNT on DISTINCT value if I get your point correct. The sample query is as below-
SELECT tags,COUNT(DISTINCT user_id)
FROM your_table
GROUP BY tags

I guess this one was tough. Just in case someone happens to have a similar requirement I was able to get this to work. Seems odd to have to get total with ts_stat then filter it again using distinct, cross join etc so that no matter how many times it finds a word each user only counts once per word. I'm not sure how efficient it will be on a large data set but it yields the expected results.
UPDATE: This is works without using a CTE. Also cross join is the key to filtering on user id.
SELECT DISTINCT (t.word) as tag, count(DISTINCT h.user_id) as posts
FROM ts_stat($$SELECT hashtagsearch FROM tagstable WHERE post_date > 1580018400$$) t
CROSS JOIN tagstable h WHERE hashtagsearch ## to_tsquery('simple',t.word)
GROUP BY t.word HAVING count(DISTINCT h.user_id) > 1 ORDER BY posts DESC LIMIT 10'
This answer helped quite a bit. https://stackoverflow.com/a/42704207/330987

Related

Why is INNER JOIN producing more records than original file?

I have two tables. Table A & Table B. Table A has 40516 rows, and records sales by seller_id. The first column in Table A is the seller_id that repeats every time a sale is made.
Example: Table A (40516 rows)
seller_id | item | cost
------------------------
1 | dog | 5000
1 | cat | 50
4 |lizard| 80
5 |bird | 20
5 |fish | 90
The seller_id is also present in Table B, and also contains the corresponding name of the seller.
Example: Table B (5851 rows)
seller_id | seller_name
-------------------------
1 | Dog and Cat World INC
4 | Reptile Love.com
5 | Ocean Dogs Inc
I want to join these two tables, but only display the seller name from Table B and all other columns from Table A. When I do this with an INNER JOIN I get 40864 rows (348 extra rows). Shouldn't the query produce only the original 40516 rows?
Also not sure if this matters, but the seller_id can contain several zeros before the number (e.g., 0000845, 0000549).
I've looked around on here and haven't really found an answer. I've tried LEFT and RIGHT joins and get the same results for one and way more results for the other.
SQL Code Example:
SELECT public.table_B.seller_name, *
FROM public.table_A
INNER JOIN public.table_B ON public.table_A.seller_id =
public.table_B.seller_id;
Expected Results:
seller_name | seller_id | item | cost
------------------------------------------------
Dog and Cat World INC | 1 | dog | 5000
Dog and Cat World INC | 1 | cat | 50
Reptile Love.com | 4 |lizard| 80
Ocean Dogs Inc | 5 |bird | 20
Ocean Dogs Inc | 5 |fish | 90
I expected the results to contain the same number of rows in Table A. Instead I gut names matching up and an additional 348 rows...
Update:
I changed "unique_id" to "seller_id" in the question.
I guess I should have chosen a better name for unique_id in the original example. I didn't mean it to be unique in the sense of a key. It is just the seller's id that repeats every time there is a sale (in Table A). The seller's ID does repeat in Table A because it is supposed to. I simply want to pair up the seller IDs with the seller names.
Thanks again everyone for their help!
unique_id is already not correctly named in the first table, so there is no reason to assume it is unique in the second table either.
Run this query to find the duplicates:
select unique_id
from table_b
group by unique_id
having count(*) > 1;
You can fix the query using distinct on:
SELECT b.seller_name, a.*
FROM public.table_A a JOIN
(SELECT DISTINCT ON (b.unique_id) b.*
FROM public.table_B b
ORDER BY b.unique_id
) b
ON a.unique_id = b.unique_id;
In this case, you may get fewer records, if there are no matches. To fix that, use a LEFT JOIN.
Because unique id column is not unique.
Gordon Linoff was correct. The seller_id (formerly listed as unique_id) was indeed duplicated throughout the data set. I foolishly assumed otherwise. Also the seller_name had many duplicates too! In the end I had to use the CONCAT() function to join the seller_id with second identifier to create a type of foreign key. After I did this the join worked as expected. Thanks everyone!

SQL query to get latest user to update record

I have a postgres database that contains an audit log table which holds a historical log of updates to documents. It contains which document was updated, which field was updated, which user made the change, and when the change was made. Some sample data looks like this:
doc_id | user_id | created_date | field | old_value | new_value
--------+---------+------------------------+-------------+---------------+------------
A | 1 | 2018-07-30 15:43:44-05 | Title | | War and Piece
A | 2 | 2018-07-30 15:45:13-05 | Title | War and Piece | War and Peas
A | 1 | 2018-07-30 16:05:59-05 | Title | War and Peas | War and Peace
B | 1 | 2018-07-30 15:43:44-05 | Description | test 1 | test 2
B | 2 | 2018-07-30 17:45:44-05 | Description | test 2 | test 3
You can see that the Title of document A was changed three times, first by user 1 then by user 2, then again by user 1.
Basically I need to know which user was the last one to update a field on a particular document. So for example, I need to know that User 1 was the last user to update the Title field on document A. I don't really care what time it happened, just the document, field, and user.
So sample output would be something like this:
doc_id | field | user_id
--------+-------------+---------
A | Title | 1
B | Description | 2
Seems like it should be fairly straightforward query to write but I'm having some trouble with it. I would think that group by would be in order but the problem is that if I group by doc_id I lose the user data:
select doc_id, max(created_date)
from document_history
group by doc_id;
doc_id | max
--------+------------------------
B | 2018-07-30 15:00:00-05
A | 2018-07-30 16:00:00-05
I could join these results table back to the document_history table but I would need to do so based on the doc_id and timestamp which doesn't seem quite right. If two people editing a document at the exact same time I would get multiple rows back for that document and field. Maybe that's so unlikely I shouldn't worry about it, but still...
Any thoughts on a way to do this in a single query?
You want to filter the records, so think where, not group by:
select dh.*
from document_history
where dh.created_date = (select max(dh2.created_date) from document_history dh2 where dh2.doc_id = dh.doc_id);
In most databases, this will have better performance than a group by, if you have an index on document_history(doc_id, created_date).
If your DBMS supports window functions (e.g. PostgreSQL, SQL Server; aka analytic function in Oracle) you could do something like this (SQLFiddle with Postgres, other systems might differ slightly in the syntax):
http://sqlfiddle.com/#!17/981af/4
SELECT DISTINCT
doc_id, field,
first_value(user_id) OVER (PARTITION BY doc_id, field ORDER BY created_date DESC) as last_user
FROM get_last_updated
first_value() OVER (... ORDER BY x DESC) orders the window frames/partitions descending and then takes the first value which is your latest time stamp.
I added the DISTINCT to get your expected result. The window function just adds a new column to your SELECT result but within the same partition with the same value. If you do not need it, remove it and then you are able to work with the origin data plus the new won information.

Dont understand how queries to retrieve top n records from each group work

I had an issue where I was trying to get the top 'n' records from each group (day) or records in my database. After a bunch of digging I found some great answers and they did in fact solve my problem.
How to select the first N rows of each group?
Get top n records for each group of grouped results
However, my noob-ness is preventing me from understanding exactly WHY these "counting" solutions work. If someone with better SQL knowledge can explain, that would be really great.
EDIT: here's more details
Let's say I had a table described below with this sample data. (To make things simpler, I have a column that kept track of the time of the next upcoming midnight, in order to group 'per day' better).
id | vote_time | time_of_midnight | name | votes_yay | votes_nay
------------------------------------------------------------------------
1 | a | b | Person p | 24 | 36
1 | a | b | Person q | 20 | 10
1 | a | b | Person r | 42 | 22
1 | c | d | Person p | 8 | 10
1 | c | d | Person s | 120 | 63
There can be tens or hundreds of "People" per day (b, d, ...)
id is some other column I needed in order to group by (you can think of it as an election id if that helps)
I'm trying to calculate the top 5 names that had the highest number of votes per day, in descending order. I was able to use the referenced articles to create a query that would give me the following results (on Oracle):
SELECT name, time_of_midnight, votes_yay, votes_nay, (votes_yay+votes_nay) AS total_votes
FROM results a
WHERE id=1 AND (
SELECT COUNT(*)
FROM results b
WHERE b.id=a.id AND b.time_of_midnight=a.time_of_midnight AND (a.votes_yay+a.votes_nay) >= (b.votes_yay+b.votes_nay)) <= 5
ORDER BY time_of_midnight DESC, total_votes DESC;
name | time_of_midnight | votes_yay | votes_nay | total_votes
------------------------------------------------------------------------
Person s | d | 120 | 63 | 183
Person p | d | 8 | 10 | 18
Person r | b | 42 | 22 | 64
Person p | b | 24 | 36 | 60
Person q | b | 20 | 10 | 30
So I'm not really sure
Why this counting method works?
[stupid]: Why don't I need to also include name in the inner query to make sure it doesn't join the data incorrectly?
Let's begin with the fact that your query is actually calculating top 5 names that had the lowest number of votes. To get the top 5 with the highest number, you'll need to change this condition:
(a.votes_yay+a.votes_nay) >= (b.votes_yay+b.votes_nay)
into this:
(a.votes_yay+a.votes_nay) <= (b.votes_yay+b.votes_nay)
or, perhaps, this (which is the same):
(b.votes_yay+b.votes_nay) >= (a.votes_yay+a.votes_nay)
(The latter form would seem to me preferable, but merely because it would be uniform with the other two comparisons which have a b column on the left-hand side and an a column on the right-hand side. That is perfectly irrelevant to the correctness of the logic.)
Logically, what's happening is this. For every row in results, the server will be looking for rows in the same table that match id and time_of_midnight of the given row and have the same or higher number of total votes than that in the given row. It will then count the found rows and check if the result is not greater than 5, i.e. if no more than 5 rows in the same (id, time_of_midnight) group have the same or higher number of votes as in the given row.
For example, if the given row happens to be one with the most votes in its group, the subquery will find only that same row (assuming there are no ties) and so the count will be 1. That is fewer than 5 – therefore, the given row will qualify for output.
If the given row will be the second most voted item in a group, the subquery will find the same row and the top-voted item (again, assuming no ties), which will give the count of 2. Again, that matches the count <= 5 condition, and so the row will be returned in the output.
In general, if a row is ranked as # N in its group according to the total number of votes, it means there are N rows in that group where the vote number is the same or higher than the number in the given row (we are still assuming there are no ties). So, when you are counting votes in this way, you are effectively calculating the given row's ranking.
Now, if there are ties, you may get fewer results per group using this method. In fact, if a group had 6 or more rows tied at the maximum number of rows, you would get no rows for that group in the output, because the subquery would never return a count value less than 6.
That is because effectively all the top-voted items would be ranked as 6 (or whatever their number would be) rather than as 1. To rank them as 1 instead, you could try the following modification of the same query:
SELECT name, time_of_midnight, votes_yay, votes_nay, (votes_yay+votes_nay) AS total_votes
FROM results a
WHERE id=1 AND (
SELECT COUNT(*) + 1
FROM results b
WHERE b.id=a.id AND b.time_of_midnight=a.time_of_midnight
AND (b.votes_yay+b.votes_nay) > (a.votes_yay+a.votes_nay)) <= 5
ORDER BY time_of_midnight DESC, total_votes DESC;
Now the subquery will be looking only for rows with the higher number of votes than in the given row. The resulting count will be increased by 1 and that will be the given row's ranking (and the value to compare against 5).
So, if the counts were e.g. 10, 10, 8, 7 etc., the rankings would be calculated as 1, 1, 3, 4 etc. rather than as 2, 2, 3, 4 etc., as with the original version.
That, of course, means that the output might now have more than 5 rows per group. For instance, if votes were distributed as 10, 9, 8, 8, 8, 8, 6 etc., you would get 10, 9 and all the 8s (because the rankings would be 1, 2, 3, 3, 3, 3, 7...). To return exactly 5 names per group (assuming there are at least 5 of them), you'd probably need to consider a different method altogether.

Select ID given the list of members

I have a table for the link/relationship between two other tables, a table of customers and a table of groups. a group is made up of one or more customers. The link table is like
APP_ID | GROUP_ID | CUSTOMER_ID
1 | 1 | 123
1 | 1 | 124
1 | 1 | 125
1 | 2 | 123
1 | 2 | 125
2 | 3 | 123
3 | 1 | 123
3 | 1 | 124
3 | 1 | 125
I now have a need, given a list of customer IDs to be able to get the group ID for that list of customer IDs. Group ID may not be unique, the same group ID will contain the same list of customer IDs but this group may exist in more than one app_id.
I'm thinking that
SELECT APP_ID, GROUP_ID, COUNT(CUSTOMER_ID) AS COUNT
FROM GROUP_CUST_REL
WHERE CUSTOMER_ID IN ( <list of ids> )
GROUP BY APP_ID, GROUP_ID
HAVING COUNT(CUSTOMER_ID) = <number of ids in list>
will return me all of the group IDs that contain all of the customer ids in the given list and only those group ids. So for a list of (123,125) only group id 2 would be returned from the above example
I will then have to link with the app table to use its created timestamp to identify the most recent application that the group existed in so that I can then pull the correct/most up to date info from the group table.
Does anyone have any thoughts on whether this is the most efficient way to do this? If there is another quicker/cleaner way I'd appreciate your thoughts.
This smells like a division:
Division sample
Other related stack overflow question
Taking a look at the provided links you'll see the solution to similar issues from relational alegebra's point of view, doesn't seem to be quicker and arguably cleaner.
I didn't look at your solution at first, and when I solved this I turned out to have solved this the same way you did.
Actually, I thought this:
<number of ids in list>
Could be turned into something like this (so that you don't need the extra parameter):
select count(*) from (<list of ids>) as t
But clearly, I was wrong. I'd stay with your current solution if I were you.

MySQL query for initial filling of order column

Sorry for vague question title.
I've got a table containing huge list of, say, products, belonging to different categories. There's a foreign key column indicating which category that particular product belongs to. I.e. in "bananas" row category might be 3 which indicates "fruits".
Now I added additional column "order" which is for display order within that particular category. I need to do initial ordering. Since the list is big, I dont wanna change every row by hand. Is it possible to do with one or two queries? I dont care what initial order is as long as it starts with 1 and goes up.
I cant do something like SET order = id because id counts from 1 up regardless of product category and order must start anew from 1 up for every different category.
Example of what I need to achieve:
ID | product | category | Order
1 | bananas | fruits | 1
2 | chair | furniture | 1
3 | apples | fruits | 2
4 | cola | drinks | 1
5 | mango | fruits | 3
6 | pepsi | drinks | 2
(category is actually a number because it's foreign key, in example I put names just for clarification)
As you see, order numbers start anew from 1 for each different category.
Sounds like something a SQL procedure would be handy for.
Why not just set the order to the category? That is, why not:
update Table
set SortOrder = Category;
As an aside, you cannot have a column named order -- that is a reserved word in SQL.