I have a large table and I need to check for similar rows. I don't need all column values be the same, just similar. The rows must not be "distant" (determined by a query over other table), no value may be too different (I have already done the queries for these conditions) and most other values must be the same. I must expect some ambiguity, so one or two different values shouldn't break the "similarity" (well, I could get better performance by accepting only "completely equal" rows, but this simplification could cause errors; I will do this as an option).
The way I am going to solve this is through PL/pgSQL: to make a FOR LOOP iterating through the results of previous queries. For each column, I have an IF testing whether it differs; if yes, I increment a difference counter and go on. At the end of each loop, I compare the value to a threshold and see if I should keep the row as "similar" or not.
Such a PL/pgSQL-heavy approach seems slow in comparison to a pure SQL query, or to an SQL query with some PL/pgSQL functions involved. It would be easy to test for rows with all but X equivalent rows if I knew which rows should be different, but the difference can occur at any of some 40 rows. Is there any way how to solve this by a single query? If not, is there any faster way than to examine all the rows?
EDIT: I mentioned a table, in fact it is a group of six tables linked by 1:1 relationship. I don't feel like explaining what is what, that's a different question. Extrapolating from doing this over one table to my situation is easy for me. So I simplified it (but not oversimplified it - it should demonstrate all the difficulties I have there) and made an example demonstrating what I need. Null and anything else should count as "different". No need to make a script testing it all - I just need to find out whether it is possible to do in any way more efficient than I thought about.
The point is that I don't need to count rows (as usual), but columns.
EDIT2: previous fiddle - this wasn't so short, so I let it here just for archiving reasons.
EDIT3: simplified example here - just NOT NULL integers, preprocessing omitted. Current state of data:
select * from foo;
id | bar1 | bar2 | bar3 | bar4 | bar5
----+------+------+------+------+------
1 | 4 | 2 | 3 | 4 | 11
2 | 4 | 2 | 4 | 3 | 11
3 | 6 | 3 | 3 | 5 | 13
When I run select similar_records( 1 );, I should get only row 2 (2 columns with different values; this is within limit), not 3 (4 different values - outside the limit of two differences at most).
To find rows that only differ on a given maximum number of columns:
WITH cte AS (
SELECT id
,unnest(ARRAY['bar1', 'bar2', 'bar3', 'bar4', 'bar5']) AS col -- more
,unnest(ARRAY[bar1::text, bar2::text, bar3::text
, bar4::text, bar5::text]) AS val -- more
FROM foo
)
SELECT b.id, count(a.val <> b.val OR NULL) AS cols_different
FROM (SELECT * FROM cte WHERE id = 1) a
JOIN (SELECT * FROM cte WHERE id <> 1) b USING (col)
GROUP BY b.id
HAVING count(a.val <> b.val OR NULL) < 3 -- max. diffs allowed
ORDER BY 2;
I ignored all the other distracting details in your question.
Demonstrating with 5 columns. Add more as required.
If columns can be NULL you may want to use IS DISTINCT FROM instead of <>.
This is using the somewhat unorthodox, but handy parallel unnest(). Both arrays must have the same number of elements to work. Details:
Is there something like a zip() function in PostgreSQL that combines two arrays?
SQL Fiddle (building on yours).
In instead of a loop to compare each row to all the others do a self join
select f0.id, f1.id
from foo f0 inner join foo f1 on f0.id < f1.id
where
f0.bar1 = f1.bar1 and f0.bar2 = f1.bar2
and
#(f0.bar3 - f1.bar3) <= 1
and
f0.bar4 = f1.bar4 and f0.bar5 = f1.bar5
or
f0.bar4 = f1.bar5 and f0.bar5 = f1.bar4
and
#(f0.bar6 - f1.bar6) <= 2
and
f0.bar7 is not null and f1.bar7 is not null and #(f0.bar7 - f1.bar7) <= 5
or
f0.bar7 is null and f1.bar7 <= 3
or
f1.bar7 is null and f0.bar7 <= 3
and
f0.bar8 = f1.bar8
and
#(f0.bar11 - f1.bar11) <= 5
;
id | id
----+----
1 | 4
1 | 5
4 | 5
(3 rows)
select * from foo;
id | bar1 | bar2 | bar3 | bar4 | bar5 | bar6 | bar7 | bar8 | bar9 | bar10 | bar11
----+------+------+------+------+------+------+------+------+------+-------+-------
1 | abc | 4 | 2 | 3 | 4 | 11 | 7 | t | t | f | 42.1
2 | abc | 5 | 1 | 6 | 2 | 8 | 39 | t | t | t | 19.6
3 | xyz | 4 | 2 | 3 | 5 | 14 | 82 | t | f | | 95
4 | abc | 4 | 2 | 4 | 3 | 11 | 7 | t | t | f | 42.1
5 | abc | 4 | 2 | 3 | 4 | 13 | 6 | t | t | | 37.7
Are you aware that the and operator has priority over or? I'm asking because it looks like the where clause in your function is not what you want. I mean in your expression it is enough to f0.bar7 is null and f1.bar7 <= 3 to be true to include the pair
Related
Say if I have two queries returning two tables with the same number of rows.
For example, if query 1 returns
| a | b | c |
| 1 | 2 | 3 |
| 4 | 5 | 6 |
and query 2 returns
| d | e | f |
| 7 | 8 | 9 |
| 10 | 11 | 12 |
How to obtain the following, assuming both queries are opaque
| a | b | c | d | e | f |
| 1 | 2 | 3 | 7 | 8 | 9 |
| 4 | 5 | 6 | 10 | 11 | 12 |
My current solution is to add to each query a row number column and inner join them
on this column.
SELECT
q1_with_rownum.*,
q2_with_rownum.*
FROM (
SELECT ROW_NUMBER() OVER () AS q1_rownum, q1.*
FROM (.......) q1
) q1_with_rownum
INNER JOIN (
SELECT ROW_NUMBER() OVER () AS q2_rownum, q2.*
FROM (.......) q2
) q2_with_rownum
ON q1_rownum = q2_rownum
However, if there is a column named q1_rownum in either of the query,
the above will break. It is not possible for me to look into q1 or q2;
the only information available is that they are both valid SQL queries
and do not contain columns with same names. Are there any SQL construct
similar to UNION but for columns instead of rows?
There is no such function. A row in a table is an entity.
If you are constructing generic code to run on any tables, you can try using less common values, such as "an unusual query rownum" -- or something more esoteric than that. I would suggest using the same name in both tables and then using using clause for the join.
Not sure if I understood your exact problem, but I think you mean both q1 and q2 are joined on a column with the same name?
You should add each table name before the column to distinguish which column is referenced:
"table1"."similarColumnName" = "table2"."similarColumnName"
EDIT:
So, problem is that if there is already a column with the same alias as your ROW_NUMBER(), the JOIN cannot be made because you have an ambiguous column name.
The easier solution if you cannot know your incoming table's columns is to make a solid alias, for example _query_join_row_number
EDIT2:
You could look into prefixing all columns with their original table's name, thus removing any conflict (you get q1_with_rows.rows and conflict column is q1_with_rows.q1.rows)
an example stack on this: In a join, how to prefix all column names with the table it came from
I am a newbie to SQL and I would like to ask for help. I have 2 tables which I want to join and I would like to generate the same number of rows that table 1 has.
Here are the tables:
Table 1
+----------+------------+---------+-------+
| ENTRY_ID | ROUTE_NAME | STATION | BOUND |
+----------+------------+---------+-------+
| 1 | 1A | ABCC | 1 |
| 2 | 2C | CBDD | 1 |
| 3 | 5 | AAAA | 2 |
| 4 | 1A | EEEE | 1 |
| 5 | 2B | ASFA | 2 |
| 6 | 5 | DSAS | 1 |
| 7 | 3 | QWEA | 2 |
| 8 | 4 | ASDA | 1 |
+----------+------------+---------+-------+
Table 2
+------------+-------+---------+---------------+
| ROUTE_NAME | BOUND | STATION | STOP_SEQUENCE |
+------------+-------+---------+---------------+
| 1A | 1 | AAA | 1 |
| 1A | 1 | ABC | 2 |
| 1A | 1 | CDA | 3 |
| 1A | 2 | ABC | 1 |
| 1A | 2 | ADC | 2 |
| 1A | 2 | ACA | 3 |
Repeated for other Routes
Short description for the Table:
Table 1 contains certain transit trips, with transit route to be taken as ROUTE_NAME, departure stop as STATION and transit bound as BOUND (only 1/2).
Table 2 contains a set of transit route data, with similar field to Table 1 and the sequence of stop as STOP_SEQUENCE
What I would like to do, is to use STATION, BOUND and ROUTE_NAME IN Table 1 to call for STOP_SEQUENCE in Table 2. The code that I have used is :
SELECT t1.ENTRY_ID, t1.ROUTE_NAME, t1.STATION, t1.BOUND, t2.STOP_SEQUENCE
FROM T1
LEFT JOIN t2 ON
(t1.STATION LIKE '*' & t2.STATION & '*') AND
(t1.BOUND = t2.BOUND) AND
(t1.ROUTE_NAME = t2.ROUTE_NAME);
The LIKE is a must as there is some mismatch between the STATION string of the 2 tables, that can be handled by the function.
The first question is, why does the LEFT JOIN not return all rows from TABLE 1? I have a similar code that works in other similar tables. For the data that didn't match up (with the LIKE statement), NULL is returned for that particular row. However, in this query less rows are returned.
The second question is, with the LIKE statement I am returning one or more rows from table 2 from table 1 which matches my criteria (that has happened in my code that 2+ rows with same ENTRY_ID has been returned). How can I keep the minimum of the returned row? i.e. if two STOP_SEQUENCE is found, return the lower one.
Have struggled for this for a long time so many thanks for your help!
UPDATE
I have found that the sentence t1.STATION LIKE '*' & t2.STATION & '*' is causing the lack of rows as in the first question. I have replaced it with = and all rows came up again. However I still need this LIKE clause, what can I do?
Why does the LEFT JOIN not return all rows from TABLE 1?
I could only suspect this is an issue with your testing since the SQL code posted in your question will return the values held by columns t1.ENTRY_ID, t1.ROUTE_NAME, t1.STATION, & t1.BOUND for all records in your table t1, and the value of t2.STOP_SEQUENCE for all records in your table t2 which fulfil the join criteria for each record in your table t1.
Note that this will return multiple records from table t2 if more than one record fulfils the join criteria for a given record in table t1. Which leads to your next question:
With the LIKE statement I am returning one or more rows from table 2 from table 1 which matches my criteria (that has happened in my code that 2+ rows with same ENTRY_ID has been returned). How can I keep the minimum of the returned row? i.e. if two STOP_SEQUENCE is found, return the lower one.
You can achieve this with simple aggregation using the min function:
select
t1.entry_id,
t1.route_name,
t1.station,
t1.bound,
min(t2.stop_sequence) as stopseq
from
t1 left join t2 on
t1.station like '*' & t2.station & '*' and
t1.bound = t2.route_bound and
t1.route_name = t2.route_name
group by
t1.entry_id,
t1.route_name,
t1.station,
t1.bound
This will return the minimum value held by the field t2.stop_sequence within the group defined by each combination of values held by t1.entry_id, t1.route_name, t1.station, & t1.bound.
Aside, note that the sample table t2 in your question does not contain the field route_bound as referenced by your posted code.
Problem: SQL Query that looks at the values in the "Many" relationship, and doesn't return values from the "1" relationship.
Tables Example: (this shows two different tables).
+---------------+----------------------------+-------+
| Unique Number | <-- Table 1 -- Table 2 --> | Roles |
+---------------+----------------------------+-------+
| 1 | | A |
| 2 | | B |
| 3 | | C |
| 4 | | D |
| 5 | | |
| 6 | | |
| 7 | | |
| 8 | | |
| 9 | | |
| 10 | | |
+---------------+----------------------------+-------+
When I run my query, I get multiple, unique numbers that show all of the roles associated to each number like so.
+---------------+-------+
| Unique Number | Roles |
+---------------+-------+
| 1 | C |
| 1 | D |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 4 | C |
| 4 | A |
| 5 | B |
| 5 | C |
| 5 | D |
| 6 | D |
| 6 | A |
+---------------+-------+
I would like to be able to run my query and be able to say, "When the role of A is present, don't even show me the unique numbers that have the role of A".
Maybe if SQL could look at the roles and say, WHEN role A comes up, grab unique number and remove it from column 1.
Based on what I would "like" to happen (I put that in quotations as this might not even be possible) the following is what I would expect my query to return.
+---------------+-------+
| Unique Number | Roles |
+---------------+-------+
| 1 | C |
| 1 | D |
| 5 | B |
| 5 | C |
| 5 | D |
+---------------+-------+
UPDATE:
Query Example: I am querying 8 tables, but I condensed it to 4 for simplicity.
SELECT
c.UniqueNumber,
cp.pType,
p.pRole,
a.aRole
FROM c
JOIN cp ON cp.uniqueVal = c.uniqueVal
JOIN p ON p.uniqueVal = cp.uniqueVal
LEFT OUTER JOIN a.uniqueVal = p.uniqueVal
WHERE
--I do some basic filtering to get to the relevant clients data but nothing more than that.
ORDER BY
c.uniqueNumber
Table sizes: these tables can have anywhere from 50,000 rows to 500,000+
Pretending the table name is t and the column names are alpha and numb:
SELECT t.numb, t.alpha
FROM t
LEFT JOIN t AS s ON t.numb = s.numb
AND s.alpha = 'A'
WHERE s.numb IS NULL;
You can also do a subselect:
SELECT numb, alpha
FROM t
WHERE numb NOT IN (SELECT numb FROM t WHERE alpha = 'A');
Or one of the following if the subselect is materializing more than once (pick the one that is faster, ie, the one with the smaller subtable size):
SELECT t.numb, t.alpha
FROM t
JOIN (SELECT numb FROM t GROUP BY numb HAVING SUM(alpha = 'A') = 0) AS s USING (numb);
SELECT t.numb, t.alpha
FROM t
LEFT JOIN (SELECT numb FROM t GROUP BY numb HAVING SUM(alpha = 'A') > 0) AS s USING (numb)
WHERE s.numb IS NULL;
But the first one is probably faster and better[1]. Any of these methods can be folded into a larger query with multiple additional tables being joined in.
[1] Straight joins tend to be easier to read and faster to execute than queries involving subselects and the common exceptions are exceptionally rare for self-referential joins as they require a large mismatch in the size of the tables. You might hit those exceptions though, if the number of rows that reference the 'A' alpha value is exceptionally small and it is indexed properly.
There are many ways to do it, and the trade-offs depend on factors such as the size of the tables involved and what indexes are available. On general principles, my first instinct is to avoid a correlated subquery such as another, now-deleted answer proposed, but if the relationship table is small then it probably doesn't matter.
This version instead uses an uncorrelated subquery in the where clause, in conjunction with the not in operator:
select num, role
from one_to_many
where num not in (select otm2.num from one_to_many otm2 where otm2.role = 'A')
That form might be particularly effective if there are many rows in one_to_many, but only a small proportion have role A. Of course you can add an order by clause if the order in which result rows are returned is important.
There are also alternatives involving joining inline views or CTEs, and some of those might have advantages under particular circumstances.
I'm trying to write a SQL query that looks at a single MySQL DB table (exp_playa_relationships) which stores relationship data of CMS' posts and which data structure looks like this:
rel_id | parent_entry_id | parent_field_id | child_entry_id
-----------------------------------------------------------------
55 | 3 | 2 | 1
56 | 3 | 2 | 4
58 | 1 | 2 | 4
59 | 8 | 4 | 2
60 | 8 | 5 | 1
63 | 4 | 2 | 3
64 | 9 | 4 | 6
65 | 9 | 5 | 3
rel_id is unique, other columns are not.
I would like to generate the following out of the data above:
event_data_id | user_id | event_id
--------------------------------------
8 | 1 | 2
9 | 3 | 6
The parent_field_id value itself is discarded in the final output but is needed to figure out if the row's child_entry_id signifies a user_id or event_id.
parent_entry_id is the event_data_id.
So in plain english I would like to:
Filter rows that have a parent_field_id value of either 4 or 5
Out of those rows, I want to join all those that share the same parent_entry_id.
Return the parent_entry_id as event_data_id.
Return the child_entry_id as a user_id if parent_field_id of the same row is 5.
Return the child_entry_id as a event_id if parent_field_id of the same row is 4.
My current SQL query (not working) is this:
SELECT
t1.`parent_entry_id` AS event_data_id,
t1.`child_entry_id` AS user_id,
t1.`child_entry_id` AS event_id
FROM `exp_playa_relationships` AS t1
INNER JOIN `exp_playa_relationships` AS t2
ON t1.`parent_entry_id` = t2.`parent_entry_id`
WHERE t1.`parent_field_id` = 4 OR t1.`parent_field_id` = 5
What I cannot figure out specifically is how to avoid creating duplicates on the parent_entry_id (SQL creates 2 new rows per row) and how to return child_entry_id as either user_id or event_id based on the parent_field_id value.
Any ideas would be much appreciated.
You're sooo close:
SELECT t1.`parent_entry_id` AS event_data_id,
t1.`child_entry_id` AS user_id,
t2.`child_entry_id` AS event_id
FROM `exp_playa_relationships` AS t1
INNER JOIN `exp_playa_relationships` AS t2
ON t2.`parent_entry_id` = t1.`parent_entry_id`
AND t2.`parent_field_id` = 4
WHERE t1.`parent_field_id` = 5
Specifically, you're having to tell it which row-set to pull the relevant data from.
By the way, your current database design will cause you more of these types of headaches... I'd recommend pulling the information out into 'result' tables (unless that's what this is for?).
In MySQL I have a table called "meanings" with three columns:
"person" (int),
"word" (byte, 16 possible values)
"meaning" (byte, 26 possible values).
A person assigns one or more meanings to each word:
person word meaning
-------------------
1 1 4
1 2 19
1 2 7 <-- Note: second meaning for word 2
1 3 5
...
1 16 2
Then another person, and so on. There will be thousands of persons.
I need to find for each of the 16 words the top three meanings (with their frequencies). Something like:
+--------+-----------------+------------------+-----------------+
| Word | 1st Most Ranked | 2nd Most Ranked | 3rd Most Ranked |
+--------+-----------------+------------------+-----------------+
| 1 | meaning 5 (35%) | meaning 19 (22%) | meaning 2 (13%) |
| 2 | meaning 8 (57%) | meaning 1 (18%) | meaning 22 (7%) |
+--------+-----------------+------------------+-----------------+
...
Is it possible to solve this with a single MySQL query?
Well, if you group by word and meaning, you can easily get the % of people who use each word/meaning combination out of the dataset.
In order to limit the number of meanings for each word returned, you will need create some sort of filter per word/meaning combination.
Seems like you just want the answer to your homework, so I wont post more than this, but this should be enough to get you on the right track.
Of course you can do
SELECT * FROM words WHERE word = 2 ORDER BY meaning DESC LIMIT 3
But this is cheating since you need to create a loop.
Im working on a better solution
I believe the problem I had a while ago looks similar. I ended up with the #counter thing.
Note about the problem
Let's suppose there is only one person, who says:
+--------+----------------+
| Person | Word | Meaning |
+--------+----------------+
| 1 | 1 | 7 |
| 1 | 1 | 3 |
| 1 | 2 | 8 |
+--------+----------------+
The report should read:
+--------+------------------+------------------+-----------------+
| Word | 1st Most Ranked | 2nd Most Ranked | 3rd Most Ranked |
+--------+------------------+------------------+-----------------+
| 1 | meaning 7 (100%) | meaning 3 (100%) | NULL |
| 2 | meaning 8 (100%) | NULL | NULL |
+--------+------------------+------------------+-----------------+
The following is not OK (50% frequency is absurd in a population of one person):
+--------+------------------+------------------+-----------------+
| Word | 1st Most Ranked | 2nd Most Ranked | 3rd Most Ranked |
+--------+------------------+------------------+-----------------+
| 1 | meaning 7 (50%) | meaning 3 (50%) | NULL |
| 2 | meaning 8 (100%) | NULL | NULL |
+--------+------------------+------------------+-----------------+
The intended meaning of the frequencies is "How many people think this meaning corresponds to that word"?
So it's not merely about counting "cases", but about counting persons in the table.