SQL statement to split a table based on a join - sql

I have a primary table for Articles that is linked by a join table Info to a table Tags that has only a small number of entries. I want to split the Articles table, by either deleting rows or creating a new table with only the entries I want, based on the absence of a link to a certain tag. There are a few million articles. How can I do this?
Not all of the articles have any tag at all, and some have many tags.
Example:
table Articles
primary_key id
table Info
foreign_key article_id
foreign_key tag_id
table Tags
primary_key id
It was easy for me to segregate the articles that do have the match right off the bat, so I thought maybe I could do that and then use a NOT IN statement but that is so slow running it's unclear if it's ever going to finish. I did that with these commands:
INSERT INTO matched_articles SELECT * FROM articles a LEFT JOIN info i ON a.id = i.article_id WHERE i.tag_id = 5;
INSERT INTO unmatched_articles SELECT * FROM articles a WHERE a.id NOT IN (SELECT m.id FROM matched_articles m);
If it makes a difference, I'm on Postgres.

INSERT INTO matched_articles
SELECT * FROM articles a LEFT JOIN info i ON a.id = i.article_id WHERE i.tag_id = 5;
INSERT INTO unmatched_articles
SELECT * FROM articles a WHERE a.id NOT IN (SELECT m.id FROM matched_articles m);
There's so much wrong here, I'm not sure where to start. OK in your first insert you do not need a left join in fact you don't actually have one. It should be
INSERT INTO matched_articles
SELECT * FROM articles a INNER JOIN info i ON a.id = i.article_id WHERE i.tag_id = 5;
Had you needed a left join you would have had
INSERT INTO matched_articles
SELECT * FROM articles a LEFT JOIN info i ON a.id = i.article_id AND i.tag_id = 5;
When you put something from the right side of a left join into the where clause (other than searching for the null values), then you convert it to an inner join becasue it must meet that condition, therefore the records that don't have a match inthe right table are elimiated.
Now the second statement can be done with a special case of the left join, although what you have will work.
INSERT INTO matched_articles
SELECT * FROM articles a
LEFT JOIN info i ON a.id = i.article_id AND i.tag_id = 5
WHERE i.tag_id is null
This will give you all the records that are in the info table except those that matched the articles table.
Now the next thing, you should not write insert staments without specifying the fields you want to insert. Nor should you ever write a select statement using select * especially if you have a join. This is generally sloppy, lazy coding and should be fixed. What if someone changed the structure of one of the tables but not the other? This kind of thing is bad for maintenance and in the case of a select statment with a join, it is returning a collumn twice (the join column) and that is a waste of server and network resources. It is just poor coding to be too lazy specify what you need and only what you need. So get out of the habit and don't do it again for any production code.
If you current stament is too slow, you may also be able to fix it with the right indexes. Are the id fields indexed on both tables? Onthe other hand if there are millionas of articles, it is going to take time to insert them. It is often better to do this in batches maybe 50000 at a time (fewer still if this takes too long). Just do the insert ina loop that selects the top XXX records and then loops until the row count affected is none.

Your queries look ok except the first one should be an inner join, not a left join. If you want to try something else, consider this:
INSERT INTO matched_articles
SELECT *
FROM articles a
INNER JOIN info i ON a.id = i.article_id
WHERE i.tag_id = 5;
INSERT INTO unmatched_articles
SELECT *
FROM articles a
LEFT JOIN info i ON a.id = i.article_id AND a.id <> 5
WHERE a.id IS NULL
That might be faster but really, what you have is probably ok if you only have to do it once.

Not sure, if Postgres has a concept of a temporary table.
Here is how this can be done, as well.
CREATE Table #temp
AS SELECT A.ID, COUNT(i.*) AS Total
FROM Articles A
LEFT JOIN info i
ON A.id = i.Article_ID AND i.Tag_ID = 5
GROUP BY A.ID
INSERT INTO Matched_Articles
SELECT A.*
FROM Articles A INNER JOIN #temp t
ON A.ID = t.Article_ID AND T.Total = 0
DELETE FROM #Temp
WHERE Total = 0
INSERT INTO UnMatched_Articles
SELECT A.*
FROM Articles AINNER JOIN #temp t
ON A.ID = t.Article_ID
Note that I am not using any editor to try this out.
I hope this gives you hint on how I would approach this.

Related

The SQL query succeeds, but does not make any changes to the table

I have the following problem. I run this request, which passes successfully but it says "0 (rows affected)". In short, what I want to do. I'm trying to write a query that adds elements to the main table because I've linked Child tables I'm trying to retrieve the IDs and put them in the main table
INSERT INTO Articles
([ID],
[ArtName],
[ArtType],
[SerNo],
[MACNo],
[UserID],
[Available],
[CityID],
[StoreID],
[WorkplaceID],
[ItemPrice],
[IP_01],
[IP_02],
[Note])
SELECT
(SELECT max([ID])+1 FROM Articles),
'HP',
art.ID,
'123',
'А18Н31',
u.ID,
av.ID,
c.ID ,
s.ID,
w.ID,
'14.23',
'192.168.11.3',
'192.168.11.3',
GetDate()
FROM Articles a
INNER JOIN Workplace w ON a.WorkplaceID = w.ID
INNER JOIN Store s ON a.StoreID = s.ID
INNER JOIN City c ON a.CityID = c.ID
INNER JOIN Avaiable av ON a.Available = av.ID
INNER JOIN Users u ON a.UserID = u.ID
INNER JOIN ArtType art ON a.ArtType = art.ID
WHERE c.CityName LIKE '%Sofia%' AND art.ArtTypeName LIKE '%FirstType%' AND s.StoreName LIKE '%First%' AND av.AvaiableName LIKE '%yes%' AND u.UserName LIKE '%Valq%' AND w.WorkplaceName LIKE '%FWorkplace%'
This says "0 rows affected" because the SELECT returns no rows. This could be because nothing matches the JOINs. This could be because the WHERE clause filters out all rows. Without sample data, there is no way to tell. You have to investigate yourself.
That said, this is highly suspicious:
(SELECT max([ID])+1 FROM Articles),
This is not the right way to have an incremental id in a table. You should be using an identity column. Or perhaps default to a sequence. In either case, the value of id would be set automatically when rows are inserted.
Also note that if this inserts multiple rows, all would get the same id, which is presumably not what you want.

How to join three tables having relation parent-child-child's child. And I want to access all records related to parent

I have three tables:
articles(id,title,message)
comments(id,article_id,commentedUser_id,comment)
comment_likes(id, likedUser_id, comment_id, action_like, action_dislike)
I want to show comments.id, comments.commentedUser_id, comments.comment, ( Select count(action_like) where action_like="like") as likes and comment_id=comments.id where comments.article_id=article.id
Actually I want to count all action_likes that related to any comment. And show all all comments of articles.
action_likes having only two values null or like
SELECT c.id , c.CommentedUser_id , c.comment , (cl.COUNT(action_like) WHERE action_like='like' AND comment_id='c.id') as likes
FROM comment_likes as cl
LEFT JOIN comments as c ON c.id=cl.comment_id
WHERE c.article_id=article.id
It shows nothing, I know I'm doing wrong way, that was just that I want say
I guess you are looking for something like below. This will return Article/Comment wise LIKE count.
SELECT
a.id article_id,
c.id comment_id,
c.CommentedUser_id ,
c.comment ,
COUNT (CASE WHEN action_like='like' THEN 1 ELSE NULL END) as likes
FROM article a
INNER JOIN comments C ON a.id = c.article_id
LEFT JOIN comment_likes as cl ON c.id=cl.comment_id
GROUP BY a.id,c.id , c.CommentedUser_id , c.comment
IF you need results for specific Article, you can add WHERE clause before the GROUP BY section like - WHERE a.id = N
I would recommend a correlated subquery for this:
SELECT a.id as article_id, c.id as comment_id,
c.CommentedUser_id, c.comment,
(SELECT COUNT(*)
FROM comment_likes cl
WHERE cl.comment_id = c.id AND
cl.action_like = 'like'
) as num_likes
FROM article a INNER JOIN
comments c
ON a.id = c.article_id;
This is a case where a correlated subquery often has noticeably better performance than an outer aggregation, particularly with the right index. The index you want is on comment_likes(comment_id, action_like).
Why is the performance better? Most databases will implement the group by by sorting the data. Sorting is an expensive operation that grows super-linearly -- that is, twice as much data takes more than twice as long to sort.
The correlated subquery breaks the problem down into smaller pieces. In fact, no sorting should be necessary -- just scanning the index and counting the matching rows.

SQL query not returning Null-value records

I am using SQL Server 2014 on a Windows 10 PC. I am sending SQL queries directly into Swiftpage’s Act! CRM system (via Topline Dash). I am trying to figure out how to get the query to give me records even when some of the records have certain Null values in the Opportunity_Name field.
I am using a series of Join statements in the query to connect 4 tables: History, Contacts, Opportunity, and Groups. History is positioned at the “center” of it all. They all have many-to-many relationships with each other, and are thus each linked by an intermediate table that sits “between” the main tables, like so:
History – Group_History – Group
History – Contact_History – Contact
History – Opportunity_History – Opportunity
The intermediate tables consist only of the PKs in each of the main tables. E.g. History_Group is only a listing of HistoryIDs and GroupIDs. Thus, any given History entry can have multiple Groups, and each Group has many Histories associated with it.
Here’s what the whole SQL statement looks like:
SELECT Group_Name, Opportunity_Name, Start_Date_Time, Contact.Contact, Contact.Company, History_Type, (SQRT(SQUARE(Duration))/60) AS Hours, Regarding, HistoryID
FROM HISTORY
JOIN Group_History
ON Group_History.HistoryID = History.HistoryID
JOIN "Group"
ON Group_History.GroupID = "Group".GroupID
JOIN Contact_History
ON Contact_History.HistoryID = History.HistoryID
JOIN Contact
ON Contact_History.ContactID = Contact.ContactID
JOIN Opportunity_History
ON Opportunity_History.HistoryID = History.HistoryID
JOIN Opportunity
ON Opportunity_History.OpportunityID = Opportunity.OpportunityID
WHERE
( Start_Date_Time >= ('2018/02/02') AND
Start_Date_Time <= ('2018/02/16') )
ORDER BY Group_NAME, START_DATE_TIME;
The problem is that when the Opportunity table is linked in, any record that has no Opportunity (i.e. a Null value) won’t show up. If you remove the Opportunity references in the Join statement, the listing will show all history events in the Date range just fine, the way I want it, whether or not they have an Opportunity associated with them.
I tried adding the following to the WHERE part of the statement, and it did not work.
AND ( ISNULL(Opportunity_Name, 'x') = 'x' OR
ISNULL(Opportunity_Name, 'x') <> 'x' )
I also tried changing the Opportunity_Name reference up in the SELECT part of the statement to read: ISNULL(Opportunity_Name, 'x') – this didn’t work either.
Can anyone suggest a way to get the listing to contain all records regardless of whether they have a Null value in the Opportunity Name or not? Many thanks!!!
I believe this is because a default JOIN statement discards unmatched rows from both tables. You can fix this by using LEFT JOIN.
Example:
CREATE TABLE dataframe (
A int,
B int
);
insert into dataframe (A,B) values
(1, null),
(null, 1)
select a.A from dataframe a
join dataframe b ON a.A = b.A
select a.A from dataframe a
left join dataframe b ON a.A = b.A
You can see that the first query returns only 1 record, while the second returns both.
SELECT Group_Name, Opportunity_Name, Start_Date_Time, Contact.Contact, Contact.Company, History_Type, (SQRT(SQUARE(Duration))/60) AS Hours, Regarding, HistoryID
FROM HISTORY
LEFT JOIN Group_History
ON Group_History.HistoryID = History.HistoryID
LEFT JOIN "Group"
ON Group_History.GroupID = "Group".GroupID
LEFT JOIN Contact_History
ON Contact_History.HistoryID = History.HistoryID
LEFT JOIN Contact
ON Contact_History.ContactID = Contact.ContactID
LEFT JOIN Opportunity_History
ON Opportunity_History.HistoryID = History.HistoryID
LEFT JOIN Opportunity
ON Opportunity_History.OpportunityID = Opportunity.OpportunityID
WHERE
( Start_Date_Time >= ('2018/02/02') AND
Start_Date_Time <= ('2018/02/16') )
ORDER BY Group_NAME, START_DATE_TIME;
You will want to make sure you are using a LEFT JOIN with the table Opportunity. This will keep records that do not relate to records in the Opportunity table.
Also, BE CAREFUL you do not filter records using the WHERE clause for the Opportunity table being LEFT JOINED. Include those filter conditions relating to Opportunity instead in the LEFT JOIN ... ON clause.

How to optimize the query? t-sql

This query works about 3 minutes and returns 7279 rows:
SELECT identity(int,1,1) as id, c.client_code, a.account_num,
c.client_short_name, u.uso, us.fio, null as new, null as txt
INTO #ttable
FROM accounts a INNER JOIN Clients c ON
c.id = a.client_id INNER JOIN Uso u ON c.uso_id = u.uso_id INNER JOIN
Magazin m ON a.account_id = m.account_id LEFT JOIN Users us ON
m.user_id = us.user_id
WHERE m.status_id IN ('1','5','9') AND m.account_new_num is null
AND u.branch_id = #branch_id
ORDER BY c.client_code;
The type of 'client_code' field is VARCHAR(6).
Is it possible to somehow optimize this query?
Insert the records in the Temporary table without using Order by Clause and then Sort them using the c.client_code. Hope it should help you.
Create table #temp
(
your columns...
)
and Insert the records in this table Without Using the Order by Clause. Now run the select with Order by Clause
Do you have indexes set up for your tables? An index on foreign key columns as well as Magazin.status might help.
Make sure there is an index on every field used in the JOINs and in the WHERE clause
If one or the tables you select from are actually views, the problem may be in the performance of these views.
Always try to list tables earlier if they are referenced in the where clause - it cuts off row combinations as early as possible. In this case, the Magazin table has some predicates in the where clause, but is listed way down in the tables list. This means that all the other joins have to be made before the Magazin rows can be filtered - possibly millions of extra rows.
Try this (and let us know how it went):
SELECT ...
INTO #ttable
FROM accounts a
INNER JOIN Magazin m ON a.account_id = m.account_id
INNER JOIN Clients c ON c.id = a.client_id
INNER JOIN Uso u ON c.uso_id = u.uso_id
LEFT JOIN Users us ON m.user_id = us.user_id
WHERE m.status_id IN ('1','5','9')
AND m.account_new_num is null
AND u.branch_id = #branch_id
ORDER BY c.client_code;
This kind of optimization can greatly improve query performance.

SQL: Speed Improvement - Left Join on cond1 or cond2

SELECT DISTINCT a.*, b.*
FROM current_tbl a
LEFT JOIN import_tbl b
ON ( a.user_id = b.user_id
OR ( a.f_name||' '||a.l_name = b.f_name||' '||b.l_name)
)
Two tables that are basically the same
I don't have access to the table structure or data input (thus no cleaning up primary keys)
Sometimes the user_id is populated in one and not the other
Sometimes names are equal, sometimes they are not
I've found that I can get the most of the data by matching on user_id or the first/last names. I'm using the ' ' between the names to avoid cases where one user has the same first name as another's last name and both are missing the other field (unlikely, but plausible).
This query runs in 33000ms, whereas individualized they are each about 200ms.
I've been up late and can't think straight right now
I'm thinking that I could do a UNION and only query by name where a user_id does not exist (the default join is the user_id, if a user_id doesn't exist then I want to join by the name)
Here is some free points to anyone that wants to help
Please don't ask for the execution plan.
Looks like you can easily avoid the string concatenation:
OR ( a.f_name||' '||a.l_name = b.f_name||' '||b.l_name)
Change it to:
OR ( a.f_name = b.f_name AND a.l_name = b.l_name)
Rather than concatenating first and last name and comparing them, try comparing them individually instead. Assuming you have them (and you should create them if you don't), this should improve your chances of using indexes on the first name and last name columns.
SELECT DISTINCT a.*, b.*
FROM current_tbl a
LEFT JOIN import_tbl b
ON ( a.user_id = b.user_id
OR (a.f_name = b.f_name and a.l_name = b.l_name)
)
If people's suggestions don't provide a major speed increase, there is a possibility that your real problem is that the best query plan for your two possible join conditions is different. For that situation you would want to do two queries and merge results in some way. This is likely to make your query much, much uglier.
One obscure trick that I have used for that kind of situation is to do a GROUP BY off of a UNION ALL query. The idea looks like this:
SELECT a_field1, a_field2, ...
MAX(b_field1) as b_field1, MAX(b_field2) as b_field2, ...
FROM (
SELECT a.field_1 as a_field1, ..., b.field1 as b_field1, ...
FROM current_tbl a
LEFT JOIN import_tbl b
ON a.user_id = b.user_id
UNION ALL
SELECT a.field_1 as a_field1, ..., b.field1 as b_field1, ...
FROM current_tbl a
LEFT JOIN import_tbl b
ON a.f_name = b.f_name AND a.l_name = b.l_name
)
GROUP BY a_field1, a_field2, ...
And now the database can do each of the two joins using the most efficient plan.
(Warning of a drawback in this approach. If a row in current_tbl joins to multiple rows in import_tbl, then you'll wind up merging data in a very odd way.)
Incidental random performance tip. Unless you have reason to believe that there are potential duplicate rows, avoid DISTINCT. It forces an implicit GROUP BY, which can be expensive.
I don't really understand why you're concatenating those strings. Seems like that's where your slowdown would be. Does this work instead?
SELECT DISTINCT a.*, b.*
FROM current_tbl a
LEFT JOIN import_tbl b
ON ( a.user_id = b.user_id
OR ( a.f_name = b.f_name AND a.l_name = b.l_name)
)
Here is Yet Another Ugly Way To Do It.
SELECT a.*
, CASE WHEN b.user_id IS NULL THEN c.field1 ELSE b.field1 END as b_field1
, CASE WHEN b.user_id IS NULL THEN c.field2 ELSE b.field2 END as b_field2
...
FROM current_tbl a
LEFT JOIN import_tbl b
ON a.user_id = b.user_id
LEFT JOIN import_tbl c
ON a.f_name = c.f_name AND a.l_name = c.l_name;
This avoids any GROUP BY, and also handles conflicting matches in a somewhat reasonable way.
Try using JOIN hints:
http://msdn.microsoft.com/en-us/library/ms173815.aspx
We were encountering the same type of behavior with one of our queries. As a last resort we added the LOOP hint, and the query ran much much faster.
It's important to note that Microsoft says this about JOIN hints:
Because the SQL Server query optimizer typically selects the best execution plan for a query, we recommend that hints, including , be used only as a last resort by experienced developers and database administrators.
my boss at my last job.. I swear.. he thought that using UNIONS was ALWAYS FASTER THAN OR.
For example.. instead of writing
Select * from employees Where Employee_id = 12 or employee_id = 47
he would write (and have me write)
Select * from employees Where employee_id = 12
UNION
Select * from employees Where employee_id = 47
SQL Sever optimizer said that this was the right thing to do in SOME situations.. I have a friend who works on the SQL Server team at Microsoft, I emailed him about this and he told me that my stats were out of date or something along those lines.
I never really got a good answer on WHY the unions are faster, it seems REALLY counter-intuitive.
I'm not recommending you DO this, but in some situations it can help.
Also two more things-- GET RID OF THE DISTINCT CLAUSE unless you absolutely need it.. n
and more importantly, you can easily get rid of the concatenation in your join, like this for example (pardon my lack of mySQL knowledge)
SELECT DISTINCT a., b.
FROM current_tbl a
LEFT JOIN import_tbl b
ON ( a.user_id = b.user_id
OR ( a.f_name = b.f_name and a.l_name = b.l_name)
)
I've had some tests at work in a similiar situation that show 10x performance improvement by getting rid of the simple concatenation in your join