Splunk - Match different fields in different events from same data source - splunk

I have a data source in which I need to return all pairs of events (event1, event2) from a single data source, where field1 from event1 matches field2 from event2.
For example, let's say I have the following data.
I need to return a pair of events where field id from event1, matches field referrer_id from event2. Let's say, to get the following report.
Adam Anderson referred Betty Burger on 2016-01-02 08:00:00.000
Adam Anderson referred Carol Camp on 2016-01-03 08:00:00.000
Betty Burger referred Darren Dougan on 2016-01-04 08:00:00.000
In sql I can do this quite easily with the following command.
select a.first_name as first1, a.last_name as last1, b.first_name as first2,
b.last_name as last2, b.date as date
from myTable a
inner join myTable b on a.id = b.referrer_id;
Which returns the following table,
which gives exactly the data I need.
Now, I've been attempting to replicate this in a splunk query and have run into quite a few issues. First I attempted to use the transaction command, but that aggregated all of the related events together as opposed to matching them a pair at a time.
Next, I attempted to use a subsearch, first finding the id and then searching in the subsearch, first for the first event by id and the appending the second event by referral_id. Then, since append creates a new row instead of appending to the same row, using a stats to aggregate the resulting rows by the matching id field. I did attempt to use appendcols but that didn't return anything for me.
...
| table id
| map search="search id=$id$
| fields first_name, last_name, id
| rename first_name as first1
| rename last_name as last1
| rename id as match_id
| append [search $id$
| search referral_id=$id$
| fields first_name, last_name, referral_id, date
| rename first_name as first2
| rename last_name as span2
| rename referral_id as match_id]"
| fields first1, last1, first2, last2, match_id, time
| stats values(first1) as first1, values(last1) as last1, values(first2) as first2,
values(last2) as last2, values(time) as time by id
The above query works for me and gives me the table I need, but it is incredibly slow due to the repeated searches over the entire time frame, and also limited by the map maxsearches which, for whatever reason, cannot be set to unlimited.
This seems like an overly complicated solution, especially in comparison to the sql query. Surely there must exist a simpler, faster way that this can be done, which isn't limited by the arbitrary limited settings or the multiple repeating search queries. I would greatly appreciate any help.

I ended up using append. Using join gave me faster results, but didn't result in every matching pair, for my example it would return 2 rows instead of three, returning Adam with Betty, but not returning Adam with Carol.
Using append returned a full list, and using stats by id gave me the result I was looking for, a full list of each matching pair. It also gave extra empty fields, so I had to remove those, and then manipulate the resulting mv's into their own individual rows. Splunk doesn't offer a multifield mv expand, so I used a workaround.
...
| rename id as matchId, first_name as first1, last_name as last1
| table matchId, first1, last1
| append [
search ...
| rename referrer_id as matchId, first_name as first2, last_name as last2
| table matchId, first2, last2, date]
| stats list(first1) as first1, list(last1) as last1, list(first2) as first2, list(last2) as last2, list(date) as date by matchId
| search first1!=null last1!=null first2!=null last2!=null
| eval zipped=mvzip(mvzip(first2, last2, ","), date, ",")
| mvexpand zipped
| makemv zipped delim=","
| eval first2=mvindex(zipped, 0)
| eval last2=mvindex(zipped, 1)
| eval date=mvindex(zipped, 2)
| fields - zipped
This is faster than using a map with multiple subsearches, and gives all the of results. It is still limited by the maximum size of the subsearch, but at least provides the necessary data.

Related

How to match phone number prefix to country from phonenumber in SQL

I am trying to extract the country code prefix from a list of numbers, and match them to the region that they belong to. The data might look something like this:
| id | phone_number |
|----|----------------|
| 1 | +27000000000 |
| 2 | +16840000000 |
| 3 | +10000000000 |
| 4 | +27000000000 |
The country codes here are:
American Samoa: +1684
United States and Caribbean: +1
South Africa: +27
And the desired result would be something this:
| country | count |
|-----------------------------|-------|
| South Africa | 2 |
| American Samoa | 1 |
| United States and Caribbean | 1 |
There are some difficulties because
country prefix codes vary from 1 to 4 numbers and even without the country prefix,
phone number length varies from place to place.
I do not have write access to this DB, so adding another column, while probably the best solution, will not work in this use case
This is my current solution:
SELECT
CASE
WHEN SUBSTRING(phone_number,1,5) = '+1684' THEN 'American Samoa'
WHEN SUBSTRING(phone_number,1,5) = '+1264' THEN 'Anguilla'
...
WHEN SUBSTRING(phone_number,1,5) = '+1599' THEN 'Saint Martin'
WHEN SUBSTRING(phone_number,1,4) = '+355' THEN 'Albania'
WHEN SUBSTRING(phone_number,1,4) = '+213' THEN 'Algeria'
...
WHEN SUBSTRING(phone_number,1,4) = '+263' THEN 'Zimbabwe'
WHEN SUBSTRING(phone_number,1,3) = '+93' THEN 'Afghanistan'
WHEN SUBSTRING(phone_number,1,3) = '+54' THEN 'Argentina'
...
WHEN SUBSTRING(phone_number,1,3) = '+58' THEN 'Venezuela'
WHEN SUBSTRING(phone_number,1,3) = '+84' THEN 'Vietnam'
WHEN SUBSTRING(phone_number,1,2) = '+1' THEN 'United States and Caribbean'
WHEN SUBSTRING(phone_number,1,2) = '+7' THEN 'Kazakhstan, Russia'
ELSE 'unknown'
END as country_name,
count(*)
FROM users
GROUP BY country_name
order by count desc
There are ~205 WHEN ... THEN cases. It seems to be very inefficient and times out. I assume this is because it runs the pattern matching on every row. This would need to scale to roughly 10s of millions of rows
Is there a more efficient way to do this?
I am using postgreSQL 9.6.16
In spite of reading the whole table, an index could help here. In order to aggregate the data per country code, the DBMS must sort all rows by country code. Sorting is an expensive operation, especially on large data sets. If you had an index on the country codes, the DBMS would find the codes already pre-sorted in the index and could avoid the work of sorting the data.
You don't have the separate country code in a column, but each phone number starts with the code, so you could index the complete phone number:
create index idx on users (phone_number);
Then you must make it obvious to the DBMS that you are interested in the beginnings of the string, so it will consider using the index. Invoking a function like SUBSTRING on the phone number is likely to make the the DBMS blind to this. Use LIKE instead. According to the docs (https://www.postgresql.org/docs/9.3/indexes-types.html), indexes on strings can be used with LIKE 'something%':
WHEN phone_number LIKE '+1684%' THEN 'American Samoa'
There is no guarantee this will help, but it's worth a try I think. It depends on whether the optimizer sees the advantage of using the pre-sorted phone numbers from the index.

0 results in MS Access totals query (w. COUNT) after applying criteria

A query I am working on is showing a rather interesting behaviour that I couldn't debug so far.
This is the query before it gets buggy:
QryCount
SELECT EmpId, [Open/Close], Count([Open/Close]) AS Occurences, Attribute1, Market, Tier, Attribute2, MtSWeek
FROM qrySource
WHERE (Venue="NewYork") AND (Type="TypeA")
GROUP BY EmpId, [Open/Close], Attribute1, Market, Tier, Attribute2, MtSWeek;
The query gives precisely the results that I would expect it to:
#01542 | Open | 5 | Call | English | Tier1 | Complain | 01/01/2017
#01542 | Closed | 2 | Call | English | Tier2 | ProdInfo | 01/01/2017
#01542 | Open | 7 | Mail | English | Tier1 | ProdInfo | 08/01/2017
etc...
But as a matter of fact in doing so it provides more records than needed at a subsequent step thereby creating cartesians.
qrySource.[Open/Close] is a string type field with possible attributes (you guessed) "open", "Closed" and null and it is actually provided by a mapping table at the creation stage of qrySource (not sure, but maybe this helps).
Now, the error comes in when I try to limit qryCount only to records where Open/Close = "Open".
I tried both using WHERE and HAVING to no avail. The query would result in 0 records, which is not what I would like to see.
I thought that maybe it is because "open" is a reserved term, but even by changing it to "D_open" in the source table didn't fix the issue.
Also tried to filter for the desired records in a subsequent query
SELECT *
FROM QryCount
WHERE [Open/Close] ="D_Open"
But nothing, still 0 records found.
I am suspicious it might be somehow related to some inherent proprieties of the COUNT function but not sure. Any help would be appreciated.
Everyone who participated, thank you and apologies for providing you with insufficient/confusing information. I recon the question could have been drafted better.
Anyhow, I found that the problem was apparently caused by the "/" in the Open/Closed field name. As soon as I removed it from the field name in the original mapping table the query performed as expected.

Items getting double-counted in SQL Server, dependent counting logic not working right

I am counting the number of RFIs (requests for info) from various agencies. Some of these agencies are also part of a task force (committee). Currently this SQL combines the agencies and task forces into one list and counts the RFIs for each. The problem is, if the RFI belongs to a task force (which is also assigned to an agency), I only want it to count for the task force and not for the agency. However, if the agency does not have a task force assigned to the RFI, I want it to still count for the agency. The RFIs are linked to various agencies through a _LinkEnd table, but that logic works just fine. Here is the logic thus far:
SELECT t.Submitting_Agency, COUNT(DISTINCT t.Count) AS RFICount
FROM (
SELECT RFI_.Submitting_Agency, RFI_.Unique_ID, _LinkEnd.EntityType_ID1, _LinkEnd.Link_ID as Count
FROM RFI_
JOIN _LinkEnd ON RFI_.Unique_ID=_LinkEnd.Entity_ID1
WHERE _LinkEnd.Link_ID LIKE 'CAS%' AND RFI_.Date_Submitted BETWEEN '20110430' AND '20110630'
UNION ALL
SELECT RFI_.Task_Force__Initiative AS Submitting_Agency, RFI_.Unique_ID, _LinkEnd.EntityType_ID1, _LinkEnd.Link_ID as Count
FROM RFI_
JOIN _LinkEnd ON RFI_.Unique_ID=_LinkEnd.Entity_ID1
WHERE _LinkEnd.Link_ID LIKE 'CAS%' AND RFI_.Date_Submitted BETWEEN '20110430' AND '20110630' AND RFI_.Task_Force__Initiative IS NOT NULL) t
GROUP BY t.Submitting_Agency
How can I get it to only count an RFI one time, even though the two fields are combined? For instance, here are sample records from the RFI_ table:
---------------------------------------------------------------------------
| Unique_ID | Submitting_Agency | Task_Force__Initiative | Date_Submitted |
---------------------------------------------------------------------------
| 1 | Social Service | Flood Relief TF | 2011-05-08 |
---------------------------------------------------------------------------
| 2 | Faith-Based Init. | Homeless Shelter Min. | 2011-06-08 |
---------------------------------------------------------------------------
| 3 | Psychology Group | | 2011-05-04 |
---------------------------------------------------------------------------
| 4 | Attorneys at Law | | 2011-05-05 |
---------------------------------------------------------------------------
| 5 | Social Service | | 2011-05-10 |
---------------------------------------------------------------------------
So assuming only one link existed to one RFI for each of these, the count should be as follows:
Social Service 1
Faith-Based Unit. 0
Psychology Group 1
Attorneys at Law 1
Flood Relief TF 1
Homeless Shelter Min. 1
Note that if both an agency and a task force are in one record, then the task force gets the count, not the agency. But it is possible for the agency to have a record without a task force, in which case the agency gets the count. How could I get this to work in this fashion so that RFIs are not double-counted? As it stands both the agency and the task force get counted, which I do not want to happen. The task force always gets the count, unless that field is blank, then the agency gets it.
I guess a simple COLESCE() would do the trick?
SELECT COLAESCE(Task_Force__Initiative, Submitting_Agency), COUNT(DISTINCT _LinkEnd.Link_ID) AS RFICount
FROM RFI_
JOIN _LinkEnd ON RFI_.Unique_ID=_LinkEnd.Entity_ID1
WHERE _LinkEnd.Link_ID LIKE 'CAS%' AND RFI_.Date_Submitted BETWEEN '20110430' AND '20110630'
GROUP BY COLAESCE(Task_Force__Initiative, Submitting_Agency);
Rather than:
SELECT t.Submitting_Agency ...
Try
SELECT
CASE t.[Task_Force__Initiative]
WHEN NULL THEN -- Or whatever value constitutes "empty"
t.[Submitting_Agency]
ELSE
t.[Task_Force__Initiative]
END ...
and then GROUP BY the same.
http://msdn.microsoft.com/en-us/library/ms181765.aspx
The result will be that your count will aggregate from the proper specified grouping point, rather than from the single agency column.
EDIT: From your example it looks like you don't use NULL for the empty field but maybe a blank string? In that case you'll want to replace the NULL in the CASE above with the proper "blank" value. If it is NULL then you can COALESCE as suggested in the other answer.
EDIT: Based on what I think your schema is... and your WHERE criteria
SELECT
COALESCE(RFI_.[Task_Force__Initiative], RFI_.[Submitting_Agency]),
COUNT(*)
FROM
RFI_
JOIN _LinkEnd
ON RFI_.[Unique_ID]=_LinkEnd.[Entity_ID1]
WHERE
_LinkEnd.[Link_ID] LIKE 'CAS%'
AND RFI_.[Date_Submitted] BETWEEN '20110430' AND '20110630'
GROUP BY
COALESCE(RFI_.[Task_Force__Initiative], RFI_.[Submitting_Agency])

Need lowest price in each region in a mysql query

I am trying to write up a query for wordpress which will give me all the post_id's with the lowest fromprice field for each region. Now the trick is these are custom fields in wordpress, and due to such, the information is stored row based, so there is no region and fromprice columns.
So the data I have is (but of course containing a lot more rows):
Post_ID | Meta_Key | Meta_Value
1 | Region | Location1
1 | FromPrice | 150
2 | Region | Location1
2 | FromPrice | 160
3 | Region | Location2
3 | FromPrice | 145
The query I am endeavoring to build should return the post_id of the "lowest priced" matching post grouped by each region with results like:
Post_ID | Region | From Price
1 | Location1 | 150
3 | Location2 | 145
This will allow me to easily iterate the post_id's and print the required information, in fact, I would be just happy with returning post_id's if the rest is harder, I can then fetch the information independently if need be.
Thanks a lot, tearing my hair out over this one; don't often have to think about shifting results on their side from row based to column based that often, but this time I need it!
So you get an idea of the table structure I have, you can use the below as a guide. I thought I had this, but it turned out yes, this query prints out each distinct region WITH the lowest from price found attached to that post in the region, but the post_id is completely incorrect. I don't know why, it seems to be just getting the first result of the post_id and using that.
SELECT pm.post_id,
pm2.meta_value as region,
MIN(pm.meta_value) as price
FROM `wp_postmeta` pm
inner join `wp_postmeta` pm2
on pm2.post_id = pm.post_id
AND pm2.meta_key = 'region'
AND pm.meta_key = 'fromprice'
group by region
I suggest changing MIN(pm.meta_value) in your query to be MIN(CAST(pm.meta_value AS DECIMAL)). Meta_value is a character field, so your existing query will be returning the minimum string value, not the minimum numeric value; for example, "100" will be deemed to be lower than "21".
EDIT - amended CAST syntax.
It's hard to figure out without being able to execute the query, but would it help to just change your group by to:
group by pm.post_id, region

Is there any difference between GROUP BY and DISTINCT

I learned something simple about SQL the other day:
SELECT c FROM myTbl GROUP BY C
Has the same result as:
SELECT DISTINCT C FROM myTbl
What I am curious of, is there anything different in the way an SQL engine processes the command, or are they truly the same thing?
I personally prefer the distinct syntax, but I am sure it's more out of habit than anything else.
EDIT: This is not a question about aggregates. The use of GROUP BY with aggregate functions is understood.
MusiGenesis' response is functionally the correct one with regard to your question as stated; the SQL Server is smart enough to realize that if you are using "Group By" and not using any aggregate functions, then what you actually mean is "Distinct" - and therefore it generates an execution plan as if you'd simply used "Distinct."
However, I think it's important to note Hank's response as well - cavalier treatment of "Group By" and "Distinct" could lead to some pernicious gotchas down the line if you're not careful. It's not entirely correct to say that this is "not a question about aggregates" because you're asking about the functional difference between two SQL query keywords, one of which is meant to be used with aggregates and one of which is not.
A hammer can work to drive in a screw sometimes, but if you've got a screwdriver handy, why bother?
(for the purposes of this analogy, Hammer : Screwdriver :: GroupBy : Distinct and screw => get list of unique values in a table column)
GROUP BY lets you use aggregate functions, like AVG, MAX, MIN, SUM, and COUNT.
On the other hand DISTINCT just removes duplicates.
For example, if you have a bunch of purchase records, and you want to know how much was spent by each department, you might do something like:
SELECT department, SUM(amount) FROM purchases GROUP BY department
This will give you one row per department, containing the department name and the sum of all of the amount values in all rows for that department.
What's the difference from a mere duplicate removal functionality point of view
Apart from the fact that unlike DISTINCT, GROUP BY allows for aggregating data per group (which has been mentioned by many other answers), the most important difference in my opinion is the fact that the two operations "happen" at two very different steps in the logical order of operations that are executed in a SELECT statement.
Here are the most important operations:
FROM (including JOIN, APPLY, etc.)
WHERE
GROUP BY (can remove duplicates)
Aggregations
HAVING
Window functions
SELECT
DISTINCT (can remove duplicates)
UNION, INTERSECT, EXCEPT (can remove duplicates)
ORDER BY
OFFSET
LIMIT
As you can see, the logical order of each operation influences what can be done with it and how it influences subsequent operations. In particular, the fact that the GROUP BY operation "happens before" the SELECT operation (the projection) means that:
It doesn't depend on the projection (which can be an advantage)
It cannot use any values from the projection (which can be a disadvantage)
1. It doesn't depend on the projection
An example where not depending on the projection is useful is if you want to calculate window functions on distinct values:
SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
GROUP BY rating
When run against the Sakila database, this yields:
rating rn
-----------
G 1
NC-17 2
PG 3
PG-13 4
R 5
The same couldn't be achieved with DISTINCT easily:
SELECT DISTINCT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
That query is "wrong" and yields something like:
rating rn
------------
G 1
G 2
G 3
...
G 178
NC-17 179
NC-17 180
...
This is not what we wanted. The DISTINCT operation "happens after" the projection, so we can no longer remove DISTINCT ratings because the window function was already calculated and projected. In order to use DISTINCT, we'd have to nest that part of the query:
SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM (
SELECT DISTINCT rating FROM film
) f
Side-note: In this particular case, we could also use DENSE_RANK()
SELECT DISTINCT rating, dense_rank() OVER (ORDER BY rating) AS rn
FROM film
2. It cannot use any values from the projection
One of SQL's drawbacks is its verbosity at times. For the same reason as what we've seen before (namely the logical order of operations), we cannot "easily" group by something we're projecting.
This is invalid SQL:
SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY name
This is valid (repeating the expression)
SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY first_name || ' ' || last_name
This is valid, too (nesting the expression)
SELECT name
FROM (
SELECT first_name || ' ' || last_name AS name
FROM customer
) c
GROUP BY name
I've written about this topic more in depth in a blog post
There is no difference (in SQL Server, at least). Both queries use the same execution plan.
http://sqlmag.com/database-performance-tuning/distinct-vs-group
Maybe there is a difference, if there are sub-queries involved:
http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/
There is no difference (Oracle-style):
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:32961403234212
Use DISTINCT if you just want to remove duplicates. Use GROUPY BY if you want to apply aggregate operators (MAX, SUM, GROUP_CONCAT, ..., or a HAVING clause).
I expect there is the possibility for subtle differences in their execution.
I checked the execution plans for two functionally equivalent queries along these lines in Oracle 10g:
core> select sta from zip group by sta;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 58 | 174 | 44 (19)| 00:00:01 |
| 1 | HASH GROUP BY | | 58 | 174 | 44 (19)| 00:00:01 |
| 2 | TABLE ACCESS FULL| ZIP | 42303 | 123K| 38 (6)| 00:00:01 |
---------------------------------------------------------------------------
core> select distinct sta from zip;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 58 | 174 | 44 (19)| 00:00:01 |
| 1 | HASH UNIQUE | | 58 | 174 | 44 (19)| 00:00:01 |
| 2 | TABLE ACCESS FULL| ZIP | 42303 | 123K| 38 (6)| 00:00:01 |
---------------------------------------------------------------------------
The middle operation is slightly different: "HASH GROUP BY" vs. "HASH UNIQUE", but the estimated costs etc. are identical. I then executed these with tracing on and the actual operation counts were the same for both (except that the second one didn't have to do any physical reads due to caching).
But I think that because the operation names are different, the execution would follow somewhat different code paths and that opens the possibility of more significant differences.
I think you should prefer the DISTINCT syntax for this purpose. It's not just habit, it more clearly indicates the purpose of the query.
For the query you posted, they are identical. But for other queries that may not be true.
For example, it's not the same as:
SELECT C FROM myTbl GROUP BY C, D
I read all the above comments but didn't see anyone pointed to the main difference between Group By and Distinct apart from the aggregation bit.
Distinct returns all the rows then de-duplicates them whereas Group By de-deduplicate the rows as they're read by the algorithm one by one.
This means they can produce different results!
For example, the below codes generate different results:
SELECT distinct ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
SELECT ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
GROUP BY Name
If there are 10 names in the table where 1 of which is a duplicate of another then the first query returns 10 rows whereas the second query returns 9 rows.
The reason is what I said above so they can behave differently!
If you use DISTINCT with multiple columns, the result set won't be grouped as it will with GROUP BY, and you can't use aggregate functions with DISTINCT.
GROUP BY has a very specific meaning that is distinct (heh) from the DISTINCT function.
GROUP BY causes the query results to be grouped using the chosen expression, aggregate functions can then be applied, and these will act on each group, rather than the entire resultset.
Here's an example that might help:
Given a table that looks like this:
name
------
barry
dave
bill
dave
dave
barry
john
This query:
SELECT name, count(*) AS count FROM table GROUP BY name;
Will produce output like this:
name count
-------------
barry 2
dave 3
bill 1
john 1
Which is obviously very different from using DISTINCT. If you want to group your results, use GROUP BY, if you just want a unique list of a specific column, use DISTINCT. This will give your database a chance to optimise the query for your needs.
If you are using a GROUP BY without any aggregate function then internally it will treated as DISTINCT, so in this case there is no difference between GROUP BY and DISTINCT.
But when you are provided with DISTINCT clause better to use it for finding your unique records because the objective of GROUP BY is to achieve aggregation.
They have different semantics, even if they happen to have equivalent results on your particular data.
Please don't use GROUP BY when you mean DISTINCT, even if they happen to work the same. I'm assuming you're trying to shave off milliseconds from queries, and I have to point out that developer time is orders of magnitude more expensive than computer time.
In Teradata perspective :
From a result set point of view, it does not matter if you use DISTINCT or GROUP BY in Teradata. The answer set will be the same.
From a performance point of view, it is not the same.
To understand what impacts performance, you need to know what happens on Teradata when executing a statement with DISTINCT or GROUP BY.
In the case of DISTINCT, the rows are redistributed immediately without any preaggregation taking place, while in the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.
Don’t think now that GROUP BY is always better from a performance point of view. When you have many different values, the preaggregation step of GROUP BY is not very efficient. Teradata has to sort the data to remove duplicates. In this case, it may be better to the redistribution first, i.e. use the DISTINCT statement. Only if there are many duplicate values, the GROUP BY statement is probably the better choice as only once the deduplication step takes place, after redistribution.
In short, DISTINCT vs. GROUP BY in Teradata means:
GROUP BY -> for many duplicates
DISTINCT -> no or a few duplicates only .
At times, when using DISTINCT, you run out of spool space on an AMP. The reason is that redistribution takes place immediately, and skewing could cause AMPs to run out of space.
If this happens, you have probably a better chance with GROUP BY, as duplicates are already removed in a first step, and less data is moved across the AMPs.
group by is used in aggregate operations -- like when you want to get a count of Bs broken down by column C
select C, count(B) from myTbl group by C
distinct is what it sounds like -- you get unique rows.
In sql server 2005, it looks like the query optimizer is able to optimize away the difference in the simplistic examples I ran. Dunno if you can count on that in all situations, though.
In that particular query there is no difference. But, of course, if you add any aggregate columns then you'll have to use group by.
You're only noticing that because you are selecting a single column.
Try selecting two fields and see what happens.
Group By is intended to be used like this:
SELECT name, SUM(transaction) FROM myTbl GROUP BY name
Which would show the sum of all transactions for each person.
From a 'SQL the language' perspective the two constructs are equivalent and which one you choose is one of those 'lifestyle' choices we all have to make. I think there is a good case for DISTINCT being more explicit (and therefore is more considerate to the person who will inherit your code etc) but that doesn't mean the GROUP BY construct is an invalid choice.
I think this 'GROUP BY is for aggregates' is the wrong emphasis. Folk should be aware that the set function (MAX, MIN, COUNT, etc) can be omitted so that they can understand the coder's intent when it is.
The ideal optimizer will recognize equivalent SQL constructs and will always pick the ideal plan accordingly. For your real life SQL engine of choice, you must test :)
PS note the position of the DISTINCT keyword in the select clause may produce different results e.g. contrast:
SELECT COUNT(DISTINCT C) FROM myTbl;
SELECT DISTINCT COUNT(C) FROM myTbl;
I know it's an old post. But it happens that I had a query that used group by just to return distinct values when using that query in toad and oracle reports everything worked fine, I mean a good response time. When we migrated from Oracle 9i to 11g the response time in Toad was excellent but in the reporte it took about 35 minutes to finish the report when using previous version it took about 5 minutes.
The solution was to change the group by and use DISTINCT and now the report runs in about 30 secs.
I hope this is useful for someone with the same situation.
Sometimes they may give you the same results but they are meant to be used in different sense/case. The main difference is in syntax.
Minutely notice the example below. DISTINCT is used to filter out the duplicate set of values. (6, cs, 9.1) and (1, cs, 5.5) are two different sets. So DISTINCT is going to display both the rows while GROUP BY Branch is going to display only one set.
SELECT * FROM student;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 2 | mech | 6.3 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 1 | cs | 5.5 |
+------+--------+------+
5 rows in set (0.001 sec)
SELECT DISTINCT * FROM student;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 2 | mech | 6.3 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 1 | cs | 5.5 |
+------+--------+------+
5 rows in set (0.001 sec)
SELECT * FROM student GROUP BY Branch;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 2 | mech | 6.3 |
+------+--------+------+
4 rows in set (0.001 sec)
Sometimes the results that can be achieved by GROUP BY clause is not possible to achieved by DISTINCT without using some extra clause or conditions. E.g in above case.
To get the same result as DISTINCT you have to pass all the column names in GROUP BY clause like below. So see the syntactical difference. You must have knowledge about all the column names to use GROUP BY clause in that case.
SELECT * FROM student GROUP BY Id, Branch, CGPA;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 1 | cs | 5.5 |
| 2 | mech | 6.3 |
| 3 | civil | 7.2 |
| 4 | eee | 8.2 |
| 6 | cs | 9.1 |
+------+--------+------+
Also I have noticed GROUP BY displays the results in ascending order by default which DISTINCT does not. But I am not sure about this. It may be differ vendor wise.
Source : https://dbjpanda.me/dbms/languages/sql/sql-syntax-with-examples#group-by
In terms of usage, GROUP BY is used for grouping those rows you want to calculate. DISTINCT will not do any calculation. It will show no duplicate rows.
I always used DISTINCT if I want to present data without duplicates.
If I want to do calculations like summing up the total quantity of mangoes, I will use GROUP BY
In Hive (HQL), GROUP BY can be way faster than DISTINCT, because the former does not require comparing all fields in the table.
See: https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct.
The way I always understood it is that using distinct is the same as grouping by every field you selected in the order you selected them.
i.e:
select distinct a, b, c from table;
is the same as:
select a, b, c from table group by a, b, c
Funtional efficiency is totally different.
If you would like to select only "return value" except duplicate one, use distinct is better than group by. Because "group by" include ( sorting + removing ) , "distinct" include ( removing )
Generally we can use DISTINCT for eliminate the duplicates on Specific Column in the table.
In Case of 'GROUP BY' we can Apply the Aggregation Functions like
AVG, MAX, MIN, SUM, and COUNT on Specific column and fetch
the column name and it aggregation function result on the same column.
Example :
select specialColumn,sum(specialColumn) from yourTableName group by specialColumn;
There is no significantly difference between group by and distinct clause except the usage of aggregate functions.
Both can be used to distinguish the values but if in performance point of view group by is better.
When distinct keyword is used , internally it used sort operation which can be view in execution plan.
Try simple example
Declare #tmpresult table
(
Id tinyint
)
Insert into #tmpresult
Select 5
Union all
Select 2
Union all
Select 3
Union all
Select 4
Select distinct
Id
From #tmpresult