"Subquery has multiple rows for like comparison - sql

I have one table give below.
In the following query, the outer query joins on a like comparison on the tag column with the subquery.
SELECT top 6 *
FROM [piarchive].[picomp2]
WHERE tag Like
(
Select distinct left(tag,19) + '%'
from (SELECT *
FROM [piarchive].[picomp2]
WHERE tag like '%CPU_Active' and time between '2014/10/02 15:13:08'and'2014/10/02 15:18:37'
and value=-524289 order by time desc) as t1
)
and tag not like '%CPU_Active' and tag not like '%Program%' and time between '2014/10/02
15:13:08'and'2014/10/02 15:18:37' order by time desc
But this subquery returns multiple rows, causing the following error:
Error : "When used as an expression, subquery can return at most one row."

Replace the where tag like (...) (where ... is the subquery, omitted here for brevity) part with where exists (...), and bring the like comparison into the subquery.
select top 6
*
from
[piarchive].[picomp2] t0
where
exists
(
select
*
from
(
select
*
from
[piarchive].[picomp2]
where
tag like '%cpu_active' and time between '2014/10/02 15:13:08' and '2014/10/02 15:18:37'
and
value = -524289
)
as t1
where
t0.tag like left(t1.tag, 19) + '%'
)
and
tag not like '%cpu_active'
and
tag not like '%program%'
and
time between '2014/10/02 15:13:08' and '2014/10/02 15:18:37'
order by
time desc;
I've added a table alias to the outer query to disambiguate the tag columns, but you can see the like comparison is shifted to within the subquery.
I can't vouch for how this will perform on large data sets, but that's a different topic. Personally, I would be looking for a way to get rid of the subquery altogether, since it's all querying the same table.
More on optimisation
It's not going to be easy to optimise, and indexes will be of little use here, for the following reasons:
The join criteria (t0.tag like left(t1.tag, 19) + '%') is not simple, and the query optimiser may have a hard time producing anything better than nested loops (i.e., executing the subquery for every row of the outer query). This is probably your biggest performance killer right here.
None of the like comparisons can utilise table indexes, because they are checking the end of the value, not the start.
Your only hope might be if the date-range check is highly selective (eliminates a lot of records). Since the same check on the time field is performed in both outer and inner queries, you could select that into a temp table:
select left(tag, 19) as key, *
into #working
from [piarchive].[picomp2]
where [time] between '2014/10/02 15:13:08' and '2014/10/02 15:18:37';
#working now has only the records in the specified time period. Since your example range is quite narrow (only 5 1/2 minutes), I'd wager this might knock out ~99% of records. An index on time will speed this up significantly. After you do this, you're only dealing with a tiny fraction of the data.
Then, possibly (see later) index key:
create clustered index cx_key on #working (key);
Then complete the rest of the query as:
select a.*
from #working a
where exists
(
select *
from #working b
where a.key = b.key and b.tag like '%cpu_active'
)
and
a.tag not like '%program%'
and
a.tag not like '%cpu_active'
What I've done is create a clustered index on the joining criteria (the first 19 chars of tag) to optimise the subquery. You'll have to test this out, as it may make no difference or even slow things down if the gains are outweighed by the cost in creating the index in the first place. This will depend on how much data you have, and other factors. I only got minimal gains by doing this (about a 5% speed increase), though I'm only running this against a few hundred rows of test data I knocked up. The more data you have, the more effective it should be.

Related

Oracle query with multiple joins taking far too long to run [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 15 days ago.
Improve this question
I'm trying to develop a table which involves three tables. TB3 has 500k records, TBL6 has 10 million records and TBL2, TBL4, TBL5 has 20 million records and are the same table. TBL1 is a subquery with a number of joins. I cannot change the database structure but have checked the indices available and that hasn't helped. I've used the OUTER APPLY join at the end as I thought that may speed up my performance but after much experimenting I still and up killing the query after 15-20 minutes.
SELECT TBL1.START_TIME,
TBL1.DEST_TIME,
TBL1.SRCADDRESS,
TBL1.POS,
TBL2.ADDRESSID AS TBL2 FROM (
SELECT TBL3.EVENTTIME,
TBL3.SOURCEADDRESS,
TBL6.FROM_POS,
TBL3.LC_NAME
FROM CUSTOMER_OVERVIEW_V TBL3
INNER JOIN CUSTOMER_SALE_RELATED TBL6 ON TBL6.LC_NAME = TBL3.LC_NAME
AND TBL6.FROM_LOC = TBL3.SOURCEADDRESS
INNER JOIN CUSTOMER TBL4 ON TBL4.CUSTID = TBL3.LC_NAME
AND TBL4.AREATYPE = 'SW'
AND TBL4.EVENTTIME <= TBL3.EVENTTIME + interval '1' second
AND TBL4.EVENTTIME >= TBL3.EVENTTIME - interval '1' second
INNER JOIN CUSTOMER TBL5 ON CUSTID = TBL3.LC_NAME
AND TBL5.AREATYPE = 'SE'
AND TBL5.EVENTTIME <= TBL3.EVENTTIME + interval '1' second
AND TBL5.EVENTTIME >= TBL3.EVENTTIME - interval '1' second
WHERE TBL3.SOURCEADDRESS IS NOT NULL
AND extract(second from TBL5.EVENTTIME - TBL4.EventTime) * 1000 > 250
ORDER BY TBL3.EVENTTIME DESC
FETCH FIRST 500 ROWS ONLY) TBL1
OUTER APPLY (SELECT ADDRESSID
FROM CUSTOMER
WHERE AREATYPE = 'STH'
AND EVENTTIME > TBL1.DEST_TIME
ORDER BY EVENTTIME ASC
FETCH FIRST 1 ROW ONLY) TBL2;
There must be a way to structure this query better to improve the performance so any suggestions would be appreciated.
You are asking for the first 500 rows, so you only want a tiny fraction of your overall data. Therefore you want to use nested loops joins with appropriate indexes in order to get those 500 rows and be done, rather than have to process millions of rows and only then take off the top 500.
That however is complicated by the fact that you want the first 500 after ordering results. That ORDER BY will require a sort, and a sort operation will have to process 100% of the incoming rows in order to produce even its first row of output, which means you have to crunch 100% of your data and that kills your performance.
If your inner joins to TBL6, TBL4 and TBL5 are not meant to drop rows that don't have a matching key (e.g. if making those outer joins would result in the same # of result rows from TBL3), and if you don't really need to filter on extract(second from TBL5.EVENTTIME - TBL4.EventTime) * 1000 > 250 which requires a join of TBL5 and TBL4 to accomplish, and if this CUSTOMER_OVERVIEW_V view is a simple single-table view that doesn't apply any predicates (not likely), or if you can bypass the view and hit the table directly, then you can do this:
Create a 2-column function-based index (e.g. customer_eventtime_idx) on (DECODE(sourceaddress,NULL,'Y','N'),eventtime) of table you need from the customer_overview_v view, in that exact column order.
Rewrite the query to get the 500 rows as early as possible, preferably before any joins, using a hint to force a descending index scan on this index. You will also need to change your IS NOT NULL predicate to the same DECODE function used in the index definition:
SELECT /*+ LEADING(c) USE_NL(csr c1 c2 adr) */
[columns needed]
FROM (SELECT /*+ NO_MERGE INDEX_DESC(x customer_eventtime_idx) */
[columns you need]
FROM TABLE_UNDER_CUSTOMER_OVERVIEW_V x
WHERE DECODE(SOURCEADDRESS,NULL,'Y','N') = 'N'
ORDER BY DECODE(SOURCEADDRESS,NULL,'Y','N') DESC,EVENTTIME DESC
FETCH FIRST 500 ROWS ONLY) c
INNER JOIN CUSTOMER_SALE_RELATED ... csr
INNER JOIN CUSTOMER ... c1
INNER JOIN CUSTOMER ... c2
OUTER APPLY ([address subquery like you have it]) adr
WHERE extract(second from c.EVENTTIME - c1.EventTime) * 1000 > 250
Generate your explain plan and make sure you see the index scan descending operation on the index you created - it must say descending, and also ensure that you see ORDER BY NOSORT step afterwards... it must say NOSORT. Creating the index as we have and ordering our query as we have was all about getting these plan operations to be selected. This is not easy to get to work right. Every detail around the inner c query must be crafted precisely to achieve the trick.
Explanation:
Oracle should seek/access the index on the DECODE Y/N result, so find the N records (those that have a non-null sourceaddress) in the first column of the index starting at the leading edge of the N values, then get the corresponding row from the table, then step backward to the pervious N value, get that table row, then the previous, etc.. emitting rows as they are found. Because the ORDER BY matches the index exactly in the same direction (descending), Oracle will skip the SORT operation as it knows that the data coming out of the index step will already be in the correct order.
These rows therefore stream out of the c inline view as they are found, which allows the FETCH FIRST to stop processing when it gets too 500 rows of output without having to wait for any buffering operation (like a SORT) to complete. You only ever hit those 500 records - it never visits the rest of the rows.
You then join via nested loops to the remaining tables, but you only have 500 driving rows. Obviously you must ensure appropriate indexing on those tables for your join clauses.
If your CUSTOMER_ORDER_V view however does joins and applies more predicates, you simply cannot do this with the view. You will have to use the base tables and apply this trick on whatever base table has that eventtime column, then join in whatever you need that the view was joining to and reproduce any remaining needed view logic (though you might find it does more than you need, and you can omit much of it). In general, don't use views whenever you can help it. You always have more control against base tables.
Lastly, note that I did not follow your TBL1, TBL2, TBL3, etc. alias convention. That is hard to read because you have to constantly look elsewhere to see what "TBL3" means. Far better to use aliases that communicate immediately what table they are, such as the initial letter or couple letters or acronym from the first letter of each word, etc..

TSQL Improving performance of Update cross apply like statement

I have a client with a stored procedure that currently take 25 minutes to run. I have narrowed the cause of this to the following statement (changed column and table names)
UPDATE #customer_emails_tmp
SET #customer_emails_tmp.Possible_Project_Ref = cp.order_project_no,
#customer_emails_tmp.Possible_Project_id = cp.order_uid
FROM #customer_emails_tmp e
CROSS APPLY (
SELECT TOP 1 p.order_project_no, p.order_uid
FROM [order] p
WHERE e.Subject LIKE '%' + p.order_title + '%'
AND p.order_date < e.timestamp
ORDER BY p.order_date DESC
) as cp
WHERE e.Possible_Project_Ref IS NULL;
There are 3 slightly different version of the above, joining to 1 of three tables. The issue is the CROSS APPLY LIKE '%' + p.title + '%'. I have tried looking into CONTAINS() and FREETEXT() but as far as my testing and investigations go, you cannot do CONTAINS(e.title, p.title) or FREETEXT(e.title,p.title).
Have I miss read something or is there a better way to write the above query?
Any help on this is much appreciated.
EDIT
Updated query to actual query used. Execution plan:
https://www.brentozar.com/pastetheplan/?id=B1YPbJiX5
Tmp table has the following indexes:
CREATE NONCLUSTERED INDEX ix_tmp_customer_emails_first_recipient ON #customer_emails_tmp (First_Recipient);
CREATE NONCLUSTERED INDEX ix_tmp_customer_emails_first_recipient_domain_name ON #customer_emails_tmp (First_Recipient_Domain_Name);
CREATE NONCLUSTERED INDEX ix_tmp_customer_emails_client_id ON #customer_emails_tmp (customer_emails_client_id);
CREATE NONCLUSTERED INDEX ix_tmp_customer_emails_subject ON #customer_emails_tmp ([subject]);
There is no index on the [order] table for column order_title
Edit 2
The purpose of this SP is to link orders (amongst others) to sent emails. This is done via multiple UPDATE statements; all other update statements are less than a second in length; however, this one ( and 2 others exactly the same but looking at 2 other tables) take an extraordinary amount of time.
I cannot remove the filter on Possible_Project_Ref IS NULL as we only want to update the ones that are null.
Also, I cannot change WHERE e.Subject LIKE '%' + p.order_title + '%' to WHERE e.Subject LIKE p.order_title + '%' because the subject line may not start with the p.order_title, for example it could start with FW: or RE:
Reviewing your execution plan, I think the main issue is you're reading a lot of data from the order table. You are reading 27,447,044 rows just to match up to find 783 rows. Your 20k row temp table is probably nothing by comparison.
Without knowing your data or desired business logic, here's a couple things I'd consider:
Updating First Round of Exact Matches
I know you need to keep your %SearchTerm% parameters, but some data might have exact matches. So if you run an initial update for exact matches, it will reduce the ones you have to search with %SearchTerm%
Run something like this before your current update
/*Recommended index for this update*/
CREATE INDEX ix_test ON [order](order_title,order_date) INCLUDE (order_project_no, order_uid)
UPDATE #customer_emails_tmp
SET Possible_Project_Ref = cp.order_project_no
,Possible_Project_id = cp.order_uid
FROM #customer_emails_tmp e
CROSS APPLY (
SELECT TOP 1 p.order_project_no, p.order_uid
FROM [order] p
WHERE e.Subject = p.order_title
AND p.order_date < e.timestamp
ORDER BY p.order_date DESC
) as cp
WHERE e.Possible_Project_Ref IS NULL;
Narrowing Search Range
This will technically change your matching criteria, but there are probably certain logical assumptions you can make that won't impact the final results. Here are a couple of ideas for you to consider, to get you thinking this way, but only you know your business. The end goal should be to narrow the data read from the order table
Is there a customer id you can match on? Something like e.customerID = p.customerID? Do you really match any email to any order?
Can you narrow your search date range to something like x days before timestamp? Do you really need to search all historical orders for all of time? Would you even want a match if an email matches to an order from 5 years ago? For this, try updating your APPLY date filter to something like p.order_date BETWEEN DATEADD(dd,-30,e.[timestamp]) AND e.[timestamp]
Other Miscellaneous Notes
If I'm understanding this correctly, you are trying to link email to some sort of project #. Ideally, when the email are generated, they would be linked to a project immediately. I know this is not always possible resource/time wise, but the clean solution is to calculate this at the beginning of the process, not afterwards. Generally anytime you have to use fuzzy string matching, you will have data issues. I know business always wants results "yesterday" and always pushes for the shortcut, and nobody ever wants to update legacy processes, but sometimes you need to if you want clean data
I'd review your indexes on the temp table. Generally I find the cost to create the indexes and for SQL Server to maintain them as I update the temp table is not worth it. So 9 times out of 10, I leave the temp table as a plain heap with 0 indexes
First, filter the NULLs when you create #customer_emails_tmp, not after. Then you can lose:
WHERE e.Possible_Project_Ref IS NULL. This way you are only bringing in rows you need instead of retrieving rows you don't need, then filtering them.
Next, us this for your WHERE clause:
WHERE EXISTS (SELECT 1 FROM [order] AS p WHERE p.order_date < e.timestamp)
If an order date doesn't have any later timestamps in e, none of the rows in e will be considered.
Next remove the timestamp filter from your APPLY subquery. Now your subquery looks like this:
SELECT TOP 1 p.order_project_no, p.order_uid
FROM [order] AS p
WHERE e.Subject LIKE '%' + p.order_title + '%'
ORDER BY p.order_date DESC
This way you are applying your "Subject Like" filter to a much smaller set of rows. The final query would look like this:
UPDATE #customer_emails_tmp
SET #customer_emails_tmp.Possible_Project_Ref = cp.order_project_no,
#customer_emails_tmp.Possible_Project_id = cp.order_uid
FROM #customer_emails_tmp e
CROSS APPLY (
SELECT TOP 1 p.order_project_no, p.order_uid
FROM [order] p
WHERE e.Subject LIKE '%' + p.order_title + '%'
ORDER BY p.order_date DESC
) as cp
WHERE EXISTS (SELECT 1 FROM [order] AS p WHERE p.order_date < e.timestamp);

How to improve SQL query performance containing partially common subqueries

I have a simple table tableA in PostgreSQL 13 that contains a time series of event counts. In stylized form it looks something like this:
event_count sys_timestamp
100 167877672772
110 167877672769
121 167877672987
111 167877673877
... ...
With both fields defined as numeric.
With the help of answers from stackoverflow I was able to create a query that basically counts the number of positive and negative excess events within a given time span, conditioned on the current event count. The query looks like this:
SELECT t1.*,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count >= t1.event_count+10)
AS positive,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count <= t1.event_count-10)
AS negative
FROM tableA as t1
The query works as expected, and returns in this particular example for each row a count of positive and negative excesses (range + / - 10) given the defined time window (+ 1000 [milliseconds]).
However, I will have to run such queries for tables with several million (perhaps even 100+ million) entries, and even with about 500k rows, the query takes a looooooong time to complete. Furthermore, whereas the time frame remains always the same within a given query [but the window size can change from query to query], in some instances I will have to use maybe 10 additional conditions similar to the positive / negative excesses in the same query.
Thus, I am looking for ways to improve the above query primarily to achieve better performance considering primarily the size of the envisaged dataset, and secondarily with more conditions in mind.
My concrete questions:
How can I reuse the common portion of the subquery to ensure that it's not executed twice (or several times), i.e. how can I reuse this within the query?
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000)
Is there some performance advantage in turning the sys_timestamp field which is currently numeric, into a timestamp field, and attempt using any of the PostgreSQL Windows functions? (Unfortunately I don't have enough experience with this at all.)
Are there some clever ways to rewrite the query aside from reusing the (partial) subquery that materially increases the performance for large datasets?
Is it perhaps even faster for these types of queries to run them outside of the database using something like Java, Scala, Python etc. ?
How can I reuse the common portion of the subquery ...?
Use conditional aggregates in a single LATERAL subquery:
SELECT t1.*, t2.positive, t2.negative
FROM tableA t1
CROSS JOIN LATERAL (
SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
, COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000
) t2;
It can be a CROSS JOIN because the subquery always returns a row. See:
JOIN (SELECT ... ) ue ON 1=1?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Use conditional aggregates with the FILTER clause to base multiple aggregates on the same time frame. See:
Aggregate columns with additional (distinct) filters
event_count should probably be integer or bigint. See:
PostgreSQL using UUID vs Text as primary key
Is there any difference in saving same value in different integer types?
sys_timestamp should probably be timestamp or timestamptz. See:
Ignoring time zones altogether in Rails and PostgreSQL
An index on (sys_timestamp) is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count) typically helps some more. If the table is vacuumed enough, you get index-only scans from it.
Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See:
Window Functions or Common Table Expressions: count previous rows within range
What are the pros and cons of performing calculations in sql vs. in your application
You have the right idea.
The way to write statements you can reuse in a query is "with" statements (AKA subquery factoring). The "with" statement runs once as a subquery of the main query and can be reused by subsequent subqueries or the final query.
The first step includes creating parent-child detail rows - table multiplied by itself and filtered down by the timestamp.
Then the next step is to reuse that same detail query for everything else.
Assuming that event_count is a primary index or you have a compound index on event_count and sys_timestamp, this would look like:
with baseQuery as
(
SELECT distinct t1.event_count as startEventCount, t1.event_count+10 as pEndEventCount
,t1.eventCount-10 as nEndEventCount, t2.event_count as t2EventCount
FROM tableA t1, tableA t2
where t2.sys_timestamp between t1.sys_timestamp AND t1.sys_timestamp + 1000
), posSummary as
(
select bq.startEventCount, count(*) as positive
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.pEndEventCount
group by bq.startEventCount
), negSummary as
(
select bq.startEventCount, count(*) as negative
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.nEndEventCount
group by bq.startEventCount
)
select t1.*, ps.positive, nv.negative
from tableA t1
inner join posSummary ps on t1.event_count=ps.startEventCount
inner join negSummary ns on t1.event_count=ns.startEventCount
Notes:
The distinct for baseQuery may not be necessary based on your actual keys.
The final join is done with tableA but could also use a summary of baseQuery as a separate "with" statement which already ran once. Seemed unnecessary.
You can play around to see what works.
There are other ways of course but this best illustrates how and where things could be improved.
With statements are used in multi-dimensional data warehouse queries because when you have so much data to join with so many tables(dimensions and facts), a strategy of isolating the queries helps understand where indexes are needed and perhaps how to minimize the rows the query needs to deal with further down the line to completion.
For example, it should be obvious that if you can minimize the rows returned in baseQuery or make it run faster (check explain plans), your query improves overall.

LEFT JOIN with ROW NUM DB2

I am using two tables having one to many mapping (in DB2) .
I need to fetch 20 records at a time using ROW_NUMBER from two table using LEFT JOIN. But due to the one to many mapping, result is not consistent. I might be getting 20 records but those records does not contains 20 unique records of first table
SELECT
A.*,
B.*,
ROW_NUMBER() OVER (ORDER BY A.COLUMN_1 DESC) as rn
from
table1
LEFT JOIN
table2 ON A.COLUMN_3 = B.COLUMN3
where
rn between 1 and 20
Please suggest some solution.
Sure, this is easy... once you know that you can use subqueries as a table reference:
SELECT <relevant columns from Table1 and Table2>, rn
FROM (SELECT <relevant columns from Table1>,
ROW_NUMBER() OVER (ORDER BY <relevant columns> DESC) AS rn
FROM table1) Table1
LEFT JOIN Table2
ON <relevant equivalent columns>
WHERE rn >= :startOfRange
AND rn < :startOfRange + :numberOfElements
For production code, never do SELECT * - always explicitly list the columns you want (there are several reasons for this).
Prefer inclusive lower-bound (>=), exclusive upper-bound (<) for (positive) ranges. For everything except integral types, this is required to sanely/cleanly query the values. Do this with integral types both to be consistent, as well as for ease of querying (note that you don't actually need to know which value you "stop" on). Further, the pattern shown is considered the standard when dealing with iterated value constructs.
Note that this query currently has two problems:
You need to list sufficient columns for the ORDER BY to return consistent results. This is best done by using a unique value - you probably want something in an index that the optimizer can use.
Every time you run this query, you (usually) have to order the ENTIRE set of results before you can get whatever slice of them you want (especially for anything after the first page). If your dataset is large, look at the answers to this question for some ideas for performance improvements. The optimizer may be able to cache the results for you, but it's not guaranteed (especially on tables that receive many updates).

How to make this SQL query using IN (with many numeric IDs) more efficient?

I've been waiting over an hour already for this query, so I know I'm probably doing something wrong. Is there efficient way to tailor this query: ?
select RespondentID, MIN(SessionID) as 'SID'
from BIG_Sessions (nolock)
where RespondentID in (
1418283,
1419863,
1421188,
1422101,
1431384,
1435526,
1437284,
1441394,
/* etc etc THOUSANDS */
1579244 )
and EntryDate between
'07-11-2011' and '07-31-2012'
GROUP BY RespondentID
I kknow that my date range is pretty big, but I can't change that part (the dates are spread all over) .
Also, the reason for MIN(SessionID) is because otherwise we get many SessionID's for each Respondent, and one suffices(it's taking MIN on an alphanumeric ID like ach2a23a-adhsdx123... and getting the first alphabetically)
Thanks
Put your thousands of numbers in a temporary table.
Index the number field in that table.
Index the RespondentID field in BIG_SESSIONS
Join the two tables
eg:
select RespondentID, MIN(SessionID) as 'SID'
from BIG_Sessions (nolock)
inner join RespondentsFilterTable
on BIG_SESSIONS.RespondentID = RespondentsFilterTable.RespondentID
where EntryDate between '07-11-2011' and '07-31-2012'
GROUP BY BIG_Sessions.RespondentID
You could add indexes to EntryDate and SessionID as well, but if you're adding to big_sessions frequently, this could be counter productive elsewhere
In general, you can can get hints of how performance of a query can be improved by studying the estimated (or if possible actual) execution plans.
If the smallest and largest ids in the IN statement are known beforehands and depending on how many ids are in the table then adding a respondedID > [smallest_known_id-1] AND respondedID < [largest_known_id+1] prior to the IN statement would help limiting the problem