I've got a query which I want to depend on certain parameters from an excel sheet, but I get the error 'parameters are not allowed in queries that can't be displayed graphically'. The only way to get around it that I see is to use a view, but how much extra overhead would that give? It would mean joining two tables (one with near 70000 and one with over 200000 records and both having around 40 fields) into that view, probably using only 5 or so of the total 80 fields. We do not have a test server. Alternatively, is there a way to change the following query to one for which microsoft query does allow parameters?
select count(distinct a) from table1 where b=0 and c < '2010-01-01' and a in
(select a from table2 where d between '2010-01-01' and '2010-12-31')
or as a join:
select count(distinct table1.a) from table1 inner join table2 on (table1.a=table2.a
and table2.d between '2010-01-01' and '2010-12-31') where table1.c < '2010-01-01'
and table1.b=0
I want to replace the dates (for c and d) with cellvalues.
Thanks,
Ernst
Have you considering migrate the data from Excel to SQL-Server tables and then execute the query?. The most of the DBMS have tools for data migration.
Related
I have a simple table tableA in PostgreSQL 13 that contains a time series of event counts. In stylized form it looks something like this:
event_count sys_timestamp
100 167877672772
110 167877672769
121 167877672987
111 167877673877
... ...
With both fields defined as numeric.
With the help of answers from stackoverflow I was able to create a query that basically counts the number of positive and negative excess events within a given time span, conditioned on the current event count. The query looks like this:
SELECT t1.*,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count >= t1.event_count+10)
AS positive,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count <= t1.event_count-10)
AS negative
FROM tableA as t1
The query works as expected, and returns in this particular example for each row a count of positive and negative excesses (range + / - 10) given the defined time window (+ 1000 [milliseconds]).
However, I will have to run such queries for tables with several million (perhaps even 100+ million) entries, and even with about 500k rows, the query takes a looooooong time to complete. Furthermore, whereas the time frame remains always the same within a given query [but the window size can change from query to query], in some instances I will have to use maybe 10 additional conditions similar to the positive / negative excesses in the same query.
Thus, I am looking for ways to improve the above query primarily to achieve better performance considering primarily the size of the envisaged dataset, and secondarily with more conditions in mind.
My concrete questions:
How can I reuse the common portion of the subquery to ensure that it's not executed twice (or several times), i.e. how can I reuse this within the query?
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000)
Is there some performance advantage in turning the sys_timestamp field which is currently numeric, into a timestamp field, and attempt using any of the PostgreSQL Windows functions? (Unfortunately I don't have enough experience with this at all.)
Are there some clever ways to rewrite the query aside from reusing the (partial) subquery that materially increases the performance for large datasets?
Is it perhaps even faster for these types of queries to run them outside of the database using something like Java, Scala, Python etc. ?
How can I reuse the common portion of the subquery ...?
Use conditional aggregates in a single LATERAL subquery:
SELECT t1.*, t2.positive, t2.negative
FROM tableA t1
CROSS JOIN LATERAL (
SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
, COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000
) t2;
It can be a CROSS JOIN because the subquery always returns a row. See:
JOIN (SELECT ... ) ue ON 1=1?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Use conditional aggregates with the FILTER clause to base multiple aggregates on the same time frame. See:
Aggregate columns with additional (distinct) filters
event_count should probably be integer or bigint. See:
PostgreSQL using UUID vs Text as primary key
Is there any difference in saving same value in different integer types?
sys_timestamp should probably be timestamp or timestamptz. See:
Ignoring time zones altogether in Rails and PostgreSQL
An index on (sys_timestamp) is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count) typically helps some more. If the table is vacuumed enough, you get index-only scans from it.
Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See:
Window Functions or Common Table Expressions: count previous rows within range
What are the pros and cons of performing calculations in sql vs. in your application
You have the right idea.
The way to write statements you can reuse in a query is "with" statements (AKA subquery factoring). The "with" statement runs once as a subquery of the main query and can be reused by subsequent subqueries or the final query.
The first step includes creating parent-child detail rows - table multiplied by itself and filtered down by the timestamp.
Then the next step is to reuse that same detail query for everything else.
Assuming that event_count is a primary index or you have a compound index on event_count and sys_timestamp, this would look like:
with baseQuery as
(
SELECT distinct t1.event_count as startEventCount, t1.event_count+10 as pEndEventCount
,t1.eventCount-10 as nEndEventCount, t2.event_count as t2EventCount
FROM tableA t1, tableA t2
where t2.sys_timestamp between t1.sys_timestamp AND t1.sys_timestamp + 1000
), posSummary as
(
select bq.startEventCount, count(*) as positive
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.pEndEventCount
group by bq.startEventCount
), negSummary as
(
select bq.startEventCount, count(*) as negative
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.nEndEventCount
group by bq.startEventCount
)
select t1.*, ps.positive, nv.negative
from tableA t1
inner join posSummary ps on t1.event_count=ps.startEventCount
inner join negSummary ns on t1.event_count=ns.startEventCount
Notes:
The distinct for baseQuery may not be necessary based on your actual keys.
The final join is done with tableA but could also use a summary of baseQuery as a separate "with" statement which already ran once. Seemed unnecessary.
You can play around to see what works.
There are other ways of course but this best illustrates how and where things could be improved.
With statements are used in multi-dimensional data warehouse queries because when you have so much data to join with so many tables(dimensions and facts), a strategy of isolating the queries helps understand where indexes are needed and perhaps how to minimize the rows the query needs to deal with further down the line to completion.
For example, it should be obvious that if you can minimize the rows returned in baseQuery or make it run faster (check explain plans), your query improves overall.
I have a query that requires me to join/refers to the same table, however, I am unable to get a result using the query.
Below is a sample of my query
SELECT a."column1", b."column1" as anotherColumn
FROM table1 AS a, table2 AS b
where a.'x' = b.'x'
AND NOT a.'y' = b.'y'
This query take forever to load. However, if I just run:
SELECT a."column1"
FROM table1 AS A
it only takes 14sec.
I'm currently using PostgreSQL with Pgadmin. table1 has 1.4million table currently.
Is it because there is a lock on the table 1 when it was first referred to as a?
EDIT : Each row contains the record of "author","book published" and in this case, there might be many authors for a book hence being collaborators. What I am trying to achieve is to find out the number of collaborators for each author
What I am trying to achieve is to find out the number of collaborators for each author
Something like this would count the number of authors, and I guess where that number is greater than 1, the number of collaborators is that number - 1
select b.name, count(a.*)-1 as num_collaborators
from books b
inner join authors a on b.id = a.book_id
group by b.name
having count(a.*) > 1
--original
SELECT a."column1", b."column1" as anotherColumn
FROM table1 AS a, table2 AS b
;
--amended
SELECT a."column1", b."column1" as anotherColumn
FROM table1 AS a, table2 AS b
where a.'x' = b.'x'
AND NOT a.'y' = b.'y'
Over 25 years ago ANSI standards for SQL introduced a more "explicit" syntax for joins and using this is well established as "best practice" now.
One of the greatest benefits of this "explicit join syntax" is that accidentally forgetting to join properly becomes impossible, unlike the original query which did forget the joining predicate. (& When that happens an unexpected Cartesian product is produced.)
So, I encourage you to stop using commas between table names. Taking that simple step will help you use better join syntax.
I have an issue with a query I have written for sap hana.
There is basically two tables.
First table is a dates table which contains dates for each single day in a calendar. second table is a results table containing a customer reference number and for each customer reference number a start date and end date. In this customer ref table, I have approximately 4 million records. So essentially in the inner part of the query I would be getting 4 million records for each day since 01012011. There must be a simple way of aggregating the results. I have tried an inner select query however it seems like hana is having performance issues.
I have written the code like this, however this is not optimal.
select date_sql, count(*) as count
from (
select date_sql
from tbl_ref_cal_link tbl_date
where date_sql between '2011-01-01' and add_days (to_date(current_date, 'YYYY-MM-DD'), -1)
)tbl_date
Left join #cust_ref_table M1
On tbl_date.date_sql between m1.startdate and m2.enddate)z
I would appreciate anyone's help or suggestions.
You could use Group By here
And you need to change the m2 in WHERE clause to m1 as in following SQLScript code
select
date_sql, count(m1.CustomerId) as count
from (
-- dates table here
) tbl_date
Left join cust_ref_table m1 On tbl_date.date_sql between m1.startdate and m1.enddate
group by date_sql
I am trying to perform a cumulative sum of values in SQLite. I initially only needed to sum a single column and had the code
SELECT
t.MyColumn,
(SELECT Sum(r.KeyColumn1) FROM MyTable as r WHERE r.Date < t.Date)
FROM MyTable as t
Group By t.Date;
which worked fine.
Now I wanted to extend this to more columns KeyColumn2 and KeyColumn3 say. Instead of adding more SELECT statements I thought it would be better to use a join and wrote the following
SELECT
t.MyColumn,
Sum(r.KeyColumn1),
Sum(r.KeyColumn2),
Sum(r.KeyColumn3)
FROM MyTable as t
Left Join MyTable as r On (r.Date < t.Date)
Group By t.Date;
However this does not give me the correct answer (instead it gives values that are much larger than expected). Why is this and how could I correct the JOIN to give me the correct answer?
You are likely getting what I would call mini-Cartesian products: your Date values are probably not unique and, as a result of the self-join, you are getting matches for each of the non-unique values. After grouping by Date the results are just multiplied accordingly.
To solve this, the left side of the join must be rid of duplicate dates. One way is to derive a table of unique dates from your table:
SELECT DISTINCT Date
FROM MyTable
and use it as the left side of the join:
SELECT
t.Date,
Sum(r.KeyColumn1),
Sum(r.KeyColumn2),
Sum(r.KeyColumn3)
FROM (SELECT DISTINCT Date FROM MyTable) as t
Left Join MyTable as r On (r.Date < t.Date)
Group By t.Date;
I noticed that you used t.MyColumn in the SELECT clause, while your grouping was by t.Date. If that was intentional, you may be relying on undefined behaviour there, because the t.MyColumn value would probably be chosen arbitrarily among the (potentially) many in the same t.Date group.
For the purpose of this example, I assumed that you actually meant t.Date, so, I replaced the column accordingly, as you can see above. If my assumption was incorrect, please clarify.
Your join is not working cause he will find way more possibilities to join then your subselect would do.
The join is exploding your table.
The sub select does a sum of all records where the date is lower then the one from the current record.
The join joins every row multiple times aslong as the date is lower then the current record. This mean a single record could do as manny joins as there are records with a date lower. This causes multiple records. And in the end a higher SUM.
If you want the sum from mulitple columns you will have to use 3 sub query or define a unique join.
I've got a SQL query that joins a pricing table to a table containing user-provided answers. My query is used to get the price based on the entered quantity. Below is my SQL statement:
SELECT JobQuestion.Value, Price.Min, Price.Max, Price.Amount FROM Price
INNER JOIN JobQuestion
ON Price.QuestionFK=JobQuestion.QuestionFK
AND JobQuestion.JobFK=1
WHERE Price.Min <= JobQuestion.Value
AND Price.Max >= JobQuestion.Value
The problem is SQL Server is running the where clause before the JOIN and it is throwing the error:
Conversion failed when converting the
varchar value 'TEST' to data type int.
because it is doing the min and max comparisons before the join ('TEST' is a valid user entered value in the JobQuestion table, but should not be returned when JobQuestion is joined to Price). I believe SQL Server is choosing to run the WHERE because for some reason the parser thinks that would be a more efficient query. If i Just run
SELECT JobQuestion.Value, Price.Min, Price.Max, Price.Amount FROM Price
INNER JOIN JobQuestion
ON Price.QuestionFK=JobQuestion.QuestionFK
AND JobQuestion.JobFK=1
I get these results back:
500 1 500 272.00
500 501 1000 442.00
500 1001 2000 782.00
So, adding the WHERE should filter out the last two and just return the first record. How do I force SQL to run the JOIN first or use another technique to filter out just the records I need?
Try "re-phrasing" the query as follows:
SELECT *
FROM (
SELECT JobQuestion.Value,
Price.Min,
Price.Max,
Price.Amount
FROM Price
INNER
JOIN JobQuestion
ON Price.QuestionFK = JobQuestion.QuestionFK
AND JobQuestion.JobFK = 1
) SQ
WHERE SQ.Min <= SQ.Value
AND SQ.Max >= SQ.Value
As per the answer from Christian Hayter, if you have the choice, change the table design =)
You shouldn't be comparing strings to ints. If you have any influence at all over your table design, then split the two different uses of the JobQuestion.Value column into two different columns.
First, this is very likely sign of poor design.
If you cannot change schema, then maybe you could force this behavior using hints. Quote:
Hints are options or strategies specified for enforcement by the SQL Server query processor on SELECT, INSERT, UPDATE, or DELETE statements. The hints override any execution plan the query optimizer might select for a query.
And some more:
Caution:
Because the SQL Server query optimizer typically selects the best execution plan for a query, we recommend that < join_hint>, < query_hint>, and < table_hint> be used only as a last resort by experienced developers and database administrators.
In case you have no influence over your table design - Could you try to filter out those records with numeric values using ISNUMERIC()? I would guess adding this to your where clause could help.
You can likely remove the where... and just add those as predicates to your join. Since it's a inner join this should work
SELECT JobQuestion.Value, Price.Min, Price.Max, Price.Amount
FROM Price
INNER JOIN JobQuestion
ON Price.QuestionFK=JobQuestion.QuestionFK
AND JobQuestion.JobFK=1
AND Price.Min <= JobQuestion.Value
AND Price.Max >= JobQuestion.Value
You can use TRY_PARSE over that strings columns to convert to numeric, and if SQL cannot convert, it will get you NULL instead of error message.
P.S. This thing is first time introduced in SQL 2012, so might be helpful.