SQL - Insert using Column based on SELECT result - sql

I currently have a table called tempHouses that looks like:
avgprice | dates | city
dates are stored as yyyy-mm-dd
However I need to move the records from that table into a table called houses that looks like:
city | year2002 | year2003 | year2004 | year2005 | year2006
The information in tempHouses contains average house prices from 1995 - 2014.
I know I can use SUBSTRING to get the year from the dates:
SUBSTRING(dates, 0, 4)
So basically for each city in tempHouses.city I need to get the the average house price from the above years into one record.
Any ideas on how I would go about doing this?

This is an SQL Server approach, and a PIVOT may be a better, but here's one way:
SELECT City,
AVG(year2002) AS year2002,
AVG(year2003) AS year2003,
AVG(year2004) AS year2004
FROM (
SELECT City,
CASE WHEN Dates BETWEEN '2002-01-01T00:00:00' AND '2002-12-31T23:59:59' THEN avgprice
ELSE 0
END AS year2002,
CASE WHEN Dates BETWEEN '2003-01-01T00:00:00' AND '2003-12-31T23:59:59' THEN avgprice
ELSE 0
END AS year2003
CASE WHEN Dates BETWEEN '2004-01-01T00:00:00' AND '2004-12-31T23:59:59' THEN avgprice
ELSE 0
END AS year2004
-- Repeat for each year
)
GROUP BY City
The inner query gets the data into the correct format for each record (City, year2002, year2003, year2004), whilst the outer query gets the average for each City.
There many be many ways to do this, and performance may be the deciding factor on which one to choose.

The best way would be to use a script to perform the query execution for you because you will need to run it multiple times and you extract the data based on year. Make sure that the only required columns are city & row id:
http://dev.mysql.com/doc/refman/5.0/en/insert-select.html
INSERT INTO <table> (city) VALUES SELECT DISTINCT `city` from <old_table>;
Then for each city extract the average values, insert them into a temporary table and then insert into the main table.
SELECT avg(price), substring(dates, 0, 4) dates from <old_table> GROUP BY dates;
Otherwise you're looking at a combination query using joins and potentially unions to extrapolate the data. Because you're flattening the table into a single row per city it's going to be a little tough to do. You should create indexes first on the date column if you don't want the database query to fail with memory limits or just take a very long time to execute.

Related

SQL Server calculate average scores from 6 possible columns with Null and Not Null values

I have a table and want to get the average score for each student。
To be more specific, scoremonth1 has more weight to be calculated than 2,3,4,5 and 6 (1>2>3>4>5>6). And we should add no more than 3 monthly scores from the table.
For instance, the average score for Tom should be (80+90)/2 since there are only 2 scores available. As for Marry, the average score should be (90+70+80)/3 since those are the three monthly scores with more weight. And again, for Anna, the average score should be (90+100+70)/3
In my case, there would be over 100 students. Except listing all the possible cases like CASE WHEN scoremonth1 is not null, scoremonth2 is NULL . etc to calculate the average, what else method could do the calculation dynamically?
I know there is a SQL function coalesce to return the first not null value, but how could I get the second and third not null values? And is there a way to track which monthlyscores are added up? I really appreciate your help!
Stu mentioned your underlying issue. To normalize your data without changing table design you can use cross apply...
select student, sum(score)
from table
cross apply (
values(1,scoremonth1),(2,scoremonth2),(3,scoremonth3)) as scores(month,score)
group by student
I strongly suggest you redesign so you don't have to manage this query when adding months by creating a new table called studentScores.
create table studentscores
(
student varchar(200)
,scoremonth int
,score decimal(5,2)
)
And then populate it like this...
insert into studentScores(student,scoremonth,score)
select *
from table
cross apply
(values
(student,1,scoremonth1)
,(student,2,scoremonth2)
,(student,3,scoremonth3)
,(student,4,scoremonth4)
,(student,5,scoremonth5)
) ca(ca1,ca2,ca3)
where ca3 is not null
And finally, usse it like this...
select ss.student, sum(score), count(*) NumOfScores, sum(score)/Count(*) avg
from table
join studentscores ss on ss.student=table.student
where ss.scoremonth between 1 and 3
group by ss.student

How to aggregate data stored column-wise in a matrix table

I have a table, Ellipses (...), represent multiple columns of a similar type
TABLE: diagnosis_info
COLUMNS: visit_id,
patient_diagnosis_code_1 ...
patient_diagnosis_code_100 -- char(100) with a value of ‘0’ or ‘1’
How do I find the most common diagnosis_code? There are 101 columns including the visit_id. The table is like a matrix table of 0s and 1s. How do I write something that can dynamically account for all the columns and count all the rows where the value is 1?
What I would normally do is not feasable as there are too many columns:
SELECT COUNT(patient_diagnostic_code_1), COUNT(patient_diagnostic_code_2),... FROM diagnostic_info WHERE patient_diagnostic_code_1 = ‘1’ and patient_diagnostic_code_2 = ‘1’ and ….
Then even if I typed all that out how would I select which column had the highest count of values = 1. The table is more column oriented instead of row oriented.
Unfortunately your data design is bad from the start. Instead it could be as simple as:
patient_id, visit_id, diagnosis_code
where a patient with 1 dignostic code would have 1 row, a patient with 100 diagnostic codes 100 rows and vice versa. At any given time you could transpose this into the format you presented (what is called a pivot or cross tab). Also in some databases, for example postgreSQL, you could put all those diagnostic codes into an array field, then it would look like:
patient_id, visit_id, diagnosis_code (data type -bool or int- array)
Now you need the reverse of it which is called unpivot. On some databases like SQL server there is UNPIVOT as an example.
Without knowing what your backend this, you could do that with an ugly SQL like:
select code, pdc
from
(
select 1 as code, count(*) as pdc
from myTable where patient_diagnosis_code_1=1
union
select 2 as code, count(*) as pdc
from myTable where patient_diagnosis_code_2=1
union
...
select 100 as code, count(*) as pdc
from myTable where patient_diagnosis_code_100=1
) tmp
order by pdc desc, code;
PS: This would return all the codes with their frequency ordered from most to least. You could limit to get 1 to get the max (with ties in case there are more than one code to match the max).

Count number of rows returned in a SQL statement

Are there any DB engines that allow you to run an EXPLAIN (or other function) where it will give you an approximate count of values that may be returned before an aggregation is run (not rows scanned but that actually would be returned)? For example, in the following query:
SELECT gender, COUNT(1) FROM sales JOIN (
SELECT id, person FROM sales2 WHERE country='US'
GROUP BY person_id
) USING (id)
WHERE sales.age > 20
GROUP BY gender
Let's say this query returns 3 rows after being aggregated, but would return 170M rows if unaggregated.
Are there any tools where you can run the query to get this '170M' number or does this have to do with complexity theory (or something similar) where it's almost just as expensive to run the query (without the final aggregation/having/sort/limit/etc) to get the count? In other words, doing a rewrite to:
SELECT COUNT(1) FROM sales JOIN (
SELECT id, person FROM sales2 WHERE country='US'
GROUP BY person_id
) USING (id)
WHERE sales.age > 20
But having to execute the query nonetheless.
As an example of using the current (mysql) explain to show how 'off' it is to get what I'm looking for:
explain select * from movies where title>'a';
# rows=147900
select count(1) from _tracktitle where title>'a';
# 144647 --> OK, pretty close
explain select * from movies where title>'u';
# rows=147900
select * from movies where title>'u';
# 11816 --> Not close at all
Assuming you can use MS SQL Server, you could tap into the same data the Optimiser is using for cardinality estimation: DBCC SHOW_STATISTICS (table, index) WITH HISTOGRAM
Part of data sets you get back is per-column histogram, which is essentially number of rows for each value range found in the table.
You probably want to query the data programmatically, one way to achieve this would be to insert it into a temp table:
CREATE TABLE #histogram (
RANGE_HI_KEY datetime PRIMARY KEY,
RANGE_ROWS INT,
EQ_ROWS INT,
DISTINCT_RANGE_ROWS INT,
AVG_RANGE_ROWS FLOAT
)
INSERT INTO #histogram
EXEC ('DBCC SHOW_STATISTICS (Users, CreationDate) WITH HISTOGRAM')
SELECT 'Estimate', SUM(RANGE_ROWS+EQ_ROWS) FROM #histogram WHERE RANGE_HI_KEY BETWEEN '2010-08-30 08:28:45.070' AND '2010-09-20 22:15:33.603'
UNION ALL
select 'Actual', COUNT(1) from Users u WHERE u.CreationDate BETWEEN '2010-08-30 08:28:45.070' AND '2010-09-20 22:15:33.603'
For example, check out what this same query run against Stack Overflow Database.
| -------- | ----- |
| Estimate | 98092 |
| Actual | 11715 |
it seems like a lot but then keep in mind that the whole table has almost 15mil records.
A note on precision and other gotchas
The maximum number of histogram steps is capped at 200 - which is not a lot, so you are not getting guaranteed 10% margin of error, but neither does SQL Server.
As you insert data into table, histograms may get stale so your results would get skewed even more.
There are different ways to update this data, some are reasonably quick while others effectively require full table scan
not all columns will have statistics. You can either create it manually or (I believe) it gets created automatically if you run a search with the column as predicate
MS Sql Server offers "execution plans". In the picture below I have queries and I press (Ctrl-L) to see the plans.
In my queries I return all records in first and just the count in the other, using the same table.
Look at metric corresponding to red arrows- estimated # of rows that WILL be scanned when queries are run. In this case, that number is same regardless whether count(*) or *, your point in case!

Filling one table using another table in SQL Server

I have two SQL tables as follows:
As you may note, the first table has a monthly frequency (date column), while the second table has a quarterly frequency. Here is what I would like to do:
For each issueid from table 1, I would like to look at the date, determine what is the previous end of quarter, and go fetch data from table 2 corresponding to that issue for that end of quarter, and insert it in the first table in the last two columns.
For example: take issueid 123456 and date 1/31/2014. The previous end of quarter is 12/31/2013. I would like to go to table 1, copy q_exp and q_act that correspond to that issueid and 12/31/2013, and paste it into the first table.
Of course, I would like to fill the entire first table and minimize manual inserts.
Any help would be appreciated! Thanks!
Try the following query
UPDATE issues
SET q_exp=(SELECT TOP 1 q.q_exp
FROM quarterlyTable q
WHERE q.issueid=i.issueid
AND q.[date]<=i.[date]
ORDER BY q.[date] DESC)
,q_act= (SELECT TOP 1 q.q_act
FROM quarterlyTable q
WHERE q.issueid=i.issueid
AND q.[date]<=i.[date]
ORDER BY q.[date] DESC)
FROM issues i

How can I select if Date column is as same as current year

This is my Student table
Id(int) | Name(varchar) | registerDate(Date)
1 John 2012-01-01
How can I write the appropriate query to check if the person's registerDate value is as same as current year (2012)?
SELECT *
FROM Student
WHERE YEAR(registerDate) = YEAR(getdate())
The most direct solution would be to use the YEAR or DATEPART function in whatever flavor of SQL you're using. This will probably meet your needs but keep in mind that this approach does not allow you to use an index if you're searching the table for matches. In this case, it would be more efficient to use the BETWEEN operator.
e.g.
SELECT id, name, registerDate
FROM Student
WHERE registerDate BETWEEN 2012-01-01 and 2012-12-31
How you would generate the first and last day of the current year will vary by SQL flavor.
Because you're using a range, and index can be utilized. If you were using a function to calculate the year for each row, it would need to be computed for each row in the table instead of seeking directly to the relevant rows.
If by chance your flavor of sql is Microsoft TSql then this works:
SELECT * FROM Student Where datepart(yy,registerDate) = datepart(yy,GetDate())
This should work for SQL Query:
SELECT * FROM myTable
WHERE registerDate=YEAR(CURDATE())