SQL expression to remove duplicates from a calculation - sql

I am trying to run a query that will give time averages but when I do... some duplicate records are in the calculation. how can I remove duplicates?
ex.
Column 1 / 07-5794 / 07-5794 / 07-5766 / 07-8423 / 07-4259
Column 2 / 00:59:59 / 00:48:22 / 00:42:48/ 00:51:47 / 00:52:12
I can get the average of the column 2 but I don't want identical values in column 1 to be calculated twice (07-5794) ???

To get the average of the minimum values for each incnum, you could write this SQL
select avg(min_time) as avg_time from
(select incnum, min(col2) as min_time from inc group by incnum)
using the correct average function for your brand of SQL.
If you're doing this in Access, you'll want to paste this into the SQL view; when you use a subquery, you can't do that directly in design view.

SELECT DISTINCT()
or
GROUP BY ()
or
SELECT UNIQUE()
... but usually averages have duplicates included. Just tell me this isn't for financial software, I wont stand for another abuse of statistics!

I believe you're looking for the DISTINCT.
http://www.sql-tutorial.com/sql-distinct-sql-tutorial && http://www.w3schools.com/SQL/sql_distinct.asp

Do you want to eliminate all entries that are duplicates? That is, should neither of the rows with "07-5794" be included in the calculation. If that is the case, I think this would work (in Oracle at least):
SELECT AVG(col2)
FROM (
SELECT col1, MAX(col2) AS col2
FROM table
GROUP BY col1
HAVING COUNT(*) = 1
)
However, if you want to retain one of the duplicates, you need to specify how to pick which one to keep.

Anastasia,
Assuming you have the calculation for the average and a unique key in the table, you could do something like this to get just the latest occurrence of the timing for each unique 07-xxx result:
select Column1, Avg(Convert(decimal,Column2))
from Table1
where TableId in
(
select Max(TableId)
from Table1
group by Column1
)
group by column1
This was assuming the following table structure in MS SQL:
CREATE TABLE [dbo].[Table1] (
[TableId] [int] IDENTITY (1, 1) NOT NULL ,
[Column1] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Column2] [int] NULL ) ON [PRIMARY]
Good Luck!

Related

How to find duplicate rows in Hive?

I want to find duplicate rows from one of the Hive table for which I was given two approaches.
First approach is to use following two queries:
select count(*) from mytable; // this will give total row count
second query is as below which will give count of distinct rows
select count(distinct primary_key1, primary_key2) from mytable;
With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800 rows are duplicate. But this query doesn't tell which rows are duplicated.
My second approach to find duplicate is:
select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;
Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.
So I would like to know:
If my first approach is correct - if yes then how do I find which rows are duplicated
Why second approach is not providing list of rows which are duplicated?
Is there any other way to find the duplicates?
Hive does not validate primary and foreign key constraints.
Since these constraints are not validated, an upstream system needs to
ensure data integrity before it is loaded into Hive.
That means that Hive allows duplicates in Primary Keys.
To solve your issue, you should do something like this:
select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;
This way you will get list of duplicated rows.
analytic window function row_number() is quite useful and can provide the duplicates based upon the elements specified in the partition by clause. A simply in-line view and exists clause will then pinpoint what corresponding sets of records contain these duplicates from the original table. In some databases (like TD, you can forgo the inline view using a QUALIFY pragma option)
SQL1 & SQL2 can be combined. SQL2: If you want to deal with NULLs and not simply dismiss, then a coalesce and concatenation might be better in the
SELECT count(1) , count(distinct coalesce(keypart1 ,'') + coalesce(keypart2 ,'') )
FROM srcTable s
3) Finds all records, not just the > 1 records. This provides all context data as well as the keys so it can be useful when analyzing why you have dups and not just the keys.
select * from srcTable s
where exists
( select 1 from (
SELECT
keypart1,
keypart2,
row_number() over( partition by keypart1, keypart2 ) seq
FROM srcTable t
WHERE
-- (whatever additional filtering you want)
) t
where seq > 1
AND t.keypart1 = s.keypart1
AND t.keypart2 = s.keypart2
)
Suppose your want get duplicate rows based on a particular column ID here. Below query will give you all the IDs which are duplicate in table in hive.
SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1

How to use indexes when trying to make null values come last?

I am using sqlite3 and I am trying to retrieve all rows ordered by some col1 with null values coming last. As of now I am using this kind o query:
select * from table order by row1 is null, row1 asc
As there are many rows in my table, the query worked quite slowly, so I decided to create an index on table(row1).
After creating the index it extremely improved the speed of queries like:
select * from table order by row1 asc
However sqlite doesn't seem to use that index with "order by col1 is null" type of queries.
Why sqlite, based on that index, can't just move rows with null values to the end?
Is there any way I can make null values come last without the need to evaluate every row every time again?
SQLite 3.8.12 will support expressions in indexes:
> CREATE TABLE t(x);
> CREATE INDEX tnx ON T(x IS NULL, x);
> EXPLAIN QUERY PLAN SELECT * FROM t ORDER BY x IS NULL, x;
0|0|0|SCAN TABLE t USING COVERING INDEX tnx
In earlier versions, you can split the query into two subqueries, each of which can use an index:
SELECT *
FROM (SELECT *
FROM MyTable
WHERE row1 IS NOT NULL
ORDER BY row1)
UNION ALL
SELECT *
FROM MyTable
WHERE row1 IS NULL;
You can try a conditional in order by.
select * from table
order by case when row1 is null then 1 else 0 end, row1
The default ordering is ascending. Hence asc has been omitted from the query above.

Average when meeting where clause

I want to find the average amount of a field where it meets a criterion. It is embedded in a big table but I would like this average field in there instead of doing it in a separate table.
This is what I have so far:
Select....
Avg( (currbal) where (select * from table
where ament2 in ('r1','r2'))
From table
If you want to AVG only a subset of a query use case when ... then to replace value in non-matching rows with null as nulls are ignored by avg().
Select id,
sum(something) SomethingSummed,
avg(case when ament2 in ('r1','r2') then currbal end) CurrbalAveragedForR1R2
From [table]
group by id
You can put all the other sums which you want to be embedded into the AVG statement, inside the table reference inside the FROM clause. Something like:
SELECT AVG(currbal)
FROM
(
SELECT * -- other sums
FROM table
WHERE ament2 IN ('r1','r2')
) t
You can write a full sub-select into the select list:
SELECT ...,
(SELECT AVG(Currbal) FROM Table WHERE ament2 IN ('r1', 'r2')) AS avg_currbal,
...
FROM ...
Whether that will do exactly what you want depends on a number of things. You might need to make that into a correlated subquery; assuming 'ament2' is in Table, it is not a correlated sub-query at the moment.

Fast way to eyeball possible duplicate rows in a table?

Similar: How can I delete duplicate rows in a table
I have a feeling this is impossible and I'm going to have to do it the tedious way, but I'll see what you guys have to say.
I have a pretty big table, about 4 million rows, and 50-odd columns. It has a column that is supposed to be unique, Episode. Unfortunately, Episode is not unique - the logic behind this was that occasionally other fields in the row change, despite Episode being repeated. However, there is an actually unique column, Sequence.
I want to try and identify rows that have the same episode number, but something different between them (aside from sequence), so I can pick out how often this occurs, and whether it's worth allowing for or I should just nuke the rows and ignore possible mild discrepancies.
My hope is to create a table that shows the Episode number, and a column for each table column, identifying the value on both sides, where they are different:
SELECT Episode,
CASE WHEN a.Value1<>b.Value1
THEN a.Value1 + ',' + b.Value1
ELSE '' END AS Value1,
CASE WHEN a.Value2<>b.Value2
THEN a.Value2 + ',' + b.Value2
ELSE '' END AS Value2
FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode
WHERE a.Value1<>b.Value1
OR a.Value2<>b.Value2
(That is probably full of holes, but the idea of highlighting changed values comes through, I hope.)
Unfortunately, making a query like that for fifty columns is pretty painful. Obviously, it doesn't exactly have to be rock-solid if it will only be used the once, but at the same time, the more copy-pasta the code, the more likely something will be missed. As far as I know, I can't just do a search for DISTINCT, since Sequence is distinct and the same row will pop up as different.
Does anyone have a query or function that might help? Either something that will output a query result similar to the above, or a different solution? As I said, right now I'm not really looking to remove the duplicates, just identify them.
Use:
SELECT DISTINCT t.*
FROM TABLE t
ORDER BY t.episode --, and whatever other columns
DISTINCT is just shorthand for writing a GROUP BY with all the columns involved. Grouping by all the columns will show you all the unique groups of records associated with the episode column in this case. So there's a risk of not having an accurate count of duplicates, but you will have the values so you can decide what to remove when you get to that point.
50 columns is a lot, but setting the ORDER BY will allow you to eyeball the list. Another alternative would be to export the data to Excel if you don't want to construct the ORDER BY, and use Excel's sorting.
UPDATE
I didn't catch that the sequence column would be a unique value, but in that case you'd have to provide a list of all the columns you want to see. IE:
SELECT DISTINCT t.episode, t.column1, t.column2 --etc.
FROM TABLE t
ORDER BY t.episode --, and whatever other columns
There's no notation that will let you use t.* but not this one column. Once the sequence column is omitted from the output, the duplicates will become apparent.
Instead of typing out all 50 columns, you could do this:
select column_name from information_schema.columns where table_name = 'your table name'
then paste them into a query that groups by all of the columns EXCEPT sequence, and filters by count > 1:
select
count(episode)
, col1
, col2
, col3
, ...
from YourTable
group by
col1
, col2
, col3
, ...
having count(episode) > 1
This should give you a list of all the rows that have the same episode number. (But just neither the sequence nor episode numbers themselves). Here's the rub: you will need to join this result set to YourTable on ALL the columns except sequence and episode since you don't have those columns here.
Here's where I like to use SQL to generate more SQL. This should get you started:
select 't1.' + column_name + ' = t2.' + column_name
from information_schema.columns where table_name = 'YourTable'
You'll plug in those join parameters to this query:
select * from YourTable t1
inner join (
select
count(episode) 'epcount'
, col1
, col2
, col3
, ...
from YourTable
group by
col1
, col2
, col3
, ...
having count(episode) > 1
) t2 on
...plug in all those join parameters here...
select count distinct ....
Should show you without having to guess. You can get your columns by viewing your table definition so you can copy/paste your non-sequence columns.
I think something like this is what you want:
select *
from t
where t.episode in (select episode from t group by episode having count(episode) > 1)
order by episode
This will give all rows that have episodes that are duplicated. Non-duplicate rows should stick out fairly obviously.
Of course, if you have access to some sort of scripting, you could just write a script to generate your query for you. It seems pretty straight-forward. (i.e. describe t and iterate over all the fields).
Also, your query should have some sort of ordering, like FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode AND a.Sequence < b.Sequence, otherwise you'll get duplicate non-duplicates.
A relatively simple solution that Ponies sparked:
SELECT t.*
FROM Table t
INNER JOIN ( SELECT episode
FROM Table
GROUP BY Episode
HAVING COUNT(*) > 1
) AS x ON t.episode = x.episode
And then, copy-paste into Excel, and use this as conditional highlighting for the entire result set:
=AND($C2=$C1,A2<>A1)
Column C is Episode. This way, you get a visual highlight when the data's different from the row above (as long as both rows have the same value for episode).
Generate and store a hash key for each row, designed so the hash values mirror your
definition of sameness. Depending on the complexity of your rows, updating the
hash might be a simple trigger on modifying the row.
Query for duplicates of the hash key, which are your "very probably" identical rows.

Most efficient way to select 1st and last element, SQLite?

What is the most efficient way to select the first and last element only, from a column in SQLite?
The first and last element from a row?
SELECT column1, columnN
FROM mytable;
I think you must mean the first and last element from a column:
SELECT MIN(column1) AS First,
MAX(column1) AS Last
FROM mytable;
See http://www.sqlite.org/lang_aggfunc.html for MIN() and MAX().
I'm using First and Last as column aliases.
if it's just one column:
SELECT min(column) as first, max(column) as last FROM table
if you want to select whole row:
SELECT 'first',* FROM table ORDER BY column DESC LIMIT 1
UNION
SELECT 'last',* FROM table ORDER BY column ASC LIMIT 1
The most efficient way would be to know what those fields were called and simply select them.
SELECT `first_field`, `last_field` FROM `table`;
Probably like this:
SELECT dbo.Table.FirstCol, dbo.Table.LastCol FROM Table
You get minor efficiency enhancements from specifying the table name and schema.
First: MIN() and MAX() on a text column gives AAAA and TTTT results which are not the first and last entries in my test table. They are the minimum and maximum values as mentioned.
I tried this (with .stats on) on my table which has over 94 million records:
select * from
(select col1 from mitable limit 1)
union
select * from
(select col1 from mitable limit 1 offset
(select count(0) from mitable) -1);
But it uses up a lot of virtual machine steps (281,624,718).
Then this which is much more straightforward (which works if the table was created without WITHOUT ROWID) [sql keywords are in capitals]:
SELECT col1 FROM mitable
WHERE ROWID = (SELECT MIN(ROWID) FROM mitable)
OR ROWID = (SELECT MAX(ROWID) FROM mitable);
That ran with 55 virtual machine steps on the same table and produced the same answer.
min()/max() approach is wrong. It is only correct, if the values are ascending only. I needed something liket this for currency rates, which are random raising and falling.
This is my solution:
select st.*
from stats_ticker st,
(
select min(rowid) as first, max(rowid) as last --here is magic part 1
from stats_ticker
-- next line is just a filter I need in my case.
-- if you want first/last of the whole table leave it out.
where timeutc between datetime('now', '-1 days') and datetime('now')
) firstlast
WHERE
st.rowid = firstlast.first --and these two rows do magic part 2
OR st.rowid = firstlast.last
ORDER BY st.rowid;
magic part 1: the subselect results in a single row with the columns first,last containing rowid's.
magic part 2 easy to filter on those two rowid's.
This is the best solution I've come up so far. Hope you like it.
We can do that by the help of Sql Aggregate function, like Max and Min. These are the two aggregate function which help you to get last and first element from data table .
Select max (column_name ), min(column name) from table name
Max will give you the max value means last value and min will give you the min value means it will give you the First value, from the specific table.