Querying duplicate rows within Oracle Database - sql

I have a table that looks like this.
| File_ID | MD5 Sum | File Size |
| --------- | ----------- | ----------- |
| 140532 | 10000000 | 3000 |
| 192348 | 11111111 | 4000 |
| 223292 | 22222222 | 4000 |
| 272364 | 11111111 | 4000 |
| 223045 | 10000000 | 3000 |
I'd like to see how much space is wasted by duplicate files. The problem is that these duplicate files have unique primary keys (file_id). We know we have duplicates because the count(distinct MD5 sum) != count(*)
I'd like to write a query that returns the total space being used by duplicate files. In this example, the query would return the 7000, because rows with file id 272364 & 223045 are duplicitous.
If anyone can help me with this, it would be much appreciated.

You can produce a row number using the MD5 and then any duplicate will show up with row number above 1.
For example:
select sum(file_size)
from (
select t.*, row_number() over(partition by md5_sum order by file_id) as rn
from t
) x
where rn > 1

An alternative to The Impaler's suggestion. But I admit I like their approach better :-)
Group by MD5 sum and look at those that have more than one entry. Then subtract one file size from the sum of filesizes to get the excess. At last add up all those file excess sums.
select sum(excess) as total
from
(
select md5, sum(filesize) - min(filesize) as excess
from mytable
group by md5
having count(*) > 1
) excess_per_file;

Related

Dividing sum results

I'm really sorry as this was probably answered before, but I couldn't find something that solved the problem.
In this case, I'm trying to get the result of dividing two sums in the same column.
| Id | month | budget | sales |
| -- | ----- | ------ | ----- |
| 1 | jan | 1000 | 800 |
| 2 | jan | 1000 | 850 |
| 1 | feb | 1200 | 800 |
| 2 | feb | 1100 | 850 |
What i want is to get the % of completition for each id and month (example: get 0,8 or 80% in a fifth column for id 1 in jan)
I have something like
sel
id,
month,
sum (daily_budget) as budget,
sum (daily_sales) as sales,
budget/sales over (partition by 1,2) as efectivenes
from sales
group by 1,2
I know im doing this wrong but I'm kinda new with sql and cant find the way :|
Thanks!
This should do it
CAST(ROUND(SUM(daily_sales) * 100.00 / SUM(daily_budget), 1) AS DECIMAL(5,2)) AS Effectiveness
I'm new at SQL too but maybe I can help. Try this?
sel
id,
month,
sum (daily_budget) as budget,
sum (daily_sales) as sales,
(sum(daily_budget)/sum(daily_sales)) over (partition by id) as efectivenes
from sales
group by id
If you want to ALTER your table so that it contains a fifth column where the result of budget/sales is automatically calculated, all you need to do this add the formula to this auto-generated column. The example I am about to show is based on MySQL.
Open MySQL
Find the table you wish to modify in the Navigator Pane, right-click on it and select "Alter Table"
Add a new row to your table. Make sure you select NN (Not Null) and G (Generated Column) check boxes
In the Default/Expression column, simply enter the expression budget / sales.
Once you run your next query, you should see your column generated and populated with the calculated results. If you simply want the SQL statement to do the same from the console, it will be something like this: ALTER table YOUR_TABLE_NAME add result FLOAT as (budget / sales);

How to query to capture recency and scale in one query?

I've built a query that calculates the number of ids from a table, per url_count.
with cte as (
select id, count(distinct.url) url_count
from table
group by id
)
select sum(if(url_count >= 1,1,0) scale
from cte
union all
select sum(if(url_count >= 2,1,0) scale
from cte
union all
select sum(if(url_count >= 3,1,0) scale
from cte
union all
select sum(if(url_count >= 4,1,0) scale
from cte
union all
select sum(if(url_count >= 5,1,0) scale
from cte
The query above says; "Give me the list of ids and the number of urls they each go to, then accumulate the number of ids who have gone to [1-5] or more urls"
It's ofc a tedious method, but works and outputs something like;
---------
| scale |
---------
|1213432|
|867554 |
|523523 |
|342232 |
|145889 |
---------
From this table, I also have a date field on the last 5 days which I'm working on adding into this query. Thus lies the challenge; Trying to add a second layer of information to the query; i.e. Recency. Been working on multiple approaches to building a query that outputs all the combinations of different scales, per the date.
The sort of output I've imagined is a pivot table which presents something like;
-------------------------------------------------------------
| date | url_co1 | url_co2 | url_co3 | url_co4 | url_co5|
-------------------------------------------------------------
|2020-01-05| 1213432 | 1112321 | 984332 | 632131 | 234124 |
|2020-01-04| 1012131 | 934242 | 867554 | 533242 | 134234 |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
-------------------------------------------------------------
Where url_co[1-5] represents the number of ids that visited [1-5] or more urls and dates gives up the date that volume was captured. No idea how to write that because once I query:
with cte as (
select id, date, count(distinct.url) url_count
from table
group by id, date
)
I've aggregated to per id, per date, which therefore something goes wrong. =/
Hope that all made sense!
Please, please help! I would appreciate some guidance.
There must be a methodology for getting the combination of volumes per recency that I've missed!
I don't really follow the full question, but the first query can be simplified to:
select url_count, count(*) as this_count,
sum(url_count) over (order by url_count desc) as descending_count
from (select id, count(distinct url) as url_count
from table
group by id
) t
group by url_count
order by url_count;

Select the difference of two consecutive columns

I have a table car that looks like this:
| mileage | carid |
------------------
| 30 | 1 |
| 50 | 1 |
| 100 | 1 |
| 0 | 2 |
| 70 | 2 |
I would like to get the average difference for each car. So for example for car 1 I would like to get ((50-30)+(100-50))/2 = 35. So I created the following query
SELECT AVG(diff),carid FROM (
SELECT (mileage-
(SELECT Max(mileage) FROM car Where mileage<mileage AND carid=carid GROUP BY carid))
AS diff,carid
FROM car GROUP BY carid)
But this doesn't work as I'm not able to use current row for the other column. And I'm quite clueless on how to actually solve this in a different way.
So how would I be able to obtain the value of the next row somehow?
The average difference is the maximum minus he minimum divided by one less than the count (you can do the arithmetic to convince yourself this is true).
Hence:
select carid,
( (max(mileage) - min(mileage)) / nullif(count(*) - 1, 0)) as avg_diff
from cars
group by carid;

MySQL Range and Average

I'm wondering if in MySQL you are able to find a range within values along with the average in a query. Assume the table below please:
-----------------------------------------
| ID | VALUE |
-----------------------------------------
| 1 | 30 |
-----------------------------------------
| 2 | 50 |
-----------------------------------------
| 3 | 10 |
-----------------------------------------
Range Low would be 10, range High would be 50, average would be 30.
Is there query that would allow me to grab these values without pulling them down into php and then sorting the array, and finding the average that way?
Cheers
SELECT Avg(Value), Max(Value), Min(Value) FROM tableName
See also MySQL Aggregate Functions
Is this what you want?
select min(value) as low, max(value) as high, avg(value) from table_name

Row Rank in a MySQL View

I need to create a view that automatically adds virtual row number in the result. the graph here is totally random all that I want to achieve is the last column to be created dynamically.
> +--------+------------+-----+
> | id | variety | num |
> +--------+------------+-----+
> | 234 | fuji | 1 |
> | 4356 | gala | 2 |
> | 343245 | limbertwig | 3 |
> | 224 | bing | 4 |
> | 4545 | chelan | 5 |
> | 3455 | navel | 6 |
> | 4534345| valencia | 7 |
> | 3451 | bartlett | 8 |
> | 3452 | bradford | 9 |
> +--------+------------+-----+
Query:
SELECT id,
variety,
SOMEFUNCTIONTHATWOULDGENERATETHIS() AS num
FROM mytable
Use:
SELECT t.id,
t.variety,
(SELECT COUNT(*) FROM TABLE WHERE id < t.id) +1 AS NUM
FROM TABLE t
It's not an ideal manner of doing this, because the query for the num value will execute for every row returned. A better idea would be to create a NUMBERS table, with a single column containing a number starting at one that increments to an outrageously large number, and then join & reference the NUMBERS table in a manner similar to the variable example that follows.
MySQL Ranking, or Lack Thereof
You can define a variable in order to get psuedo row number functionality, because MySQL doesn't have any ranking functions:
SELECT t.id,
t.variety,
#rownum := #rownum + 1 AS num
FROM TABLE t,
(SELECT #rownum := 0) r
The SELECT #rownum := 0 defines the variable, and sets it to zero.
The r is a subquery/table alias, because you'll get an error in MySQL if you don't define an alias for a subquery, even if you don't use it.
Can't Use A Variable in a MySQL View
If you do, you'll get the 1351 error, because you can't use a variable in a view due to design. The bug/feature behavior is documented here.
Oracle has a rowid pseudo-column. In MySQL, you might have to go ugly:
SELECT id,
variety,
1 + (SELECT COUNT(*) FROM tbl WHERE t.id < id) as num
FROM tbl
This query is off the top of my head and untested, so take it with a grain of salt. Also, it assumes that you want to number the rows according to some sort criteria (id in this case), rather than the arbitrary numbering shown in the question.