BigQuery: How to delete rows that have 2 columns with identical data? - sql

I haven't been able to find any similar question but I am looking for a way to delete all but 1 of similar rows that have 2 specific columns that contain identical data. For example:
price
symbol
date
13
RT
2020-10-1
80.9
DX
2020-10-2
81
DX
2020-10-2
90
AP
2020-10-3
89.9
AP
2020-10-3
90
AP
2020-10-3
85
DX
2020-10-4
In this example, I'd like to be able to run a query in the BQ console to find any of the rows with that have both the date AND the symbol as identical and delete one of them (which one gets deleted doesn't matter much.) The query would delete 1 of the DX rows on 2020-10-2 and 2 of the AP rows on 2020-10-3.
I appreciate the help!!

As you are using the big-query, I would suggest you to use CREATE OR REPLACE TABLE as follows:
CREATE OR REPLACE TABLE your_table
AS SELECT DISTINCT price, symbol, date
FROM your_table;

You can use this example code.
DELETE FROM [SampleDB].[dbo].[Employee]
WHERE ID NOT IN
(
SELECT MAX(ID) AS MaxRecordID
FROM [SampleDB].[dbo].[Employee]
GROUP BY [FirstName],
[LastName],
[Country]
);
Check this link for more info: https://www.sqlshack.com/different-ways-to-sql-delete-duplicate-rows-from-a-sql-table/

You specifically say that you want to delete based on two columns, not all three. In your example data, the price is the same on all rows, but that might not be the case in the real data.
You can use create or replace table, but I would recommend:
CREATE OR REPLACE TABLE t AS
SELECT ARRAY_AGG(t LIMIT 1)[ORDINAL(1)].*
FROM `t` t
GROUP BY symbol, date;
You can also express this using window functions:
CREATE OR REPLACE TABLE t AS
SELECT t.* EXCEPT (seqnum)
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY symbol, date ORDER BY price) as seqnum
FROM `t` t
) t
WHERE seqnum = 1;

Below is for BigQuery Standard SQL
create or replace table your_table as
with temp as (
select as value array_agg(t order by price limit 1) [offset(0)]
from your_table t
group by symbol, date
)
select * from temp;
Note: you can remove order by price part if you don't care about which exactly row to survive out of those with duplicate symbol and date
if applied to sample data from your question - resulted table is

Related

How to get the latest 3 records of each group from dolphindb database?

My table name is trades, and its columns are permno, symbol, date, prc, shrout, ret, vol. I want to get the latest 3 records of each stock each date group. Does DolphinDB support such querying methods?
declare #trades as table
(
permno int,
symbol int,
groupdate date
)
insert into #trades(permno,symbol,groupdate)
values
(1,1,'2019-01-01'),
(2,2,'2019-01-01'),
(3,3,'2019-01-01'),
(4,4,'2019-01-01'),
(1,11,'2019-01-02'),
(2,22,'2019-01-02'),
(3,33,'2019-01-02'),
(4,44,'2019-01-02')
select * from(
select ROW_NUMBER() over(partition by groupdate order by groupdate)as rn,* from #trades)x
where rn <=3
In DolphinDB, one can use context-by clause to solve similar problems. For your question, use the code below:
select * from trades context by symbol, date limit -3
A negative value -3 for limit clause tells the system to get last 3 records for each symbol and date combination.

Retrieving most recent data in SQL

Total disclosure: I'm a SQL beginner.
I have a data set of certain accounting and governance metrics for US companies. It has about 15 columns and roughly 18 million rows. Each row is a unique combination of company, date and metric being measured. The columns include certain identifiers like isin number, ticker symbol, etc, the date the metric was released, the metric description, and the metric itself.
What I'm trying to do is write a query that will yield the NEWEST values for a certain metric for all companies. In my hopeless search over the past few days I've come to think that the GROUP BY clause may be what I'm looking for. However, it doesn't seem to do exactly what I need. I've got it working with just 2 columns: isin number (company identifier), and date. In other words, I can spit out a list that shows the most recent date for each company, but I'm not sure how to add more columns to this, how to specify what metric to look at.
Any guidance would be appreciated, even if it's just pointing me in the right direction towards what kind of commands I should be looking into.
Thanks!
EDIT: Wow. Thanks for the quick and thorough replies. And point taken on the clarity and example data sets/starting query. Update: I think I have it working. Here's what I used:
SELECT a1.["id_isin_number"], a1.["metric_description"], a1.["date_period_ends"], a1.["company_metric_value"], a2.maxdate
FROM [AGR Metrics].[dbo].[Audit_Integrity_Metric_Data_File_NA Original_0] a1
INNER JOIN (
SELECT a2.["id_isin_number"], MAX(a2.["date_period_ends"]) AS maxdate
FROM [AGR Metrics].[dbo].[Audit_Integrity_Metric_Data_File_NA Original_0] a2
GROUP BY a2.["id_isin_number"]
) a2
ON a1.["date_period_ends"] = a2.maxdate
AND a1.["id_isin_number"] = a2.["id_isin_number"]
WHERE a1.["metric_description"] = '"Litigation: Class Action"'
I'm looking over the responses now to make sure I'm doing this as efficiently as possible.
You can use the ROW_NUMBER() function for this (if using SQL Server 2005 or newer):
SELECT *
FROM (SELECT *,ROW_NUMBER() OVER(PARTITION BY isin ORDER BY [date] DESC) AS RowRank
FROM YourTable
)sub
WHERE RowRank = 1
Just list out the fields you want in place of * if you don't want them all returned.
The ROW_NUMBER() function adds a number to each row, PARTITION BY is optional and is used to define a group for which numbering will start over at 1, in this case, you want the most recent for each value of isin so we PARTITION BY that. ORDER BY is required and defines the order of the numbering, in this case by date.
Your current query can also be used, but the ROW_NUMBER() method is simpler and more efficient:
SELECT a.*
FROM YourTable a
JOIN (SELECT isin, MAX([date])
FROM YourTable
GROUP BY isin
)b
ON a.isin = b.isin
AND a.[date] = b.[date]
Well as you quote the date the metric was released , So you can use it to sort your table using Order By .
This is a very basic example which can be used to simply sort data and selecting top 1 value.
Please refer This
CREATE TABLE trialOne (
Id INT NULL,
NAME VARCHAR(50) NULL,
[Date] DATETIME NULL
)
SELECT * FROM dbo.ETProgram
INSERT INTO trialone VALUES(1,'john','2009-01-06 11:39:51.827')
INSERT INTO trialone VALUES(2,'joseph','2010-01-06' )
INSERT INTO trialone VALUES(3,'Ajay','2009-05-06' )
INSERT INTO trialone VALUES(4,'Dave','2009-11-06' )
INSERT INTO trialone VALUES(5,'jonny','2004-01-06')
INSERT INTO trialone VALUES(6,'sunny','2005-01-06')
INSERT INTO trialone VALUES(7,'elle','2013-01-06' )
INSERT INTO trialone VALUES(8,'mac','2012-01-06' )
INSERT INTO trialone VALUES(8,'Sam','2008-01-06' )
INSERT INTO trialone VALUES(10,'xxxxx','2013-08-06')
SELECT TOP(1)name FROM trialone ORDER BY Date DESC

SQL. Is there any efficient way to find second lowest value?

I have the following table:
ItemID Price
1 10
2 20
3 12
4 10
5 11
I need to find the second lowest price. So far, I have a query that works, but i am not sure it is the most efficient query:
select min(price)
from table
where itemid not in
(select itemid
from table
where price=
(select min(price)
from table));
What if I have to find third OR fourth minimum price? I am not even mentioning other attributes and conditions... Is there any more efficient way to do this?
PS: note that minimum is not a unique value. For example, items 1 and 4 are both minimums. Simple ordering won't do.
SELECT MIN( price )
FROM table
WHERE price > ( SELECT MIN( price )
FROM table )
select price from table where price in (
select
distinct price
from
(select t.price,rownumber() over () as rownum from table t) as x
where x.rownum = 2 --or 3, 4, 5, etc
)
Not sure if this would be the fastest, but it would make it easier to select the second, third, etc... Just change the TOP value.
UPDATED
SELECT MIN(price)
FROM table
WHERE price NOT IN (SELECT DISTINCT TOP 1 price FROM table ORDER BY price)
To find out second minimum salary of an employee, you can use following:
select min(salary)
from table
where salary > (select min(salary) from table);
This is a good answer:
SELECT MIN( price )
FROM table
WHERE price > ( SELECT MIN( price )
FROM table )
Make sure when you do this that there is only 1 row in the subquery! (the part in brackets at the end).
For example if you want to use GROUP BY you will have to define even further using:
SELECT MIN( price )
FROM table te1
WHERE price > ( SELECT MIN( price )
FROM table te2 WHERE te1.brand = te2.brand)
GROUP BY brand
Because GROUP BY will give you multiple rows, otherwise you will get the error:
SQL Error [21000]: ERROR: more than one row returned by a subquery used as an expression
I guess a simplest way to do is using offset-fetch filter from standard sql, distinct is not necessary if you don't have repeat values in your column.
select distinct(price) from table
order by price
offset 1 row fetch first 1 row only;
no need to write complex subqueries....
In amazon redshift use limit-fetch instead for ex...
Select distinct(price) from table
order by price
limit 1
offset 1;
You can either use one of the following:-
select min(your_field) from your_table where your_field NOT IN (select distinct TOP 1 your_field from your_table ORDER BY your_field DESC)
OR
select top 1 ColumnName from TableName where ColumnName not in (select top 1 ColumnName from TableName order by ColumnName asc)
I think you can find the second minimum using LIMIT and ORDER BY
select max(price) as minimum from (select distinct(price) from tableName order by price asc limit 2 ) --or 3, 4, 5, etc
if you want to find third or fourth minimum and so on... you can find out by changing minimum number in limit. you can find using this statement.
You can use RANK functions,
it may seem complex query but similar results like other answers can be achieved with the same,
WITH Temp_table AS (SELECT ITEM_ID,PRICE,RANK() OVER (ORDER BY PRICE) AS
Rnk
FROM YOUR_TABLE_NAME)
SELECT ITEM_ID FROM Temp_table
WHERE Rnk=2;
Maybe u can check the min value first and then place a not or greater than the operator. This will eliminate the usage of a subquery but will require a two-step process
select min(price)
from table
where min(price) <> -- "the min price you previously got"

How to select multiple rows in SQL Server while filling one column with the first value

Each of my rows have a date. I want the database to keep the good date. But I am in a situation where I want only the first date. But I still want all the other rows. So I would like to fill the date column with all the same date in my result.
For an example (Because I don't think I expressed myself well)
I have this:
name value date
a 10 5/13
b 14 2/13
c 20 1/13
a 11 7/13
a 5 8/13
b 8 9/13
I want it to become like this in the result:
name value date
a 26 5/13
b 22 5/13
c 20 5/13
I searched for this information but I only find the way to select the first row.
for now I'm doing
SELECT name, SUM(value), date FROM table
ORDER BY name
And I'm kind of clueless for what to do next.
Thanks :)
Databases don't have a concept of "first". Here is an attempt, but no guarantees unless you have a way of ordering to determine first:
select name, sum(value), const.date
from table cross join
(select top 1 date from table) const
group by name, const.date
If you only want to do this for a query, to provide this aggregated data for some specific client requirement, then #freshPrince's answer is appropriate. But if want to actually modify the data in the table itself, and prevent the issue from arising again, then you need to change the schema.
Create Table newTable(
name varChar(30) not null,
date datetime not null,
value decimal(10,2) not null default(0),
primary key (name, date) )
Insert newTable (name, date, value)
Select name, SUM(value), Min(date)
FROM currentTable
Group By Name
and delete the old table... then rename the new table to whatever...
You will also have to modify the process used to insert new rows so that instread of always inserting a new row, it updates the existing row for a specified name and date if it already exists...
Your question is slightly confusing since your desired result is showing a date that does not exists with either b or c but if that is the result that you want want you could use something similar to the following:
select name, sum(value) value, d.date
from yt
cross join
(
select min(date) date
from yt
where name = (select min(name)
from yt)
) d
group by name, d.date;
See SQL Fiddle with Demo
But it seems like you actually would want the min(date) for each name:
select name, sum(value) value, min(date)
from yt
group by name;
See SQL Fiddle with Demo.
If the order of the date should be the determined by the name then you could use:
select t.name, sum(value) value, d.date
from yt t
cross join
(
select top 1 name, date
from yt
order by name, date
) d
group by t.name, d.date;
See Demo

Fastest way to identify differences between two tables?

I have a need to check a live table against a transactional archive table and I'm unsure of the fastest way to do this...
For instance, let's say my live table is made up of these columns:
Term
CRN
Fee
Level Code
My archive table would have the same columns, but also have an archive date so I can see what values the live table had at a given date.
Now... How would I write a query to ensure that the values for the live table are the same as the most recent entries in the archive table?
PS I'd prefer to handle this in SQL, but PL/SQL is also an option if it's faster.
SELECT term, crn, fee, level_code
FROM live_data
MINUS
SELECT term, crn, fee, level_code
FROM historical_data
Whats on live but not in historical. Can then union to a reverse of this to get whats in historical but not live.
Simply:
SELECT collist
FROM TABLE A
minus
SELECT collist
FROM TABLE B
UNION ALL
SELECT collist
FROM TABLE B
minus
SELECT collist
FROM TABLE A;
You didn't mention how rows are uniquely identified, so I've assumed you also have an "id" column:
SELECT *
FROM livetable
WHERE (term, crn, fee, levelcode) NOT IN (
SELECT FIRST_VALUE(term) OVER (ORDER BY archivedate DESC)
,FIRST_VALUE(crn) OVER (ORDER BY archivedate DESC)
,FIRST_VALUE(fee) OVER (ORDER BY archivedate DESC)
,FIRST_VALUE(levelcode) OVER (ORDER BY archivedate DESC)
FROM archivetable
WHERE livetable.id = archivetable.id
);
Note: This query doesn't take NULLS into account - if any of the columns are nullable you can add suitable logic (e.g. NVL each column to some "impossible" value).
unload to table.unl
select * from table1
order by 1,2,3,4
unload to table2.unl
select * from table2
order by 1,2,3,4
diff table1.unl table2.unl > diff.unl
Could you use a query of the form:
SELECT your columns FROM your live table
EXCEPT
SELECT your columns FROM your archive table WHERE archive date is most recent;
Any results will be rows in your live table that are not in your most recent archive.
If you also need rows in your most recent archive that are not in your live table, simply reverse the order of the selects, and repeat, or get them all in the same query by performing a (live UNION archive) EXCEPT (live INTERSECTION archive)