Get latest data for all people in a table and then filter based on some criteria - sql

I am attempting to return the row of the highest value for timestamp (an integer) for each person (that has multiple entries) in a table. Additionally, I am only interested in rows with the field containing ABCD, but this should be done after filtering to return the latest (max timestamp) entry for each person.
SELECT table."person", max(table."timestamp")
FROM table
WHERE table."type" = 1
HAVING table."field" LIKE '%ABCD%'
GROUP BY table."person"
For some reason, I am not receiving the data I expect. The returned table is nearly twice the size of expectation. Is there some step here that I am not getting correct?

You can 1st return a table having max(timestamp) and then use it in sub query of another select statement, following is query
SELECT table."person", timestamp FROM
(SELECT table."person",max(table."timestamp") as timestamp, type, field FROM table GROUP BY table."person")
where type = 1 and field LIKE '%ABCD%'

Direct answer: as I understand your end goal, just move the HAVING clause to the WHERE section:
SELECT
table."person", MAX(table."timestamp")
FROM table
WHERE
table."type" = 1
AND table."field" LIKE '%ABCD%'
GROUP BY table."person";
This should return no more than 1 row per table."person", with their associated maximum timestamp.
As an aside, I surprised your query worked at all. Your HAVING clause referenced a column not in your query. From the documentation (and my experience):
The fundamental difference between WHERE and HAVING is this: WHERE selects input rows before groups and aggregates are computed (thus, it controls which rows go into the aggregate computation), whereas HAVING selects group rows after groups and aggregates are computed.

Related

What does SELECT Function is SQL actually produce? Does it produce a new table by default?

I am struggling to understand what the output of SELECT is meant to be in SQL (I am using MS ACCESS), and what sort of criteria this output needs to specify, if any. As a result, I don't understand why some queries work and others don't. So I know it retrieves data from a table, does calculations with it and displays it. But I don't understand the "inner" working of SELECT function. For instance, what is the name of data structure / entity it displays? Is it a "new" table?
And for example, suppose I have a table called "table_name", with 5 columns. One of the columns called "column_3", and there are 20 records.
SELECT column_3, COUNT(*) AS Count
FROM table_name;
Why does this query fail to run? By logic, I would expect it to display two columns: first column will be "column_3", containing 20 rows with relevant data, and second column will be "Count", containing just one non-empty row (displaying 20), and other 19 rows will be empty (or NULL maybe)?
Is it because SELECT is meant to produce equal number of rows for each column?
Your questions involve a basic understanding of SQL. SELECT statements do not create tables, but instead return virtual result sets. Nothing is persisted unless you change it to an INSERT.
In your example question, you will need to "tell" the SQL engine what you want a count "of". Because you added column_3, you need to write:
SELECT column_3, COUNT(*) AS Count
FROM table_name
GROUP BY column_3
If you wanted a count of all the rows, simply:
SELECT COUNT(*) FROM table_name

Querying a table from a parameter in a BigQuery UDF

I am trying to create a UDF that will find the maximum value of a field called 'DatePartition' for each table that is passed through to the UDF as a parameter. The UDF I have created looks like this:
CREATE TEMP FUNCTION maxDatePartition(x STRING) AS ((
SELECT MAX(DatePartition) FROM x WHERE DatePartition >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(),INTERVAL 7 DAY)
));
but I am getting the following error: "Table name "x" missing dataset while no default dataset is set in the request."
The table names will get passed to the UDF in the format:
my-project.my-dataset.my-table
EDIT: Adding more context: I have multiple tables that are meant to update every morning with yesterday's data. Sometimes the tables are updated later than expected so I am creating a view which will allow users to quickly see the most recent data in each table. To do this I need to calculate MAX(DatePartition) for all of these tables in one statement. The list of tables will be stored in another table but it will change from time to time so I can't hardcode them in.
I have tried to do it in a single statement, but have found I need to invoke a common table expression as a sorting mechanism. I haven't found success using the MAX() function on TIMESTAMPs. Here is a method that has worked the best for me that I've discovered (and most concise). No UDF needed. Try something like this:
WITH
DATA AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY your_group_by_fields ORDER BY DatePartition DESC) AS _row,
*
FROM
`my-project.my-dataset.my-table`
WHERE
Date_Partition >= TIMESTAMP_SUB(CURRENT_TIMESTAMP, INTERVAL 7 DAY)
)
SELECT
* EXCEPT(_row)
FROM
DATA
WHERE
_row = 1;
What this does is creates a new field with a row number for each partition of whatever grouped field that has muliple records of different timestamps. So for each of the records of a certain group, it will order them by most recent DatePartition and give them a row number value with "1" being the most recent since we sorted the DatePartition DESC.
Then it takes your common table expression of sorted values, and just returns everything in your table (EXCEPT that row number "_row" you assigned) and then filter only on "_row =1" which will be your most recent records.

Column value divided by row count in SQL Server

What happens when each column value in a table is divided with the total table row count. What function is basically performed by sql server? Can any one help?
More specifically: what is the difference between sum(column value ) / row count and column value/ row count. for e.g,
select cast(officetotal as float) /count(officeid) as value,
sum(officetotal)/ count(officeid) as average from check1
where officeid ='50009' group by officeid,officetotal
What is the operation performed on both select?
In your example both will be allways the same value because count(officeid) is allways equal to 1 because officeid is contained in the WHERE clause and officetotal is also contained in GROUP BY clause. So the example will not work because no grouping will be applied.
When you remove officetotal from the GROUP BY, you will get following message:
Column 'officetotal' is invalid in the select list because it is not
contained in either an aggregate function or the GROUP BY clause.
It means that you cannot use officetotal and SUM(officetotal) in one select - because SUM is meant to work for set of values and it is pointless to SUM only one value.
It is just not possible to write it this way in SQL using GROUP BY. If you look for something like first or last value from a group, you will have to use MIN(officetotal) or MAX(officetotal) or some other approach.

Assign an ID Value for Every Set of Duplicates

How can i generate an ID value for every set of duplicate records as seen in the second table with ID column? In other words, how can I let the first table to look like the second table using SQL query?
Assume that first name and last name in the first table can appear in duplicates.
Each first name and last name can have one or many purchase yr and cost.
The given image is just a sample. Total records in table 1 can reach thousands.
I'm using Oracle SQL.
Note: I'm working with one table only that is the first one. The second table is what I want.
You can use the DENSE_RANK analytic function to assign ID's as below:
EDIT:
Simplified query to generate ID's.
SELECT
DENSE_RANK() OVER (ORDER BY First_Name, Last_Name) ID,
t.*
FROM Table1 t;
Reference:
DENSE_RANK on Oracle Database SQL Reference

In SQL, why does group by make a difference when using having count()

I have a table that stores zone_id. Sometimes a zone id is twice in the database. I wrote a query to show only entries that have two or more entries of the same zone_id in the table.
The following query returns the correct result:
select *, count(zone_id)
from proxies.storage_used
group by zone_id desc
having count(zone_id) > 1;
However, if I group by last_updated or company_id, it returns random values. If I don't add a group by clause, it only displays one value as per the screenshot below. First output shows above query string, second output shows same query string without the 'group by' line and returns only one value:
correction: I'm a new member and thus can't post pictures directly, so I added it on minus: http://min.us/m3yrlkSMu#1o
While my query works, I don't understand why. Can somebody help me understand why group by is altering the actual output, instead of only the grouping of the output? I am using MySQL.
A group by divides the resulting rows into groups and performs the aggregate function on the records in each group. If you do a count(*) without a group by you will get a single count of all rows in a table. Since you didn't specify a group by there is only one group, all records in the table. If you do a count(*) with a group by of zone id, you will get a count of how many records there are for each zone id. If you do a count(*) of zone id and last updated date, you will get a count of how many rows were updated on each date in each zone.
Without a group by clause, everything is stored in the same group, so you get a single result. If there are more than one row in your table, then the having will succeed. So, you'll end up counting all the rows in your table...
source
From what I got, you could create a query with having and without group by only in two situations:
You have a where clause, and you want to test a condition on an aggregation of all rows that satisfy that clause.
Same as above, but for all rows in your table (in practice, it doesn't make sense, though).