Getting Hourly statistics using SQL - sql

We have a table, name 'employeeReg' with fields
employeeNo | employeeName | Registered_on
Here Registered_on is a timestamp.
We require an hourly pattern of registrations, over a period of days. eg.
01 Jan 08 : 12 - 01 PM : 1592 registrations
01 Jan 08 : 01 - 02 PM : 1020 registrations
Can someone please suggest a query for this.
We are using Oracle 10gR2 as our DB server.

This is closely related to, but slightly different from, this question about How to get the latest record for each day when there are multiple entries per day. (One point in common with many, many SQL questions - the table name was not given originally!)
The basic technique will be to find a function that will format the varied Registered_on values such that all the entries in a particular hour are grouped together. This presumably can be done with TO_CHAR() since we're dealing with Oracle (MySQL does not support this).
SELECT TO_CHAR(Registered_on, "YYYY-MM-DD HH24") AS TimeSlot,
COUNT(*) AS Registrations
FROM EmployeeReg
GROUP BY 1
ORDER BY 1;
You might be able to replace the '1' entries by TimeSlot, or by the TO_CHAR() expression; however, for reasons of backwards compatibility, it is likely that this will work as written (but I cannot verify that for you on Oracle - an equivalent works OK on IBM Informix Dynamic Server using EXTEND(Registered_on, YEAR TO HOUR) in place of TO_CHAR()).
If you then decide you want zeroes to appear for hours when there are no entries, then you will need to create a list of all the hours you do want reported, and you will need to do a LEFT OUTER JOIN of that list with the result from this query. The hard part is generating the correct list - different DBMS have different ways of doing it.

Achieved what I wanted, with :)
SELECT TO_CHAR(a.registered_on, 'DD-MON-YYYY HH24') AS TimeSlot,
COUNT(*) AS Registrations
FROM EmployeeReg a
Group By TO_CHAR(a.registered_on, 'DD-MON-YYYY HH24');

Related

How to convert MySQL to PostgreSQL and add timezone conversion

I want to display all courses that have been accessed in the last 2 years, who accessed it last and when.
This MySQL query lists when each course was last accessed and by who. I'm converting this query to PostgreSQL 9.3.22. I haven't had much exposure to Postgres, which is proving very difficult. I also need to convert the epoch date to a different time zone, as the PostgreSQL database location is not in my timezone. Edit: timecreated in both databases is stored as epoch (e.g. 1612399773)
select
userid 'lastaccesseduserid',
courseid,
contextid,
from_unixtime(max(timecreated), '%D %M %Y') 'lastaccesseddate'
from mdl_logstore_standard_log
where timecreated >= unix_timestamp(date_sub(now(), interval 2 year))
group by courseid
This lists the output as such:
| lastaccesseduserid | courseid | contextid | lastaccesseddate |
|--------------------|----------|-----------|-------------------|
| 45 | 6581 | 68435 | 22nd January 2021 |
| 256676 | 32 | 4664 | 19th August 2019 |
etc.
My efforts at converting to PostgreSQL:
select
distinct ON (courseid) courseid,
contextid,
to_timestamp(max(timecreated))::timestamptz::date at time zone 'utc' at time zone 'Australia/Sydney' "last accessed date",
userid
from mdl_logstore_standard_log
where timecreated >= extract(epoch from now()- interval '2 year')
group by courseid
-- error: column userid, contextid must appear in the GROUP BY clause or be used in an aggregate function
None of these columns is the Primary Key (id is, as per here). Grouping by id is bad, as it will list every entry in the log table instead. Any help is appreciated!
Postgres is correct, that query is not valid SQL.
SQL-92 and earlier does not permit queries for which the select list, HAVING condition, or ORDER BY list refer to nonaggregated columns that are not named in the GROUP BY clause.
You can't group by courseid and select courseid, contextid, userid because each courseid might have many rows with different contextids and userids. You either need to group by courseid, contextid, userid or you need to tell the database how you want those columns aggregated like with sum or string_agg.
I can't tell you which is correct, but the original never really worked. MySQL is just choosing one value at random for you.
In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic, which is probably not what you want
MySQL allowed some unwise SQL "extensions" which later versions turned them off by default. This particular one is controlled by ONLY_FULL_GROUP_BY which MySQL 5.7 and up wisely turn on by default. Your database either turned it off, or is so old that it was not the default.
See MySQL Handling of GROUP BY for more.
I would suggest first enabling ONLY_FULL_GROUP_BY and fixing the queries in MySQL. Then port to Postgres.
MySQL has many such non-standard features. PostgreSQL is much more standards compliant. It will be a struggle to both convert to standard SQL and PostgreSQL. I would suggest doing them one at a time. First, convert to standard SQL by turning on ANSI and TRADITITONAL SQL modes and fix the resulting issues in MySQL. Then try converting the now more standard SQL to PostgreSQL. These SQL modes are collections of MySQL server configs, like ONLY_FULL_GROUP_BY, and can be turned on and fixed one at a time.
Note that PostgreSQL 9.3.22 end-of-lifed two years ago. It would be silly to do all that work to change databases only to use an obsolete version. Consider an upgrade.
Storing times as Unix epochs is awkward and unnecessary. If at all possible, consider converting to timestamp while migrating your data. If you intend to also store the time zone, use timestamp with zone.
You didn't specify what your intention is, but it seems you want to get the latest timecreated for each courseid.
This does not need a GROUP BY in Postgres, only distinct on (). Which has the added benefit that you can include any column you want without being limited by the GROUP BY rules. This only works however if you want one row per courseid (and that should be the "earliest" or "latest"). For other requirements (e.g. "the three latest") window functions are a better fit.
to_timestamp() already returns a timestamptz so the cast is unnecessary. If you want to remove the time part (which is what the ::date cast will do) I think that should be done after you adjusted the time zone. But then adjusting the time zone seems rather futile if you don't care about the time.
select distinct ON (courseid)
courseid,
contextid,
to_timestamp(timecreated) at time zone 'utc' at time zone 'Australia/Sydney' "last accessed date",
userid
from mdl_logstore_standard_log
where to_timestamp(timecreated) >= current_timestamp - interval '2 year'
group by courseid, 3 DESC
You should also use a real timestamp value in the WHERE clause because the duration of "2 years" might be different depending on the actual year. Comparing epochs won't take that into account.
You might want to think of changing the column to a proper timestamptz column completely in the long run.
Instead of referencing the column index in (3) in the order by, you can also repeat the whole expression: order by courseid, to_timestamp(timecreated) at time zone 'utc' at time zone 'Australia/Sydney' DESC
And you really shouldn't be using Postgres 9.3 - especially not for a new installation. There is no reason to not use the latest version (which is 13 as of 2021-02-04). If this is an existing (old) installation, upgrade as soon as possible. Upgrading from 9.3.22 to 13.1 gives you 2.7 years worth of fixes (2278 of them)

Using the TABLE_DATE_RANGE function in BigQuery

I'm using BigQuery for the first time in quite awhile, so I'm a bit rusty.
I'm using a public dataset that can be found here for Reddit data.
Here is a snapshot:
What I'm trying to do is create a query that extracts all data from 2017.
Basically, I want to use the BQ syntax specific equivalent of this, which is written using Standard SQL:
fh-bigquery.reddit_posts.2017*
I know that would involve using the TABLE_DATA_RANGE function, but I'm stumped on the specific wording of it.
If I was using just one of the tables, it would look like this:
SELECT
FORMAT_UTC_USEC(SEC_TO_TIMESTAMP(created_utc)) AS created_date
FROM
[fh-bigquery:reddit_posts.2017_06]
LIMIT
10
But I'm obviously trying to span this over multiple months.
Below is for BigQuery Standard SQL
#standardSQL
SELECT
TIMESTAMP_SECONDS(created_utc) AS created_date
FROM `fh-bigquery.reddit_posts.2017_*`
LIMIT 10
It does what your query for one table does - but for all tables for 2017 (not sure what actually the logic you are looking for in your query - but I hope you left it outside the question just for simplicity sake)
Note: you can use _TABLE_SUFFIX in your query to identify which exactly table specific row comes from - for example:
#standardSQL
SELECT
_TABLE_SUFFIX AS month,
COUNT(1) AS records
FROM `fh-bigquery.reddit_posts.2017_*`
GROUP BY month
ORDER BY month
with output as below
month records
----- ---------
01 9,218,513
02 8,588,120
03 9,616,340
04 9,211,051
05 9,498,553
06 9,597,725
07 9,989,122
08 10,424,133
09 9,787,604
10 10,281,718
In case if for whatever reason you still bound to BigQuery Legacy SQL - you can use below
#legacySQL
SELECT
FORMAT_UTC_USEC(SEC_TO_TIMESTAMP(created_utc)) AS created_date
FROM TABLE_QUERY([fh-bigquery:reddit_posts], "LEFT(table_id, 5) = '2017_'")
LIMIT 10
But it is highly recommended to migrate to Standard SQL

where-clause based on current and record date

I'm struggling to add a where-clause based on the difference of the current date and the date of the record entry. There are some other simplistic clauses as well, but I have no problem with those, so I'm making the demo data simple to highlight the issue I'm having.
Demo dataset:
Rec_no Rec_date
77 20170606
69 20170605
55 20170601
33 20170520
29 20170501
Date is recorded in format yyyymmdd and I'd like to build a where clause to only show records that are created X number of days ago from current date - lets say 10.
So, in this case, only records no 33 and 29 should be shown.
Unfortunately I'm not sure on what the actual DB engine is, but it should be something from IBM.
How could this be done?
As suggested in the comments updating the schema to store the date at the correct type is the best option before you start, however if this is not possible for whatever reason, you would first need to convert the stored date to the correct format at runtime.
I'll write an example in t-sql as that's what I know. Once you have worked out your dbms i can edit to the relevant functions/syntax
Select *
FROM Demodataset
WHERE Cast(Rec_Date as datetime) >= dateadd(day,-10,getdate())

applying knowledge of SQL for everyday workplace activities

My question is how to properly write a SQL query for the below highlighted/bold question.
There is a table in HMO database which stores doctor's working
hours.Table has following fields
"FirstName","LastName","Date","HoursWorked". write a sql statement
which retrieves average working hours for period January-March for a
doctor with name Joe Doe.
so far i have
SELECT HoursWorked
FROM Table
WHERE DATE = (January - March) AND
SELECT AVG(HoursWorked) FROM Table WHERE FirstName="Joe",LastName="Doe"*
A few pointers as this sounds like a homework question (which we don't answer for you here, but we can try to give you some guidance).
You want to put all the things you want to return from your select first and you want to have all your search conditions at the end.
So the general format would be :
SELECT Column1,
Column2,
Column3,
FROM YourTable
WHERE Column4 = Restriction1
AND Column5 = Restriction2
The next thing you need to think about is how the dates are formatted in your database table. Hopefully they're kept in a column of type datetime or date (options will depend on the database engine you're using, eg, Microsoft SQL Server, Oracle or MySql). In reality some older databases people use can store dates in all sorts of formats which makes this much harder, but since I'm assuming it's a homework type question, lets assume it's a datetime format.
You specify restrictions by comparing columns to a value, so if you wanted all rows where the date was after midnight on the 2nd of March 2012, you would have the WHERE clause :
WHERE MyDateColumn >= '2012-03-02 00:00:00'
Note that to avoid confusion, we usually try to format dates as "Year-Month-Day Hour:Minute:Second". This is because in different countries, dates are often written in different formats and this is considered a Universal format which is understood (by computers at least) everywhere.
So you would want to combine a couple of these comparisons in your WHERE, one for dates AFTER a certain date in time AND one for dates before another point in time.
If you give this a go and see where you get to, update your question with your progress and someone will be able to help get it finished if you have problems.
If you don't have access to an actual database and need to experiment with syntax, try this site : http://sqlfiddle.com/
you already have the answer written
SELECT AVG(HoursWorked) FROM Table WHERE FirstName="Joe",LastName="Doe"*
you only need to fix the query
SELECT AVG(HoursWorked) as AVGWORKED FROM Table WHERE FirstName='Joe' AND LastName='Doe'
That query will give you the average hours worked for Joe Doe, however you only need to get between some time you add the next "AND", if you are using SQL server you can use the built in function DateFromParts(year,month,day) to create a new Date, or if you are using another Database Engine you can convert a string to a DateColumn Convert(Date,'MM/dd/yyyy')
Example
SELECT AVG(HoursWorked) as AVGWORKED FROM Table WHERE FirstName='Joe' AND LastName='Doe' AND DateColumn between DateFromParts(year,month,day) and Convert(Date,'MM/dd/yyyy')
In the example i showed both approaches (datefromparts for the initial date, and convert(date) for the ending date).

Detecting Invalid Dates in Oracle 11g database (ORA-01847 )

I am querying an Oracle 11.2 instance to build a small data mart that includes extracting the date of birth and date of death of people.
Unfortunately the INSERT query (which takes its data from a SELECT) fails due to ORA-01847 (day of month must be between 1 and last day of month).
To find my bad dates I first did:
SELECT extract(day FROM SOME_DT_TM),
extract(month FROM SOME_DT_TM),
COUNT(*)
FROM PERSON
GROUP BY extract(day FROM SOME_DT_TM), extract(month FROM SOME_DT_TM)
ORDER BY COUNT(*) DESC;
It gave me 367 rows, one for each day of the year including NULL and February-29th (leap year). True for the other date column as well, so it looks like the data is fine from a SELECT perspective.
However if I set logging up on my insert
create table registry_new_dates
(some_dob date, some_death_date date);
exec dbms_errlog.create_error_log('SOME_NEW_DATES');
And then run my long insert query:
SELECT some_dob,some_death_date,ora_err_mesg$ FROM ERR$_SOME_NEW_DATES;
I get the following weird results (first 3 rows shown) which makes me think that zip codes have been somehow inserted instead of dates for the 2nd column.
31-DEC-25 35244 "ORA-01847: day of month must be between 1 and last day of month"
13-DEC-33 35244-3402 "ORA-01847: day of month must be between 1 and last day of month"
23-JUN-58 35235 "ORA-01847: day of month must be between 1 and last day of month"
My question is - how do I detect these bad rows (there are 11 apparentlyh) with an SQL statement so I can fix or remove them. Fixing them in the originating table is not an option (no write privileges). I tried using queries like this:
SELECT DECEASED_DT_TM
FROM WH_CLN_PERSON
WHERE DECEASED_DT_TM LIKE '35%'
AND rownum<3;
But it did not find the offending rows.
Not sure if you are still actively researching this (or if you got an answer already).
To find the rows with the bad data, can't you instead select the DOB and the date of death, and express the WHERE clause in terms of DOB - like so:
...WHERE some_dob = to_date('31-DEC-25')
? After you find those rows, you may want to do another query on just one or two of those rows, including a calculated column: dump(date of death). Then post that. We can learn a lot from the dump - the internal representation of the so-called "date" (which may very well be a ZIP code instead). With that in hand we may be able to figure out what's stored, and how to hunt for it.