How to convert MySQL to PostgreSQL and add timezone conversion - sql

I want to display all courses that have been accessed in the last 2 years, who accessed it last and when.
This MySQL query lists when each course was last accessed and by who. I'm converting this query to PostgreSQL 9.3.22. I haven't had much exposure to Postgres, which is proving very difficult. I also need to convert the epoch date to a different time zone, as the PostgreSQL database location is not in my timezone. Edit: timecreated in both databases is stored as epoch (e.g. 1612399773)
select
userid 'lastaccesseduserid',
courseid,
contextid,
from_unixtime(max(timecreated), '%D %M %Y') 'lastaccesseddate'
from mdl_logstore_standard_log
where timecreated >= unix_timestamp(date_sub(now(), interval 2 year))
group by courseid
This lists the output as such:
| lastaccesseduserid | courseid | contextid | lastaccesseddate |
|--------------------|----------|-----------|-------------------|
| 45 | 6581 | 68435 | 22nd January 2021 |
| 256676 | 32 | 4664 | 19th August 2019 |
etc.
My efforts at converting to PostgreSQL:
select
distinct ON (courseid) courseid,
contextid,
to_timestamp(max(timecreated))::timestamptz::date at time zone 'utc' at time zone 'Australia/Sydney' "last accessed date",
userid
from mdl_logstore_standard_log
where timecreated >= extract(epoch from now()- interval '2 year')
group by courseid
-- error: column userid, contextid must appear in the GROUP BY clause or be used in an aggregate function
None of these columns is the Primary Key (id is, as per here). Grouping by id is bad, as it will list every entry in the log table instead. Any help is appreciated!

Postgres is correct, that query is not valid SQL.
SQL-92 and earlier does not permit queries for which the select list, HAVING condition, or ORDER BY list refer to nonaggregated columns that are not named in the GROUP BY clause.
You can't group by courseid and select courseid, contextid, userid because each courseid might have many rows with different contextids and userids. You either need to group by courseid, contextid, userid or you need to tell the database how you want those columns aggregated like with sum or string_agg.
I can't tell you which is correct, but the original never really worked. MySQL is just choosing one value at random for you.
In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic, which is probably not what you want
MySQL allowed some unwise SQL "extensions" which later versions turned them off by default. This particular one is controlled by ONLY_FULL_GROUP_BY which MySQL 5.7 and up wisely turn on by default. Your database either turned it off, or is so old that it was not the default.
See MySQL Handling of GROUP BY for more.
I would suggest first enabling ONLY_FULL_GROUP_BY and fixing the queries in MySQL. Then port to Postgres.
MySQL has many such non-standard features. PostgreSQL is much more standards compliant. It will be a struggle to both convert to standard SQL and PostgreSQL. I would suggest doing them one at a time. First, convert to standard SQL by turning on ANSI and TRADITITONAL SQL modes and fix the resulting issues in MySQL. Then try converting the now more standard SQL to PostgreSQL. These SQL modes are collections of MySQL server configs, like ONLY_FULL_GROUP_BY, and can be turned on and fixed one at a time.
Note that PostgreSQL 9.3.22 end-of-lifed two years ago. It would be silly to do all that work to change databases only to use an obsolete version. Consider an upgrade.
Storing times as Unix epochs is awkward and unnecessary. If at all possible, consider converting to timestamp while migrating your data. If you intend to also store the time zone, use timestamp with zone.

You didn't specify what your intention is, but it seems you want to get the latest timecreated for each courseid.
This does not need a GROUP BY in Postgres, only distinct on (). Which has the added benefit that you can include any column you want without being limited by the GROUP BY rules. This only works however if you want one row per courseid (and that should be the "earliest" or "latest"). For other requirements (e.g. "the three latest") window functions are a better fit.
to_timestamp() already returns a timestamptz so the cast is unnecessary. If you want to remove the time part (which is what the ::date cast will do) I think that should be done after you adjusted the time zone. But then adjusting the time zone seems rather futile if you don't care about the time.
select distinct ON (courseid)
courseid,
contextid,
to_timestamp(timecreated) at time zone 'utc' at time zone 'Australia/Sydney' "last accessed date",
userid
from mdl_logstore_standard_log
where to_timestamp(timecreated) >= current_timestamp - interval '2 year'
group by courseid, 3 DESC
You should also use a real timestamp value in the WHERE clause because the duration of "2 years" might be different depending on the actual year. Comparing epochs won't take that into account.
You might want to think of changing the column to a proper timestamptz column completely in the long run.
Instead of referencing the column index in (3) in the order by, you can also repeat the whole expression: order by courseid, to_timestamp(timecreated) at time zone 'utc' at time zone 'Australia/Sydney' DESC
And you really shouldn't be using Postgres 9.3 - especially not for a new installation. There is no reason to not use the latest version (which is 13 as of 2021-02-04). If this is an existing (old) installation, upgrade as soon as possible. Upgrading from 9.3.22 to 13.1 gives you 2.7 years worth of fixes (2278 of them)

Related

Use SQL to ensure I have data for each day of a certain time period

I'm looking to only select one data point from each date in my report. I want to ensure each day is accounted for and has at least one row of information, as we had to do a few different things to move a large data file into our data warehouse (import one large Google Sheet for some data, use Python for daily pulls of some of the other data - want to make sure no date was left out), and this data goes from now through last summer. I could do a COUNT DISTINCT clause to just make sure the number of days between the first data point and yesterday (the latest data point), but I want to verify each day is accounted for. Should mention I am in BigQuery. Also, an example of the created_at style is: 2021-02-09 17:05:44.583 UTC
This is what I have so far:
SELECT FIRST(created_at)
FROM 'large_table'
ORDER BY created_at
**I know FIRST is probably not the best clause for this case, and it's currently acting to grab the very first data point in created_at, but just as a jumping-off point.
You can use aggregation:
select any_value(lt).*
from large_table lt
group by created_at
order by min(created_at);
Note: This assumes that created_at is a date -- or at least only has one value per date. You might need to convert it to a date:
select any_value(lt).*
from large_table lt
group by date(created_at)
order by min(created_at);
BigQuery equivalent of the query in your question
SELECT created_at
FROM 'large_table'
ORDER BY created_at
LIMIT 1

applying knowledge of SQL for everyday workplace activities

My question is how to properly write a SQL query for the below highlighted/bold question.
There is a table in HMO database which stores doctor's working
hours.Table has following fields
"FirstName","LastName","Date","HoursWorked". write a sql statement
which retrieves average working hours for period January-March for a
doctor with name Joe Doe.
so far i have
SELECT HoursWorked
FROM Table
WHERE DATE = (January - March) AND
SELECT AVG(HoursWorked) FROM Table WHERE FirstName="Joe",LastName="Doe"*
A few pointers as this sounds like a homework question (which we don't answer for you here, but we can try to give you some guidance).
You want to put all the things you want to return from your select first and you want to have all your search conditions at the end.
So the general format would be :
SELECT Column1,
Column2,
Column3,
FROM YourTable
WHERE Column4 = Restriction1
AND Column5 = Restriction2
The next thing you need to think about is how the dates are formatted in your database table. Hopefully they're kept in a column of type datetime or date (options will depend on the database engine you're using, eg, Microsoft SQL Server, Oracle or MySql). In reality some older databases people use can store dates in all sorts of formats which makes this much harder, but since I'm assuming it's a homework type question, lets assume it's a datetime format.
You specify restrictions by comparing columns to a value, so if you wanted all rows where the date was after midnight on the 2nd of March 2012, you would have the WHERE clause :
WHERE MyDateColumn >= '2012-03-02 00:00:00'
Note that to avoid confusion, we usually try to format dates as "Year-Month-Day Hour:Minute:Second". This is because in different countries, dates are often written in different formats and this is considered a Universal format which is understood (by computers at least) everywhere.
So you would want to combine a couple of these comparisons in your WHERE, one for dates AFTER a certain date in time AND one for dates before another point in time.
If you give this a go and see where you get to, update your question with your progress and someone will be able to help get it finished if you have problems.
If you don't have access to an actual database and need to experiment with syntax, try this site : http://sqlfiddle.com/
you already have the answer written
SELECT AVG(HoursWorked) FROM Table WHERE FirstName="Joe",LastName="Doe"*
you only need to fix the query
SELECT AVG(HoursWorked) as AVGWORKED FROM Table WHERE FirstName='Joe' AND LastName='Doe'
That query will give you the average hours worked for Joe Doe, however you only need to get between some time you add the next "AND", if you are using SQL server you can use the built in function DateFromParts(year,month,day) to create a new Date, or if you are using another Database Engine you can convert a string to a DateColumn Convert(Date,'MM/dd/yyyy')
Example
SELECT AVG(HoursWorked) as AVGWORKED FROM Table WHERE FirstName='Joe' AND LastName='Doe' AND DateColumn between DateFromParts(year,month,day) and Convert(Date,'MM/dd/yyyy')
In the example i showed both approaches (datefromparts for the initial date, and convert(date) for the ending date).

Detecting Invalid Dates in Oracle 11g database (ORA-01847 )

I am querying an Oracle 11.2 instance to build a small data mart that includes extracting the date of birth and date of death of people.
Unfortunately the INSERT query (which takes its data from a SELECT) fails due to ORA-01847 (day of month must be between 1 and last day of month).
To find my bad dates I first did:
SELECT extract(day FROM SOME_DT_TM),
extract(month FROM SOME_DT_TM),
COUNT(*)
FROM PERSON
GROUP BY extract(day FROM SOME_DT_TM), extract(month FROM SOME_DT_TM)
ORDER BY COUNT(*) DESC;
It gave me 367 rows, one for each day of the year including NULL and February-29th (leap year). True for the other date column as well, so it looks like the data is fine from a SELECT perspective.
However if I set logging up on my insert
create table registry_new_dates
(some_dob date, some_death_date date);
exec dbms_errlog.create_error_log('SOME_NEW_DATES');
And then run my long insert query:
SELECT some_dob,some_death_date,ora_err_mesg$ FROM ERR$_SOME_NEW_DATES;
I get the following weird results (first 3 rows shown) which makes me think that zip codes have been somehow inserted instead of dates for the 2nd column.
31-DEC-25 35244 "ORA-01847: day of month must be between 1 and last day of month"
13-DEC-33 35244-3402 "ORA-01847: day of month must be between 1 and last day of month"
23-JUN-58 35235 "ORA-01847: day of month must be between 1 and last day of month"
My question is - how do I detect these bad rows (there are 11 apparentlyh) with an SQL statement so I can fix or remove them. Fixing them in the originating table is not an option (no write privileges). I tried using queries like this:
SELECT DECEASED_DT_TM
FROM WH_CLN_PERSON
WHERE DECEASED_DT_TM LIKE '35%'
AND rownum<3;
But it did not find the offending rows.
Not sure if you are still actively researching this (or if you got an answer already).
To find the rows with the bad data, can't you instead select the DOB and the date of death, and express the WHERE clause in terms of DOB - like so:
...WHERE some_dob = to_date('31-DEC-25')
? After you find those rows, you may want to do another query on just one or two of those rows, including a calculated column: dump(date of death). Then post that. We can learn a lot from the dump - the internal representation of the so-called "date" (which may very well be a ZIP code instead). With that in hand we may be able to figure out what's stored, and how to hunt for it.

How to design SQL tables when column data arrives in multiple types/margins of error?

I've been given a stack of data where a particular value has been collected sometimes as a date (YYYY-MM-DD) and sometimes as just a year.
Depending on how you look at it, this is either a variance in type or margin of error.
This is a subprime situation, but I can't afford to recover or discard any data.
What's the optimal (eg. least worst :) ) SQL table design that will accept either form while avoiding monstrous queries and allowing maximum use of database features like constraints and keys*?
*i.e. Entity-Attribute-Value is out.
You could store the year, month and day components in separate columns. That way, you only need to populate the columns for which you have data.
if it comes in as just a year make it default to 01 for month and date, YYYY-01-01
This way you can still use a date/datetime datatype and don't have to worry about invalid dates
Either bring it in as a string unmolested, and modify it so it's consistent in another step, or modify the year-only values during the import like SQLMenace recommends.
I'd store the value in a DATETIME type and another value (just an integer will do, or some kind of enumerated type) that signifies its precision.
It would be easier to give more information if you mentioned what kind of queries you will be doing on the data.
Either fix it, then store it (OK, not an option)
Or store it broken with a fixed computed columns
Something like this
CREATE TABLE ...
...
Broken varchar(20),
Fixed AS CAST(CASE WHEN Broken LIKE '[12][0-9][0-9][0-9]' THEN Broken + '0101' ELSE Broken END AS datetime)
This also allows you to detect good from bad source data
If you don't always have a full date, what sort of keys and constraints would you need? Perhaps store two columns of data; a full date, and a year. For data that has only year, the year is stored and date is null. For items with full info, both are populated.
I'd put three columns in the table:
The provided value (YYYY-MM-DD or YYYY)
A date column, Date or DateTime data type, which is nullable
A year column, as an integer or char(4) depending upon your needs.
I'd always populate the year column, populate the date column only when the provided value is a date.
And, because you've kept the provided value, you can always re-process down the road if needs change.
An alternative solution would be to that of a date mask (like in IP). Store the date in a regular datetime field, and insert an additional field of type smallint or something, where you could indicate which is present (could go even binary here):
If you have YYYY-MM-DD, you would have 3 bits of data, which will have the values 1 if data is present and 0 if not.
Example:
Date Mask
2009-12-05 7 (111)
2009-12-01 6 (110, only year and month are know, and day is set to default 1)
2009-01-20 5 (101, for some strange reason, only the year and the date is known. January has 31 days, so it will never generate an error)
Which solution is better depends on what you will do with it.
This is better when you want to select those with full dates, which are between a certain period (less to write). Also this way it's easier to compare any dates which have masks like 7,6,4. It may also take up less memory (date + smallint may be smaller than int+int+int, and only if datetime uses 64 bit, and smallint uses up as much as int, it will be the same).
I was going to suggest the same solution as #ninesided did above. Additionally, you could have a date field and a field that quantitatively represents your uncertainty. This offers the advantage of being able to represent things like "on or about Sept 23, 2010". The problem is that to represent the case where you only know the year, you'd have to set your date to be the middle of the year, with 182.5 days' uncertainty (assuming non-leap year), which seems ugly.
You could use a similar but distinct approach with a mask that represents what date parts you're confident about - that's what SQLMenace offered in his answer above.
+1 each to recommendations from ninesided, Nikki9696 and Jeff Siver - I support all those answers though none was exactly what I decided upon.
My solution:
a date column used only for complete dates
an int column used for years
a constraint to ensure integrity between the two
a trigger to populate the year if only date is supplied
Advantages:
can run simple (one-column) queries on the date column with missing data ignored (by using NULL for what it was designed for)
can run simple (one-column) queries on the year column for any row with a date (because year is automatically populated)
insert either year or date or both (provided they agree)
no fear of disagreement between columns
self explanatory, intuitive
I would argue that methods using YYYY-01-01 to signify missing data (when flagged as such with a second explanatory column) fail seriously on points 1 and 5.
Example code for Sqlite 3:
create table events
(
rowid integer primary key,
event_year integer,
event_date date,
check (event_year = cast(strftime("%Y", event_date) as integer))
);
create trigger year_trigger after insert on events
begin
update events set event_year = cast(strftime("%Y", event_date) as integer)
where rowid = new.rowid and event_date is not null;
end;
-- various methods to insert
insert into events (event_year, event_date) values (2008, "2008-02-23");
insert into events (event_year) values (2009);
insert into events (event_date) values ("2010-01-19");
-- select events in January without expressions on supplementary columns
select rowid, event_date from events where strftime("%m", event_date) = "01";

Getting Hourly statistics using SQL

We have a table, name 'employeeReg' with fields
employeeNo | employeeName | Registered_on
Here Registered_on is a timestamp.
We require an hourly pattern of registrations, over a period of days. eg.
01 Jan 08 : 12 - 01 PM : 1592 registrations
01 Jan 08 : 01 - 02 PM : 1020 registrations
Can someone please suggest a query for this.
We are using Oracle 10gR2 as our DB server.
This is closely related to, but slightly different from, this question about How to get the latest record for each day when there are multiple entries per day. (One point in common with many, many SQL questions - the table name was not given originally!)
The basic technique will be to find a function that will format the varied Registered_on values such that all the entries in a particular hour are grouped together. This presumably can be done with TO_CHAR() since we're dealing with Oracle (MySQL does not support this).
SELECT TO_CHAR(Registered_on, "YYYY-MM-DD HH24") AS TimeSlot,
COUNT(*) AS Registrations
FROM EmployeeReg
GROUP BY 1
ORDER BY 1;
You might be able to replace the '1' entries by TimeSlot, or by the TO_CHAR() expression; however, for reasons of backwards compatibility, it is likely that this will work as written (but I cannot verify that for you on Oracle - an equivalent works OK on IBM Informix Dynamic Server using EXTEND(Registered_on, YEAR TO HOUR) in place of TO_CHAR()).
If you then decide you want zeroes to appear for hours when there are no entries, then you will need to create a list of all the hours you do want reported, and you will need to do a LEFT OUTER JOIN of that list with the result from this query. The hard part is generating the correct list - different DBMS have different ways of doing it.
Achieved what I wanted, with :)
SELECT TO_CHAR(a.registered_on, 'DD-MON-YYYY HH24') AS TimeSlot,
COUNT(*) AS Registrations
FROM EmployeeReg a
Group By TO_CHAR(a.registered_on, 'DD-MON-YYYY HH24');