SQL query to count number of checkins per month - sql

To put a long story short, I am working on a database using PostgreSQL that is managing yelp checkins. The checkintable has the attributes business_id(string), date(string in form yyyy-mm-dd), and time(string in form 00:00:00).
What I simply need to do is, given a business_id, I need to return a list of the total number of checkins based on just the mm (month) value.
So for instance, I need to retrieve the total checkins that were in Jan, Feb, March, April, etc, not based upon the year.
Any help is greatly appreciated. I've already considered group by clauses but I didn't know how to factor in '%mm%'.

Reiterating Gordon, class or not, storing dates and times as strings makes things harder, slower, and more likely to break. It's harder to take advantage of Postgres's powerful date math functions. Storing dates and times separately makes things even harder; you have to concatenate them together to get the full timestamp which means it will not be indexed. Determining the time between two events becomes unnecessarily difficult.
It should be a single timestamp column. Hopefully your class will introduce that shortly.
What I simply need to do is, given a business_id, I need to return a list of the total number of checkins based on just the mm (month) value.
This is deceptively straightforward. Cast your strings to dates, fortunately they're in ISO 8601 format so no reformatting is required. Then use extract to extract just the month part.
select
extract('month' from checkin_date::date) as month,
count(*)
from yelp_checkins
where business_id = ?
group by month
order by month
But there's a catch. What if there are no checkins for a business on a given month? We'll get no entry for that month. This is a pretty common problem.
If we want a row for every month, we need to generate a table with our desired months with generate_series, then left join with our checkin table. A left join ensures all the months (the "left" table) will be there even if there is no corresponding month in the join table (the "right" table).
select
months.month,
count(business_id)
from generate_series(1,12) as months(month)
left join yelp_checkins
on months.month = extract('month' from checkin_date::date)
and business_id = ?
group by months.month
order by months.month
Now that we have a table of months, we can group by that. We can't use a where business_id = ? clause or that will filter out empty months after the left join has happened. Instead we must put that as part of the left join.
Try it.

Why would you store the date as a string? That is a broken data model. You should fix the data.
That said, I recommend converting a date and truncating to the first day of the month:
select date_trunc('day', datestr::date) as yyyymm, count(*)
from t
group by yyyymm
order by yyyymm;
If you don't want these based on the year, then use extract():
select extract(month from datestr::date) as mm, count(*)
from t
group by mm
order by mm;

Related

Criteria in SQL/ms-access is only considering day of the month, not full date

NB: this is a follow up question from Syntax of MS Access/SQL sub-query including aggregate functions.
I am trying to produce a database to manage maintenance of equipment. I have two tables:
One (Inventory) containing details of each piece of equipment, including Purchase Date and Service Period,
One containing details of work done (WorkDone), including the date the work was carried out (Work Date).
I would like a query that displays the date that it should be next serviced. So far I have:
SELECT Max(NZ(DateAdd('m', i.[Service Period], w.[Work Date]),
DateAdd('m', i.[Service Period], i.[Purchase Date]))
) AS NextServiceDate, i.Equipement
FROM Inventory i LEFT JOIN WorkDone w ON i.ID = w.Equipment
GROUP BY i.Equipement
I would now like to order by NextServiceDate and only show entries where NextServiceDate is in the next week. However adding
HAVING (((Max(Nz(DateAdd('m',i.[Service Period],w.[Work Date]),DateAdd('m',i.[Service Period],i.[Purchase Date]))))<DateAdd('ww',1,Date()))
ORDER BY Max(Nz(DateAdd('m',i.[Service Period],w.[Work Date]),DateAdd('m',i.[Service Period],i.[Purchase Date])));
only shows when the day of the month is less than one week from now (e.g. if it is the 1st today it will show all entries where NextServiceDate occurs in the first 7 days of any month of any year, past or future). For some reason it is only considering the day of the month and not the full date. I don't understand why...
NB: I have a British system so date format is dd-mm-yyyy.
After a cursory review of your code, I'm unsure whether the instances of i.Equipement is a typo (since your JOIN clause refers to a similar field w.Equipment), or whether these two fields are intentionally named differently?
I can't see anything else immediately wrong with your code, but it may be clearer and easier to debug if you were to restructure the code to the following:
SELECT
Max(sub.dat) as NextServiceDate, sub.eqp
FROM
(
SELECT
DateAdd('m',i.[Service Period],Nz(w.[Work Date],i[Purchase Date])) as dat, i.Equipement as eqp
FROM
Inventory i LEFT JOIN WorkDone w ON i.ID = w.Equipment
) AS sub
GROUP BY
sub.eqp
HAVING
Max(sub.dat) < DateAdd('ww',1,Date())
ORDER BY
Max(sub.dat)
Note that the difference in regional date formats will only have an effect when you are specifying literal dates (for example, as criteria), in which case you would need to adhere to the format #mm/dd/yyyy#.

Comparing Dates in PL/SQL (I'm a beginner, be gentle)

Relatively new to SQL, so still trying to get my head around a couple of concepts.
I have two tables, the first has a bunch of VARCHAR attributes, as well as a 'DAY' column which is in date format, and 'USAGE', a number.
The second table only has one column, 'HOLIDAY_DATE', also datatype date. As the name suggests, this is a bunch of dates corresponding to past and future holidays.
I'm trying to find 'USAGE' on days that are not holidays, by comparing 'DAY' in the first table to 'DATE' in the second. My select statement so far is:
SELECT DAY, USAGE
FROM FIRST_TABLE, HOLIDAYS
WHERE FIRST_TABLE.DAY, NOT LIKE HOLIDAYS.HOLIDAY_DATE
GROUP BY DAY, USAGE
ORDER BY DAY;
But this still seems to bring up ALL the days, including holidays. Not quite sure where to go from here.
SELECT DAY, USAGE
FROM FIRST_TABLE
WHERE FIRST_TABLE.DAY NOT in (select HOLIDAY_DATE from HOLIDAYS)
GROUP BY DAY, USAGE
ORDER BY DAY
there may be other ways but you can use subquery concept
SELECT DAY, USAGE
FROM FIRST_TABLE
WHERE DAY NOT IN(Select HOLIDAY_DATE FROM HOLIDAYS)
GROUP BY DAY, USAGE
ORDER BY DAY;
One More thing I want to add here eventhough your query has logical issue but there is Syntaxt error also
WHERE FIRST_TABLE.DAY, NOT LIKE HOLIDAYS.HOLIDAY_DATE
^

Querying SQLITE DB for Data from One Column Based On Another Column

I hope the title of this post makes sense.
The db in question has two columns that are related to my issue, a date column that follows the format xx/xx/xxxx and price a column. What I want to do is get a sum of the prices in the price column based on the month and year in which they occurred, but that data is in the other aforementioned column. Doing so will allow me to determine the total for a given month of a given year. The problem is I have no idea how to construct a query that would do what I need. I have done some reading on the web, but I'm not really sure how to go about this. Can anyone provide some advice/tips?
Thanks for your time!
Mike
I was able to find a solution using a LIKE clause:
SELECT sum(price) FROM purchases WHERE date LIKE '11%1234%'
The "11" could be any 2-digit month and the "1234" is any 4 digit year. The % sign acts as a wildcard. This query, for example, returns the sum of any prices that were from month 11 of year 1234 in the db.
Thanks for your input!
You cannot use the built-in date functions on these date values because you have stored them formatted for displaing instead of in one of the supported date formats.
If the month and day fields always have two digits, you can use substr:
SELECT substr(MyDate, 7, 4) AS Year,
substr(MyDate, 1, 2) AS Month,
sum(Price)
FROM Purchases
GROUP BY Year,
Month
So, the goal is to get an aggregate grouping by the month?
select strftime('%m', mydate), sum(price)
from mytable
group by strftime('%m', mydate)
Look into group by

Is there a way to handle immutability that's robust and scalable?

Since bigquery is append-only, I was thinking about stamping each record I upload to it with an 'effective date' similar to how peoplesoft works, if anybody is familiar with that pattern.
Then, I could issue a select statement and join on the max effective date
select UTC_USEC_TO_MONTH(timestamp) as month, sum(amt)/100 as sales
from foo.orders as all
join (select id, max(effdt) as max_effdt from foo.orders group by id) as latest
on all.effdt = latest.max_effdt and all.id = latest.id
group by month
order by month;
Unfortunately, I believe this won't scale because of the big query 'small joins' restriction, so I wanted to see if anyone else had thought around this use case.
Yes, adding a timestamp for each record (or in some cases, a flag that captures the state of a particular record) is the right approach. The small side of a BigQuery "Small Join" can actually return at least 8MB (this value is compressed on our end, so is usually 2 to 10 times larger), so for "lookup" table type subqueries, this can actually provide a lot of records.
In your case, it's not clear to me what the exact query you are trying to run is.. it looks like you are trying to return the most recent sales times of every individual item - and then JOIN this information with the SUM of sales amt per month of each item? Can you provide more info about the query?
It might be possible to do this all in one query. For example, in our wikipedia dataset, an example might look something like...
SELECT contributor_username, UTC_USEC_TO_MONTH(timestamp * 1000000) as month,
SUM(num_characters) as total_characters_used FROM
[publicdata:samples.wikipedia] WHERE (contributor_username != '' or
contributor_username IS NOT NULL) AND timestamp > 1133395200
AND timestamp < 1157068800 GROUP BY contributor_username, month
ORDER BY contributor_username DESC, month DESC;
...to provide wikipedia contributions per user per month (like sales per month per item). This result is actually really large, so you would have to limit by date range.
UPDATE (based on comments below) a similar query that finds "num_characters" for the latest wikipedia revisions by contributors after a particular time...
SELECT current.contributor_username, current.num_characters
FROM
(SELECT contributor_username, num_characters, timestamp as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL)
AS current
JOIN
(SELECT contributor_username, MAX(timestamp) as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL AND timestamp > 1265073722 GROUP BY contributor_username) AS latest
ON
current.contributor_username = latest.contributor_username
AND
current.time = latest.time;
If your query requires you to use first build a large aggregate (for example, you need to run essentially an accurate COUNT DISTINCT) another option is to break this query up into two queries. The first query could provide the max effective date by month along with a count and save this result as a new table. Then, could run a sum query on the resulting table.
You could also store monthly sales records in separate tables, and only query the particular table for the months you are interested in, simplifying your monthly sales summaries (this could also be a more economical use of BigQuery). When you need to find aggregates across all tables, you could run your queries with multiple tables listed after the FROM clause.

PostgreSQL - GROUP BY timestamp values?

I've got a table with purchase orders stored in it. Each row has a timestamp indicating when the order was placed. I'd like to be able to create a report indicating the number of purchases each day, month, or year. I figured I would do a simple SELECT COUNT(xxx) FROM tbl_orders GROUP BY tbl_orders.purchase_time and get the value, but it turns out I can't GROUP BY a timestamp column.
Is there another way to accomplish this? I'd ideally like a flexible solution so I could use whatever timeframe I needed (hourly, monthly, weekly, etc.) Thanks for any suggestions you can give!
This does the trick without the date_trunc function (easier to read).
// 2014
select created_on::DATE from users group by created_on::DATE
// updated September 2018 (thanks to #wegry)
select created_on::DATE as co from users group by co
What we're doing here is casting the original value into a DATE rendering the time data in this value inconsequential.
Grouping by a timestamp column works fine for me here, keeping in mind that even a 1-microsecond difference will prevent two rows from being grouped together.
To group by larger time periods, group by an expression on the timestamp column that returns an appropriately truncated value. date_trunc can be useful here, as can to_char.