Creating custom event schedules. Should I use "LIKE"? - sql

I'm creating a campaign event scheduler that allows for frequencies such as "Every Monday", "May 6th through 10th", "Every day except Sunday", etc.
I've come up with a solution that I believe will work fine (not yet implemented), however, it uses "LIKE" in the queries, which I've never been too fond of. If anyone else has a suggestion that can achieve the same result with a cleaner method, please suggest it!
+----------------------+
| Campaign Table |
+----------------------+
| id:int |
| event_id:foreign_key |
| start_at:datetime |
| end_at:datetime |
+----------------------+
+-----------------------------+
| Event Table |
+-----------------------------+
| id:int |
| valid_days_of_week:string | < * = ALL. 345 = Tue, Wed, Thur. etc.
| valid_weeks_of_month:string | < * = ALL. 25 = 2nd and 5th weeks of a month.
| valid_day_numbers:string | < * = ALL. L = last. 2,7,17,29 = 2nd day, 7th, 17th, 29th,. etc.
+-----------------------------+
A sample event schedule would look like this:
valid_days_of_week = '1357' (Sun, Tue, Thu, Sat)
valid_weeks_of_month = '*' (All weeks)
valid_day_numbers = ',1,2,5,6,8,9,25,30,'
Using today's date (6/25/15) as an example, we have the following information to query with:
Day of week: 5 (Thursday)
Week of month: 4 (4th week in June)
Day number: 25
Therefore, to fetch all of the events for today, the query would look something like this:
SELECT c.*
FROM campaigns AS c,
LEFT JOIN events AS e
ON c.event_id = e.id
WHERE
( e.valid_days_of_week = '*' OR e.valid_days_of_week LIKE '%5%' )
AND ( e.valid_weeks_of_month = '*' OR e.valid_weeks_of_month LIKE '%4%' )
AND ( e.valid_day_numbers = '*' OR e.valid_day_numbers LIKE '%,25,%' )
That (untested) query would ideally return the example event above. The "LIKE" queries are what have me worried. I want these queries to be fast.
By the way, I'm using PostgreSQL
Looking forward to excellent replies!

Use arrays:
CREATE TABLE events (id INT NOT NULL, dow INT[], wom INT[], dn INT[]);
CREATE INDEX ix_events_dow ON events USING GIST(dow);
CREATE INDEX ix_events_wom ON events USING GIST(wom);
CREATE INDEX ix_events_dn ON events USING GIST(dn);
INSERT
INTO events
VALUES (1, '{1,3,5,7}', '{0}', '{1,2,5,6,8,9,25,30}'); -- 0 means any
, then query:
SELECT *
FROM events
WHERE dow && '{0, 5}'::INT[]
AND wom && '{0, 4}'::INT[]
AND dn && '{0, 26}'::INT[]
This will allow using the indexes to filter the data.

Related

how to generate unique weekid using weekofyear in hive

I have a table I"m just iterating dates of 50 years.
Using the values of weekofyear("date") -> week_no_in_this_year.
I would like to create a column using (week_no_in_this_year), it should be unique for a week. name it as -> week_id
which should be concatination of Year+two_digit_week_no_in_this_year+Some_number(to make this id unique for one week). I tried like below:
concat(concat(YEAR,IF(week_no_in_this_year<10,
concat(0,week_no_in_this_year),week_no_in_this_year)),'2') AS week_id.
But I'm facing issue for few dates for below scenario's:
SELECT weekofyear("2019-01-01") ;
SELECT concat(concat("2019",IF(1<10, concat(0,1),1)),'2') AS week_id;
Expected Result: 2019012
SELECT weekofyear("2019-12-31");
SELECT concat(concat("2019",IF(1<10, concat(0,1),1)),'2') AS week_id;
Expected Result: 2020012
One way to do it is with UDF. Create a python script and push it to HDFS
mypy.py
import sys
import datetime
for line in sys.stdin:
line = line.strip()
(y,m,d) = line.split("-")
d = datetime.date(int(y),int(m),int(d)).isocalendar()
print str(d[0])+str(d[1])
In Hive
add file hdfs:/user/cloudera/mypy.py;
select transform("2019-1-1") using "python mypy.py" as (week_id);
INFO : OK
+----------+--+
| week_id |
+----------+--+
| 20191 |
+----------+--+
select transform("2019-12-30") using "python mypy.py" as (week_id)
+----------+--+
| week_id |
+----------+--+
| 20201 |
+----------+--+
1 row selected (33.413 seconds)
This scenario only happens when there is a split between years at the end of a given year (that is Dec 31) and the week number is rollover to next year. If we put a condition for this case, then we get what you expect.
Right function is the same as substr (, -n).
SELECT DTE as Date,
CONCAT(IF(MONTH(DTE)=12 and WEEKOFYEAR(DTE)=1, year(DTE)+1, year(DTE)),
SUBSTR(CONCAT('0', WEEKOFYEAR(DTE)), -2), '2') as weekid
FROM tbl;
Result:
Date WeekId
2019-01-01 2019012
2019-11-01 2019442
2019-12-31 2020012

SQL dynamic column name

How do I declare a column name that changes?
I take some data from DB and I am interested in last 12 months, so I only take events that happend, let's say in '2016-07', '2016-06' and so on...
Then, I want my table to look like this:
event type | 2016-07 | 2016-06
-------------------------------
A | 12 | 13
B | 21 | 44
C | 98 | 12
How can I achieve this effect that the columns are named using previous YYYY-MM pattern, keeping in mind that the report with that query can be executed any time, so it would change.
Simplified query only for prev month:
select distinct
count(event),
date_year_month,
event_name
from
data_base
where date_year_month = TO_CHAR(add_months(current_date, -1),'YYYY-MM')
group by event_name, date_year_month
I don't think there is an automated way of pivoting the year-month columns, and change the number of columns in the result dynamically based on the data.
However if you are looking for pivoting solution, you accomplish using table functions in netezza.
select event_name, year_month, event_count
from event_counts_groupby_year_month, table(inza.inza.nzlua('
local rows={}
function processRow(y2016m06, y2016m07)
rows[1] = { 201606, y2016m06 }
rows[2] = { 201607, y2016m07 }
return rows
end
function getShape()
columns={}
columns[1] = { "year_month", integer }
columns[2] = { "event_count", double }
return columns
end',
y2016m06, y2016m07));
you could probably build a wrapper on this to dynamically generate the query based on the year month present in the table using shell script.

SQL: SUM of MAX values WHERE date1 <= date2 returns "wrong" results

Hi stackoverflow users
I'm having a bit of a problem trying to combine SUM, MAX and WHERE in one query and after an intense Google search (my search engine skills usually don't fail me) you are my last hope to understand and fix the following issue.
My goal is to count people in a certain period of time and because a person can visit more than once in said period, I'm using MAX. Due to the fact that I'm defining people as male (m) or female (f) using a string (for statistic purposes), CHAR_LENGTH returns the numbers I'm in need of.
SELECT SUM(max_pers) AS "People"
FROM (
SELECT "guests"."id", MAX(CHAR_LENGTH("guests"."gender")) AS "max_pers"
FROM "guests"
GROUP BY "guests"."id")
So far, so good. But now, as stated before, I'd like to only count the guests which visited in a certain time interval (for statistic purposes as well).
SELECT "statistic"."id", SUM(max_pers) AS "People"
FROM (
SELECT "guests"."id", MAX(CHAR_LENGTH("guests"."gender")) AS "max_pers"
FROM "guests"
GROUP BY "guests"."id"),
"statistic", "guests"
WHERE ( "guests"."arrival" <= "statistic"."from" AND "guests"."departure" >= "statistic"."to")
GROUP BY "statistic"."id"
This query returns the following, x = desired result:
x * (x+1)
So if the result should be 3, it's 12. If it should be 5, it's 30 etc.
I probably could solve this algebraic but I'd rather understand what I'm doing wrong and learn from it.
Thanks in advance and I'm certainly going to answer all further questions.
PS: I'm using LibreOffice Base.
EDIT: An example
guests table:
ID | arrival | departure | gender |
10 | 1.1.14 | 10.1.14 | mf |
10 | 15.1.14 | 17.1.14 | m |
11 | 5.1.14 | 6.1.14 | m |
12 | 10.2.14 | 24.2.14 | f |
13 | 27.2.14 | 28.2.14 | mmmmmf |
statistic table:
ID | from | to | name |
1 | 1.1.14 | 31.1.14 |January | expected result: 3
2 | 1.2.14 | 28.2.14 |February| expected result: 7
MAX(...) is the wrong function: You want COUNT(DISTINCT ...).
Add proper join syntax, simplify (and remove unnecessary quotes) and this should work:
SELECT s.id, COUNT(DISTINCT g.id) AS People
FROM statistic s
LEFT JOIN guests g ON g.arrival <= s."from" AND g.departure >= s."too"
GROUP BY s.id
Note: Using LEFT join means you'll get a result of zero for statistics ids that have no guests. If you would rather no row at all, remove the LEFT keyword.
You have a very strange data structure. In any case, I think you want:
SELECT s.id, sum(numpersons) AS People
FROM (select g.id, max(char_length(g.gender)) as numpersons
from guests g join
statistic s
on g.arrival <= s."from" AND g.departure >= s."too"
group by g.id
) g join
GROUP BY s.id;
Thanks for all your inputs. I wasn't familiar with JOIN but it was necessary to solve my problem.
Since my databank is designed in german, I made quite the big mistake while translating it and I'm sorry if this caused confusion.
Selecting guests.id and later on grouping by guests.id wouldn't make any sense since the id is unique. What I actually wanted to do is select and group the guests.adr_id which links a visiting guest to an adress databank.
The correct solution to my problem is the following code:
SELECT statname, SUM (numpers) FROM (
SELECT statistic.name AS statname, guests.adr_id, MAX( CHAR_LENGTH( guests.gender ) ) AS numpers
FROM guests
JOIN statistics ON (guests.arrival <= statistics.too AND guests.departure >= statistics.from )
GROUP BY guests.adr_id, statistic.name )
GROUP BY statname
I also noted that my database structure is a mess but I created it learning by doing and haven't found any time to rewrite it yet. Next time posting, I'll try better.

SQL Views - Modify Returned Result

I'm a little stuck here. I'm trying to modify a returned View based on a condition. I'm fairly green on SQL and am having a bit of difficultly with the returned result. Heres a partial component of the view I wrote:
WITH A AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY fkidContract,fkidTemplateItem ORDER BY bStdActive DESC, dtdateplanned ASC) AS RANK,
tblWorkItems.fkidContract AS ContractNo,
....
FROM tblWorkItems
WHERE fkidTemplateItem IN
(2895,2905,2915,2907,2908,
2909,3047,2930,2923,2969,
2968,2919,2935,2936,2927,
2970,2979)
AND ...
)
SELECT * FROM A WHERE RANK = 1
The return result is similar to the following:
ContractNo| ItemNumber | Planned | Complete
001 | 100 | 01/01/1900 | 02/01/1900
001 | 101 | 03/04/1900 | 02/01/1901
001 | 102 | 03/06/1901 | 02/08/1900
002 | 100 | 01/03/1911 | 02/08/1913
This gives me the results I expect, but due a nightmare crystal report I need to alter this view slightly. I want to take this returned result set and modify an existing column with a value pulled from the same table and the same Contract relationship, something like the following:
UPDATE A
SET A.Completed = ( SELECT R.Completed
FROM myTable R
INNER JOIN A
ON A.ContractNo = R.ContractNo
WHERE A.ItemNumber = 100 AND R.ItemNumber = 101
)
What I'm trying to do is modify the "Completed Date" of one task and make it the complete date of another task if they both share the same ContractNo field value.
I'm not sure about the ItemNumber relationships between A and R (perhaps it was just for testing...), but it seems like you don't really want to UPDATE anything, but you want to use a different value under some circumstances. So, maybe you just want to change the non-cte part of your query to something like:
SELECT A.ContractNo, A.ItemNumber, A.Planned,
COALESCE(R.Completed,A.Completed) as Completed
FROM A
LEFT OUTER JOIN myTable R
ON A.ContractNo = R.ContractNo
AND A.ItemNumber = 100 AND R.ItemNumber = 101 -- I'm not sure about this part
WHERE A.Rank = 1
So it turns out that actually reading the vendor documentation helps :)
SELECT
column1,
column2 =
case
when date > 1999 then 'some value'
when date < 1999 then 'other value'
else 'back to the future'
end
FROM ....
For reference, the total query did a triple inner join over ~5 million records and this case statement was surprisingly performant.
I suggest that this gets closed as a duplicate.

Using SQL to get the last item before n

I am not quite sure how to ask this so I will start off with an example. Let's say I have a table in my database that looks like this:
id | time | event | pnumber
---------------------------
1 | 1200 | foo | 23
2 | 1130 | bar | 52
3 | 1045 | bat | 13
...
n | 0 | baz | 7
Now say I wanted to get the last known pnumber after a certain time. For example at time = 1135, it would have to go back and find the last known time in the table (1130) and then return that pnumber. So for t = 1130, it would return pnumber = 52. But as soon as the t = 1045 it would return pnumber = 13. (Time counts down in this context from 1200 to 0).
Here's what I have so far.
SELECT pnumber FROM table WHERE time = (SELECT time FROM table WHERE time <= '1135' ORDER BY time LIMIT 1)
Is there an easier way to do this? Without using multiple statements. I am using sqlite3
Sure. You can condense that query by doing:
SELECT pnumber FROM table WHERE time >= 1135 ORDER BY time DESC LIMIT 1;
No need to nest the select to get a specific time first, this should work.
EDIT: Got the inequality sign mixed around -- if you're looking for the first record AFTER a specific time, you'll want time >= 1135 and order by time descending with a limit of one.
Why do you need the second query? Could you do something like this:
SELECT TOP 1 pnumber FROM table WHERE time >= '1135' ORDER BY TIME DESC
I'm a bit confused. You are asking that 1135 would return the value for 1130, yet you are using greater than or equal to instead of less than. If your example is what you are looking for, try this.
SELECT PNUMBER FROM TABLE WHERE TIME<=1135 ORDER BY TIME DESC LIMIT 1