This is in relation to my survey application for our team. I have 3 tables in my database related to this problem.
I apologize if the database is not fully normalized.
TBL_CHURCH columns:
1 FAM_CHURCH_SACRMNT_NUM (Primary Key) Int(15)
2 RSPONDNT_NUM
3 SURVYR_NUM
4 QN_NUMBER
5 CHRCHFAMLY_NAME
6 CHRCHFAMLY_ISBAPTIZED
Sample row based on order of columns above:
1 2 3 4 5 6
6422164 76826499 5712 362 Serio Tecson Jr. Yes
TBL_INTRVW columns:
1 QN_NUMBR (Primary Key)
2 SURVYR_NUM
3 ZONE_NUM
4 RSPONDNT_NUM
Sample row based on order of columns above:
1 2 3 4
362 5712 11 76826499
TBL_AREA columns:
1 BRGY_ZONE_NUM (Primary Key)
2 BRGY_CODE
Sample row based on order of columns above:
1 2
11 2A
21 2A
31 2A
The field CRCHFAMLY_ISBAPTIZED has only two values. A "Yes" or a "No" and each row has a QN_NUMBR value that is referenced to TBL_INTRVW and each QN_NUMBR on TBL_INTRVW has a unique ZONE_NUM that is referenced to TBL_AREA and that ZONE_NUM has a corresponding BRGY_CODE. Each BRGY_CODE have at least 2 ZONE_NUM values
My problem is that I want to count the number of people baptized in a given area.
The output more or less should look like this:
(The output is collected from the 3 different ZONE_NUM)
Zone Name Num of People Baptized
2A 20
I'm having what trouble what to use in my SQL statements. Should I use a WHERE within an INNER JOIN? And how do I go about in my SELECT statements?
SELECT c.BRGY_ZONE_NUM,count(a.CHRCHFAMLY_ISBAPTIZED) as [Num of People Baptized]
from TBL_CHURCH a
left join
TBL_INTRVW b
on a.QN_NUMBER=b.QN_NUMBER
left join
TBL_AREA c
on b.ZONE_NUM=cRGY_ZONE_NUM
where a.CHRCHFAMLY_ISBAPTIZED='Yes'
group by c.BRGY_ZONE_NUM
I dont see Zone Name column on the three table, so i used BRGY_ZONE_NUM
Related
I have one table loaded twice to perform a self join called current and previous. Both contain columns "key" (not unique) and "value". I have grouped by key, and counted the number of values in each group of keys.
I would like to find how many more values were added to the current table compared to the previous table, but I get the error "Invalid scalar projection: cur_count : A column needs to be projected from a relation for it to be used as a scalar". I am relatively new to pig latin, so I'm unsure of what the syntax should be for performing this difference.
Please disregard syntax for the cur_count and prev_count.
cur_count = FOREACH cur_grouped GENERATE COUNT(current);
prev_count = FOREACH prev_grouped GENERATE COUNT(previous);
left_join = join current by key LEFT OUTER, previous by key-1;
difference = FOREACH left_join GENERATE key, cur_count-prev_count; //error here
dump difference;
Below are some sample data
key value
1 12
1 34
1 11
1 45
2 4
3 34
3 34
3 23
4 15
4 19
What my script does so far: it counts the number of values in each group of keys
key count
1 4
2 1
3 3
4 2
I would like to find the difference in number of values between a key and the previous key
key difference
2 -3
3 2
4 -1
cur_count and prev_count are relations and cannot be used the way you are using.You can achieve the desired output using the script below.After joining the relations with (key-1),use the columns from the relation to get the difference.
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int);
B = GROUP A BY f1;
C = FOREACH B GENERATE group,COUNT(A);
D = FOREACH B GENERATE group,COUNT(A);
E = JOIN C BY $0,D BY ($0-1);
F = FOREACH E GENERATE $2,$3-$1;
DUMP F;
Presume you have two groups grp1 and grp2 with the content you described earlier
key count
1 4
2 1
3 3
4 2
Note: I have not executed below Pig statements.
-- Generate the Ranks for two relations
grp1 = rank grp1;
grp2 = rank grp2;
-- Increment rank by 1 for each record in grp2
grp2 = foreach grp2 generate ($0+1) as rank,key,count
After these the two relations would look like below. Arranged them side by side for comparison.
Group 1 Group 2
Rank key count Rank key count
1 1 4 2 1 4
2 2 1 3 2 1
3 3 3 4 3 3
4 4 2 5 4 2
Join the two groups by RANK which would yield below output
Rank key count Rank key count
2 2 1 2 1 4
3 3 3 3 2 1
4 4 2 4 3 3
5 4 2
Now you can run another "foreach" statement that finds the difference in two count columns above.
result = FOREACH <<joined relation>> GENERATE $1 as key,($2-$5) as difference
My company sends folks to training. Based on projected new hires/transfers, I was asked to generate a report that estimates the number of seats we need in each course broken out by quarter.
Question: My question is two-fold:
What is the best way to represent a sequence of courses (i.e. prerequisites) in a relational DB?
How do I create the query(-ies) necessary to produce the following desired output:
Desired Output:
ID PersonnelID CourseID ProjectedStartDate ProjectedEndDate
1 1 1 1/14/2017 1/14/2017
2 2 1 2/17/2017 2/17/2017
3 2 2 2/18/2017 2/19/2017
4 2 3 2/20/2017 2/20/2017
5 3 49 1/18/2017 2/03/2017
6 …
Background Info: The courses are taken in-sequence: the first few courses are orientation courses for the company, and later courses are more specific to the employee's workrole. There are over 50 different courses, 40 different workroles and we're projecting ~1k new hires/transfers. Each work role must take a sequence of courses in a prescribed order, but I'm having trouble representing this ordering and subsequently writing the necessary query.
Existing Tables:
I have several tables that I've used to store the data: Personnel, LnkPersonnelToWorkroles,Workroles, LnkWorkrolesToCourses, and Courses (there's many others as well, but I omit them for the sake of scoping this question down). Here's some notional data from these tables:
Personnel (These are the projected new hires and their estimated arrival date.)
ID DisplayName RequiredCompletionDate
1 Kristel Bump 10/1/2016
2 Shelton Franke 3/11/2017
3 Shaunda Launer 4/16/2017
4 Clarinda Kestler 3/13/2017
5 My Wimsatt 6/6/2017
6 Gillian Bramer 10/25/2016
7 ...
Workroles (These are the positions in the company)
ID Workrole
1 Manager
2 Secretary
3 Admin Asst.
4 ...
LnkPersonnelToWorkroles (Links projected new hires to their projected workrole)
ID PersonnelID WorkroleID
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 ...
Courses (All courses available)
ID CourseName LengthInDays
1 Orientation 1
2 Email Etiquette 2
3 Workplace Safety 1
4 ...
LnkWorkrolesToCourses
(Links workroles to their required courses in a Many-to-Many relationship)
ID WorkroleID CourseID
1 1 1
2 2 1
3 2 2
4 2 3
5 3 49
6 ...
Thoughts: My approach is to first develop a person-by-person schedule based upon the new hire's target completion date and workrole. Then for each class, I could sum the number of new hires starting in that quarter.
I've considered trying to represent the courses in the most general way I could think of (i.e. using a directed acyclic graph), but since most of the courses have only a single prerequisite course, I think it's much easier to represent the prerequisites using the Prerequisites table below; however, I don't know how I would use this in a query.
Prerequisites (Is this a good idea?)
ID CourseID PrereqCourseID
1 2 1
2 3 1
3 4 1
4 5 4
5 ...
Note: I am not currently concerned with whether or not the courses are actually offered on those days; we will figure out the course schedules once we know approximately how many we need each quarter. Right now, we're trying to estimate the demand for each course.
Edit 1: To clarify the Desired Output table: if the person begins course 1 on day D, then they can't start course 2 until after they finish course 1, i.e. until the next day. For courses with a length L >1 days, the start date for a subsequent courses is delayed L days. Notice this effect playing out for workrole ID 2 in the Desired Output table: He is expected to arrive on 2/17, start and complete course 1 the same day, begin course 2 the next day (on 2/18), and finish course 2 the day after that (on 2/19).
I'm posting this answer because it gives me an approximate solution; other answers are still welcome.
I avoided a prerequisite table altogether and opted for a simpler approach: a partial ordering of the courses.
First, I drew the course prerequisite tree; it looked similar to this image:
I defined a partial ordering of the courses based on their depth in the prerequisite tree. In the picture above, CHM124 and High School Chem w/ Lab are priority 1, CHM152 is priority 2, CHM 153 is priority 3, CHM260 and CHM 270 are priority 4, and so on... This partial ordering was stored in the CoursePriority table:
CoursePriority:
ID CourseID Priority
1 1 1
2 2 2
3 3 3
4 4 3
5 5 4
6 6 3
7 ...
So that no two courses would every be taken at the same time, I perturbed each course's priority by a small random number using the following Update query:
UPDATE CoursePriority SET CoursePriority.Priority = [Priority]+Rnd([ID])/1000;
(I used [ID] as input to the Rnd method to ensure each course was perturbed by a different random number.) I ended up with this:
ID CourseID Priority
1 1 1.000005623
2 2 2.000094955
3 3 3.000036401
4 4 3.000052486
5 5 4.000076711
6 6 3.00000535
7 ...
The approach above answers my first question "What is the best [sensible] way to represent a sequence of courses (i.e. prerequisites) in a relational DB?" Now as for generating the course schedule...
First, I created a query qryLnkCoursesPriorities to link Courses to the CoursePriority table:
SELECT Courses.ID AS CourseID, Courses.DurationInDays, CoursePriority.Priority
FROM Courses INNER JOIN CoursePriority ON Courses.ID = CoursePriority.CourseID;
Result:
CourseID DurationInDays Priority
1 35 1.000076177
2 21 2.000148297
3 28 3.000094352
4 14 3.000081442
5...
Second, I created the qryWorkrolePriorityDelay query:
SELECT LnkWorkrolesToCourses.WorkroleID, qryLnkCoursePriorities.CourseID AS CourseID, qryLnkCoursePriorities.Priority, qryLnkCoursePriorities.DurationInDays, ([DurationInDays]+Nz(DSum("DurationInDays","qryLnkCoursePriorities","[Priority]>" & [Priority] & ""))) AS LeadTimeInDays
FROM LnkWorkrolesToCourses INNER JOIN qryLnkCoursePriorities ON LnkWorkrolesToCourses.CourseID = qryLnkCoursePriorities.CourseID
ORDER BY LnkWorkrolesToCourses.WorkroleID, qryLnkCoursePriorities.Priority;
Simply put: The qryWorkrolePriorityDelay query tells me how many days in advance each course should be taken to ensure the new hire can complete all subsequent courses prior to their required training completion deadline. It looks like this:
WorkroleID CourseID Priority DurationInDays LeadTimeInDays
1 7 1.000060646 7 147
1 1 1.000076177 35 140
1 2 2.000148297 21 105
1 4 3.000081442 14 84
1 6 3.000082824 14 70
1 3 3.000094352 28 56
1 5 4.000106905 28 28
2...
Finally, I was able to bring this all together to create the qryCourseSchedule query:
SELECT Personnel.ID AS PersonnelID, LnkWorkrolesToCourses.CourseID, [ProjectedHireDate]-[leadTimeInDays] AS ProjectedStartDate, [ProjectedHireDate]-[leadTimeInDays]+[Courses].[DurationInDays] AS ProjectedEndDate
FROM Personnel INNER JOIN (((LnkWorkrolesToCourses INNER JOIN (Courses INNER JOIN qryWorkrolePriorityDelay ON Courses.ID = qryWorkrolePriorityDelay.CourseID) ON (Courses.ID = LnkWorkrolesToCourses.CourseID) AND (LnkWorkrolesToCourses.WorkroleID = qryWorkrolePriorityDelay.WorkroleID)) INNER JOIN LnkPersonnelToWorkroles ON LnkWorkrolesToCourses.WorkroleID = LnkPersonnelToWorkroles.WorkroleID) INNER JOIN CoursePriority ON Courses.ID = CoursePriority.CourseID) ON Personnel.ID = LnkPersonnelToWorkroles.PersonnelID
ORDER BY Personnel.ID, [ProjectedHireDate]-[leadTimeInDays]+[Courses].[DurationInDays];
This query gives me the following output:
PersonnelID CourseID ProjectedStartDate ProjectedEndDate
1 7 5/7/2016 5/14/2016
1 1 5/14/2016 6/18/2016
1 2 6/18/2016 7/9/2016
1 4 7/9/2016 7/23/2016
1 6 7/23/2016 8/6/2016
1 3 8/6/2016 9/3/2016
1 5 9/3/2016 10/1/2016
2...
With this output, I created a pivot table, where course start dates were grouped by quarter and counted. This gave me exactly what I needed.
I've Benchmarking table like this
BMID TestID BMTitle ConnectedTestID
---------------------------------------------------
1 5 My BM1 0
2 6 My BM2 5
3 7 My BM3 5,6
4 8 My BM4 10,12,8
5 9 My BM5 0
6 10 My BM6 3,6
7 5 My BM7 8,3,12,9
8 3 My BM8 7,10
9 8 My BM9 0
10 12 My BM10 9
---------------------------------------------
Explaining the table a little
Here the TestID and the connected TestID is playing the roles. If the user wants all the benchmarks for the TestID 3
It should return rows where testID=3 and also if any rows having connectedTestID column having that testID in it among the comma separated values
That means if the user specify the value 3 as the testID, it should return
---------------------------------------------
8 3 My BM8 7,10
7 5 My BM7 8,3,12,9
6 10 My BM6 3,6
--------------------------------------------
Hope its clear how those 3 rows returned. Means First row is because the testID 3 is there. the other two rows because 3 is in their connectedIDs cell
You should fix the data structure. Storing numeric ids in a comma-delimited list is a bad, bad, bad idea:
SQL Server doesn't have the best string manipulation functions.
Storing numberings as character strings is a bad idea.
Having undeclared foreign key relationships is a bad idea.
The resulting queries cannot make use of indexes.
While you are exploring what a junction table is so you can fix the problem with the data structure, you can use a query such as this:
where testid = 3 or
',' + ConnectedTestID + ',' like '%,3,%'
I am new to SQL Server and need help with one of my SQL query.
I have 2 tables (Rating and LikeDislike).
I am trying to get data from both of these tables using a LEFT JOIN like this:
SELECT distinct LD.TopicID, R.ID, R.Topic, R.CountLikes, R.CountDisLikes, LD.UserName, LD.Clikes
FROM Rating As R
LEFT JOIN LikeDislike AS LD on LD.TopicID = R.ID
The above SELECT statement displays results fine but also includes duplicates. I want to remove duplicates when the data is displayed, I tried using DISTINCT and GROUP BY, but with no luck, maybe because I am not using it correctly.
To be more clear and less confusing let me tell you what exactly each table does and what I am trying to achieve.
The Rating table has following columns (ID, Topic, CountLikes, CountDisLikes, Extra, CreatedByUser). It stores topic information and number of likes and dislikes for each topics and the UserID of the user who created that topic.
Rating table with sample data
ID Topic CountLikes CountDisLikes Extra CreatedByUser
1 Do You Like This 211 58 YesId 2
2 Or This 17 25 This also 3
79 Testing at home 1 0 Testing at home 2
80 Testing at home again 1 0 Testing 2
82 testing dislikes 0 1 Testing 2
76 Testing part 3 7 5 Testing 3 4
77 Testing part 4 16 6 Testing 4 5
The LikeDisLike table has following columns (ID, TopicID, UserName, Clikes). TopicID is a FK to the ID column in Rating table.
LikeDislike table with sample data
ID TopicID UserName Clikes
213 77 2 TRUE
214 76 2 FALSE
215 77 5 TRUE
194 77 3 TRUE
195 76 3 FALSE
196 2 3 TRUE
197 1 3 FALSE
Now what I am trying to do is get information from both of this table without duplicate rows. I need to get data all the columns from Rating table + UserName and Clikes columns from LikeDislike table without any duplicate rows
Below are the results with duplicates
TopicID ID Topic CountLikes CountDislikes UserName Clikes
NULL 79 Testing at home 1 0 NULL NULL
NULL 80 Testing at home2 1 0 NULL NULL
NULL 82 testing dislikes 0 1 NULL NULL
1 1 Do You Like This 211 58 3 FALSE
2 2 Or This 17 25 3 TRUE
76 76 Testing part 3 7 5 2 FALSE
76 76 Testing part 3 7 5 3 FALSE
77 77 Testing part 4 16 6 2 TRUE
77 77 Testing part 4 16 6 3 TRUE
77 77 Testing part 4 16 6 5 TRUE
Just like in yesterday's post, I don't think you understand what DISTINCT is suppose to return you. Because you have different values in your LikeDislike table, you are returning the DISTINCT rows.
Let's take TopicId 77 for instance. It returns 3 DISTINCT rows because you have 3 matching records in your LikeDislike table. If your desired output is a single row where the UserName and Clikes are comma delimted, that is possible -- look into using for xml and perhaps stuff (here is a recent answer on the subject). Or if you want to return the first row that matches the TopicId, then that is possible as well -- look into using a subquery with row_number.
Please let us know your desired output and we can help provide a solution.
Good luck.
Given the following SQL tables:
Administrators:
id Name rating
1 Jeff 48
2 Albert 55
3 Ken 35
4 France 56
5 Samantha 52
6 Jeff 50
Meetings:
id originatorid Assitantid
1 3 5
2 6 3
3 1 2
4 6 4
I would like to generate a table from Ken's point of view (id=3) therefore his id could be possibly present in two different columns in the meetings' table. (The statement IN does not work since I introduce two different field columns).
Thus the ouput would be:
id originatorid Assitantid
1 3 5
2 6 3
If you really just need to see which column Ken's id is in, you only need an OR. The following will produce your example output exactly.
SELECT * FROM Meetings WHERE originatorid = 3 OR Assistantid = 3;
If you need to take the complex route and list names along with meetings, an OR in your join's ON clause should work here:
SELECT
Administrators.name,
Administrators.id,
Meetings.originatorid,
Meetings.Assistantid
FROM Administrators
JOIN Meetings
ON Administrators.id = Meetings.originatorid
OR Administrators.id = Meetings.Assistantid
Where Administrators.name = 'Ken'