I am trying to create a PowerShell (5.1) script (Open to SQL suggestions as well using SSMS with SQL Server 2014) to eliminate manual effort of updating a very large data file exported in CSV format.
Here is what the raw data set looks like that needs updated:
Parent ID | Parent Owner | Sub ID | Sub Owner | Sub Hours
A aA Rob Green 0
A aB Rob Green 6
B aA Jane Doe 4
B aB Jane Doe 10
B aC Bob Smith 18
C cA Jane Doe 0
C cB Jane Doe 6
D dA Bob Smith 0
D dB Bob Smith 6
E dE Joe Brown 0
As you can see, Parent IDs can have one or many Sub Owners and Sub IDs.
My goal is to populate the Parent Owner field based on the following criteria:
For every Parent ID set
If there is only one distinct Sub Owner, then that Sub Owner should be the Parent Owner for all corresponding Parent IDs.
If there is only one occurrence of a Parent ID then that Sub Owner should be the Parent Owner for that Parent ID.
If there are multiple Sub Owners for the Parent ID, the Sub Owner with the highest summed Sub Hours should be the Parent Owner for every occurrence of that Parent ID.
To clarify, the criteria applies to the raw data above like so:
Parent ID "A" applies to criteria 1
Parent ID "B" applies to criteria 3
Parent ID "C" applies to criteria 1
Parent ID "D" applies to criteria 1
Parent ID "E" applies to criteria 2
This is what I expect the data above to look like after completed:
Parent ID | Parent Owner | Sub ID | Sub Owner | Sub Hours
A Rob Green aA Rob Green 0
A Rob Green aB Rob Green 6
B Bob Smith aA Jane Doe 4
B Bob Smith aB Jane Doe 10
B Bob Smith aC Bob Smith 18
C Jane Doe cA Jane Doe 0
C Jane Doe cB Jane Doe 6
D Bob Smith dA Bob Smith 0
D Bob Smith dB Bob Smith 6
E Joe Brown dE Joe Brown 0
My biggest struggle is criteria 3. I cannot wrap my head around how to do this. Can anyone give me an idea of how I can get my expected output using PS or SQL?
Any help would be GREATLY appreciated!
I vowed myself that SQL is strictly taboo for me. However, below's an example of pure PowerShell solution (and I'm pretty sure that it's convertible to SQL simply):
# mimic SQL output
$SqlOutput = #"
Parent ID|Parent Owner|Sub ID|Sub Owner|Sub Hours
A||aA|Rob Green|0
A||aB|Rob Green|6
B||aA|Jane Doe|4
B||aB|Jane Doe|10
B||aC|Bob Smith|18
C||cA|Jane Doe|0
C||cB|Jane Doe|6
D||dA|Bob Smith|0
D||dB|Bob Smith|6
E||dE|Joe Brown|0
"# | ConvertFrom-Csv -Delimiter '|'
# compute an auxiliary variable
$SqlOutputGroups = $SqlOutput |
Group-Object -Property 'Parent ID', 'Sub Owner' |
ForEach-Object {
New-Object psobject -Property #{
'Parent ID' = ( $_.Name -split ', ')[0]
'Sub Owner' = ( $_.Name -split ', ')[1]
Hours = ( $_.Group |
Measure-Object -Property 'Sub Hours' -Sum).Sum
}
}
# compute Criterium3 as a hashtable
$SqlOutputCriterium3 = #{}
$SqlOutputGroups | Group-Object -Property 'Parent ID' |
ForEach-Object {
$SqlOutputCriterium3[$_.Name] = ($_.Group |
Sort-Object -Property Hours |
Select-Object -Last 1).'Sub Owner'
}
# apply Criterium3
$SqlOutput | ForEach-Object {
$_.'Parent Owner' = $SqlOutputCriterium3.$($_.'Parent ID')
}
# show result in a table format
$SqlOutput | Format-Table -AutoSize
Output: D:\PShell\SO\45963820.ps1
Parent ID Parent Owner Sub ID Sub Owner Sub Hours
--------- ------------ ------ --------- ---------
A Rob Green aA Rob Green 0
A Rob Green aB Rob Green 6
B Bob Smith aA Jane Doe 4
B Bob Smith aB Jane Doe 10
B Bob Smith aC Bob Smith 18
C Jane Doe cA Jane Doe 0
C Jane Doe cB Jane Doe 6
D Bob Smith dA Bob Smith 0
D Bob Smith dB Bob Smith 6
E Joe Brown dE Joe Brown 0
Note that Criterium 3 covers criteria 1 and 2 but does not suffice if more Sub Owners have the same highest sum of Sub Hours for a particular Parent ID (e.g in case of B||aA|Jane Doe|8 instead of B||aA|Jane Doe|4 in the above example, then Jane Doe has sum of Sub Hours =18 as well as Bob Smith in Parent ID=B).
Related
I have a column in a data frame which has a full name as first name, middle name lastname, however for some records no middle name available and want to make sure that it populates the middle name conditionally based on the available pattern but not sure how I can achieve this.
import pandas as pd
name_df = pd.read_csv(r"NameData1.txt",delimiter=",")
splitted_name=name_df.name.str.split(' ',expand=True).fillna('No Value')
##splited_name['middle_name']= splited_name.apply(lambda x : x[1] if x[2] != 'No Value' else '' )
name_df['Middle_name']=name_df.apply(lambda splited_name : splited_name[1] if splited_name[2] != 'No Value' else '')
name_df
I want to display the middle name only when it's there else the last name should be populated.
Sample records:
Id,name
1,TOM M SMITH
2,Gary SMITH
3,John C Doe
4,Hary Knox
5,Rakesh Vaidya
6,John Doe Doe
Use numpy.where for set new column by condition, here are tested None values by Series.isna:
splitted_name=name_df.name.str.split(expand=True)
name_df['First_name'] = splitted_name[0]
name_df['Middle_name']= np.where(splitted_name[2].notna(), splitted_name[1], '')
name_df['Last_name']= np.where(splitted_name[2].notna(), splitted_name[2], splitted_name[1])
print (name_df)
Id name First_name Middle_name Last_name
0 1 TOM M SMITH TOM M SMITH
1 2 Gary SMITH Gary SMITH
2 3 John C Doe John C Doe
3 4 Hary Knox Hary Knox
4 5 Rakesh Vaidya Rakesh Vaidya
5 6 John Doe Doe John Doe Doe
I want to display middle name only wen its there else last name should be populated.
So you can do the below using str.split():
df['middle_or_last']=df.name.apply(lambda x:x.split(' ', maxsplit=len(x.split()))).str[1]
print(df)
Id name middle_or_last
0 1 TOM M SMITH M
1 2 Gary SMITH SMITH
2 3 John C Doe C
3 4 Hary Knox Knox
4 5 Rakesh Vaidya Vaidya
5 6 John Doe Doe Doe
Let's say there is a table of medical records. Each visit has a unique ID but is made up of several rows corresponding to various codes/services rendered for the visit.
For example, there could be 3 rows with claimID "John" for each unique procedure code "123", "456", and "789"; 15 rows for "Jane" with codes; 6 rows for "David"...
ID Code
John 123
John 456
John 789
Jane 123
Jane 456
Jane 789
Jane 321
Jane 654
David 123
David 456
David 789
David 987
I have a list of 50 unique procedure codes and want to return the entire set of claim lines (i.e. all rows of "John") where any combination of these 50 codes have been billed with another, but not with themselves ("123" with "321", but not "123" with "123"). If "123" is in my list of 50 but "456" and "789" are not, it should not return the set of "John" claims since only one code of my 50 are present. I hope this makes sense.
Positive Result Codes
123
321
987
The query should return all 5 Jane rows (123 and 321) and all 4 David rows (123 & 987).
ID Code
Jane 123
Jane 456
Jane 789
Jane 321
Jane 654
David 123
David 456
David 789
David 987
Try this code:
;WITH Visits as (
SELECT claimID,COUNT(DISTINCT Code) as CNT FROM tbl_Visits
WHERE Code in (123,123,321,987)
GROUP by claimID
HAVING COUNT(DISTINCT Code) > 1
)
SELECT * FROM tbl_Visits
WHERE claimID in (SELECT claimID FROM Visits);
I want to sort a table having 3 columns (time, source , recipient) by the order by which communication is being made. If the source and recipient are conversing together then it will list them by the time. The goal is to see the communication happening between similar people ordered by time.An example is as:
time|source|recipient
1 paul amy
2 amy paul
3 amy paul
5 paul jane
8 amy paul
9 jane paul
10 paul amy
11 paul jane
the end result would be like
1 paul amy
2 amy paul
3 amy paul
8 amy paul
10 paul amy
5 paul jane
9 jane paul
11 paul jane
Your question is a bit vague. My educated guess is you want this:
SELECT *
FROM tbl
ORDER BY (GREATEST(source, recipient), LEAST(source, recipient), "time";
The manual about GREATEST and LEAST.
I am trying to create a view that records the selected attributes for all Computer Science majors.
This is my query to create a view:
DROP VIEW CS_grade_report;
CREATE VIEW CS_grade_report AS
SELECT Student.student_id AS "ID",
student_name AS "Name",
course_number AS "Course #",
credit AS "Credit",
grade AS Grade
FROM Student, Class, Enrolls
WHERE major = 'CSCI'
AND Student.student_id = Enrolls.student_id
AND Class.schedule_num = Enrolls.schedule_num;
SELECT *
FROM CS_grade_report;
And this is what is generated:
ID Name Course # Credit GR
------ ------------------------- -------- ---------- --
600000 John Smith CSCI3200 4 B+
600000 John Smith CSCI3700 3 C
600000 John Smith SPAN1004 3 A-
600000 John Smith CSCI4300 3 A+
600001 Andrew Tram MUSC2406 2 A+
600001 Andrew Tram SPAN1004 3 A
600001 Andrew Tram CSCI3700 3 B-
600002 Jane Doe CSCI4200 3 D+
600003 Michael Jordan CSCI4300 3 A+
600004 Tiger Woods MUSC1000 1 A
600007 Dominique Davis CSCI4300 3 F
ID Name Course # Credit GR
------ ------------------------- -------- ---------- --
600009 Will Smith CSCI3200 4 A
600010 Papa Johns CSCI3200 4 B
600011 John Doe CSCI3200 4 C
600012 Jackie Chan CSCI3200 4 D
600013 Some Guy CSCI3200 4 E
16 rows selected.
I am assuming this is output from sqlplus. There is a "pagesize" option to define when breaks are added. If you only want to see one heading, set the size to a large enough value prior to running your SELECT statement as such:
set pagesize 500
(or whatever size you want)
There are many command options for sqlplus. This link is a good cheat-sheet.
Can someone please lend a hand with this query? I've been fooling with LIMIT or TOP, but I think I'm off track. I want to return all fields from a table, but with a max of 3 duplicate id's in the new table.
Table
id first last
===================
1 John Doe
1 John Doe
1 John Doe
1 John Doe
2 Mary Green
2 Mary Green
3 Stacy Kirk
3 Stacy Kirk
3 Stacy Kirk
3 Stacy Kirk
3 Stacy Kirk
Desired Results (up to 3 ids)
id first last
====================
1 John Doe
1 John Doe
1 John Doe
2 Mary Green
2 Mary Green
3 Stacy Kirk
3 Stacy Kirk
3 Stacy Kirk
Thanks!
since you mentioned TOP, this is for SQL SERVER
SELECT id, first, last
FROM
(
SELECT id, first, last,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY LAST) rn
FROM TABLE1
) s
WHERE s.rn <= 3
SQLFiddle Demo (SQL Server)