How do I join or concat 2 dataframes where I get a new column for each row where the left_on/right_on key is the same? - pandas

Given 2 dataframes:
DF1
ID
Name
123
Jim
456
Bob
DF2
record_id
model_year
make_desc
model_desc
vin
123
2008
Chevy
Tahoe
cvin
456
2020
Hyundai
Elantra
hvin
456
2018
Ford
F-150
fvin
I want to merge/join/groupby, not sure really such that the result is:
ID
Name
model_year1
make_desc1
model_desc1
vin1
123
Jim
2008
Chevy
Tahoe
cvin
456
Bob
2020
Hyundai
Elantra
hvin
model_year2
make_desc2
model_desc2
vin2
2008
Chevy
Tahoe
cvin
2018
Ford
F150
fvin
(the second table of results is just more columns from the first table, i couldnt figure out the markup)
so kind of like a join, I need to be able to join data on a value
but I want to add columns instead of adding rows, when there are multiple matches,
and the number of matches cant be known upfront so it could need to add 10 columns.
I tried a horizontal concat but it doesnt seem to match on value
I have also read up a bunch on groupby, but I can't get it.
any help would be appreciated.

Didnt fight a straigtfoward way. Please try as explained and coded below;
df3=pd.merge(df1,df2, how='left', on='ID')#Merge the two dfs
df3=df3.groupby(['ID','Name'])['JobCode'].unique().reset_index()# JobCode to list
df3[['JobCode','JobCode_x']]=pd.DataFrame(df3['JobCode'].tolist(), index= df3.index)#Create required columns
ID Name JobCode JobCode_x
0 123 Jim H1B None
1 456 Bob H1B H2B

Related

Postgres rank() without duplicates

I'm ranking race data for series of cycling events. Racers win various amounts of points for their position in races. I want to retain the discrete event scoring, but also rank the racer in the series. For example, considering a sub-query that returns this:
License #
Rider Name
Total Points
Race Points
Race ID
123
Joe
25
5
567
123
Joe
25
12
234
123
Joe
25
8
987
456
Ahmed
20
12
567
456
Ahmed
20
8
234
You can see Joe has 25 points, as he won 5, 12, and 8 points in three races. Ahmed has 20 points, as he won 12 and 8 points in two races.
Now for the ranking, what I'd like is:
Place
License #
Rider Name
Total Points
Race Points
Race ID
1
123
Joe
25
5
567
1
123
Joe
25
12
234
1
123
Joe
25
8
987
2
456
Ahmed
20
12
567
2
456
Ahmed
20
8
234
But if I use rank() and order by "Total Points", I get:
Place
License #
Rider Name
Total Points
Race Points
Race ID
1
123
Joe
25
5
567
1
123
Joe
25
12
234
1
123
Joe
25
8
987
4
456
Ahmed
20
12
567
4
456
Ahmed
20
8
234
Which makes sense, since there are three "ties" at 25 points.
dense_rank() solves this problem, but if there are legitimate ties across different racers, I want there to be gaps in the rank (e.g if Joe and Ahmed both had 25 points, the next racer would be in third place, not second).
The easiest way to solve this I think would be to issue two queries, one with the "duplicate" racers eliminated, and then a second one where I can retain the individual race data, which I need for the points break down display.
I can also probably, given enough effort, think of a way to do this in a single query, but I'm wondering if I'm not just missing something really obvious that could accomplish this in a single, relatively simple query.
Any suggestions?
You have to break this into steps to get what you want, but that can be done in a single query with common table expressions:
with riders as ( -- get individual riders
select distinct license, rider, total_points
from racists
), places as ( -- calculate non-dense rankings
select license, rider, rank() over (order by total_points desc) as place
from riders
)
select p.place, r.* -- join rankings into main table
from places p
join racists r on (r.license, r.rider) = (p.license, p.rider);
db<>fiddle here

How to turn values of a column into new individual columns in SQL

Hello everyone I am trying to convert a categorical variable which is a column named Educational Group and has values like
State | Educational Group | No of Persons |
-------+-----------------------+---------------+
A Below Metric 123
A metric/secondary 456
A diploma 789
A graduate and above 101112
A post graduate 131415
B Below Metric 145
B metric/secondary 467
B diploma 564
B graduate and above 987
B post graduate 875
I want this to be converted as
State | Below Metric_ NO of persons | Metric/Secondary_No of persons | Diploma_No of Persons| ...
-------+-------------------------------+--------------------------------+---------------------+
A 123 456 789
B 145 467 564
and so on for all states and all educational levels.
Is it possible to do in SQL? Actually I did the same in Python using pivot function and it worked pretty well and now I the same to be done in Microsoft SQL Server Management Studio.
I want to convert this
https://ibb.co/L15m2sS
into this https://ibb.co/9tLpk7V
As mentioned PIVOT should do the trick.
SELECT *
FROM
(
SELECT *
FROM mytable
) AS SourceTable PIVOT(AVG([No_of_Persons]) FOR [Educational_Group] IN([Below Metric],
[metric/secondary],
[graduate and above],
[post graduate])) AS PivotTable;
Online demonstration using your table on db<>iddle.

PowerBI Report or SQL Query Grouping Data Spanning Columns

I'm wracking my brain trying to figure this out. I have a dataset / table that looks like this:
ID | Person1 | Person2 | Person3 | EffortPerPerson
01 | Bob | Ann | Frank | 2
02 | Frank | Bob | Joe | 3
03 | Ann | Joe | Beth | 1
I'm trying add up "Effort" for each person. For example, Bob is 2+3, Joe is 3+1, etc. My goal is to produce a PowerBI scatter plot showing total Effort for each person.
In a perfect world, the query shouldn't care how many "Person" fields there are. It should just count up the Effort value for every row that the individual's name appears.
I thought GROUP BY would work, but obviously that's only for one column, and I can't wrap my head around how to make nested queries work here.
Any one have any ideas? Thanks in advance!
As Nick suggested, you should go with the Unpivot transformation. Go to Edit Queries and select Transform tab:
Select columns you want to transform in rows, open dropdown menu under Unpivot Columns and select "Unpivot Only Selected Columns":
And that's it! Power BI will aggregate values for you:

SQL join tables with wildcard (MS Access)

how do i join following tables with wildcards? I would like to get all distinct rows from People table which contains SearchedName from SearchedPeople table.
SearchedPeople:
SearchedName
--------
Andrew
John
John Smith
People:
ID PersonName Attribute Age
----------------------------------------
1 John Smith 1 23
2 John Smith Jr 3 25
3 John Smith Jr II 4 73
4 Kevin 2 21
5 Andrew Smith 1 14
6 Marco 5 90
Desired Output:
PersonName Attribute Age
----------------------------------------
John Smith 1 23
John Smith Jr 3 25
John Smith Jr II 4 73
Andrew Smith 1 14
Code i got so far which doesnt wor. It returns three empty rows(why is that?).
SELECT b.PersonName, b.Attribute, b.Age
FROM SearchedPeople a
LEFT JOIN People b ON "%"&a.SearchedName&"%" like b.PersonName
It returns three empty rows because you don't have any columns from table a (SearchedPeople) and the LEFT JOIN didn't produce a match.
The reason is your criteria is in the wrong order you are searching for PersonName in the string %Searchedname% you need to switch that around. Also Access doesn't like the % as much as it likes the asteriks * for wilcard unless you make some changes to the query or configuration of MS-Access see below comment from Parafait.
I just tested this:
SELECT a.SearchedName
,b.PersonName, b.Attribute, b.Age
FROM
SearchedPeople a
LEFT JOIN People b
ON b.PersonName LIKE ("*" & a.SearchedName & "*")
Edit:
Good Ms Access specific information from a comment from #Parafait pasting in answer in case comment every got deleted.:
Use ALIKE and percents work. And if OP connects to MS Access via OLEDB and not the GUI .exe program, the % operator is required for LIKE statements in coded SQL. OP can also change database settings to ANSI-92 mode to always use % wildcards.

appropriate method for text match in one column to other column in oracle

I have to write a query in Oracle. I have a table called 'Entity' with 2 columns 'Pref_mail_name' and 'spouse_name'.
Now i want list of all spouse_name where the last name of the spouse_name is not populated from pref_mail_name.
For example my table has following data
Pref_mail_name spouse_name
Kunio Tanaka | Lorraine
Mrs. Betty H. Williams | Chester Williams
Mr. John Baranger | Mrs. Cathy Baranger
William kane Gallio | Karen F. Gallio
Sangon Kim | Jungja
i need output as 1st and 5th row only. I did some analysis and came up with oracle built in function
SELECT PREF_MAIL_NAME, SPOUSE_NAME, UTL_MATCH.JARO_WINKLER_SIMILARITY(a, b)
similarity from entity
order by similarity;
But above query is not looking genuine.Even though spouse last name is not populated from pref_mail_name its giving a value above 80 for similarity.