How to merge multiple entries in a data frame into a single entry, Python - pandas

I am trying to add all the balances of a customers accounts but I am having difficulties doing such. There are two columns, the 'Customer' and the 'Balance' column. The data frame looks like this:
Customer Balance
John Doe account1 400
John Doe account2 600
John Doe account3 200
Jane Doe account1 500
Jane Doe account2 100
John Deer account1 800
What I am trying to accomplish is this: summing all the accounts' balances into just one account into another data frame or into the same data frame. Whichever is faster or easier.
Customer Balance
John Doe AccountX 1200
Jane Doe AccountX 600
John Deer AccountX 800
Can I please ask for some help regarding this matter? I can't seem to get around this problem. Sorry, still just a beginner programmer trying to learn. Thank you for your time, any help is greatly appreciated.

Use
In [181]: df.groupby(df.Customer.str[:-1].add('X'), sort=False).sum().reset_index()
Out[181]:
Customer Balance
0 John Doe accountX 1200
1 Jane Doe accountX 600
2 John Deer accountX 800

Related

Keeping part of a string that overlaps based on a condition BigQuery

I have two tables that look like this
table_1
store_no
store_loc
ID
1234
CAL
ID123
6789
LAL
ID947
5678
PAA
ID456
5678
PAA
ID654
9876
LAS
ID789
table_2
ID
client_no
client_name
product
ID123
1029
John Doe
tent blue
ID947
1029
John Doe
tent red
ID456
4538
Jane Doe
skates 42
ID654
4538
Jane Doe
skates black red
ID789
9234
John Smith
bag green
I am trying to remove the parts of the 'product' that don't overlap if the 'store_no' and 'store_loc' match. So given these two tables I'm looking to get the following as a result:
ID
client_no
client_name
product
ID123
1029
John Doe
tent blue
ID947
1029
John Doe
tent red
ID456
4538
Jane Doe
skates
ID789
9234
John Smith
bag green
As in the example, I don't have a defined strings that I need removed, the string could be a number or a word. That's why I need a way to extract only the part that overlaps.
I think I need to use IF and REGEXP, but I'm not sure how to do it. I don't know how to make sure I'm only keeping the part of the string that overlaps given a condition.
Consider below simple approach
select t.*
from table_1
join table_2 t using (ID)
qualify row_number() over(partition by store_no, store_loc order by ID) = 1
if applied to sample data in your question - output is
Row ID client_no client_name product
1 ID123 1029 John Doe tent blue
2 ID456 4538 Jane Doe skates 42
3 ID947 1029 John Doe tent red
4 ID789 9234 John Smith bag green

SQL Db2 - How to unify two rows in one using datetime

I've got a table where we have registries of employees and where they have worked. In each row, we have the employee's starting date on that place. It's something like this:
Employee ID
Name
Branch
Start Date
1
John Doe
234
2018-01-20
1
John Doe
300
2019-03-20
1
John Doe
250
2022-01-19
2
Jane Doe
200
2019-02-15
2
Jane Doe
234
2020-05-20
I need a query where the data returned looks for the next value, making the starting date on the next branch as the end of the current. Eg:
Employee ID
Name
Branch
Start Date
End Date
1
John Doe
234
2018-01-20
2019-03-20
1
John Doe
300
2019-03-20
2022-01-19
1
John Doe
250
2022-01-19
---
2
Jane Doe
200
2019-02-15
2020-05-20
2
Jane Doe
234
2020-05-20
---
When there is not another register, we assume that the employee is still working on that branch, so we can leave it blank or put a default "9999-01-01" value.
Is there any way we can achieve a result like this using only SQL?
Another approach to my problem would be a query that returns only the row that is in a range. For example, if I look for what branch John Doe worked in 2020-12-01, the query should return the row that shows the branch 300.
You can use LEAD() to peek at the next row, according to a subgroup and ordering within it.
For example:
select
t.*,
lead(start_date) over(partition by employee_id order by start_date) as end_date
from t

find records with date greater than matching entry in same table

Im trying to find records from a specific records start date forward. Im not sure how to do this. Example
Name Issue Open Date Issue Close Date
John Doe 02/01/2017 02/15/2017
John Doe 02/25/2017 03/01/2017
John Doe 03/05/2017 03/15/2017
John Doe 03/20/2017 03/25/2017
Jane Doe 02/01/2017 02/20/2017
Jane Doe 02/22/2017 02/28/2017
Jane Doe 03/07/2017 03/22/2017
Jane Doe 03/25/2017 04/05/2017
Jim Jones 02/17/2017 02/25/2017
Jim Jones 02/15/2017 02/18/2017
Jim Jones 03/01/2017 03/07/2017
Jim Jones 03/19/2017 04/02/2017
I want to find each record from the first issue close date and forward, but the dates are scattered. So for John Doe, I want to pull back records from 02/15/17 and greater. For Jane Doe I want to pull back records from 02/20/17 and greater. and for Jim Jones I want to pull back records from 02/25/2017 and greater. I need to pull back records starting from a specific date, but I cant just say where issue close date > 02/01/2017 because I don't always know the close date and the date is scattered. thanks
You will need to perform some level of aggregation and store the results in either a CTE or temp table before you can use the first issue close date as a filter. For example:
;WITH CTE AS (
SELECT Name
,min([Issue Close Date]) as FirstIssueCloseDate
FROM yourtable
GROUP BY Name)
SELECT *
FROM yourtable A
JOIN CTE B on a.name = b.name and a.[Issue Open Date] <=
b.FirstIssueCloseDate

Hiding a field based on another field's hidden property

I'm trying to create a RDLC report. My data look like this:
Room Time Capacity Attendee
101 8:00am 100 Fred
101 8:00am 100 Bob
101 8:00am 100 Jim
101 1:00pm 100 Tom
101 1:00pm 100 Steve
101 1:00pm 100 Mike
etc.
I'd like my report to look like this:
Room Time Capacity Attendee
101 8:00am 100 Fred
Bob
Jim
1:00pm 100 Tom
Steve
Mike
I've turned on Hide Duplicates for the Room and Time columns and that works great:
Room Time Capacity Attendee
101 8:00am 100 Fred
100 Bob
100 Jim
1:00pm 100 Tom
100 Steve
100 Mike
But I don't know how to handle the capacity column. I can't set Hide Duplicates on it, because the AM and PM capacities are the same and it will hide the PM capacity.
I thought maybe I could use the Time text box's hidden property, but there doesn't seem to be a way to access it from another text box.
Ok, I figured it out. On the Capacity field, I set HideDuplicates to the field I was grouping the Time on. So it only suppressed duplicates in that group. Which is what I wanted.

duplicate fields with an inner join

I'm having trouble understanding how to do a multi-table join without generating lots of duplicate fields.
Let's say that I have three tables:
family: id, name
parent: id, family, name
child: id, family, name
If I do a simple select:
select family.id, family.name from family
order by family.id;
I get a simple list:
ID Name
1 Smith
2 Jones
3 Wong
If I add an inner join:
select family.id, family.name, parent.first_name, parent.last_name
from family
inner join parent
on parent.family = family.id
order by family.id;
I get some duplicated fields:
ID Name Parent
1 Smith Howard Smith
1 Smith Janet Smith
2 Jones Phil Jones
2 Jones Harriet Jones
3 Wong Billy Wong
3 Wong Rachel Wong
And if I add another inner join:
select family.id, family.name, parent.first_name, parent.last_name
from family
inner join parent
on parent.family = family.id
inner join child
on child.family = family.id
order by family.id;
I get even more duplicated fields:
ID Name Parent Child
1 Smith Howard Smith Peter Smith
1 Smith Howard Smith Sally Smith
1 Smith Howard Smith Fred Smith
1 Smith Janet Smith Peter Smith
1 Smith Janet Smith Sally Smith
1 Smith Janet Smith Fred Smith
2 Jones Phil Jones Mark Jones
2 Jones Phil Jones Melissa Jones
2 Jones Harriet Jones Mark Jones
2 Jones Harriet Jones Melissa Jones
3 Wong Billy Wong Mary Wong
3 Wong Billy Wong Jennifer Wong
3 Wong Rachel Wong Mary Wong
3 Wong Rachel Wong Jennifer Wong
What I would prefer, because it's more human readable, is something like this:
ID Name Parent Child
1 Smith Howard Smith Peter Smith
Janet Smith Sally Smith
Fred Smith
2 Jones Phil Jones Mark Jones
Harriet Jones Melissa Jones
3 Wong Billy Wong Mary Wong
Rachel Wong Jennifer Wong
I know that one of the benefits of an inner join is to avoid presenting excess information through a Cartesian product. But it seems that I get something similar with a multi-table join. Is there a way to summarize each group as shown above or will this require post-processing with a scripting language like Python?
Thanks,
--Dan
This is precisely the way the relation databases work: each row must contain all information in itself, with every single field that you request. In other words, each row needs to make sense in isolation from all other rows. If you do a single query and you need to get all three levels of information, you need to deal with eliminating duplicates yourself for the desired formatting.
Alternatively, you can run three separate queries, and then do in-memory joins in code. Although this may be desirable in certain rare situations, it is generally a wrong way of spending your development time, because RDBMS are usually much more efficient at joining relational data.
You've hit it on the head. You'll need some post processing to get the results you're looking for.
SQL query results are always simple tabular data, so to get the results you're looking for would definitely not be a pretty query. You could do it, but it would involve quite a bit of query voodoo, storing things in temporary tables or using cursors, or some other funky workaround.
I'd definitely suggest using an external application to retrieve your data and format it appropriately from there.
ORMs like Entity Framework in .NET can probably do this pretty easily, but you could definitely do this with a few nested collections or dictionaries in any language.