Remove group rows - pandas

How do I remove group rows based on other columns for a particular ID such that:
ID Att Comp Att. Inc. Att
aaa 2 0 2
aaa 2 0 2
bbb 3 1 2
bbb 3 1 2
bbb 3 0 2
becomes:
ID Att Comp Att. Inc. Att
aaa 2 0 2
bbb 3 1 2
I need to discard cases which are not just duplicate, but also infer the same data based on the columns.

Use drop_duplicates -- check out the documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
I can't tell for sure from your description what you want to pay attention for for duplicates, but you can tell drop_duplicates which column(s) to look at.

Related

Find the certain value at each row data and count the frequency pandas

I want to calculate the frequency by each row data. For instance,
column_nameA
column_nameB
column_nameC
title
content
AAA company
AAA
Ben Simons
AAA company has new product lanuch.
AAA company has released new product. AAA claims that the product X has significant changed than before. Ben Simons, who is AAA company CEO, also mentioned.......
BBB company
BBB
Alex Wong
AAA company has new product lanuch.
AAA company has released new product. BBB claims that the product X has significant changed than before, and BBB company has invested around 1 millions…....
In here, the result I expected is
When AAA company happens in the title and counts 1, if AAA company appears twice in the title then it should count as 2.
Also, the similar idea in the content, if AAA company appears once then count number shows 1, if AAA company appears twice in the title then it should count as 2.
However, if AAA company appears in the second row which the row only needs to consider BBB company or BBB instead AAA company or AAA.
So, the result would be like:
nameA_appear_in_title
nameB_appear_in_title
nameC_appear_in_title
nameA_appear_in_content
nameB_appear_in_content
nameC_appear_in_content
1
1
0
2
1
1
0
0
0
1
1
0
All the data has stored into the dataframe, and hope this can manipulate by using panda.
One more thing would be highlighted, the title or content cannot be tokenized to count the frequency.
Use itertools.product for all combinations of lists of columns names and create new columns with count, last remove original columns names if necessary:
cols = df.columns
L1 = ['column_nameA', 'column_nameB', 'column_nameC']
L2 = ['title', 'content']
from itertools import product
for a, b in product(L2, L1):
df[f'{b}_{a}'] = df.apply(lambda x: x[a].count(x[b]), axis=1)
df = df.drop(cols, axis=1)
print (df)
column_nameA_title column_nameB_title column_nameC_title \
0 1 1 0
1 0 0 0
column_nameA_content column_nameB_content column_nameC_content
0 2 3 1
1 1 2 0
Last if necessary subtract column_nameA from column_nameB use:
cola = df.columns.str.startswith('column_nameA')
colb = df.columns.str.startswith('column_nameB')
df.loc[:, colb] = df.loc[:, colb] - df.loc[:, cola].to_numpy()
print (df)
column_nameA_title column_nameB_title column_nameC_title \
0 1 0 0
1 0 0 0
column_nameA_content column_nameB_content column_nameC_content
0 2 1 1
1 1 1 0

MDX query to get employees under given supervisor with parent child relationship

I have employee dimension in my cube where each employee has a supervisor which is also an employee. The sample data set is,
Employee ID | Supervisor ID | Name
1 0 ABC
2 1 AAA
3 1 BBB
4 2 CCC
5 2 DDD
6 4 EEE
7 3 FFF
I want to get the all employees under given supervisor. E.g. If the supervisor is 2, then the result should be
CCC
DDD
EEE
using below query i can get all the employees
SELECT {AddCalculatedMembers({[Employee].[EmployeeName].Children})} ON COLUMNS FROM [MY_CUBE]
I am new to MDX and please tell me how to write MDX query for above requirement.
#mmarie
I already have a cube. But not sure whether I implemented it correctly. My schema is as below,
The dimension "dimEmployee" has columns "EmployeeID, EmployeeName, Dept".
Also I have used bridge table "BridgeEmployee" and it has columns "ParentEmployeeID, ChildEmployeeID, Distance"
sample data in bridge are,
ParentEmployeeID | ChildEmployeeID | Distance
1 1 0
2 2 0
1 2 1
3 3 0
1 3 1
4 4 0
2 4 1
1 4 2
I am using SSAS and I have implemented the bridge table as Measure Group.

View to replace values with max value corresponding to a match

I am sure my question is very simple for some, but I cannot figure it out and it is one of those things difficult to search an answer for. I hope you can help.
In a table in SQL I have the following (simplified data):
UserID UserIDX Number Date
aaa bbb 1 21.01.2000
aaa bbb 5 21.01.2010
ppp ggg 9 21.01.2009
ppp ggg 3 15.02.2020
xxx bbb 99 15.02.2020
And I need a view which will give me the same amount of records, but for every combination of UserID and UserIDX, there should be only 1 value under the Number field, i.e. the highest value found in the combination data set. The Date field needs to remain unchanged. So the above would be transformed to:
UserID UserIDX Number Date
aaa bbb 5 21.01.2000
aaa bbb 5 21.01.2010
ppp ggg 9 21.01.2009
ppp ggg 9 15.02.2020
xxx bbb 99 15.02.2020
So, for all instances of aaa+bbb combination the unique value in Number should be 5 and for ppp+ggg the unique number is 9.
Thank you very much.
Leo
select userid,useridx,maxnum,date
from table a
inner join (
select userid,useridx,max(number) maxnum
from table
group by userid,useridx) b
on a.userid = b.userid and a.useridx = b.useridx

SQL - querying without duplicate base on another column, /improving condition.?

I have written a query which involves joins and finally returns the below result,
Name ID
AAA 1
BBB 1
BBB 6
CCC 1
CCC 6
DDD 6
EEE 1
But I want my result to be still filtered in such a way that, the duplicate values in the first column should be ignored which has lesser value. ie, CCC and BBB which are duplicates with value 1 should be removed. The result should be
AAA 1
BBB 6
CCC 6
DDD 6
EEE 1
Note: I have a condition called Where (ID = '6' or ID = '1'), is there any way to improve this condition saying Where ID = 6 or ID = 1 (if no 6 is available in that table)"
You will likely want to add:
GROUP BY name
to the bottom of your query and change ID to MAX(ID) in your SELECT statement
It is hard to give a more specific answer without seeing the query you've already written.

SQL - conditional statements in crosstab queries - say what

I am working with MS Access 2007. I have 2 tables: Types of Soda, and Likeability.
Types of Soda are: Coke, Pepsi, Dr. Pepper, and Mello Yellow
Likeability is a lookup with these options: Liked, Disliked, No preference
I know how to count the number of Cokes or Mello Yellows in the table using DCount("[Types]", "[Types of Soda]", "[Types]" = 'Coke')
I also know how to count the number of Liked, Disliked, No preference.
("[Perception]", "[Likeability]", "[Perception]" = 'Liked')
But, what if I need to count the number of "Likes" by Type.
i.e. the table should look like this:
Coke | Pepsi | Dr. Pepper | Mello Yellow
Likes 9 2 12 19
Dislikes 2 45 1 0
No Preference 0 12 14 15
I know in Access I can create a cross tab queries, but my tables are joined by an ID. So my [Likeability] table has an ID column, which is the same as the ID column in my [Types] table. That's the relationship, and that's what connects my tables.
My problem is that I don't know how to apply the condition for counting the likes, dislikes, etc, for ONLY the Types that I specify. It seems like I first have to check the [Likeability] table for "Likes", and cross reference the ID with the ID in the [Types] table.
I am very confused, and you may be too, now. But all I want to do is count the # of Likes and Dislikes for each type of soda.
Please help.
Its not really clear (to me anyway) what your tables look like so lets assume the following
tables
Soda
------
Soda_ID (Long Integer (Increment))
Soda_Name (Text(50)
Perception
------
Perception_ID (Long Integer (Increment))
Perception_Name (Text(50)
Likeability
-----------
Likeability_ID (Long Integer (Increment))
Soda_ID (Long Integer)
Perception_ID (Long Integer)
User_ID (Long Integer)
Data
Soda_Id Soda_Name
------- ---------
1 Coke
2 Pepsi
3 Dr. Pepper
4 Mello Yellow
Perception_ID Perception_Name
------------- ---------
1 Likes
2 Dislikes
3 No Preference
Likeability_ID Soda_ID Perception_ID User_ID
-------------- ------- ------------- -------
1 1 1 1
2 2 1 1
3 3 1 1
4 4 1 1
5 1 2 2
6 2 2 2
7 3 2 2
8 4 2 2
9 1 3 3
10 2 3 3
11 3 3 3
12 4 3 3
13 1 1 5
14 2 2 6
15 2 2 7
16 3 3 8
17 3 3 9
18 3 3 10
Transform query You could write a query like this
TRANSFORM
Count(l.Likeability_ID) AS CountOfLikeability_ID
SELECT
p.Perception_Name
FROM
Soda s
INNER JOIN (Perception p
INNER JOIN Likeability l
ON p.Perception_ID = l.Perception_ID)
ON s.Soda_Id = l.Soda_ID
WHERE
p.Perception_Name<>"No Preference"
GROUP BY
p.Perception_Name
PIVOT
s.Soda_Name;
query output
Perception_Name Coke Dr_ Pepper Mello Yellow Pepsi
--------------- ---- ---------- ------------ -----
Dislikes 1 1 1 3
Likes 2 1 1 1