Assigning string group ID in pandas - pandas

I have a data frame (data)
Col 1 Col 2 Combination
1 2 (1,2)
3 4 (3,4)
1 2 (1,2)
2 3 (2,3)
4 6 (4,6)
3 4 (3,4)
I want to assign a group ID based on Col 1 and Col 2 as a categorical variable not a numerical one
My output needed
Col 1 Col 2 Combination GroupID
1 2 (1,2) A
3 4 (3,4) C
1 2 (1,2) A
2 3 (2,3) B
4 6 (4,6) D
3 4 (3,4) C
The GroupID need to be a categorical data type need not to be numerical and can follow any order.
I have tried this code but the GroupID column is treated as numerical datatype
data['GroupID']=data1.groupby(['Col','Col2']).ngroup()
data['GroupID'] = data['GroupID'].astype('category')
Can anyone suggest a proper way to deal with this issue?

Related

Assign row number per ID, where certain values always have fixed numbers

I want to create a query that assigns row numbers per ID in a database table, and certain specific values always get fixed row numbers. For instance, if the value in col2 is A, then the row number should be consistently set to 1. Similarly, if col2 contains the value B, then the row number should always be 2. All other values in col2 should be assigned row numbers in consecutive order starting from 3.
Desired result:
myid col1 col2 row_number
----------------------------------
1 foo A 1
1 bar B 2
1 foobar C 3
1 foobar D 4
2 foobar A 1
2 foob X 3
3 hello B 2
3 hello Z 3
3 hi Y 4
Here is an example which is not working properly.
Sounds like you want to start the row_number with a specific offset, ignoring constant values and assigning them a constant row number.
You can do something a bit ugly like this:
SELECT myid, col1, col2,
case
when col2 = 'A' then 1
when col2 = 'B' then 2
else row_number() over (partition by myid
order by case when col2 = 'A' then 'ZZZ'
when col2 = 'B' then 'ZZZ1'
else col2
end) + 2
end as row_number
FROM newtable
ORDER BY myid, row_number
Result:
MYID COL1 COL2 ROW_NUMBER
1 foo A 1
1 bar B 2
1 foobar C 3
1 foobar D 4
2 foobar A 1
2 foob X 3
3 hello B 2
3 hello Y 3
3 hello Z 4
This start the row number from +2 (Depending on the number of constant values [A,B]), giving each constant value a value that will be sorted last in the row_number window function so the rest will be sorted first.

Hive - Group by with respect to following values

I have a table with rows:
id
a
b
0
1
1
1
1
2
2
2
1
3
1
1
I need to get sum of field "b" values grouped by "a" with respect to changes in "a".
For my example i want to get:
a
b
1
3
2
1
1
1

Custom aliases for all fields with GROUP BY ROLLUP

I have such tables:
Group - combination of TypeId and ZoneId
ID TypeID ZoneID
-- -- --
1 1 1
2 1 2
3 2 1
4 2 2
5 2 3
6 3 3
Object
ID GroupId
-- --
1 1
2 1
3 2
4 3
5 3
6 3
I want to build a query for grouping all these tables by TypeId and ZoneId, with number of objects which have specific combination of these field:
ResultTable
TypeId ZoneId Number of objects
-- -- --
1 1 2
1 2 1
2 1 3
2 2 1
2 3 0
3 3 0
Query for this:
SELECT
group.TypeId,
group.ZoneId,
COUNT(obj.ID) as NumberOfObjects
FROM[Group] group
JOIN[Object] obj on obj.GroupID = group.ID
GROUP BY group.TypeId, group.ZoneId ORDER BY group.TypeId
But! I want to add summarize row after each group, and make it like:
ResultTableWithSummary
TypeId ZoneId Number of objects
-- -- --
1 1 2
1 2 1
Summary (empty field) 3
2 1 3
2 2 1
2 3 0
Summary (empty field) 4
3 3 0
Summary (empty field) 0
The problem is that I can use GROUP BY ROLLUP(group.TypeId, group.ZoneId):
TypeId ZoneId Number of objects
-- -- --
1 1 2
1 2 1
1 null 3
2 1 3
2 2 1
2 3 0
2 null 4
3 3 0
3 null 0
but I cannot or don't know how to change not-null group.TypeId in summary rows with "Summary".
How can I do this?
The simplest method is coalesce(), but you need to be sure the types match:
SELECT COALESCE(CONVERT(VARCHAR(255), group.TypeId, 'Summary') as TypeId,
. . .
This is not the most general method, because it does not handle real NULL values in the GROUP BY keys. That doesn't seem to be an issue in this case. If it were, you could use a CASE expression with GROUPING().
EDIT:
For your particular variant (which I find strange), you can use:
SELECT (CASE WHEN group.TypeId IS NULL OR group.ZoneID IS NULL
THEN 'Summary' ELSE CONVERT(VARCHAR(255), group.TypeId)
END) as TypeId,
. . .
In practice, I would use something similar to the COALESCE() in both columns, so I don't lose the information on what the summary is for.

Select field 1 Where Field 2 ='value1' And field 2 ='value2' when Field 1 is Same

I am currently having troubles with filtering my SQL records. I need something like what it results in the following concept: Table is
A B
1 1
1 3
2 1
2 2
2 3
2 4
3 1
3 2
I want to select value of A , where B=1 and B=2 And B=3 when same A .... result is
A
2
Please help
You can use aggregation:
select a
from mytable
where b in (1, 2, 3)
group by a
having count(*) = 3
This assumes no duplicates in the table - else, you need to change the having clause to:
having count(distinct b) = 3

collapse staggered records to a single row for repeating keys

I want to collapse table to eliminate values in sql but the table has repeating keys. For example, I want to collapse this:
key1 key2 v1 v2 v3
1 A a NULL NULL
1 A NULL NULL 9
1 A NULL x NULL
1 A b NULL NULL
1 A NULL NULL 8
1 A NULL x NULL
1 A a NULL NULL
1 A NULL NULL 7
1 A NULL y NULL
1 A b NULL NULL
1 A NULL NULL 6
1 A NULL y NULL
1 B a NULL NULL
1 B NULL NULL 5
1 B NULL z NULL
1 B b NULL NULL
1 B NULL NULL 4
1 B NULL z NULL
1 C a NULL NULL
1 C NULL NULL 10
1 C z NULL
1 C b NULL NULL
1 C NULL NULL 11
1 C NULL z NULL
into this:
key1 key2 v1 v2 v3
1 A a x 9
1 A b x 8
1 A a y 7
1 A b y 6
1 B a z 5
1 B b z 4
1 C a z 10
1 C b z 11
Aggregate functions don't work and I haven't had success with self-join. Any idea?
You have a key/value table. This is something we usually avoid, but sometimes it cannot be avoided. Your original table looks something like this:
key1 key2 col value
1 A v1 a
1 A v1 a
1 A v1 a
1 A v1 a
1 A v1 b
1 A v1 b
1 A v1 b
1 A v1 b
1 A v2 x
1 A v2 x
1 A v2 y
1 A v2 y
...
I am showing the rows in another order then your query result, but that doesn't matter, for a table has no inherent order; it contains the data as an unordered set. We can see that for the same key 1|A|v1 the table contains diifferent values (four times 'a', four times 'b'). This is unexpected. Usually key value tables show one value per key.
So it may be that there is something wrong with your data model. Or the table has more columns, e.g. a date to show history data and also enable us to select the current value for 1|A|v1. Then you'd have to change your original query to take this into account. Or that data model is correct and 1|A|v1 does have four 'a' and four 'b', but then your expected query result makes no sense, for there is nothing to relate v1='a' to v2='x' for instance.
So something is wrong: datamodel, existing query, desired result. Find out which.
Have you tried "select distinct"
Select distinct key1, key2, v1, v2, v3
From SomeTable