Numpy delete repeated rows - numpy

I simply need to remove the rows that are repeated in an array but maintain one of them, I can't use unique because I need to maintain the order.
Example
1 a234 125
1 a123 265
1 a234 125
1 a145 167
1 a234 125
2 a189 547
2 a189 547
3 a678 567
3 a357 569
I need this output
1 a234 125
1 a123 265
1 a145 167
2 a189 547
3 a678 567
3 a357 569

I think this does what you want ,and uses np.unique with the return_index keyword argument:
import numpy as np
a = np.array([[1, 'a234', 125],
[2, 'b189', 547],
[1, 'a234', 125],
[3, 'c678', 567],
[1, 'a234', 125],
[2, 'b189', 547]])
b = a.ravel().view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))
_, unique_idx = np.unique(b, return_index=True)
new_a = a[np.sort(unique_idx)]
>>> new_a
array([['1', 'a234', '125'],
['2', 'b189', '547'],
['3', 'c678', '567']],
dtype='|S4')
The hackiest part is the view b, that turns each row into a single element of np.void dtype, so that full rows can be compared for equality by np.unique.

Related

Pandas-way to separate a DataFrame based on previouse groupby() explorations without loosing the not-grouped columns

I tried to translate the problem with my real data to example data presented in my question. Maybe I just have a simple technical problem. Or maybe my whole way and workflow is not the best?
The objectiv
There are persons (column name) who have eaten different fruit's at different day's. And there is some more data (column foo and bar) I do not want to lose.
I want to separate/split the original data, without loosing the additational data (in foo and bar).
The condition to separate is the number of unique fruits eaten at the specific days.
That is the initial data
>>> df
name day fruit foo bar
0 Tim 1 Apple 708 20
1 Tim 1 Apple 135 743
2 Tim 2 Apple 228 562
3 Anna 1 Banana 495 924
4 Anna 1 Strawberry 236 542
5 Bob 1 Strawberry 420 894
6 Bob 2 Apple 27 192
7 Bob 2 Kiwi 671 145
The separated interim result should look like this two DataFrame's:
>>> two
name day fruit foo bar
0 Anna 1 Banana 495 924
1 Anna 1 Strawberry 236 542
2 Bob 2 Apple 27 192
3 Bob 2 Kiwi 671 145
>>> non_two
name day fruit foo bar
0 Tim 1 Apple 708 20
1 Tim 1 Apple 135 743
2 Tim 2 Apple 228 562
3 Bob 1 Strawberry 420 894
Example explanation in words: Tim ate just Apple's at day 1 and 2. It does not matter how many apples. It just matters that it is one unique fruit.
What I have done so far
I did some groupby() magic to find out who and when have eaten two or less/more then two unique fruits.
import pandas as pd
import random as rd
data = {'name': ['Tim', 'Tim', 'Tim', 'Anna', 'Anna', 'Bob', 'Bob', 'Bob'],
'day': [1, 1, 2, 1, 1, 1, 2, 2],
'fruit': ['Apple', 'Apple', 'Apple', 'Banana', 'Strawberry',
'Strawberry', 'Apple', 'Kiwi'],
'foo': rd.sample(range(1000), 8),
'bar': rd.sample(range(1000), 8)
}
# That is the primary DataFrame
df = pd.DataFrame(data)
# Explore the data
a = df[['name', 'day', 'fruit']].groupby(['name', 'day', 'fruit']).count().reset_index()
b = a.groupby(['name', 'day']).count()
# People who ate 2 fruits on specific days
two = b[(b.fruit == 2)].reset_index()
print(two)
# People who ate less or more then 2 fruits on specific days
non_two = b[(b.fruit != 2)].reset_index()
print(non_two)
Here is my roadblocker
With the dataframes two and non_two I have the informations I want. Know I want to separate the initial dataframe based on that informations. I think name and day are the columns I should use to select and separate in the initial dataframe.
# filter mask
mymask = (df.name == two.name) & (df.day == two.day)
df_two = df[mymask]
df_non_two = df[~mymask]
But this does not work. The first line raise ValueError: Can only compare identically-labeled Series objects.
Use DataFrameGroupBy.nunique in GroupBy.transform, so possible filter original DataFrame:
mymask = df.groupby(['name', 'day'])['fruit'].transform('nunique').eq(2)
df_two = df[mymask]
df_non_two = df[~mymask]
print (df_two)
name day fruit foo bar
3 Anna 1 Banana 335 62
4 Anna 1 Strawberry 286 694
6 Bob 2 Apple 822 738
7 Bob 2 Kiwi 793 449

comverting the numpy array to proper dataframe

I have numpy array as data below
data = np.array([[1,2],[4,5],[7,8]])
i want to split it and change to dataframe with column name as below to get the first value of each array as below
df_main:
value_items excluded_items
1 2
4 5
7 8
from which later I can take like
df:
value_items
1
4
7
df2:
excluded_items
2
5
8
I tried to convert to dataframe with command
df = pd.DataFrame(data)
it resulted in still array of int32
so, the splitting is failure for me
Use reshape for 2d array and also add columns parameter:
df = pd.DataFrame(data.reshape(-1,2), columns=['value_items','excluded_items'])
Sample:
data = np.arange(785*2).reshape(1, 785, 2)
print (data)
[[[ 0 1]
[ 2 3]
[ 4 5]
...
[1564 1565]
[1566 1567]
[1568 1569]]]
print (data.shape)
(1, 785, 2)
df = pd.DataFrame(data.reshape(-1,2), columns=['value_items','excluded_items'])
print (df)
value_items excluded_items
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
.. ... ...
780 1560 1561
781 1562 1563
782 1564 1565
783 1566 1567
784 1568 1569
[785 rows x 2 columns]

Sort Column of Dataframe by similarity, first row should be fixed Python

I want to order the Frame depending on the first row of B. So the first row of B is allways fixed and the second, third .... row is sorted by similarity of B's first row. It should also be flexible, B could contain 2-20 or even more rows
I expect a result like this
Any idea how to do this?
If you sort the values by the difference from the first value in b, you can just use that index into the original DataFrame:
In [35]: df = pd.DataFrame({'a': range(6), 'b': [483, 479, 503, 479, 485, 495]})
In [36]: df
Out[36]:
a b
0 0 483
1 1 479
2 2 503
3 3 479
4 4 485
5 5 495
In [37]: idx = df['b'].sub(df.loc[0, 'b']).abs().sort_values().index
In [38]: df.loc[idx]
Out[38]:
a b
0 0 483
4 4 485
1 1 479
3 3 479
5 5 495
2 2 503

Add rows to pandas dataframe using column of dictionaries

I have a dataframe like this:
matrix = [(222, {'a': 1, 'b':3, 'c':2, 'd':1}),
(333, {'a': 1, 'b':0, 'c':0, 'd':1})]
df = pd.DataFrame(matrix, columns=['ordernum', 'dict_of item_counts'])
ordernum dict_of item_counts
0 222 {'a': 1, 'b': 3, 'c': 2, 'd': 1}
1 333 {'a': 1, 'b': 0, 'c': 0, 'd': 1}
and I would like to create a dataframe in which each ordernum is repeated for each dictionary key in dict_of_item_counts that is not 0. I would also like to create a key column that shows the corresponding dictionary key for this row as well as a value column that contains the dictionary values. Finally, I would also an ordernum_index that counts the different rows in the dataframe for each ordernum.
The final dataframe should look like this:
ordernum ordernum_index key value
222 1 a 1
222 2 b 3
222 3 c 2
222 4 d 1
333 1 a 1
333 2 d 1
Any help would be much appreciated :)
Always try to structure your data, Can be done easily like below:
>>> matrix
[(222, {'a': 1, 'b': 3, 'c': 2, 'd': 1}), (333, {'a': 1, 'b': 0, 'c': 0, 'd': 1})]
>>> data = [[item[0]]+[i+1]+list(value) for item in matrix for i,value in enumerate(item[1].items()) if value[-1]!=0]
>>> data
[[222, 1, 'a', 1], [222, 2, 'b', 3], [222, 3, 'c', 2], [222, 4, 'd', 1], [333, 1, 'a', 1], [333, 4, 'd', 1]]
>>> pd.DataFrame(data, columns=['ordernum', 'ordernum_index', 'key', 'value'])
ordernum ordernum_index key value
0 222 1 a 1
1 222 2 b 3
2 222 3 c 2
3 222 4 d 1
4 333 1 a 1
5 333 4 d 1
Expand the dictionary by using apply with pd.Series and use concat to concatenate that to your other column (ordernum). See below for your in-between result of df2.
Now to turn every column into a row, use melt, then use query to drop all the 0-rows and finally assign the cumcount to get the index (after ordering) and add 1 to start counting from 1, not 0.
df2 = pd.concat([df[['ordernum']], df['dict_of item_counts'].apply(pd.Series)], axis=1)
(df2.melt(id_vars='ordernum', var_name='key')
.query('value != 0')
.sort_values(['ordernum', 'key'])
.assign(ordernum_index = lambda df: df.groupby('ordernum').cumcount().add(1)))
# ordernum key value ordernum_index
#0 222 a 1 1
#2 222 b 3 2
#4 222 c 2 3
#6 222 d 1 4
#1 333 a 1 1
#7 333 d 1 2
Now df2 looks like:
# ordernum a b c d
#0 222 1 3 2 1
#1 333 1 0 0 1
You can do this by unpacking your dictionarys while accesing them with iterrows and creating a tuple out of the ordernum, key, value.
Finally to create your ordernum_index we groupby on ordernum and do a cumcount:
data = [(r['ordernum'], k, v) for _, r in df.iterrows() for k, v in r['dict_of item_counts'].items() ]
new = pd.DataFrame(data, columns=['ordernum', 'key', 'value']).sort_values('ordernum').reset_index(drop=True)
new['ordernum_index'] = new[new['value'].ne(0)].groupby('ordernum').cumcount().add(1)
new.dropna(inplace=True)
ordernum key value ordernum_index
0 222 a 1 1.0
1 222 b 3 2.0
2 222 c 2 3.0
3 222 d 1 4.0
4 333 a 1 1.0
7 333 d 1 2.0
Construct dataframe df1 using df['dict_of item_counts'].tolist() for values and df.ordernum for index. replace 0 with np.nan and stack with dropna=True to ignore 0 values. reset_index to get all columns.
Next, create column ordernum_index by using groupby and cumcount.
Finally, change column names to appropriate names.
df1 = pd.DataFrame(df['dict_of item_counts'].tolist(), index=df.ordernum).replace(0, np.nan).stack(dropna=True).reset_index(name='value')
df1['ordernum_index'] = df1.groupby('ordernum')['value'].cumcount() + 1
df1 = df1.rename(columns={'level_1': 'key'})
Out[732]:
ordernum key value ordernum_index
0 222 a 1.0 1
1 222 b 3.0 2
2 222 c 2.0 3
3 222 d 1.0 4
4 333 a 1.0 1
5 333 d 1.0 2

pd.Categorical.from_codes with missing values

Assume I have:
df = pd.DataFrame({'gender': np.random.choice([1, 2], 10), 'height': np.random.randint(150, 210, 10)})
I'd like to make the gender column categorical. If I try:
df['gender'] = pd.Categorical.from_codes(df['gender'], ['female', 'male'])
it'll fail.
I can pad the categories
df['gender'] = pd.Categorical.from_codes(df['gender'], ['N/A', 'female', 'male'])
But then 'N/A' is returned in some methods:
In [67]: df['gender'].value_counts()
Out[67]:
female 5
male 5
N/A 0
Name: gender, dtype: int64
I thought about using None as the padding value. It works as intended in the value_counts however I get a warning:
opt/anaconda3/bin/ipython:1: FutureWarning:
Setting NaNs in `categories` is deprecated and will be removed in a future version of pandas.
#!/opt/anaconda3/bin/python
Any better way to do this? Also is there a way to give a mapping from code to category explicitly?
you can use rename_categories() method:
Demo:
In [33]: df
Out[33]:
gender height
0 1 203
1 2 169
2 2 181
3 1 172
4 2 174
5 1 166
6 2 187
7 2 200
8 1 208
9 1 201
In [34]: df['gender'] = df['gender'].astype('category').cat.rename_categories(['male','feemale'])
In [35]: df
Out[35]:
gender height
0 male 203
1 feemale 169
2 feemale 181
3 male 172
4 feemale 174
5 male 166
6 feemale 187
7 feemale 200
8 male 208
9 male 201
In [36]: df.dtypes
Out[36]:
gender category
height int32
dtype: object
Assign the new categories directly to it's .categories attribute and it would then be renamed to these values:
df['gender'] = df['gender'].astype('category')
df['gender'].cat.categories = ['female', 'male']
df['gender'].value_counts()
Out[23]:
female 7
male 3
Name: gender, dtype: int64
df.dtypes
Out[24]:
gender category
height int32
dtype: object
If you want a mapper dict of code and it's respective category, then:
old = df['gender'].cat.categories
new = ['female', 'male']
dict(zip(old, new))
Out[28]:
{1: 'female', 2: 'male'}
The error you get from pd.Categorical.from_codes(df['gender'], ['female', 'male']) should alert you that your codes need to be 0 indexed.
So you can simply make it so with your DataFrame declaration.
df = pd.DataFrame({'gender': np.random.choice([0, 1], 10), 'height': np.random.randint(150, 210, 10)})