Sorting Pandas dataframe by multiple conditions - pandas

I have a large dataframe (thousands of rows by hundreds of columns), a short excerpt is as the following:
data = {'Step':['', '', '', 'First', 'First', 'Second', 'Third', 'Second', 'First', 'Second', 'First', 'First', 'Second', 'Second'],
'Stuff':['tot', 'white', 'random', 7583, 3563, 824, 521, 7658, 2045, 33, 9823, 5, 8090, 51],
'Mark':['marking', '', '', 1, 5, 5, 5, 1, 27, 27, 1, 6, 1, 9],
'A':['item_a', 100, 'st1', 142, 2, 2, 2, 100, 150, 105, 118, 118, 162, 156],
'B':['skill', 66, 'abc', 160, 2, 130, 140, 169, 1, 2, 130, 140, 144, 127],
'C':['item', 50, 'st1', 2000, 2, 65, 2001, 1999, 1, 2, 2000, 4, 2205, 2222],
'D':['item_c', 100, 'st1', 433, 430, 150, 170, 130, 1, 2, 300, 4, 291, 606],
'E':['test', 90, 'st1', 111, 130, 5, 10, 160, 1, 2, 232, 4, 144, 113],
'F':['done', 80, 'abc', 765, 755, 5, 10, 160, 1, 2, 733, 4, 666, 500],
'G':['nd', 90, 'mag', 500, 420, 5, 10, 160, 1, 2, 300, 4, 469, 500],
'H':['prt', 100, 'st1', 999, 200, 5, 10, 160, 1, 2, 477, 4, 620, 7],
'Name':['NS', '', '', "Pat", "Lucy", "Lucy", "Lucy", "Nick", "Kirk", "Kirk", "Joe", "Nico", "Nico", "Bryan"],
'Value':[ -1, 0, 0, 0, 3, 6, 5, 0, 7, 7, 0, 6, 0, 1]}
df = pd.DataFrame(data)
I need to sort this dataframe according to the following conditions that have to be satisfied all together:
In the "Name" column, names that are the same are to remain grouped (e.g. there are 3
records of "Lucy" next to each other, and they cannot be moved apart)
For each group of names, the appearance order has to remain the one
given by the "Step" column (e.g. the first appearance of "Lucy" is
related to the value "First" in the "Step" column, the second to
"Second" and so on)
All the remaining names that in the "Value" column have a value = 0,
have to be moved below the others (e.g. "Pat" can be moved after the
others, but not "Nico" because there are two records of "Nico" and
the other one has a value = 6)
The first three rows cannot be moved
What I have done is to concatenate different sub-dataframes:
df_groupnames=df[df.duplicated(subset=['Name'], keep=False)]
df_nogroup = df[~df.duplicated(subset=['Name'], keep=False)]
df_nogroup_high = df_nogroup[df_nogroup["Value"] > 0 ]
df_nogroup_null = df_nogroup[df_nogroup["Value"] == 0]
# Let's concatenate these dataframes to get the sorted one
df_sorted = pd.concat([df_groupnames, df_nogroup_high, df_nogroup_null])
It works, but I wonder if there's a smarter, simpler way, and maybe faster, to obtain the same.
Thank you for your attention.

Related

Feeding Word Embedding Matrix into a Pytorch LSTM Model

I have a LSTM model I am using to predict the unemployment rate from federal reserve filings. It uses glove vectors and vocab2index embedding and the training went as planned. However, upon attempting to feed a word embedding into the model for prediction testing it keeps throwing various errors.
Here is the model:
def load_glove_vectors(glove_file= glove_embedding_vectors_text_file):
"""Load the glove word vectors"""
word_vectors = {}
with open(glove_file) as f:
for line in f:
split = line.split()
word_vectors[split[0]] = np.array([float(x) for x in split[1:]])
return word_vectors
def get_emb_matrix(pretrained, word_counts, emb_size = 300):
""" Creates embedding matrix from word vectors"""
vocab_size = len(word_counts) + 2
vocab_to_idx = {}
vocab = ["", "UNK"]
W = np.zeros((vocab_size, emb_size), dtype="float32")
W[0] = np.zeros(emb_size, dtype='float32') # adding a vector for padding
W[1] = np.random.uniform(-0.25, 0.25, emb_size) # adding a vector for unknown words
vocab_to_idx["UNK"] = 1
i = 2
for word in word_counts:
if word in word_vecs:
W[i] = word_vecs[word]
else:
W[i] = np.random.uniform(-0.25,0.25, emb_size)
vocab_to_idx[word] = i
vocab.append(word)
i += 1
return W, np.array(vocab), vocab_to_idx
word_vecs = load_glove_vectors()
pretrained_weights, vocab, vocab2index = get_emb_matrix(word_vecs, counts)
Unfortunately when I feed this array
[array([ 3, 10, 6287, 6, 113, 271, 3, 6639, 104, 5105, 7525,
104, 7526, 9, 23, 9, 10, 11, 24, 7527, 7528, 104,
11, 24, 7529, 7530, 104, 11, 24, 7531, 7530, 104, 11,
24, 7532, 7530, 104, 11, 24, 7533, 7534, 24, 7535, 7536,
104, 7537, 104, 7538, 7539, 7540, 6643, 7541, 7354, 7542, 7543,
7544, 9, 23, 9, 10, 11, 24, 25, 8, 10, 11,
24, 3, 10, 663, 168, 9, 10, 290, 291, 3, 4909,
198, 10, 1478, 169, 15, 4621, 3, 3244, 3, 59, 1967,
113, 59, 520, 198, 25, 5105, 7545, 7546, 7547, 7546, 7548,
7549, 7550, 1874, 10, 7551, 9, 10, 11, 24, 7552, 6287,
7553, 7554, 7555, 24, 7556, 24, 7557, 7558, 7559, 6, 7560,
323, 169, 10, 7561, 1432, 6, 3134, 3, 7562, 6, 7563,
1862, 7144, 741, 3, 3961, 7564, 7565, 520, 7566, 4833, 7567,
7568, 4901, 7569, 7570, 4901, 7571, 1874, 7572, 12, 13, 7573,
10, 7574, 7575, 59, 7576, 59, 638, 1620, 7577, 271, 6488,
59, 7578, 7579, 7580, 7581, 271, 7582, 7583, 24, 669, 5932,
7584, 9, 113, 271, 3764, 3, 5930, 3, 59, 4901, 7585,
793, 7586, 7587, 6, 1482, 520, 7588, 520, 7589, 3246, 7590,
13, 7591])
into torch.LongTensor() I keep getting the following error:
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
Any ideas on how to remedy? I am fairly new to AI in general, and I am an economist by trade so I am almost certain I have made a boneheaded error.

How to merge classes in multiclass image segmentation

I am performing an image segmentation with a u-net model.
My mask has classes from 0-50.
I also have a text file dictionary with codes representing each class.
For example -
{1: '1234', 2:'5678', 3:'1245'} etc.
How do I combine when the 2 first string characters are the same so for example above key 1 and 3 are the same because they both start with "12".
How can I do this for all classes?
firstTwoCharDict = {}
for key, value in dictionary.items():
if key == 0:
value == value
firstTwoCharDict[key] = value
else:
value = value[:2]
firstTwoCharDict[key] = value
newDict = {}
for key, value in firstTwoCharDict.items():
if value not in newDict:
newDict[value] = [key]
else:
newDict[value].append(key)
This provides this
{'62': [1, 39],
'90': [2, 5, 9, 20, 32, 42, 47, 72, 88, 91, 95],
'97': [3, 49, 55],
'98': [4, 24, 34, 40, 53, 76, 81, 90, 96],
'31': [6, 17, 30, 48, 83],
'69': [7, 13, 15, 16, 27, 44, 51, 54, 56, 75],
'79': [8, 50],
'71': [10, 19, 22, 35, 61, 63, 65],
'99': [11, 12, 21, 46, 52, 69, 78, 84, 89],
'48': [14, 36, 74],
'60': [18],
'64': [23, 38, 66, 97]
```
Now i have an 2d array with integers, how do I replace them with they keys if the array values are equal to the values in the dict?

Outliers in data

I have a dataset like so -
15643, 14087, 12020, 8402, 7875, 3250, 2688, 2654, 2501, 2482, 1246, 1214, 1171, 1165, 1048, 897, 849, 579, 382, 285, 222, 168, 115, 92, 71, 57, 56, 51, 47, 43, 40, 31, 29, 29, 29, 29, 28, 22, 20, 19, 18, 18, 17, 15, 14, 14, 12, 12, 11, 11, 10, 9, 9, 8, 8, 8, 8, 7, 6, 5, 5, 5, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
Based on domain knowledge, I know that larger values are the only ones we want to include in our analysis. How do I determine where to cut off our analysis? Should it be don't include 15 and lower or 50 and lower etc?
You can do a distribution check with quantile function. Then you can remove values below lowest 1 percentile or 2 percentile. Following is an example:
import numpy as np
data = np.array(data)
print(np.quantile(data, (.01, .02)))
Another method is calculating the inter quartile range (IQR) and setting lowest bar for analysis is Q1-1.5*IQR
Q1, Q3 = np.quantile(data, (0.25, 0.75))
data_floor = Q1 - 1.5 * (Q3 - Q1)

Numpy : How to assign directly a subarray from values when these values are step spaced

I have 2 global arrays "tab1" and "tab2" with dimensions respectively equal to 21x21 and 17x17.
I would like to assign the block of "tab1" ( indexed by [15:20,0:7]) by the block of "tab2" indexed by [7:17:2,0:7] (so with a step between elements of 1st array dimension) : I tried whith this syntax :
tab1[15:20,0:7] = tab2[7:17:2,0:7]
Unfortunately, this doesn't work, it seems that only "diagonal" (I mean one by one) elements of 15:20 are taken into account following the values of "tab2" along [7:17:2].
Is there a way to assign a subarray of "tab1" with another subarray "tab2" composed of indexes with step spaced values ?
If someone could see what's wrong or suggest another method, this would be nice.
UPDATE 1: indeed, from my last tests, it seems good but is it also the same for the assignment of block [15:20,15:20] :
tab1[15:20,15:20] = tab2[7:17:2,7:17:2]
??
ANSWER : it seems ok also for this block assignment, sorry
The assignment works as I expect.
In [1]: arr = np.ones((20,10),int)
The two blocks have the same shape:
In [2]: arr[15:20, 0:7].shape
Out[2]: (5, 7)
In [3]: arr[7:17:2, 0:7].shape
Out[3]: (5, 7)
and assigning something interesting, looks right:
In [4]: arr2 = np.arange(200).reshape(20,10)
In [5]: arr[15:20, 0:7] = arr2[7:17:2, 0:7]
In [6]: arr
Out[6]:
array([[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
...
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 70, 71, 72, 73, 74, 75, 76, 1, 1, 1],
[ 90, 91, 92, 93, 94, 95, 96, 1, 1, 1],
[110, 111, 112, 113, 114, 115, 116, 1, 1, 1],
[130, 131, 132, 133, 134, 135, 136, 1, 1, 1],
[150, 151, 152, 153, 154, 155, 156, 1, 1, 1]])
I see a (5,7) block of values from arr2, skipping rows like [80, 100,...]

Preventing an animation from looping

I want my animation to only play once and not loop. My understanding is that you can do that by setting "next" to false. However, my animation is still looping. Here is my sprite sheet json file:
{
"images": [
"ressources/atlas/apparition.png"
],
"framerate": 12,
"frames": [
[1, 1, 170, 172, 0, -15, -15],
[1, 175, 164, 165, 0, -19, -18],
[1, 342, 156, 160, 0, -23, -21],
[159, 342, 147, 146, 0, -27, -28],
[167, 175, 134, 128, 0, -33, -37],
[173, 1, 122, 96, 0, -40, -52],
[173, 99, 96, 64, 0, -52, -68]
],
"animations": {
"apparition": { "frames": [6, 5, 4, 3, 2, 1, 0], "next": false }
}
}
Ideas?
Well... it seems that you must use gotoAndPlay() if you want to prevent looping. I was using play().