Removing selected features from dataset - pandas

I am following this program: https://scikit-learn.org/dev/auto_examples/inspection/plot_permutation_importance_multicollinear.html
since I have a problem with highly correlated features in my model (different from that one shown in the example). In this step
selected_features = [v[0] for v in cluster_id_to_feature_ids.values()]
I can get information on the features that I will need to remove from my classifier. They are given as numbers ([0, 3, 5, 6, 8, 9, 10, 17]). How can I get names of these features?

Ok, there are two different elements to this problem I think.
First, you need to get a list of the column names. In the example code you linked, it looks like the list of feature names is stored like this:
data.feature_names
Once you have the feature names, you'd need a way to loop through them and grab only the ones you want. Something like this should work:
columns = ['a', 'b', 'c', 'd']
keep_index = [0, 3]
new_columns = [columns[i] for i in keep_index]
new_columns
['a', 'b']

Related

split content of a column pandas

I have the following Pandas Dataframe
Which can also be generated using this list of dictionaries:
list_of_dictionaries = [
{'Project': 'A', 'Hours': 2, 'people_ids': [16986725, 17612732]},
{'Project': 'B', 'Hours': 2, 'people_ids': [17254707, 17567393, 17571668, 17613773]},
{'Project': 'C', 'Hours': 3, 'people_ids': [17097009, 17530240, 17530242, 17543865, 17584457, 17595079]},
{'Project': 'D', 'Hours': 2, 'people_ids': [17097009, 17584457, 17702185]}]
I have implemented kind of what I need, but adding columns vertically:
df['people_id1']=[x[0] for x in df['people_ids'].tolist()]
df['people_id2']=[x[1] for x in df['people_ids'].tolist()]
And then I get a different column of every single people_id, just until the second element, because when I add the extraction 3rd element on a third column, it crashes because , there is no 3rd element to extract from the first row.
Even though, what I am trying to do is to extract every people_id from people_ids column, and then each one of those will have their associated value from the Project and Hours columns, so I get a dataset like this one:
Any idea on how could I get this output?
I think what you are looking for is explode on 'people_ids' column.
df = df.explode('people_ids', ignore_index=True)

Rails SQL "select in" across several columns: where (code1, code2) in (("A", 1), ("A", 3), ("Q", 9))

I have a business requirement to select records based on two fields in one table: code1 and code2. The selection is complex and hard-coded, with no codeable rhyme or reason, and includes about a dozen pairs, out of the hundred pairs that actually exist in the table.
C, 1
C, 2
J, 9
Z, 0
Note that there are other "C" codes in the table, such as (C, 3). There is no combined field that captures them both as a value, e.g, "C3".
SQL supports a query like this: Two columns in subquery in where clause e.g.
SELECT * from rejection_codes
where (code1, code2) in (("A", 1), ("A", 3), ("Q", 9))
Is there a way to do this with Rails and ActiveRecord's ORM, without resorting to raw SQL?
I'm running Rails 4.2.9 with Postgres, if it matters.
* Why Don't You... *
Add a field: I don't have control over the database schema. If I did, I'd add a new column as a flag for this group. Or a computed column that concatenates the values into a string. Or something... But I can't.
Use raw SQL: Yeah...I might do that if I can't do it through the ORM.
If you want exactly that structure then you can do things like this:
pairs = [['A', 1], ['A', 3], ['Q', 9]]
RejectionCode.where('(code1, code2) in ((?), (?), (?))', *pairs)
Of course, pairs.length presumably won't always be three so you could say:
pairs = [['A', 1], ['A', 3], ['Q', 9]]
placeholders = (%w[(?)] * pairs.length).join(', ')
RejectionCode.where("(code1, code2) in (#{placeholders})", *pairs)
Yes, that's using string interpolation to build an SQL snippet but it is perfectly safe in this case because you're building all the strings and you know exactly what's in them. If you put this into a scope then at least the ugliness would be hidden and you could easily cover it with your test suite.
Alternatively, you could take advantage of some equivalences. An in is a fancy or so these do roughly the same thing:
c in (x, y, z)
c = x or c = y or c = z
and records (even anonymous ones) are compared column by column so these are equivalent:
(a, b) = (x, y)
a = x and b = y
That means that something like this:
pairs = [['A', 1], ['A', 3], ['Q', 9]]
and_pair = ->(a) { RejectionCode.where('code1 = ? and code2 = ?', *a) }
and_pair[pairs[0]].or(and_pair[pairs[1]]).or(and_pair[pairs[2]])
should give you the same result. Or more generally:
pairs = [['A', 1], ['A', 3], ['Q', 9], ... ]
and_pair = ->(a) { RejectionCode.where('code1 = ? and code2 = ?', *a) }
query = pairs[1..-1].inject(and_pair[pairs.first]) { |q, a| q.or(and_pair[a]) }
Again, you'd want to hide this ugliness in a scope.
* This is a decent workaround, but not exactly a solution to the ORM question *
Failing to find the right way to do this in ActiveRecord, I just guessed, hoping for the best:
class ApprovalCode < ActiveRecord::Base
REJECTION_CODES = [
['A', '0'],
['R', '1'],
['R', '5'],
['R', '6'],
['X', 'F'],
['X', 'G']
]
scope :rejection_allowed, -> { where([:code, :sub_code], REJECTION_CODES) } # This didn't work.
end
That did not work. So, I used raw SQL in the scope, and this did work:
scope :rejection_allowed, -> { where("(code, sub_code) in (#{rejection_list})") }
def self.rejection_list
REJECTION_CODES
.map{|code, sub_code| "('#{code}', '#{sub_code}')"}
.join(', ')
end
I am still hopeful to find how to do this in the ORM, or read suggestions on completely different approaches to the problem. Since it's all encapsulated in a scope and a constant, it will be trivial to refactor later, and keeping the constants and the scope separate will allow for painless tests.

Get indices for values of one array in another array

I have two 1D-arrays containing the same set of values, but in a different (random) order. I want to find the list of indices, which reorders one array according to the other one. For example, my 2 arrays are:
ref = numpy.array([5,3,1,2,3,4])
new = numpy.array([3,2,4,5,3,1])
and I want the list order for which new[order] == ref.
My current idea is:
def find(val):
return numpy.argmin(numpy.absolute(ref-val))
order = sorted(range(new.size), key=lambda x:find(new[x]))
However, this only works as long as no values are repeated. In my example 3 appears twice, and I get new[order] = [5 3 3 1 2 4]. The second 3 is placed directly after the first one, because my function val() does not track which 3 I am currently looking for.
So I could add something to deal with this, but I have a feeling there might be a better solution out there. Maybe in some library (NumPy or SciPy)?
Edit about the duplicate: This linked solution assumes that the arrays are ordered, or for the "unordered" solution, returns duplicate indices. I need each index to appear only once in order. Which one comes first however, is not important (neither possible based on the data provided).
What I get with sort_idx = A.argsort(); order = sort_idx[np.searchsorted(A,B,sorter = sort_idx)] is: [3, 0, 5, 1, 0, 2]. But what I am looking for is [3, 0, 5, 1, 4, 2].
Given ref, new which are shuffled versions of each other, we can get the unique indices that map ref to new using the sorted version of both arrays and the invertibility of np.argsort.
Start with:
i = np.argsort(ref)
j = np.argsort(new)
Now ref[i] and new[j] both give the sorted version of the arrays, which is the same for both. You can invert the first sort by doing:
k = np.argsort(i)
Now ref is just new[j][k], or new[j[k]]. Since all the operations are shuffles using unique indices, the final index j[k] is unique as well. j[k] can be computed in one step with
order = np.argsort(new)[np.argsort(np.argsort(ref))]
From your original example:
>>> ref = np.array([5, 3, 1, 2, 3, 4])
>>> new = np.array([3, 2, 4, 5, 3, 1])
>>> np.argsort(new)[np.argsort(np.argsort(ref))]
>>> order
array([3, 0, 5, 1, 4, 2])
>>> new[order] # Should give ref
array([5, 3, 1, 2, 3, 4])
This is probably not any faster than the more general solutions to the similar question on SO, but it does guarantee unique indices as you requested. A further optimization would be to to replace np.argsort(i) with something like the argsort_unique function in this answer. I would go one step further and just compute the inverse of the sort:
def inverse_argsort(a):
fwd = np.argsort(a)
inv = np.empty_like(fwd)
inv[fwd] = np.arange(fwd.size)
return inv
order = np.argsort(new)[inverse_argsort(ref)]

Is there a graphlab equivalent to df.irow?

I need to pick out a few rows in my sframe by index. Is there an equivalent graphlab command to pandas df.irow()?
There is no direct equivalent in graphlab to DataFrame.iloc (previously irow). One way to achieve the same thing is to add a column of row numbers and use the filter_by method. Suppose I want to get only the 1st and 3rd rows:
import graphlab
sf = graphlab.SFrame({'x': ['a', 'b', 'a', 'c']})
sf = sf.add_row_number('row_id')
new_sf = sf.filter_by(values=[0, 2], column_name='row_id')

sqlAlchemy dynamic where clause

I have an array of dictionaries that contains an array for each value. The values of each dictionary are the conditions for an update where clause. Since the length of each array in the dictions is variable I need to be able to dynamically create the where clause.
I'd like to do something like below.
sqlAlUpdateList = []
indexHash = [ {1: [1, 6, 11]}, {2: [7, 12]}, {3: [3, 8, 13, 74]}
for (key, values) in indexHash.iteritems():
stmt = xtable.update().value(xtable.c.ykey=key).
where(or_(xtable.c.id == values))
sqlAlcUpdateList.append(stmt)
for sqlAlcCommand in sqlAlcUpdateList:
conn.execute(sqlAlcCommand)
I know this could be split into multiple update commands but I would like to create one command.
I think there are no reason to prefer one single sentence. You're assigning different values to different rows, so I think they are separated actions. But if someone could correct me I'd like to know how to do it!