Pandas dataframe: replace value with that of another column based on original value - pandas

I have a pandas dataframe where I want to replace the value in the Prediction column with the value in the column referred to by the prediction column.
A
B
C
D
Prediction
stipulation
interrelation
jurisdiction
interpretation
D
typically
conceivably
tentatively
desperately
C
familiar
imaginative
apparent
logical
A
plan
explain
study
discard
B
I have tried a few methods using df.apply() and map() but they haven't worked. The resulting dataframe would look like this:
A
B
C
D
Prediction
stipulation
interrelation
jurisdiction
interpretation
interpretation
typically
conceivably
tentatively
desperately
tentatively
familiar
imaginative
apparent
logical
familiar
plan
explain
study
discard
explain

# create a dictionary for each row and return the key value of Prediction
df['val']=df.apply(lambda x: x.to_dict( )[x['Prediction']], axis=1)
df
A B C D Prediction val
0 stipulation interrelation jurisdiction interpretation D interpretation
1 typically conceivably tentatively desperately C tentatively
2 familiar imaginative apparent logical A familiar
3 plan explain study discard B explain

We used to have lookup ... but it had been removed. One work around
df['new'] = df.values[df.index,df.columns.get_indexer(df.Prediction)]
df
Out[318]:
A B ... Prediction new
0 stipulation interrelation ... D interpretation
1 typically conceivably ... C tentatively
2 familiar imaginative ... A familiar
3 plan explain ... B explain
[4 rows x 6 columns]

Related

Solving an underdetermined scipy.sparse matrix using svd

Problem
I have a set of equations with variables denoted with lowercase variables and constants with uppercase variables as such
A = a + b
B = c + d
C = a + b + c + d + e
I'm provided the information as to the structure of these equations in a pandas DataFrame with two columns: Constants and Variables
E.g.
df = pd.DataFrame([['A','a'],['A','b'],['B','c'],['B','d'],['C','a'],['C','b'],
['C','c'],['C','d'],['C','e']],columns=['Constants','Variables'])
I then convert this to a sparse CSC matrix by using NetworkX
table = nx.bipartite.biadjacency_matrix(nx.from_pandas_dataframe(df,'Constants','Variables')
,df.Constants.unique(),df.Variables.unique(),format='csc')
When converted to a dense matrix, table looks like the following
matrix([[1, 1, 0, 0, 0],[0, 0, 1, 1, 0],[1, 1, 1, 1, 1]], dtype=int64)
What I want from here is to find which variables are solvable (in this example, only e is solvable) and for each solvable variable, what constants is its value dependent on (in this case, since e = C-B-A, it is dependent on A, B, and C)
Attempts at Solution
I first tried to use rref to solve for the solvable variables. I used the symbolics library sympy and the function sympy.Matrix.rref, which gave me exactly what I wanted, since any solvable variable would have its own row with almost all zeros and 1 one, which I could check for row by row.
However, this solution was not stable. Primarily, it was exceedingly slow, and didn't make use of the fact that my datasets are likely to be very sparse. Moreover, rref doesn't do too well with floating points. So I decided to move on to another approach motivated by Removing unsolvable equations from an underdetermined system, which suggested using svd
Conveniently, there is a svd function in the scipy.sparse library, namely scipy.sparse.linalg.svds. However, given my lack of linear algebra background, I don't understand the results outputted by running this function on my table, or how to use those results to get what I want.
Further Details in the Problem
The coefficient of every variable in my problem is 1. This is how the data can be expressed in the two column pandas DataFrame shown earlier
The vast majority of variables in my actual examples will not be solvable. The goal is to find the few that are solvable
I'm more than willing to try an alternate approach if it fits the constraints of this problem.
This is my first time posting a question, so I apologize if this doesn't exactly follow guidelines. Please leave constructive criticism but be gentle!
The system you are solving has the form
[ 1 1 0 0 0 ] [a] [A]
[ 0 0 1 1 0 ] [b] = [B]
[ 1 1 1 1 1 ] [c] [C]
[d]
[e]
i.e., three equations for five variables a, b, c, d, e. As the answer linked in your question mentions, one can tackle such underdetermined system with the pseudoinverse, which Numpy directly provides in terms of the pinv function.
Since M has linearly independent rows, the psudoinverse has in this case the property that M.pinv(M) = I, where I denotes identity matrix (3x3 in this case). Thus formally, we can write the solution as:
v = pinv(M) . b
where v is the 5-component solution vector, and b denotes the right-hand side 3-component vector [A, B, C]. However, this solution is not unique, since one can add a vector from the so-called kernel or null space of the matrix M (i.e., a vector w for which M.w=0) and it will be still a solution:
M.(v + w) = M.v + M.w = b + 0 = b
Therefore, the only variables for which there is a unique solution are those for which the corresponding component of all possible vectors from the null space of M is zero. In other words, if you assemble the basis of the null space into a matrix (one basis vector per column), then the "solvable variables" will correspond to zero rows of this matrix (the corresponding component of any linear combination of the columns will be then also zero).
Let's apply this to your particular example:
import numpy as np
from numpy.linalg import pinv
M = [
[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0],
[1, 1, 1, 1, 1]
]
print(pinv(M))
[[ 5.00000000e-01 -2.01966890e-16 1.54302378e-16]
[ 5.00000000e-01 1.48779676e-16 -2.10806254e-16]
[-8.76351626e-17 5.00000000e-01 8.66819360e-17]
[-2.60659800e-17 5.00000000e-01 3.43000417e-17]
[-1.00000000e+00 -1.00000000e+00 1.00000000e+00]]
From this pseudoinverse, we see that the variable e (last row) is indeed expressible as - A - B + C. However, it also "predicts" that a=A/2 and b=A/2. To eliminate these non-unique solutions (equally valid would be also a=A and b=0 for example), let's calculate the null space borrowing the function from SciPy Cookbook:
print(nullspace(M))
[[ 5.00000000e-01 -5.00000000e-01]
[-5.00000000e-01 5.00000000e-01]
[-5.00000000e-01 -5.00000000e-01]
[ 5.00000000e-01 5.00000000e-01]
[-1.77302319e-16 2.22044605e-16]]
This function returns already the basis of the null space assembled into a matrix (one vector per column) and we see that, within a reasonable precision, the only zero row is indeed only the last one corresponding to the variable e.
EDIT:
For the set of equations
A = a + b, B = b + c, C = a + c
the corresponding matrix M is
[ 1 1 0 ]
[ 0 1 1 ]
[ 1 0 1 ]
Here we see that the matrix is in fact square, and invertible (the determinant is 2). Thus the pseudoinverse coincides with "normal" inverse:
[[ 0.5 -0.5 0.5]
[ 0.5 0.5 -0.5]
[-0.5 0.5 0.5]]
which corresponds to the solution a = (A - B + C)/2, .... Since M is invertible, its kernel / null space is empty, that's why the cookbook function returns only []. To see this, let's use the definition of the kernel - it is formed by all non-zero vectors x such that M.x = 0. However, since M^{-1} exists, x is given as x = M^{-1} . 0 = 0 which is a contradiction. Formally, this means that the found solution is unique (or that all variables are "solvable").
To build on ewcz's answer, both the nullspace and pseudo-inverse can be calculated using numpy.linalg.svd. See the links below:
pseudo-inverse
nullspace

Arrange numbers in order

I've some variables, Lets say a, b, c, d. All belongs to a fixed interval [0, e]
Now i've some relations between them like
a > b
a > c
b > d
Something like this; I want to make a function which print all the possible cases for this.
Example:
a b c d
a c b d
a b d c
a c b d
In essence, what you have is a directed acyclic graph.
A relatively simple approach is to store, for each variable, a set of the variables that must precede them. (In your example, this storage would map b to {a}, c to {a}, and d to {b}.) You can then write a recursive function that generates all valid tails consisting of a subset of these variables (in your case, for example, the subset {c,d} produces two valid tails: [c,d] and [d,c]). This recursive function examines each variable in the subset and determines whether its prerequisites are already met. (For example, since b maps to {a}, any subset including both a and b cannot produce a tail that begins with b.) If so, then it can recursively call itself on the subset excluding that variable.
There are some optimizations you can then perform, if desired. For example, you can use dynamic programming to avoid repeatedly re-computing the set of valid tails for the same subset.

What is impossible?

Hi recently i appeared in an aptitude,there was a problem that i realy cant understand please provide some idea, how to solve it.(and sorry to for poor English.)
(Question)-> Three candidates, Amar, Birendra and Chanchal stand for the local election. Opinion polls are
conducted and show that fraction a of the voters prefer Amar to Birendra, fraction b prefer Birendra to
Chanchal and fraction c prefer Chanchal to Amar. Which of the following is impossible?
(a) (a, b, c) = (0.51, 0.51, 0.51);
(b) (a, b, c) = (0.61, 0.71, 0.67);
(c) (a, b, c) = (0.68, 0.68, 0.68);
(d) (a, b, c) = (0.49, 0.49, 0.49);
(e) None of the above.
If you tried to list of possible preferences people can have
are either
ABC (means you prefer A to B, prefer B to C and therefore also prefer A to C)
ACB
BAC
BCA
CAB
CBA
in this case you'll find that each fraction of the population represents:
a=ABC+ACB+CAB
b=ABC+BAC+BCA
c=BCA+CAB+CBA
therefore a+b+c = 2(ABC+BCA+CAB)+ACB+BAC+CBA
as you notice not all groups within the population are repeated. we can therefore assume than (a+b+c) can never be more than twice the population since each member of the population is represented twice at the most.
out of the options C is the one where the sum is more than 2. and is therefore the impossible value.

Approaches to converting a table of possibilities into logical statements

I'm not sure how to express this problem, so my apologies if it's already been addressed.
I have business rules summarized as a table of outputs given two inputs. For each of five possible value on one axis, and each of five values on another axis, there is a single output. There are ten distinct possibilities in these 25 cells, so it's not the case that each input pair has a unique output.
I have encoded these rules in TSQL with nested CASE statements, but it's hard to debug and modify. In C# I might use an array literal. I'm wondering if there's an academic topic which relates to converting logical rules to matrices and vice versa.
As an example, one could translate this trivial matrix:
A B C
-- -- -- --
X 1 1 0
Y 0 1 0
...into rules like so:
if B OR (A and X) then 1 else 0
...or, in verbose SQL:
CASE WHEN FieldABC = 'B' THEN 1
WHEN FieldABX = 'A' AND FieldXY = 'X' THEN 1
ELSE 0
I'm looking for a good approach for larger matrices, especially one I can use in SQL (MS SQL 2K8, if it matters). Any suggestions? Is there a term for this type of translation, with which I should search?
Sounds like a lookup into a 5x5 grid of data. The inputs on axis and the output in each cell:
Y=1 Y=2 Y=3 Y=4 Y=5
x=1 A A D B A
x=2 B A A B B
x=3 C B B B B
x=4 C C C D D
x=5 C C C C C
You can store this in a table of x,y,outvalue triplets and then just do a look up on that table.
SELECT OUTVALUE FROM BUSINESS_RULES WHERE X = #X and Y = #Y;

database index: why pairing

I have a table with multiple indexes, several of which duplicate the same columns:
Index 1 columns: X, B, C, D
Index 2 columns: Y, B, C, D
Index 3 columns: Z, B, C, D
I'm not very knowledgeable on indexing in practice, so I'm wondering if somebody can explain why X, Y and Z were paired with these same columns. B is an effective date. C is a semi-unique key ID for this table for a specific effective date B. D is a sequence that identifies the priority of this record for the identifier C.
Why not just create 6 indexes, one for each X, Y, Z, B, C, D?
I want to add an index to another column T, but in some contexts I'll only be querying on T alone while in others I will also be specifying the B, C and D columns... so should I create just one index like above or should I create one for T and one for (T, B, C, D)?
I've not had as much luck as expected when googling for comprehensive coverage of indexing. Any resources where I can get a through explanation and lots of examples of B-tree indexing?
The rule with indexing is that an index can be used to filter on any list of columns that constitute a prefix of the columns used for that index.
In other words, we can use Index 1 when we filter on X and B, or X, B and C, or just X, or all four.
However, we cannot use the index to filter "in the middle". This is because indexes work not entirely unlike concatenating the values of those columns for each row, and sorting the result. If we know what the thing we're looking for begins with, we can figure out where in the index to look - just like when doing binary search.
That's why a single index is no good: if we need to filter on B, C, D, and one of X, Y and Z, we need three indexes; X, Y is no good as an index for just filtering on Y, because the prefix of the values we're looking for - the X - is not known.
As Daniel mentioned, a covering index is a possible explanation for repeating B, C, and D: even if D is never filtered on, it may be the case that we need exactly the columns which you see in your indexes, and we can then just read the columns from the index instead of just using the index to locate the row.
One reason for having B, C and D in those indexes might be to have a covering index for frequently used queries. You will have a covering index when the index itself contains all the required data fields for a particular query.
A covering index can dramatically speed up data retrieval, since only the index pages, not the data pages, will be used to retrieve the data.
Below is an example query where index 1 would be a covering index:
SELECT B, C, D FROM table WHERE X = '10'
You should create it in (T, B, C, D).
Let's say you have two fields with an index in a table: A and B. When you create a separate index on each one of the columns, and have a query such as:
SELECT * FROM table WHERE A = 10 AND B = 20
What happens is either:
1) The DB creates two intermediate result-sets, one with rows where A = 10, and another one with rows where B = 20. It then has to merge these two result-sets into one (and also check for duplicate rows).
2) The DB creates one result-set with rows where A = 10. It then has to go manually through all of the rows in this intermediate result-set and check in each one where B = 10.
However when you know that index B depends on index A, and your query uses A before B, you can create one index for both of the columns: (A, B)
What this means that now the DB will first find all rows where A = 10, but because B is part of the same index, it can use the same index information to filter the result-set into rows where B is also 20. It doesn't have to make two intermediate result-sets + merge them, or only use one of the indexes and do manual scan for the other.
There might be other ways that the DB deals with these situations as well, it largely depends on an implementation.
The indexes in the form (X, B, C, D) can be used to optimize queries like:
... WHERE X rel sthg (possibly ORDER BY B, C, D)
... WHERE X = sthg AND B rel sthg (possibly ORDER BY C, D)
... WHERE X = sthf AND B = sthg AND C rel sthg (possibly ORDER BY D)
etc. where rel are arbitrary relation operators (<, >, =, <=, >=) and sthg are values or expressions. Especially the second two, and the sorting variants wouldn't be optimized by the "single column indexes variant".
OTOH, it cannot optimize a query
... WHERE B = sthg
because it starts in the middle of the index; here, the single column index would work.
For a resource where you can get a through explanation and lots of examples regarding indexes on Oracle (and any other Oracle-related issue), you should visit and bookmark askTom.