Check how many 1 are in the column of a matrix #minizinc - authentication

Given a matrix Z[n,m]:
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
1 0 0 0 0
0 0 1 0 0
I'd like to check how many "1" there are in the different columns of the matrix. So given k=1 in this case, the problem should be unsatisfiable since in the column there are 2 "1", so "number of 1">k. I tried in this way but it doesn't work:
constraint forall(i in n, j in m) forall(k in n) k<=( Z[i,j]\/Z[k,j])
Where am I wrong?
In the case I have this variables how I can do?
int b;
int: k;
set of int: PEOPLE = 1..p;
set of int: STOPS = 1..s;
array [1..b, PEOPLE, STOPS] of var bool: Z;
Z[1]
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
1 0 0 0 0
0 0 1 0 0
Z[2]
0 1 0 0 0
0 0 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
p = 5;
s =5;
k=1;
b=2;
So in this case the result should be:
Z[1]: 1 0 1 0 0 , the number of "1" is 2, "2 > K"
Z[2]: 0 1 0 0 0, the number of "1" is 1, "1<=K"
UNSATISFIABLE

I just solved in this way:
array [1..b, STOPS] of var bool: M;
constraint forall (m in 1..b) ( forall (j in STOPS) ( M[m,j]= exists([Z[m,i,j] | i in PEOPLE ])));
constraint forall (m in 1..b) ( let {
var int: s = sum (j in STOPS)(M[m,j]>0);
} in
s <= t );
thank you all for the answers :)

Related

Get_dummies produces more columns than its supposed to

I'm using get_dummies on a column of data that has zeroes or 'D' or "E". Instead of producing 2 columns it produces 5 - C, D, E, N, O. I'm not sure what they are and how to make it do just 2 as its supposed to.
When I just pull that column shows 0's and D and E, but when I put it in get_dummies adds extra columns
data[[2]]
0
0
D
0
0
0
0
D
0
0
When I do this:
dummy = pd.get_dummies(data[2], dummy_na = False)
dummy.head()
I get
0 C D E N O PreferredContactTime
0 0 0 0 0 0 1
1 0 0 0 0 0 0
1 0 0 0 0 0 0
0 0 1 0 0 0 0
1 0 0 0 0 0 0
What are C , N and O? I don't understand what it is displaying at all.
Setup
dtype = pd.CategoricalDtype([0, 'C', 'D', 'E', 'N', 'O', 'PreferredContactTime'])
data = pd.DataFrame({2: [
'PreferredContactTime', 0, 0, 'D', 0, 0, 0, 0, 'D', 0, 0
]}).astype(dtype)
Your result
dummy = pd.get_dummies(data[2], dummy_na=False )
dummy.head()
0 C D E N O PreferredContactTime
0 0 0 0 0 0 0 1
1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 0 0 1 0 0 0 0
4 1 0 0 0 0 0 0

How to speed up Pandas' "iterrows"

I have a Pandas dataframe which I want to transform in the following way: I have some sensor data from an intelligent floor which is in column "CAPACITANCE" (split by ",") and that data comes from the device indicated in column "DEVICE". Now I want to have one row with a column per sensor - each device has 8 sensors, so I want to have devices x 8 columns and in that column I want the sensor data from exactly that sensor.
But my code seems to be super slow since I have about 90.000 rows in that dataframe! Does anyone have a suggestion how to speed it up?
BEFORE:
CAPACITANCE DEVICE TIMESTAMP \
0 0.00,-1.00,0.00,1.00,1.00,-2.00,13.00,1.00 01,07 2017/11/15 12:24:42
1 0.00,0.00,-1.00,-1.00,-1.00,0.00,-1.00,0.00 01,07 2017/11/15 12:24:42
2 0.00,-1.00,-2.00,0.00,0.00,1.00,0.00,-2.00 01,07 2017/11/15 12:24:43
3 2.00,0.00,-2.00,-1.00,0.00,0.00,1.00,-2.00 01,07 2017/11/15 12:24:43
4 1.00,0.00,-2.00,1.00,1.00,-3.00,5.00,1.00 01,07 2017/11/15 12:24:44
AFTER:
01,01-0 01,01-1 01,01-2 01,01-3 01,01-4 01,01-5 01,01-6 01,01-7 \
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
01,02-0 01,02-1 ... 05,07-1 05,07-2 05,07-3 05,07-4 05,07-5 \
0 0 0 ... 0 0 0 0 0
1 0 0 ... 0 0 0 0 0
2 0 0 ... 0 0 0 0 0
3 0 0 ... 0 0 0 0 0
4 0 0 ... 0 0 0 0 0
05,07-6 05,07-7 TIMESTAMP 01,07-8
0 0 0 2017-11-15 12:24:42 1.00
1 0 0 2017-11-15 12:24:42 0.00
2 0 0 2017-11-15 12:24:43 -2.00
3 0 0 2017-11-15 12:24:43 -2.00
4 0 0 2017-11-15 12:24:44 1.00
# creating new dataframe based on the old one
floor_df_resampled = floor_df.copy()
floor_device = ["01,01", "01,02", "01,03", "01,04", "01,05", "01,06", "01,07", "01,08", "01,09", "01,10",
"02,01", "02,02", "02,03", "02,04", "02,05", "02,06", "02,07", "02,08", "02,09", "02,10",
"03,01", "03,02", "03,03", "03,04", "03,05", "03,06", "03,07", "03,08", "03,09",
"04,01", "04,02", "04,03", "04,04", "04,05", "04,06", "04,07", "04,08", "04,09",
"05,06", "05,07"]
# creating new columns
floor_objects = []
for device in floor_device:
for sensor in range(8):
floor_objects.append(device + "-" + str(sensor))
# merging new columns
floor_df_resampled = pd.concat([floor_df_resampled, pd.DataFrame(columns=floor_objects)], ignore_index=True, sort=True)
# part that takes loads of time
for index, row in floor_df_resampled.iterrows():
obj = row["DEVICE"]
sensor_data = row["CAPACITANCE"].split(',')
for idx, val in enumerate(sensor_data):
col = obj + "-" + str(idx + 1)
floor_df_resampled.loc[index, col] = val
floor_df_resampled.drop(["DEVICE"], axis=1, inplace=True)
floor_df_resampled.drop(["CAPACITANCE"], axis=1, inplace=True)
Like commented, I'm not sure why you want that many columns, but the new columns can be created as follows:
def explode(x):
dev_name = x.DEVICE.iloc[0]
ret_df = x.CAPACITANCE.str.split(',', expand=True).astype(float)
ret_df.columns = [f'{dev_name}-{col}' for col in ret_df.columns]
return ret_df
new_df = df.groupby('DEVICE').apply(explode).fillna(0)
and then you can merge this with the old data frame:
df = df.join(new_df)

How can I change my index vector into sparse feature vector that can be used in sklearn?

I am doing a News recommendation system and I need to build a table for users and news they read. my raw data just like this :
001436800277225 [12,456,157]
009092130698762 [248]
010003000431538 [361,521,83]
010156461231357 [173,67,244]
010216216021063 [203,97]
010720006581483 [86]
011199797794333 [142,12,86,411,201]
011337201765123 [123,41]
011414545455156 [62,45,621,435]
011425002581540 [341,214,286]
the first column is userID, the second column is the newsID.newsID is a index column, for example, after transformation, [12,456,157] in the first row means that this user has read the 12th, 456th and 157th news (in sparse vector, the 12th column, 456th column and 157th column are 1, while other columns have value 0). And I want to change these data into a sparse vector format that can be used as input vector in Kmeans or DBscan algorithm of sklearn.
How can I do that?
One option is to construct the sparse matrix explicitly. I often find it easier to build the matrix in COO matrix format and then cast to CSR format.
from scipy.sparse import coo_matrix
input_data = [
("001436800277225", [12,456,157]),
("009092130698762", [248]),
("010003000431538", [361,521,83]),
("010156461231357", [173,67,244])
]
NUMBER_MOVIES = 1000 # maximum index of the movies in the data
NUMBER_USERS = len(input_data) # number of users in the model
# you'll probably want to have a way to lookup the index for a given user id.
user_row_map = {}
user_row_index = 0
# structures for coo format
I,J,data = [],[],[]
for user, movies in input_data:
if user not in user_row_map:
user_row_map[user] = user_row_index
user_row_index+=1
for movie in movies:
I.append(user_row_map[user])
J.append(movie)
data.append(1) # number of times users watched the movie
# create the matrix in COO format; then cast it to CSR which is much easier to use
feature_matrix = coo_matrix((data, (I,J)), shape=(NUMBER_USERS, NUMBER_MOVIES)).tocsr()
Use MultiLabelBinarizer from sklearn.preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.newsID), columns=mlb.classes_)
12 41 45 62 67 83 86 97 123 142 ... 244 248 286 341 361 411 435 456 521 621
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
7 0 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 0

How to set (1) to max elements in pandas dataframe and (0) to everything else?

Let's say I have a pandas DataFrame.
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df:
a b c d e f
0 -1.238393 -0.755117 -0.228638 -0.077966 0.412947 0.887955
1 -0.342087 0.296171 0.177956 0.701668 -0.481744 -1.564719
2 0.610141 0.963873 -0.943182 -0.341902 0.326416 0.818899
3 -0.561572 0.063588 -0.195256 -1.637753 0.622627 0.845801
4 -2.506322 -1.631023 0.506860 0.368958 1.833260 0.623055
5 -1.313919 -1.758250 -1.082072 1.266158 0.427079 -1.018416
6 -0.781842 1.270133 -0.510879 -1.438487 -1.101213 -0.922821
7 -0.456999 0.234084 1.602635 0.611378 -1.147994 1.204318
8 0.497074 0.412695 -0.458227 0.431758 0.514382 -0.479150
9 -1.289392 -0.218624 0.122060 2.000832 -1.694544 0.773330
how to I get set 1 to rowwise max and 0 to other elements?
I came up with:
>>> for i in range(len(df)):
... df.loc[i][df.loc[i].idxmax(axis=1)] = 1
... df.loc[i][df.loc[i] != 1] = 0
generates
df:
a b c d e f
0 0 0 0 0 0 1
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 0 0 1
4 0 0 0 0 1 0
5 0 0 0 1 0 0
6 0 1 0 0 0 0
7 0 0 1 0 0 0
8 0 0 0 0 1 0
9 0 0 0 1 0 0
Does anyone has a better way of doing it? May be by getting rid of the for loop or applying lambda?
Use max and check for equality using eq and cast the boolean df to int using astype, this will convert True and False to 1 and 0:
In [21]:
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df
Out[21]:
a b c d e f
0 0.797000 0.762125 -0.330518 1.117972 0.817524 0.041670
1 0.517940 0.357369 -1.493552 -0.947396 3.082828 0.578126
2 1.784856 0.672902 -1.359771 -0.090880 -0.093100 1.099017
3 -0.493976 -0.390801 -0.521017 1.221517 -1.303020 1.196718
4 0.687499 -2.371322 -2.474101 -0.397071 0.132205 0.034631
5 0.573694 -0.206627 -0.106312 -0.661391 -0.257711 -0.875501
6 -0.415331 1.185901 1.173457 0.317577 -0.408544 -1.055770
7 -1.564962 -0.408390 -1.372104 -1.117561 -1.262086 -1.664516
8 -0.987306 0.738833 -1.207124 0.738084 1.118205 -0.899086
9 0.282800 -1.226499 1.658416 -0.381222 1.067296 -1.249829
In [22]:
df = df.eq(df.max(axis=1), axis=0).astype(int)
df
Out[22]:
a b c d e f
0 0 0 0 1 0 0
1 0 0 0 0 1 0
2 1 0 0 0 0 0
3 0 0 0 1 0 0
4 1 0 0 0 0 0
5 1 0 0 0 0 0
6 0 1 0 0 0 0
7 0 1 0 0 0 0
8 0 0 0 0 1 0
9 0 0 1 0 0 0
Timings
In [24]:
# #Raihan Masud's method
%timeit df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)
# mine
%timeit df.eq(df.max(axis=1), axis=0).astype(int)
100 loops, best of 3: 7.94 ms per loop
1000 loops, best of 3: 640 µs per loop
In [25]:
# #Nader Hisham's method
%%timeit
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
​
df.apply( max_binary , axis = 1)
100 loops, best of 3: 9.63 ms per loop
You can see that my method is over 12X faster than #Raihan's method
In [4]:
%%timeit
for i in range(len(df)):
df.loc[i][df.loc[i].idxmax(axis=1)] = 1
df.loc[i][df.loc[i] != 1] = 0
10 loops, best of 3: 21.1 ms per loop
The for loop is also significantly slower
import numpy as np
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
df.apply( max_binary , axis = 1)
Following Nader's pattern, this is a shorter version:
df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)

set column value based on distinct values in another column

I am trying to do something very similar to this question: mysql - UPDATEing row based on other rows
I have a table, called modset, of the following form:
member year y1 y2 y3 y1y2 y2y3 y1y3 y1y2y3
a 1 0 0 0 0 0 0 0
a 2 0 0 0 0 0 0 0
a 3 0 0 0 0 0 0 0
b 1 0 0 0 0 0 0 0
b 2 0 0 0 0 0 0 0
c 1 0 0 0 0 0 0 0
c 3 0 0 0 0 0 0 0
d 2 0 0 0 0 0 0 0
Columns 3:9 are binary flags to indicate which combination of years the member has records in. So I wish the result of an SQL update to look as follows:
member year y1 y2 y3 y1y2 y2y3 y1y3 y1y2y3
a 1 0 0 0 0 0 0 1
a 2 0 0 0 0 0 0 1
a 3 0 0 0 0 0 0 1
b 1 0 0 0 1 0 0 0
b 2 0 0 0 1 0 0 0
c 1 0 0 0 0 0 1 0
c 3 0 0 0 0 0 1 0
d 2 0 1 0 0 0 0 0
The code in the question linked above does something very close but only when it is a count of the distinct years in which the member has records. I need to base the columns on the specific values of the years in which the member has records.
Thanks in advance!
SOLUTION
SELECT member,
case when min(distinct(year)) = 1 and max(distinct(year)) = 1 then 1 else 0 end y1,
case when min(distinct(year)) = 1 and max(distinct(year)) = 2 then 1 else 0 end y1y2,
case when min(distinct(year)) = 1 and max(distinct(year)) = 3 and count(distinct(year)) = 2 then 1 else 0 end y1y3,
case when min(distinct(year)) = 1 and max(distinct(year)) = 3 and count(distinct(year)) = 3 then 1 else 0 end y1y2y3,
case when min(distinct(year)) = 2 and max(distinct(year)) = 2 then 1 else 0 end y2,
case when min(distinct(year)) = 2 and max(distinct(year)) = 3 then 1 else 0 end y2y3,
case when min(distinct(year)) = 3 then 1 else 0 end y3
INTO temp5
FROM modset
GROUP BY member;
UPDATE modset M
SET y1 = T.y1, y2 = T.y1, y3 = T.y3, y1y2 = T.y1y2, y1y3 = T.y1y3, y2y3 = T.y2y3, y1y2y3 = T.y1y2y3
FROM temp5 T
WHERE T.member = M.member;
What is the query you are using to return the indicators of the years the member has records in?
It sounds like you would want take your query results and use it in your update:
http://dev.mysql.com/doc/refman/5.0/en/update.html
It may look something like this:
UPDATE targetTable t, sourceTable s
SET t.y1 = s.y1, t.y2 = s.y2 -- (and so on...)
WHERE t.member = s.member AND t.year = m.year;