Linking one feature column to another feature column - tensorflow

I'm new to TensorFlow and am trying to perform binary classification on my dataset. Essentially, I'm trying to predict whether an item is "attractive" or "not attractive".
I've simplified my training set to look something like that:
lamp; 20cm; description: lightbulb, switch; attractive
lightbulb; 3cm; description: filament; attractive
switch; 1cm; description: switch; not attractive
filament; 0.5cm; description: -; attractive
Explanation of features:
1st column is the name of the item
2nd column is the width of item
3rd column is a list of text related to the item. Note here that this list can be NULL or have >0 items. Note also that each of the items in the list will appear exactly once in the 1st column of one of the rows in the dataset.
And the 4th column shows the classification of the training data.
From what I've read online, if I'm not mistaken, the above data cannot be used just like that - it needs to be converted into a format readable by TensorFlow.
Note: I do not want to do any text classification since the prediction should be based on its attribute (width) and its relation with other items.
My attempt at making the training set usable(?) - by encoding each of the items with an item ID and then using an array to represent the relations:
1; 20; [2, 3]; 1
2; 3; [4]; 1
3; 1; [3]; 0
4; 0.5; []; 1
Test set:
5; 12; [2, 2]; ?
I'm assuming there's no need to create a separate file with the mapping of ID to item names since I'd previously mentioned that the item name itself is assumed to have no bearing on the result?
Questions:
If the above format is put into a CSV file, is that alright?
Is there any way to "link" the 3rd column to the 1st? So that TensorFlow knows that the 3rd column is actually an array of keys into the first.
Any available resources/tutorials that might help? I've already gone through the Getting Started guide with the Iris flowers example (but their features are all decimals - with no user-specified relations to other features).

Re 1., TF supports csv just fine
For 2 and 3, you should look at the documentation for tf feature columns.

Related

how to do clustering when I have multiple categorical column and less number of numerical column in pandas?

Say I have one column (X) which holds the customer id and have other multiple columns x1,x2,x3,x4,x5,x6
which have only these 4 distinct values ('High','Low','Medium','Nan') repeatedly. Please click on the above the attachment
Recent update: 16/12/2021: I have done one hot encoding and got 19 features now along with X column now I need to know how to go ahead with the clustering part for such unsupervised data set
Regarding the question what encoding to use i found this article to give a good understanding of when to use label encoding or one hot encoding:
https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/
In your case as you do have a ordinal value of your data (high > medium > low > nan) i would suggest using the label encoding technique.
Then regarding the clusteringpart you have identified three diffrent clusters, do you want to identify which samples belong to which cluster or what is your goal?
You could start train a model with 3 cluster centroids as you have identified yourself but could also use an elbow function to find a optimal number of clusters to your dataset. (https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/)
For label encoding on a column in your dataframe:
encoding_dict = {}
def label_encode(string_value):
num_value = encoding_dict.setdefault(string_value, len(encoding_dict)) # Sets a numerical value for your string value
return num_value
for col in dataframe.columns:
if dataframe[col].dtype == object: #indicates string
dataframe[col] = dataframe[col].apply(label_encode)
encoding_dict = {} # Reset encoding dict to not reuse same values (or dont if you always have same values)

While extracting a text from image using pytesseract , numbers are printing first and then the strings are printed

While extracting a text from image using pytesseract, numbers are printing first and then the strings are printed. Why is this happening?
This is my input image.
import cv2
import pytesseract
from pytesseract import Output
from PIL import Image
imginput = cv2.imread('ss.png')
x,img1 = cv2.threshold(imginput, 180, 255, cv2.THRESH_BINARY)
img = Image.fromarray(img1)
d = pytesseract.image_to_string(img, output_type=Output.DICT)
print(d)
My output:
'text':
**'71.\n\n72.\n\n73.\n\n74.\n\n75.\n\n76.\n\n77.\n\n78.\n\n79.\n\n80.**n\nPick
out the synonym of the word ‘depositary’ :\n\n(A) inheritor (B) ward
(C) patron (D) trustee\nThe fifth chapter comprises three
sections.\n(A) of (B) with (C) no preposition (D) on\n\nAntonym of
‘abortive’ is :\n(A) _ successful (B) reproductive (C) instantaneous
(D) fruitful\n\nThe one word for a person who doubts in religious
practices :\n(A) _ stoic (B) sceptic (C) theist (D) pantheist\n\nThe
idiom “bury the hatchet’ means .\n(A) keep enmity (B) open enmity (C)
stop enmity (D) have no enmity\n\nVictor seldom visits his uncle, Add
proper tag question.\n(A) doesn’t he ? (B) isn’the? (C) ishe? (D) does
he ?\n\n‘Khalil Gibran is one of the greatest poets of the world.’
Pick out the comparative degree of\nthe sentence.\n\n(A) Khalil Gibran
is greater than many other poets of the world.\n(B) Khalil Gibran is
greater than any other poet of the world.\n(C) Khalil Gibran is
greater than any other poets of the world.\n(D) Khalil Gibran is the
greatest poet of the world.\n\nThe passive form of ‘I keep my books
here.’ is :\n(A) My books keep here (B) My books are keeping here\n(C)
Iam kept the books here (D) My books are kept here\n\nPick out the
correctly spelt word.\n\n(A) Constellation (B) Consistancy\n(C)
Conspirecy (D) Conservatary\nWe need two more players to the team.
Supply suitable phrasal verb.\n(A) make out (B) make up (C) make for
(D) make of\n11 052/2019 - M\n\n{P.T.0}'}
Try running with other segmentation modes:
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
Add it like so:
# Example of adding any additional options.
custom_oem_psm_config = r'--psm 6'
pytesseract.image_to_string(image, config=custom_oem_psm_config, output_type=Output.DICT)

error in LDA in r: each row of the input matrix needs to contain at least one non-zero entry

I am a starter in text mining topic. When I run LDA() over a huge dataset with 996165 observations, it displays the following error:
Error in LDA(dtm, k, method = "Gibbs", control = list(nstart = nstart, :
Each row of the input matrix needs to contain at least one non-zero entry.
I am pretty sure that there is no missing values in my corpus and also. The table of "DocumentTermMatrix" and "simple_triplet_matrix" is:
table(is.na(dtm[[1]]))
#FALSE
#57100956
table(is.na(dtm[[2]]))
#FALSE
#57100956
A little confused how "57100956" comes. But as my dataset is pretty large, I don't know how to check why does this error occurs. My LDA command is:
ldaOut<-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
Can anyone provide some insights? Thanks.
In my opinion the problem is not the presence of missing values, but the presence of all 0 rows.
To check it:
raw.sum=apply(table,1,FUN=sum) #sum by raw each raw of the table
Then you can delete all raws which are all 0 doing:
table=table[raw.sum!=0,]
Now table should has all "non 0" raws.
I had the same problem. The design matrix, dtm, in your case, had rows with all zeroes because dome documents did not contain certain words (i.e. their frequency was zero). I suppose this somehow causes a singular matrix problem somewhere along the line. I fixed this by adding a common word to each of the documents so that every row would have at least one non-zero entry. At the very least, the LDA ran successfully and classified each of the documents. Hope this helps!

PsychoPy: "Conditional" but random selection from list across two dimensions

In my experiment, I am presenting images (faces) that are different across 2 dimensions: face identity and emotion.
There are 5 faces displaying 5 different emotional expressions; making 25 unique stimuli in total. These only need to be presented once (so 25 trials).
After I present one of the faces, the next face has to be different on only the emotion OR the identity, but the same on the other.
Example:
Face 1, emotion 1 -> face 3, emotion 1 -> face 3, emotion 4 -> ... etc.
1: is psychopy up to this task? I have mostly worked with the builder so far, except for some data-logging code, but I'd be happy to get more experienced with the coder.
My hunch is that I would need to add two columns to the list of trials, one for identity and one for emotion. Then use the getEarlierTrial call somehow, but I pretty much get lost at this point.
2: Would anyone be willing to point me in the right direction please?
Many thanks in advance.
This is difficult to implement in Builder's normal mode of operation, which is to drive trials from a fixed list of conditions. Although the order of rows can be randomised across subjects, the pairings of values across columns remains constant.
The standard answer to this is what you allude to in your comment above: in code, shuffle the conditions file at the beginning of each experiment, so each subject is in essence having their trials driven by a unique conditions file.
You seem happy to do this in Matlab. That would work fine, as this stuff can be done before PsychoPy even launches. But it could also very easily be implemented in Python code. That way you could do everything in PsychoPy, and in this case there would be no need to abandon Builder. You'd just insert a code component with some code to run at the beginning of the experiment that customises a conditions file.
You'll need to create three lists, not two, i.e. you also need a list of pseudo-random choices to alternate between preserving either face or emotion from trial to trial: if you do this fully randomly, you'll get unbalanced, and exhaust one of the attributes before the other.
from numpy.random import shuffle
# make a list of 25 dictionaries of unique face/emotion pairs:
pairsList = []
for face in ['1', '2', '3', '4', '5']:
for emotion in ['1', '2', '3', '4', '5']:
pairsList.append({'faceNum': face, 'emotionNum': emotion})
shuffle(pairsList)
# a list of whether to alternate between preserving face or emotion across trials:
attributes = ['faceNum', 'emotionNum'] * 12 # length 24
shuffle(attributes)
# need to create an initial selection before cycling though the
# next 24 randomised but balanced choices:
pair = pairsList.pop()
currentFace = pair['faceNum']
currentEmotion = pair['emotionNum']
images = ['face_' + currentFace + '_emotion_' + currentEmotion + '.jpg']
for attribute in attributes:
if attribute == 'faceNum':
selection = currentFace
else:
selection = currentEmotion
# find another pair with the same selected attribute:
for pair in pairsList:
if pair[attribute] == selection:
# store the combination for this trial:
currentFace = pair['faceNum']
currentEmotion = pair['emotionNum']
images.append('face_' + currentFace + '_emotion_' + currentEmotion + '.jpg')
# remove this combination so it can't be used again
pairsList.remove(pair)
images.reverse()
print(images)
Then just write the images list to a single column .csv file to use as a conditions file.
Remember to set the loop in Builder to be in a fixed order, not randomised, as the list itself has the randomisation built in.

OpenCV - Variable value range of trackbar

I have a set of images and want to make a cross matching between all and display the results using trackbars using OpenCV 2.4.6 (ROS Hydro package). The matching part is done using a vector of vectors of vectors of cv::DMatch-objects:
image[0] --- image[3] -------- image[8] ------ ...
| | |
| cv::DMatch-vect cv::DMatch-vect
|
image[1] --- ...
|
image[2] --- ...
|
...
|
image[N] --- ...
Because we omit matching an image with itself (no point in doing that) and because a query image might not be matched with all the rest each set of matched train images for a query image might have a different size from the rest. Note that the way it's implemented right I actually match a pair of images twice, which of course is not optimal (especially since I used a BruteForce matcher with cross-check turned on, which basically means that I match a pair of images 4 times!) but for now that's it. In order to avoid on-the-fly drawing of matched pairs of images I have populated a vector of vectors of cv::Mat-objects. Each cv::Mat represents the current query image and some matched train image (I populate it using cv::drawMatches()):
image[0] --- cv::Mat[0,3] ---- cv::Mat[0,8] ---- ...
|
image[1] --- ...
|
image[2] --- ...
|
...
|
image[N] --- ...
Note: In the example above cv::Mat[0,3] stands for cv::Mat that stores the product of cv::drawMatches() using image[0] and image[3].
Here are the GUI settings:
Main window: here I display the current query image. Using a trackbar - let's call it TRACK_QUERY - I iterate through each image in my set.
Secondary window: here I display the matched pair (query,train), where the combination between the position of TRACK_QUERY's slider and the position of the slider of another trackbar in this window - let's call it TRACK_TRAIN - allows me to iterate through all the cv::Mat-match-images for the current query image.
The issue here comes from the fact that each query can have a variable number of matched train images. My TRACK_TRAIN should be able to adjust to the number of matched train images, that is the number of elements in each cv::Mat-vector for the current query image. Sadly so far I was unable to find a way to do that. The cv::createTrackbar() requires a count-parameter, which from what I see sets the limit of the trackbar's slider and cannot be altered later on. Do correct me if I'm wrong since this is exactly what's bothering me. A possible solution (less elegant and involving various checks to avoid out-of-range erros) is to take the size of the largest set of matched train images and use it as the limit for my TRACK_TRAIN. I would like to avoid doing that if possible. Another possible solution involves creating a trackbar per query image with the appropriate value range and swap each in my secondary windows according to the selected query image. For now this seems to be the more easy way to go but poses a big overhead of trackbars not to mention that fact that I haven't heard of OpenCV allowing you to hide GUI controls. Here are two example that might clarify things a little bit more:
Example 1:
In main window I select image 2 using TRACK_QUERY. For this image I have managed to match 5 other images from my set. Let's say those are image 4, 10, 17, 18 and 20. The secondary window updates automatically and shows me the match between image 2 and image 4 (first in the subset of matched train images). TRACK_TRAIN has to go from 0 to 4. Moving the slider in both directions allows me to go through image 4, 10, 17, 18 and 20 updating each time the secondary window.
Example 2:
In main window I select image 7 using TRACK_QUERY. For this image I have managed to match 3 other images from my set. Let's say those are image 0, 1, 11 and 19. The secondary window updates automatically and shows me the match between image 2 and image 0 (first in the subset of matched train images). TRACK_TRAIN has to go from 0 to 2. Moving the slider in both directions allows me to go through image 0, 1, 1 and 19 updating each time the secondary window.
If you have any questions feel free to ask and I'll to answer them as well as I can. Thanks in advance!
PS: Sadly the way the ROS package is it has the bare minimum of what OpenCV can offer. No Qt integration, no OpenMP, no OpenGL etc.
After doing some more research I'm pretty sure that this is currently not possible. That's why I implemented the first proposition that I gave in my question - use the match-vector with the most number of matches in it to determine a maximum size for the trackbar and then use some checking to avoid out-of-range exceptions. Below there is a more or less detailed description how it all works. Since the matching procedure in my code involves some additional checks that does not concern the problem at hand, I'll skip it here. Note that in a given set of images we want to match I refer to an image as object-image when that image (example: card) is currently matched to a scene-image (example: a set of cards) - top level of the matches-vector (see below) and equal to the index in processedImages (see below). I find the train/query notation in OpenCV somewhat confusing. This scene/object notation is taken from http://docs.opencv.org/doc/tutorials/features2d/feature_homography/feature_homography.html. You can change or swap the notation to your liking but make sure you change it everywhere accordingly otherwise you might end up with a some weird results.
// stores all the images that we want to cross-match
std::vector<cv::Mat> processedImages;
// stores keypoints for each image in processedImages
std::vector<std::vector<cv::Keypoint> > keypoints;
// stores descriptors for each image in processedImages
std::vector<cv::Mat> descriptors;
// fill processedImages here (read images from files, convert to grayscale, undistort, resize etc.), extract keypoints, compute descriptors
// ...
// I use brute force matching since I also used ORB, which has binary descriptors and HAMMING_NORM is the way to go
cv::BFmatcher matcher;
// matches contains the match-vectors for each image matched to all other images in our set
// top level index matches.at(X) is equal to the image index in processedImages
// middle level index matches.at(X).at(Y) gives the match-vector for the Xth image and some other Yth from the set that is successfully matched to X
std::vector<std::vector<std::vector<cv::DMatch> > > matches;
// contains images that store visually all matched pairs
std::vector<std::vector<cv::Mat> > matchesDraw;
// fill all the vectors above with data here, don't forget about matchesDraw
// stores the highest count of matches for all pairs - I used simple exclusion by simply comparing the size() of the current std::vector<cv::DMatch> vector with the previous value of this variable
long int sceneWithMaxMatches = 0;
// ...
// after all is ready do some additional checking here in order to make sure the data is usable in our GUI. A trackbar for example requires AT LEAST 2 for its range since a range (0;0) doesn't make any sense
if(sceneWithMaxMatches < 2)
return -1;
// in this window show the image gallery (scene-images); the user can scroll through all image using a trackbar
cv::namedWindow("Images", CV_GUI_EXPANDED | CV_WINDOW_AUTOSIZE);
// just a dummy to store the state of the trackbar
int imagesTrackbarState = 0;
// create the first trackbar that the user uses to scroll through the scene-images
// IMPORTANT: use processedImages.size() - 1 since indexing in vectors is the same as in arrays - it starts from 0 and not reducing it by 1 will throw an out-of-range exception
cv::createTrackbar("Images:", "Images", &imagesTrackbarState, processedImages.size() - 1, on_imagesTrackbarCallback, NULL);
// in this window we show the matched object-images relative to the selected image in the "Images" window
cv::namedWindow("Matches for current image", CV_WINDOW_AUTOSIZE);
// yet another dummy to store the state of the trackbar in this new window
int imageMatchesTrackbarState = 0;
// IMPORTANT: again since sceneWithMaxMatches stores the SIZE of a vector we need to reduce it by 1 in order to be able to use it for the indexing later on
cv::createTrackbar("Matches:", "Matches for current image", &imageMatchesTrackbarState, sceneWithMaxMatches - 1, on_imageMatchesTrackbarCallback, NULL);
while(true)
{
char key = cv::waitKey(20);
if(key == 27)
break;
// from here on the magic begins
// show the image gallery; use the position of the "Images:" trackbar to call the image at that position
cv::imshow("Images", processedImages.at(cv::getTrackbarPos("Images:", "Images")));
// store the index of the current scene-image by calling the position of the trackbar in the "Images:" window
int currentSceneIndex = cv::getTrackbarPos("Images:", "Images");
// we have to make sure that the match of the currently selected scene-image actually has something in it
if(matches.at(currentSceneIndex).size())
{
// store the index of the current object-image that we have matched to the current scene-image in the "Images:" window
int currentObjectIndex = cv::getTrackbarPos("Matches:", "Matches for current image");
cv::imshow(
"Matches for current image",
matchesDraw.at(currentSceneIndex).at(currentObjectIndex < matchesDraw.at(currentSceneIndex).size() ? // is the current object index within the range of the matches for the current object and current scene
currentObjectIndex : // yes, return the correct index
matchesDraw.at(currentSceneIndex).size() - 1)); // if outside the range show the last matched pair!
}
}
// do something else
// ...
The tricky part is the trackbar in the second window responsible for accessing the matched images to our currently selected image in the "Images" window. As I've explained above I set the trackbar "Matches:" in the "Matches for current image" window to have a range from 0 to (sceneWithMaxMatches-1). However not all images have the same amount of matches with the rest in the image set (applies tenfold if you have done some additional filtering to ensure reliable matches for example by exploiting the properties of the homography, ratio test, min/max distance check etc.). Because I was unable to find a way to dynamically adjust the trackbar's range I needed a validation of the index. Otherwise for some of the images and their matches the application will throw an out-of-range exception. This is due to the simple fact that for some matches we try to access a match-vector with an index greater than it's size minus 1 because cv::getTrackbarPos() goes all the way to (sceneWithMaxMatches - 1). If the trackbar's position goes out of range for the currently selected vector with matches, I simply set the matchDraw-image in "Matches for current image" to the very last in the vector. Here I exploit the fact that the indexing can't go below zero as well as the trackbar's position so there is not need to check this but only what comes after the initial position 0. If this is not your case make sure you check the lower bound too and not only the upper.
Hope this helps!