Convert string features to numeric features in sklearn and pandas - pandas

I'm currently working with sklearn (I'm beginner) and I want to train and test a very naif classifier.
The structure of my training and testing data is the following:
----|----|----|----|----|----|------|----|----|----|-------
f1 | f2 | f3 | c1 | c2 | c3 | word | c4 | c5 | c6 | label
----|----|----|----|----|----|------|----|----|----|-------
Where:
f1: feature 1, binary numerical type like 0
f2: feature 2, binary numerical type like 1
f3: feature 3, binary numerical type like 0
c1: context 1, string type like "from"
c2: context 2, string type like "this"
c3: context 3, string type like "website"
word: central word (string) of the context like "http://.."
c4: context 4, string type
c5: context 5, string type
c6: context 6, string type
label: this is the label (string) that the classifier has to train and predict like: "URL" (I have only three types of label: REF,IRR,DATA)
What I want to do is to convert my context string features in numerical features. Every string field is composed of a maximum of one word.
The main goal is to assign a numeric value for every context and word string in such a way to make the system works.
What I thought is that it's possible to define a vocabulary like:
{ from, website, to, ... }
and provide this vocabulary to the DictVectorizer, but I don't know how to do this now.
What I really want to do is to generate a huge number of binary features: the word “from” immediately preceding the word in question is one feature; the word “available” two positions after the word is another one. But I really don't know how.
This is what I tried to do:
#I tried to read the train csv:
train = pd.read_csv('train.csv')
#Drop the label field:
train_X = train.drop(['label'],axis=1)
#Take the other parameters:
train_y = train.label.values
#Then I convert the panda's data type into a dictionary:
train_X = train_X.to_dict('r')
#And I tried to vectorize everything:
vec = DictVectorizer()
train_X = vec.fit_transform(train_X).toarray()
Obviously did't work. This because the context and word fields can be a very big word like an url.
Any suggestions? I accept all kinds of solutions.
Thank you very much.

If unique words are finite, you can do something like this using pandas.
mapping_dict = {'word1':0,
'word2':1,
'word3':3 }
df[col] = df[col].str.map(mapping_dict)

Related

Pandas manipulation: matching data from other columns to one column, applied uniquely to all rows

I have a model that predicts 10 words for a particular course in order of likelihood, and I'd like the first 5 words of those words that appear in the course's description.
This is the format of the data:
course_name course_title course_description predicted_word_10 predicted_word_9 predicted_word_8 predicted_word_7 predicted_word_6 predicted_word_5 predicted_word_4 predicted_word_3 predicted_word_2 predicted_word_1
Xmath 32 Precalculus Polynomial and rational functions, exponential... directed scholars approach build african different visual cultures placed global
Xphilos 2 Morality Introduction to ethical and political philosop... make presentation weekly european ways general range questions liberal speakers
My idea is for each row to start iterating from predicted_word_1 until I get the first 5 that are in the description. I'd like to save those words in the order they appear into additional columns description_word_1 ... description_word_5. (If there are <5 predicted words in the description I plan to return NAN in the corresponding columns).
To clarify with an example: if the course_description of a course is 'Polynomial and rational functions, exponential and logarithmic functions, trigonometry and trigonometric functions. Complex numbers, fundamental theorem of algebra, mathematical induction, binomial theorem, series, and sequences. ' and its first few predicted words are irrelevantword1, induction, exponential, logarithmic, irrelevantword2, polynomial, algebra...
I would want to return induction, exponential, logarithmic, polynomial, algebra for that in that order and do the same for the rest of the courses.
My attempt was to define an apply function that will take in a row and iterate from the first predicted word until it finds the first 5 that are in the description, but the part I am unable to figure out is how to create these additional columns that have the correct words for each course. This code will currently only keep the words for one course for all the rows.
def find_top_description_words(row):
print(row['course_title'])
description_words_index=1
for i in range(num_words_per_course):
description = row.loc['course_description']
word_i = row.loc['predicted_word_' + str(i+1)]
if (word_i in description) & (description_words_index <=5) :
print(description_words_index)
row['description_word_' + str(description_words_index)] = word_i
description_words_index += 1
df.apply(find_top_description_words,axis=1)
The end goal of this data manipulation is to keep the top 10 predicted words from the model and the top 5 predicted words in the description so the dataframe would look like:
course_name course_title course_description top_description_word_1 ... top_description_word_5 predicted_word_1 ... predicted_word_10
Any pointers would be appreciated. Thank you!
If I understand correctly:
Create new DataFrame with just 100 predicted words:
pred_words_lists = df.apply(lambda x: list(x[3:].dropna())[::-1], axis = 1)
Please note that, there are lists in each row with predicted words. The order is nice, I mean the first, not empty, predicted word is on the first place, the second on the second place and so on.
Now let's create a new DataFrame:
pred_words_df = pd.DataFrame(pred_words_lists.tolist())
pred_words_df.columns = df.columns[:2:-1]
And The final DataFrame:
final_df = df[['course_name', 'course_title', 'course_description']].join(pred_words_df.iloc[:,0:11])
Hope this works.
EDIT
def common_elements(xx, yy):
temp = pd.Series(range(0, len(xx)), index= xx)
return list(df.reindex(yy).sort_values()[0:10].dropna().index)
pred_words_lists = df.apply(lambda x: common_elements(x[2].replace(',','').split(), list(x[3:].dropna())), axis = 1)
Does it satisfy your requirements?
Adapted solution (OP):
def get_sorted_descriptions_words(course_description, predicted_words, k):
description_words = course_description.replace(',','').split()
predicted_words_list = list(predicted_words)
predicted_words = pd.Series(range(0, len(predicted_words_list)), index=predicted_words_list)
predicted_words = predicted_words[~predicted_words.index.duplicated()]
ordered_description = predicted_words.reindex(description_words).dropna().sort_values()
ordered_description_list = pd.Series(ordered_description.index).unique()[:k]
return ordered_description_list
df.apply(lambda x: get_sorted_descriptions_words(x['course_description'], x.filter(regex=r'predicted_word_.*'), k), axis=1)

derive features from date string in TensorFlow

I try to parse a CSV file which contains a date string (format "2018-03-30 09:30:05").
It should be turned into one-hot encoded features in the form of day / hour / minute / second.
One obvious way to do this is using pandas and store in a separate file or HDF store.
But in order to simplify the workflow (and leverage the GPU), I would like to do this directly in TensorFlow.
Assume the date string is on position -2, I thought something like tf.int32(tf.substr(row[-2],0,4)) should work to get the year, but it returns TypeError: 'DType' object is not callable.
with tf.python_io.TFRecordWriter("train_sample_sorted.tfrecords") as tf_writer:
i = 0
for row in myArray:
i +=1
if(i%10000==0):
print(row[-2])
#timefeatures = int(row[-2][0:4]) ## TypeError: Value must be iterable
#timefeatures = tf.int32(tf.substr(row[-2],0,4)) ## TypeError: 'DType' object is not callable
features, label = row[:-2], row[-1]
example = tf.train.Example()
example.features.feature["features"].float_list.value.extend(features)
example.features.feature["timefeatures"].float_list.value.extend(timefeatures)
example.features.feature["label"].int64_list.value.append(label)
tf_writer.write(example.SerializeToString())
What is the best practice to handle date strings as input features? Is there a way around pre-processing?
Thanks
The first version int( row[ -2 ][ 0 : 4 ] ) fails for two reasons: one is that indexing cannot be used on a string tensor's strings, and if it didn't fail for that, it would fail because you cannot convert it to int like that.
The second version tf.int32( tf.substr( row[ -2 ], 0, 4 ) ) is almost there, it does the string splitting fine, but to convert strings to numbers you have to use tf.string_to_number you cannot simply cast a string to a number like that with tensors.
Without access to the data you use I couldn't test it, but this should work:
tf.string_to_number( tf.substr( row[ -2 ], 0, 4 ), out_type = tf.int32 )

How to make a new variable based on 30 other variables

I have 30 variables on family history of cancer i.e. breast cancer father, breast cancer mother, breast cancer sister etc. I would like to make a new variable and give it a value of "1" if in one of my columns there is a 1.
Thus:
I have 30 variables with answers 1 to 3; 1 is yes, 2 is no and, 3 is unknown if one of the 30 variables is given a 1 I would like my new variable to take on the value 1.
Does someone know how I can do this?
You can create a list instead of separate 30 variables and then filter it out to create a new variable. This will make it more dynamic.
// This will be the cancer history for a single family
var cancerHistory = [];
// Add dummy data
cancerHistory.push('yes');
cancerHistory.push('no')
cancerHistory.push('unknown');
cancerHistory.push('no');
// Check if at least one of them is "yes"
var hasHistoryOfCancer = cancerHistory.indexOf('yes') > -1;
alert(hasHistoryOfCancer); // true
You can use a for loop. You did not mention the language so I am writing the code in Python which is easy to understand. If you want it in other language you can use the similar approach and apply it
import pandas as pd
new_var = []
df = pd.read_csv("DataFile.csv") # Convert data file to csv and put name it.
for i in range(len(df)):
x = [df['column1'][i], df['column2'][i] ...., df['column30'][i]]
if (1 in x): new_var.append(1)
else: new_var.append(0)
df['new_var'] = new_var
df.to_csv('NewDataFile.csv', sep=',', encoding='utf-8')

Convert cell text in progressive number

I have written this SQL in PostgreSQL environment:
SELECT
ST_X("Position4326") AS lon,
ST_Y("Position4326") AS lat,
"Values"[4] AS ppe,
"Values"[5] AS speed,
"Date" AS "timestamp",
"SourceId" AS smartphone,
"Track" as session
FROM
"SingleData"
WHERE
"OsmLineId" = 44792088
AND
array_length("Values", 1) > 4
AND
"Values"[5] > 0
ORDER BY smartphone, session;
Now I have imported the result in Matlab and I have six vectors and one cell (because the text from the UUIDs was converted in cell) all of 5710x1 size.
Now I would like convert the text in the cell, in a progressive number, like 1, 2, 3... for each different session code.
In Excel it is easy with FIND.VERT(obj, matrix, col), but I do not know how do it in Matlab.
Now I have a big cell with a lot of codes like:
ff95465f-0593-43cb-b400-7d32942023e1
I would like convert this cell in an array of numbers where at the first occurrence of
ff95465f-0593-43cb-b400-7d32942023e1 -> 1
and so on. And you put 2 when a different code appear, and so on.
OK, I have solve.
I put the single session code in a second cell C.
At this point, with a for loop, I obtain:
%% Converting of the UUIDs into integer
C = unique(session);
N = length(session);
session2 = zeros(N, 1);
for i = 1:N
session2(i) = find(strcmp(C, session(i)));
end
Thanks to all!

Matplotlib table: individual column width

Is there a way to specify the width of individual columns in a matplotlib table?
The first column in my table contains just 2-3 digit IDs, and I'd like this column to be smaller than the others, but I can't seem to get it to work.
Let's say I have a table like this:
import matplotlib.pyplot as plt
fig = plt.figure()
table_ax = fig.add_subplot(1,1,1)
table_content = [["1", "Daisy", "ill"],
["2", "Topsy", "healthy"]]
table_header = ('ID', 'Name','Status')
the_table = table_ax.table(cellText=table_content, loc='center', colLabels=table_header, cellLoc='left')
fig.show()
(Never mind the weird cropping, it doesn't happen in my real table.)
What I've tried is this:
prop = the_table.properties()
cells = prop['child_artists']
for cell in cells:
text = cell.get_text()
if text == "ID":
cell.set_width(0.1)
else:
try:
int(text)
cell.set_width(0.1)
except TypeError:
pass
The above code seems to have zero effect - the columns are still all equally wide. (cell.get_width() returns 0.3333333333, so I would think that width is indeed cell-width... so what am I doing wrong?
Any help would be appreciated!
I've been searching the web over and over again looking for similar probelm sollutions. I've found some answers and used them, but I didn't find them quite straight forward. By chance I just found the table method get_celld when simply trying different table methods.
By using it you get a dictionary where the keys are tuples corresponding to table coordinates in terms of cell position. So by writing
cellDict=the_table.get_celld()
cellDict[(0,0)].set_width(0.1)
you will simply adress the upper left cell. Now looping over rows or columns will be fairly easy.
A bit late answer, but hopefully others may be helped.
Just for completion. The column header starts with (0,0) ... (0, n-1). The row header starts with (1,-1) ... (n,-1).
---------------------------------------------
| ColumnHeader (0,0) | ColumnHeader (0,1) |
---------------------------------------------
rowHeader (1,-1) | Value (1,0) | Value (1,1) |
--------------------------------------------
rowHeader (2,-1) | Value (2,0) | Value (2,1) |
--------------------------------------------
The code:
for key, cell in the_table.get_celld().items():
print (str(key[0])+", "+ str(key[1])+"\t"+str(cell.get_text()))
Condition text=="ID" is always False, since cell.get_text() returns a Text object rather than a string:
for cell in cells:
text = cell.get_text()
print text, text=="ID" # <==== here
if text == "ID":
cell.set_width(0.1)
else:
try:
int(text)
cell.set_width(0.1)
except TypeError:
pass
On the other hand, addressing the cells directly works: try cells[0].set_width(0.5).
EDIT: Text objects have an attribute get_text() themselves, so getting down to a string of a cell can be done like this:
text = cell.get_text().get_text() # yup, looks weird
if text == "ID":