TF-IDF using in pandas data frame - pandas

i am trying to use TF-IDF in pandas with data set content two columns first column it content text data and the another one it content categorical data looks like blow
summary type of attack
unknown african american assailants fired seve... Armed Assault
unknown perpetrators detonated explosives paci... Bombing
karl armstrong member years gang threw firebom... Infrastructure
karl armstrong member years gang broke into un... Infrastructure
unknown perpetrators threw molotov cocktail in... Infrastructure
i want to use tf-idf to convert the first column and then use it to build the mode for prediction of the second columns that content the attack type

I helped you to process your df into X and y to be trained with a short example.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
data = {'summary':['unknown african american assailants fired',
'Armed Assault unknown perpetrators detonated explosives','Bombing karl armstrong member years gang threw'],'type of attack':['bullet','explosion','gang']}
#tfidf
df = pd.DataFrame(data)
tf = TfidfVectorizer()
X = tf.fit_transform(df['summary'])
#label encoding
le = LabelEncoder()
y = le.fit_transform(df['type of attack'])
#your X and y ready to be trained
print('X----')
print(X)
print('y----')
print(y)
Output
X----
(0, 9) 0.4673509818107163
(0, 4) 0.4673509818107163
(0, 1) 0.4673509818107163
(0, 0) 0.4673509818107163
(0, 15) 0.35543246785041743
(1, 8) 0.4233944834119594
(1, 7) 0.4233944834119594
(1, 13) 0.4233944834119594
(1, 5) 0.4233944834119594
(1, 2) 0.4233944834119594
(1, 15) 0.3220024178194947
(2, 14) 0.37796447300922725
(2, 10) 0.37796447300922725
(2, 16) 0.37796447300922725
(2, 12) 0.37796447300922725
(2, 3) 0.37796447300922725
(2, 11) 0.37796447300922725
(2, 6) 0.37796447300922725
y----
[0 1 2]

Related

Large Sampling with Replacement by index layer of a Pandas multiindexed Dataframe

Imagine a dataframe with the structure below:
>>> print(pair_df)
0 1
centre param h pair_ind
0 x1 1 (0, 1) 2.244282 2.343915
(1, 2) 2.343915 2.442202
(2, 3) 2.442202 2.538162
(3, 4) 2.538162 2.630836
(4, 5) 2.630836 2.719298
... ... ...
9 x3 7 (1, 8) 1.407902 1.417398
(2, 9) 1.407953 1.422860
8 (0, 8) 1.407896 1.417398
(1, 9) 1.407902 1.422860
9 (0, 9) 1.407896 1.422860
[1350 rows x 2 columns]
What is the most efficient way to largely (e.g., 1000 times) sample (with replacement) this dataframe by index layer centre (10 values here) and put them all together?
I have found two solutions:
1)
import numpy as np
bootstrap_rand = np.random.choice(list(range(0,10)), size=len(range(0,10*1000)), replace=True).tolist()
sampled_df = pd.concat([pair_df.loc[idx[i, :, :, :], :] for i in bootstrap_rand])
sampled_df = pair_df.unstack(['param', 'h', 'pair_ind']).\
sample(10*1000, replace=True).\
stack(['param', 'h', 'pair_ind'])
Any more efficient ideas?

Julia - Generate 2-matching odd Set

In Julia, given a Set{Tuple{Int, Int}} named S of length greater than 3, for instance:
julia> S = Set{Tuple{Int,Int}}([(1, 4), (2, 5), (2, 6), (3, 6)])
Set{Tuple{Int64,Int64}} with 4 elements:
(2, 5)
(3, 6)
(2, 6)
(1, 4)
I want to return a subset T of S of length greater than 3 and odd (3, 5, 7, ...) such that, all first values of the tuples are unique. For instance, I can't have (2, 5) and (2, 6) because first value, 2 will not be unique. The same applies for second values meaning that I can't have (2, 6) and (3, 6).
If it is not possible, returning an empty Set of Tuple is fine.
Finally for the above minimal example the code should return:
julia> T = Set{Tuple{Int,Int}}([(1, 4), (2, 5), (3, 6)])
Set{Tuple{Int64,Int64}} with 3 elements:
(2, 5)
(3, 6)
(1, 4)
I am truly open to any other type of strucutre if you think it is better than Set{Tuple{Int, Int}} :)
I know how I can do it with integer programming. However, I will run this many times with large instances and I would like to know if there is a better way because I deeply think it can be done in polynomial time and perhaps in Julia with clever map or other efficient functions!
What you need is a way to filter the possible combinations of members of a set. So create a filtering function. If the part about an odd [3, 5, 7...] sequence you mentioned applies here, somehow, you might need to add that to the filter logic below:
using Combinatorics
allunique(a) = length(a) == length(unique(a))
slice(tuples, position) = [t[position] for t in tuples]
uniqueslice(tuples, position) = allunique(slice(tuples, position))
is_with_all_positions_unique(tuples) = all(n -> uniqueslice(tuples, n), 1:length(first(tuples)))
Now you can find combinations. With big sets these will explode in number, so make sure to exit when you have enough. You could use Lazy.jl here, or just a function:
function tcombinations(tuples, len, needed)
printed = 0
for combo in combinations(collect(tuples), len)
if is_with_all_positions_unique(combo)
printed += 1
println(combo)
printed >= needed && break
end
end
end
tcombinations(tuples, 3, 4)
[(2, 5), (4, 8), (3, 6)]
[(2, 5), (4, 8), (1, 4)]
[(2, 5), (4, 8), (5, 6)]
[(2, 5), (3, 6), (1, 4)]

Tensorflow tokeniser: the maximum number of words to keep

Trying to tokenize the IMDB movie reviews by applying Tensorflow tokenizer. I want to have a maximum 10000-word vocabulary. For unseen words, I use a default token.
type(X), X.shape, X[:3]
(pandas.core.series.Series,(25000,),
0 first think another disney movie might good it...
1 put aside dr house repeat missed desperate hou...
2 big fan stephen king s work film made even gre...
Name: SentimentText, dtype: object)
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer=Tokenizer(num_words=10000,oov_token='xxxxxxx')
# fit on the input data
tokenizer.fit_on_texts(X)
When I check the number of words in tokenizer dictionary I get:
X_dict=tokenizer.word_index
list(enumerate(X_dict.items()))[:10]
[(0, ('xxxxxxx', 1)),
(1, ('s', 2)),
(2, ('movie', 3)),
(3, ('film', 4)),
(4, ('not', 5)),
(5, ('it', 6)),
(6, ('one', 7)),
(7, ('like', 8)),
(8, ('i', 9)),
(9, ('good', 10))]
print(len(X_dict))
Out: 74120
Why do I get 74120 words instead of 10000 words?
Because the dictionary of words is always saved. When you have a look at the source code you see that in the function fit_on_texts() the parameter num_words is ignored. However, when you convert your text to sequences with texts_to_sequences() you can see the call to the texts_to_sequences_generator() which then has the following piece of code:
for w in seq:
i = self.word_index.get(w)
if i is not None:
if num_words and i >= num_words:
if oov_token_index is not None:
vect.append(oov_token_index)
else:
vect.append(i)
elif self.oov_token is not None:
vect.append(oov_token_index)
yield vect
There you can see, that num_words is noticed and used for further generating the sequences. This is useful as you can change the number of words easily without fitting on the whole text again, so experiment whether it suits your needs well or you need more words for successfully work on your task, as nicolewhite states in her github answer.
So basically, what you observe is just as expected, when you run np.unique() on all of your sequences, you will not have more than 10000 values.

How to exclude business days from date calculations

Dates are in the following format:
YYYY-MM-DD HH:MM:SS
Using the following notebooks and calculation:
import pandas as pd
import numpy as np
import datetime as dt
F52_metrics['Dock to UPS SAP Receipt'] = (F52_metrics['UPS SAP Receipt Date'].dt.date - F52_metrics['Dock Date'].dt.date).astype(str).map(lambda x: x.rstrip('00:00:00.000000000')).str.replace("NaT", "").str.replace("+","").str.replace("days","")
Need to replicate the above calculation to exclude business days. I have tried replacing calculation entirely with numpy.busday_count but have been experiencing syntax errors.
You can use numpy calendar.
def calendar():
#set work week mask and optional holidays array
return np.busdaycalendar(weekmask='1111100', holidays=['2020-01-01','2020-01-20','2020-02-17','2020-05-25','2020-07-03','2020-09-07','2020-10-12','2020-11-11','2020-11-26','2020-12-25'])
def countWeekDays(fromDate='2020-03-03', toDate='2020-06-03'):
d = np.arange(fromDate, toDate, dtype=np.datetime64)
weekdays = d[np.is_busday(d, busdaycal=calendar())]
workDays = [(m, np.array([i for i in weekdays if i.item().month==m]).size) for m in range(1,13)]
return workDays #weekdays, months #
>>> countWeekDays()
[(1, 0), (2, 0), (3, 21), (4, 22), (5, 20), (6, 2), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 0)]

Python, Numpy: all UNIQUE combinations of a numpy.array() vector

I want to get all unique combinations of a numpy.array vector (or a pandas.Series). I used itertools.combinations but it's very slow. For an array of size (1000,) it takes many hours. Here is my code using itertools (actually I use combination differences):
def a(array):
temp = pd.Series([])
for i in itertools.combinations(array, 2):
temp = temp.append(pd.Series(np.abs(i[0]-i[1])))
temp.index=range(len(temp))
return temp
As you see there is no repetition!!
The sklearn.utils.extmath.cartesian is really fast and good but it provides repetitions which I do not want! I need help rewriting above function without using itertools and much more speed for large vectors.
You could take the upper triangular part of a matrix formed on the Cartesian product with the binary operation (here subtraction, as in your example):
import numpy as np
n = 3
a = np.random.randn(n)
print(a)
print(a - a[:, np.newaxis])
print((a - a[:, np.newaxis])[np.triu_indices(n, 1)])
gives
[ 0.04248369 -0.80162228 -0.44504522]
[[ 0. -0.84410597 -0.48752891]
[ 0.84410597 0. 0.35657707]
[ 0.48752891 -0.35657707 0. ]]
[-0.84410597 -0.48752891 0.35657707]
with n=1000 (and output piped to /dev/null) this runs in 0.131s
on my relatively modest laptop.
For a random array of ints:
import numpy as np
import pandas as pd
import itertools as it
b = np.random.randint(0, 8, ((6,)))
# array([7, 0, 6, 7, 1, 5])
pd.Series(list(it.combinations(np.unique(b), 2)))
it returns:
0 (0, 1)
1 (0, 5)
2 (0, 6)
3 (0, 7)
4 (1, 5)
5 (1, 6)
6 (1, 7)
7 (5, 6)
8 (5, 7)
9 (6, 7)
dtype: object