How to pass stopwords from a column of dataframe

How to pass stopwords from a column of dataframe - pandas

I am getting error while I directly passing dataframe column into stopword.
How Can I resolve this
stop_words_corpus=pd.DataFrame(word_dictionary_corpus.Word.unique(),columns=feature_names)
cv = CountVectorizer( max_features = 200,analyzer='word',stop_words= stop_words_corpus)
cv_txt = cv.fit_transform(data.pop('Clean_addr'))
****Updated Error***
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
867
868 vocabulary, X = self._count_vocab(raw_documents,
--> 869 self.fixed_vocabulary_)
870
871 if self.binary:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
783 vocabulary.default_factory = vocabulary.__len__
784
--> 785 analyze = self.build_analyzer()
786 j_indices = []
787 indptr = _make_int_array()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in build_analyzer(self)
260
261 elif self.analyzer == 'word':
--> 262 stop_words = self.get_stop_words()
263 tokenize = self.build_tokenizer()
264
I fixed the error taht error still having the issue

Try this:
cv = CountVectorizer(max_features = 200,
analyzer='word',
stop_words=stop_words_corpus.stack().unique())

We need to make the dataframe into NpArray to pass stopwords in to the countvectorizer
stop_word =stop_words_corpus['Word'].values
cv = CountVectorizer(max_features = 200,
analyzer='word',
stop_words=stop_word)

Related

ValueError: The two structures don't have the same sequence length. Input structure has length 1, while shallow structure has length 2

What is the solution to the following error in tensorflow.
ValueError: The two structures don't have the same sequence length.
Input structure has length 1, while shallow structure has length 2.
I tried tensorflow versions: 2.9.1 and 2.4.0.
The toy example is given to reproduce the error.
import tensorflow as tf
d1 = tf.data.Dataset.range(10)
d1 = d1.map(lambda x:tf.cast([x], tf.float32))
def func1(x):
y1 = 2.0 * x
y2 = -3.0 * x
return tuple([y1, y2])
d2 = d1.map(lambda x: tf.py_function(func1, [x], [tf.float32, tf.float32]))
d3 = d2.padded_batch(3, padded_shapes=(None,))
for x, y in d2.as_numpy_iterator():
pass
The full error is:
ValueError Traceback (most recent call last)
~/Documents/pythonProject/tfProjects/asr/transformer/dataset.py in <module>
256 return tuple([y1, y2])
257 d2 = d1.map(lambda x: tf.py_function(func1, [x], [tf.float32, tf.float32]))
---> 258 d3 = d2.padded_batch(3, padded_shapes=(None,))
259 for x, y in d2.as_numpy_iterator():
260 pass
~/miniconda3/envs/jtf2/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py in padded_batch(self, batch_size, padded_shapes, padding_values, drop_remainder, name)
1887 padding_values,
1888 drop_remainder,
-> 1889 name=name)
1890
1891 def map(self,
~/miniconda3/envs/jtf2/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py in __init__(self, input_dataset, batch_size, padded_shapes, padding_values, drop_remainder, name)
5171
5172 input_shapes = get_legacy_output_shapes(input_dataset)
-> 5173 flat_padded_shapes = nest.flatten_up_to(input_shapes, padded_shapes)
5174
5175 flat_padded_shapes_as_tensors = []
~/miniconda3/envs/jtf2/lib/python3.7/site-packages/tensorflow/python/data/util/nest.py in flatten_up_to(shallow_tree, input_tree)
377 `input_tree`.
378 """
--> 379 assert_shallow_structure(shallow_tree, input_tree)
380 return list(_yield_flat_up_to(shallow_tree, input_tree))
381
~/miniconda3/envs/jtf2/lib/python3.7/site-packages/tensorflow/python/data/util/nest.py in assert_shallow_structure(shallow_tree, input_tree, check_types)
290 if len(input_tree) != len(shallow_tree):
291 raise ValueError(
--> 292 "The two structures don't have the same sequence length. Input "
293 f"structure has length {len(input_tree)}, while shallow structure "
294 f"has length {len(shallow_tree)}.")
ValueError: The two structures don't have the same sequence length. Input structure has length 1, while shallow structure has length 2.

The following modification in padded_shapes argument will resolve the error.
import tensorflow as tf
d1 = tf.data.Dataset.range(10)
d1 = d1.map(lambda x:tf.cast([x], tf.float32))
def func1(x):
y1 = 2.0 * x
y2 = -3.0 * x
return tuple([y1, y2])
d2 = d1.map(lambda x: tf.py_function(func1, [x], [tf.float32, tf.float32]))
d3 = d2.padded_batch(3, padded_shapes=([None],[None]))
for x, y in d2.as_numpy_iterator():
pass

Pandas PerformanceWarning: DataFrame is highly fragmented. Whats the efficient solution?

Here is a generic code representing what is happening in my script:
import pandas as pd
import numpy as np
dic = {}
for i in np.arange(0,10):
dic[str(i)] = df = pd.DataFrame(np.random.randint(0,1000,size=(5000, 20)),
columns=list('ABCDEFGHIJKLMNOPQRST'))
df_out = pd.DataFrame(index = df.index)
for i in np.arange(0,10):
df_out['A_'+str(i)] = dic[str(i)]['A'].astype('int')
df_out['D_'+str(i)] = dic[str(i)]['D'].astype('int')
df_out['H_'+str(i)] = dic[str(i)]['H'].astype('int')
df_out['I_'+str(i)] = dic[str(i)]['I'].astype('int')
df_out['M_'+str(i)] = dic[str(i)]['M'].astype('int')
df_out['O_'+str(i)] = dic[str(i)]['O'].astype('int')
df_out['Q_'+str(i)] = dic[str(i)]['Q'].astype('int')
df_out['R_'+str(i)] = dic[str(i)]['R'].astype('int')
df_out['S_'+str(i)] = dic[str(i)]['S'].astype('int')
df_out['T_'+str(i)] = dic[str(i)]['T'].astype('int')
df_out['C_'+str(i)] = dic[str(i)]['C'].astype('int')
You will notice that as soon as df_out (output) numbers of inseted columns exceed 100 I get the following warning:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider using pd.concat instead
The question is how could I use:
pd.concat()
And still have the custom column name that depens on the dictionary key ?
IMPORTANT: I still would like to keep a specific column selections, not all of them.
Like in the example: A, D , H , I etc...
SPECIAL EDIT (based on Corralien's answer)
cols = {'A': 'float',
'D': 'bool'}
out = pd.DataFrame()
for c, df in dic.items():
for col, ftype in cols.items():
out = pd.concat([out,df[[col]].add_suffix(f'_{c}')],
axis=1).astype(ftype)
Many thanks for your help !

You can use a comprehension with pd.concat:
cols = {'A': 'float', 'D': 'bool'}
out = pd.concat([df[cols].astype(cols).add_prefix(f'{k}_')
for k, df in dic.items()], axis=1)
print(out)
# Output:
0_A 0_D 1_A 1_D 2_A 2_D 3_A 3_D
0 116.0 True 396.0 True 944.0 True 398.0 True
1 128.0 True 102.0 True 561.0 True 70.0 True
2 982.0 True 613.0 True 822.0 True 246.0 True
3 830.0 True 366.0 True 861.0 True 906.0 True
4 533.0 True 741.0 True 305.0 True 874.0 True

Use concat with flatten MultiIndex in map:
cols = ['A','D']
df_out = pd.concat({k: v[cols] for k, v in dic.items()}, axis=1).astype('int')
df_out.columns = df_out.columns.map(lambda x: f'{x[1]}_{x[0]}')
print (df_out)
A_0 D_0 A_1 D_1 A_2 D_2 A_3 D_3
0 116 341 396 502 944 483 398 839
1 128 621 102 70 561 656 70 169
2 982 44 613 775 822 379 246 25
3 830 987 366 481 861 632 906 676
4 533 349 741 410 305 422 874 19

XGBoostError: problem with pipeline and scikit.learn

I am very new to Python and ML. I have been doing few courses from Kaggle and working on pipelines.Everything seemed to work fine without the pipelines but got XGBoostError when I piped it all. I have an issue with my code but I cannot figure it out. Here below is the code and the error after:
X_full = pd.read_csv(train_path).copy()
X_test = pd.read_csv(test_path).copy()
def cleaning(var):
q1, q3 = np.percentile(var['Fare'], [25, 75])
iqr = q3 - q1
lower_bound_val = q1 - (1.5 * iqr)
upper_bound_val = q3 + (1.5 * iqr)
var = var[(var['Fare'] >= lower_bound_val) & (var['Fare'] < upper_bound_val)].copy()
var['family_size'] = var.SibSp + var.Parch
drop_cols = ['PassengerId', 'Name', 'Parch', 'SibSp', 'Ticket', 'Cabin', 'Embarked']
var = var.drop(drop_cols, axis=1)
return var
get_cleaning = FunctionTransformer(cleaning, validate=False)
age_transformer = SimpleImputer(missing_values=np.nan, strategy='median')
age_col = ['Age']
sex_transformer = OneHotEncoder(handle_unknown='ignore', sparse=False)
sex_col = ['Sex']
# Define the model
xgboost_m = XGBRegressor(random_state=0)
prepro_col = ColumnTransformer(
transformers=[
('age', age_transformer, age_col),
('sex', sex_transformer, sex_col)
])
pl = Pipeline(steps=[('get_cleaning', get_cleaning),
('prepro_col', prepro_col),
('XGBoost', xgboost_m)
])
# Drop assign target to y and drop from X_full
y = X_full.Survived
X_full.drop(['Survived'], axis=1, inplace=True)
# Split data
X_train, X_valid, y_train, y_valid = train_test_split(X_full, y, train_size=0.8, test_size=0.2, random_state=0)
pl.fit(X_train, y_train)
And here the error:
---------------------------------------------------------------------------
XGBoostError Traceback (most recent call last)
<ipython-input-887-676d922c8ba5> in <module>
----> 1 pl.fit(X_train, y_train)
/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
333 if self._final_estimator != 'passthrough':
334 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 335 self._final_estimator.fit(Xt, y, **fit_params_last_step)
336
337 return self
/opt/conda/lib/python3.7/site-packages/xgboost/sklearn.py in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, callbacks)
546 obj=obj, feval=feval,
547 verbose_eval=verbose, xgb_model=xgb_model,
--> 548 callbacks=callbacks)
549
550 if evals_result:
/opt/conda/lib/python3.7/site-packages/xgboost/training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, xgb_model, callbacks)
210 evals=evals,
211 obj=obj, feval=feval,
--> 212 xgb_model=xgb_model, callbacks=callbacks)
213
214
/opt/conda/lib/python3.7/site-packages/xgboost/training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
73 # Skip the first update if it is a recovery step.
74 if version % 2 == 0:
---> 75 bst.update(dtrain, i, obj)
76 bst.save_rabit_checkpoint()
77 version += 1
/opt/conda/lib/python3.7/site-packages/xgboost/core.py in update(self, dtrain, iteration, fobj)
1159 _check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
1160 ctypes.c_int(iteration),
-> 1161 dtrain.handle))
1162 else:
1163 pred = self.predict(dtrain, output_margin=True, training=True)
/opt/conda/lib/python3.7/site-packages/xgboost/core.py in _check_call(ret)
186 """
187 if ret != 0:
--> 188 raise XGBoostError(py_str(_LIB.XGBGetLastError()))
189
190
XGBoostError: [22:28:42] ../src/data/data.cc:530: Check failed: labels_.Size() == num_row_ (712 vs. 622) : Size of labels must equal to number of rows.
Stack trace:
[bt] (0) /opt/conda/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0xa5dc4) [0x7f27232f2dc4]
[bt] (1) /opt/conda/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x106c92) [0x7f2723353c92]
[bt] (2) /opt/conda/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x1a84b7) [0x7f27233f54b7]
[bt] (3) /opt/conda/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x1aae4e) [0x7f27233f7e4e]
[bt] (4) /opt/conda/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x55) [0x7f27232e4f35]
[bt] (5) /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f2783ff0630]
[bt] (6) /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f2783feffed]
[bt] (7) /opt/conda/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f278323c60e]
[bt] (8) /opt/conda/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x13044) [0x7f278323d044]

The error indicates that, labels_.Size() == num_row_ (712 vs. 622) , your have 622 rows and 712 label, that isn't equal. Check your dataset and try again. In your dataset y = X_full.Survived is label/ Target Output.

Spark random forest - could not convert float to int error

I have features which are numeric and a binary response. I am trying to build ensemble decision trees such as random forest and gradient-boosted trees. However, I get an error. I have reproduced the error with iris data.
The error is below and the whole error message is at the bottom.
TypeError: Could not convert 12.631578947368421 to int
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
y = list(iris.target)
df = pd.read_csv("https://raw.githubusercontent.com/venky14/Machine- Learning-with-Iris-Dataset/master/Iris.csv")
df = df.drop(['Species'], axis = 1)
df['label'] = y
spark_df = spark.createDataFrame(df).drop('Id')
cols = spark_df.drop('label').columns
assembler = VectorAssembler(inputCols = cols, outputCol = 'features')
output_dat = assembler.transform(spark_df).select('label', 'features')
rf = RandomForestClassifier(labelCol = "label", featuresCol = "features")
paramGrid_rf = ParamGridBuilder() \
.addGrid(rf.maxDepth, np.linspace(5, 30, 6)) \
.addGrid(rf.numTrees, np.linspace(10, 60, 20)).build()
crossval_rf = CrossValidator(estimator = rf,
estimatorParamMaps = paramGrid_rf,
evaluator = BinaryClassificationEvaluator(),
numFolds = 5)
cvModel_rf = crossval_rf.fit(output_dat)
TypeError Traceback (most recent call last)
<ipython-input-24-44f8f759ed8e> in <module>
2 paramGrid_rf = ParamGridBuilder() \
3 .addGrid(rf.maxDepth, np.linspace(5, 30, 6)) \
----> 4 .addGrid(rf.numTrees, np.linspace(10, 60, 20)) \
5 .build()
6
~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in build(self)
120 return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
121
--> 122 return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]
123
124
~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in <listcomp>(.0)
120 return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
121
--> 122 return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]
123
124
~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in to_key_value_pairs(keys, values)
118
119 def to_key_value_pairs(keys, values):
--> 120 return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
121
122 return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]
~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in <listcomp>(.0)
118
119 def to_key_value_pairs(keys, values):
--> 120 return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
121
122 return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]
~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/param/__init__.py in toInt(value)
197 return int(value)
198 else:
--> 199 raise TypeError("Could not convert %s to int" % value)
200
201 #staticmethod
TypeError: Could not convert 12.631578947368421 to int```

Both maxDepth and numTrees need to be integers; Numpy linspace procudes floats:
import numpy as np
np.linspace(10, 60, 20)
Result:
array([ 10. , 12.63157895, 15.26315789, 17.89473684,
20.52631579, 23.15789474, 25.78947368, 28.42105263,
31.05263158, 33.68421053, 36.31578947, 38.94736842,
41.57894737, 44.21052632, 46.84210526, 49.47368421,
52.10526316, 54.73684211, 57.36842105, 60. ])
So, your code bumps upon the first non-integer value (here 12.63157895), and produces an error.
Use arange instead:
np.arange(10, 60, 20)
# array([10, 30, 50])

pyspark: creating a k-means clustering model using spark-ml with spark data frame

I am using the following code to create a clustering model:
import pandas as pd
pandas_df = pd.read_pickle('df_features.pickle')
spark_df = sqlContext.createDataFrame(pandas_df)
from pyspark.ml.linalg import Vectors
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=2, seed=1.0)
modela = kmeans.fit(spark_df)
Then I got errors:
AnalysisException Traceback (most recent call last)
<ipython-input-26-00e1e2ba1983> in <module>()
3
4 kmeans = KMeans(k=2, seed=1.0)
----> 5 modela = kmeans.fit(spark_df)
/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/base.pyc in fit(self, dataset, params)
62 return self.copy(params)._fit(dataset)
63 else:
---> 64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/wrapper.pyc in _fit(self, dataset)
211
212 def _fit(self, dataset):
--> 213 java_model = self._fit_java(dataset)
214 return self._create_model(java_model)
215
/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/wrapper.pyc in _fit_java(self, dataset)
208 """
209 self._transfer_params_to_java()
--> 210 return self._java_obj.fit(dataset._jdf)
211
212 def _fit(self, dataset):
/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u"cannot resolve '`features`' given input columns: [field_1, field_2, field_3, field_4, field_5, field_6, field_7];"
Did I create the data frame wrong? Does anyone know what I missed? Thanks!

You need to use VectorAssembler
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=spark_df.columns, outputCol="features")
vector_df = vecAssembler.transform(spark_df)
kmeans = KMeans().setK(n_clusters).setSeed(1)
model = kmeans.fit(vector_df )

For kmeans, it requires an rdd of DenseVectors. So you need to create a rdd of DenseVectors, where each vector corresponds to one row of your dataframe. So supposing that your dataframe has three columns you are feeding into the K Means model, I would refactor it to be along the lines of:
spark_rdd = spark_df.rdd.sortByKey()
modelInput = spark_rdd.map(lambda x: Vectors.dense(x[0],x[1],x[2])).sortByKey()
modelObject = Kmeans.train(modelInput,2)
Then if you want to get the results back from an RDD into a dataframe, I would do something like:
labels = modelInput.map(lambda x: model.predict(x))
results = labels.zip(spark_rdd)
resultFrame = results.map(lambda x: Row(Label = x[0], Column1 = x[0][1], Column2 = x[1][1],Column3 = x[1][2]).toDF()

data = [(Vectors.dense( [x[0], x[1]]),) for x in pandas_df.iloc[0:,2:4].values]
spark_df = spark.createDataFrame(data, ["features"])
kmeans = KMeans(k=2, seed=1.0)
modela = kmeans.fit(spark_df)
for more details refer to the official manual

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to pass stopwords from a column of dataframe - pandas

Try this: cv = CountVectorizer(max_features = 200, analyzer='word', stop_words=stop_words_corpus.stack().unique())

We need to make the dataframe into NpArray to pass stopwords in to the countvectorizer stop_word =stop_words_corpus['Word'].values cv = CountVectorizer(max_features = 200, analyzer='word', stop_words=stop_word)

Related

ValueError: The two structures don't have the same sequence length. Input structure has length 1, while shallow structure has length 2

Pandas PerformanceWarning: DataFrame is highly fragmented. Whats the efficient solution?

XGBoostError: problem with pipeline and scikit.learn

Spark random forest - could not convert float to int error

pyspark: creating a k-means clustering model using spark-ml with spark data frame

Categories

Resources