ML之NB:利用NB朴素贝叶斯算法(CountVectorizer/TfidfVectorizer+去除停用词)进行分类预测、评估
生活随笔
收集整理的這篇文章主要介紹了
ML之NB:利用NB朴素贝叶斯算法(CountVectorizer/TfidfVectorizer+去除停用词)进行分类预测、评估
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
ML之NB:利用NB樸素貝葉斯算法(CountVectorizer/TfidfVectorizer+去除停用詞)進行分類預測、評估
?
?
?
目錄
輸出結果
設計思路
核心代碼
?
?
?
?
?
輸出結果
?
?
設計思路
?
核心代碼
class CountVectorizer Found at: sklearn.feature_extraction.textclass CountVectorizer(BaseEstimator, VectorizerMixin):"""Convert a collection of text documents to a matrix of token countsThis implementation produces a sparse representation of the counts usingscipy.sparse.csr_matrix.If you do not provide an a-priori dictionary and you do not use an analyzerthat does some kind of feature selection then the number of features willbe equal to the vocabulary size found by analyzing the data.Read more in the :ref:`User Guide <text_feature_extraction>`.Parameters----------input : string {'filename', 'file', 'content'}If 'filename', the sequence passed as an argument to fit isexpected to be a list of filenames that need reading to fetchthe raw content to analyze.If 'file', the sequence items must have a 'read' method (file-likeobject) that is called to fetch the bytes in memory.Otherwise the input is expected to be the sequence strings orbytes items are expected to be analyzed directly.encoding : string, 'utf-8' by default.If bytes or files are given to analyze, this encoding is used todecode.decode_error : {'strict', 'ignore', 'replace'}Instruction on what to do if a byte sequence is given to analyze thatcontains characters not of the given `encoding`. By default, it is'strict', meaning that a UnicodeDecodeError will be raised. Othervalues are 'ignore' and 'replace'.strip_accents : {'ascii', 'unicode', None}Remove accents during the preprocessing step.'ascii' is a fast method that only works on characters that havean direct ASCII mapping.'unicode' is a slightly slower method that works on any characters.None (default) does nothing.analyzer : string, {'word', 'char', 'char_wb'} or callableWhether the feature should be made of word or character n-grams.Option 'char_wb' creates character n-grams only from text insideword boundaries; n-grams at the edges of words are padded with space.If a callable is passed it is used to extract the sequence of featuresout of the raw, unprocessed input.preprocessor : callable or None (default)Override the preprocessing (string transformation) stage whilepreserving the tokenizing and n-grams generation steps.tokenizer : callable or None (default)Override the string tokenization step while preserving thepreprocessing and n-grams generation steps.Only applies if ``analyzer == 'word'``.ngram_range : tuple (min_n, max_n)The lower and upper boundary of the range of n-values for differentn-grams to be extracted. All values of n such that min_n <= n <= max_nwill be used.stop_words : string {'english'}, list, or None (default)If 'english', a built-in stop word list for English is used.If a list, that list is assumed to contain stop words, all of whichwill be removed from the resulting tokens.Only applies if ``analyzer == 'word'``.If None, no stop words will be used. max_df can be set to a valuein the range [0.7, 1.0) to automatically detect and filter stopwords based on intra corpus document frequency of terms.lowercase : boolean, True by defaultConvert all characters to lowercase before tokenizing.token_pattern : stringRegular expression denoting what constitutes a "token", only usedif ``analyzer == 'word'``. The default regexp select tokens of 2or more alphanumeric characters (punctuation is completely ignoredand always treated as a token separator).max_df : float in range [0.0, 1.0] or int, default=1.0When building the vocabulary ignore terms that have a documentfrequency strictly higher than the given threshold (corpus-specificstop words).If float, the parameter represents a proportion of documents, integerabsolute counts.This parameter is ignored if vocabulary is not None.min_df : float in range [0.0, 1.0] or int, default=1When building the vocabulary ignore terms that have a documentfrequency strictly lower than the given threshold. This value is alsocalled cut-off in the literature.If float, the parameter represents a proportion of documents, integerabsolute counts.This parameter is ignored if vocabulary is not None.max_features : int or None, default=NoneIf not None, build a vocabulary that only consider the topmax_features ordered by term frequency across the corpus.This parameter is ignored if vocabulary is not None.vocabulary : Mapping or iterable, optionalEither a Mapping (e.g., a dict) where keys are terms and values areindices in the feature matrix, or an iterable over terms. If notgiven, a vocabulary is determined from the input documents. Indicesin the mapping should not be repeated and should not have any gapbetween 0 and the largest index.binary : boolean, default=FalseIf True, all non zero counts are set to 1. This is useful for discreteprobabilistic models that model binary events rather than integercounts.dtype : type, optionalType of the matrix returned by fit_transform() or transform().Attributes----------vocabulary_ : dictA mapping of terms to feature indices.stop_words_ : setTerms that were ignored because they either:- occurred in too many documents (`max_df`)- occurred in too few documents (`min_df`)- were cut off by feature selection (`max_features`).This is only available if no vocabulary was given.See also--------HashingVectorizer, TfidfVectorizerNotes-----The ``stop_words_`` attribute can get large and increase the model sizewhen pickling. This attribute is provided only for introspection and canbe safely removed using delattr or set to None before pickling."""def __init__(self, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=r"(?u)\b\w\w+\b", ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=np.int64):self.input = inputself.encoding = encodingself.decode_error = decode_errorself.strip_accents = strip_accentsself.preprocessor = preprocessorself.tokenizer = tokenizerself.analyzer = analyzerself.lowercase = lowercaseself.token_pattern = token_patternself.stop_words = stop_wordsself.max_df = max_dfself.min_df = min_dfif max_df < 0 or min_df < 0:raise ValueError("negative value for max_df or min_df")self.max_features = max_featuresif max_features is not None:if (not isinstance(max_features, numbers.Integral) or max_features <= 0):raise ValueError("max_features=%r, neither a positive integer nor None" % max_features)self.ngram_range = ngram_rangeself.vocabulary = vocabularyself.binary = binaryself.dtype = dtypedef _sort_features(self, X, vocabulary):"""Sort features by nameReturns a reordered matrix and modifies the vocabulary in place"""sorted_features = sorted(six.iteritems(vocabulary))map_index = np.empty(len(sorted_features), dtype=np.int32)for new_val, (term, old_val) in enumerate(sorted_features):vocabulary[term] = new_valmap_index[old_val] = new_valX.indices = map_index.take(X.indices, mode='clip')return Xdef _limit_features(self, X, vocabulary, high=None, low=None, limit=None):"""Remove too rare or too common features.Prune features that are non zero in more samples than high or lessdocuments than low, modifying the vocabulary, and restricting it toat most the limit most frequent.This does not prune samples with zero features."""if high is None and low is None and limit is None:return X, set()# Calculate a mask based on document frequenciesdfs = _document_frequency(X)tfs = np.asarray(X.sum(axis=0)).ravel()mask = np.ones(len(dfs), dtype=bool)if high is not None:mask &= dfs <= highif low is not None:mask &= dfs >= lowif limit is not None and mask.sum() > limit:mask_inds = -tfs[mask].argsort()[:limit]new_mask = np.zeros(len(dfs), dtype=bool)new_mask[np.where(mask)[0][mask_inds]] = Truemask = new_masknew_indices = np.cumsum(mask) - 1 # maps old indices to newremoved_terms = set()for term, old_index in list(six.iteritems(vocabulary)):if mask[old_index]:vocabulary[term] = new_indices[old_index]else:del vocabulary[term]removed_terms.add(term)kept_indices = np.where(mask)[0]if len(kept_indices) == 0:raise ValueError("After pruning, no terms remain. Try a lower"" min_df or a higher max_df."):kept_indices], removed_termsreturn X[def _count_vocab(self, raw_documents, fixed_vocab):"""Create sparse feature matrix, and vocabulary where fixed_vocab=False"""if fixed_vocab:vocabulary = self.vocabulary_else:# Add a new value when a new vocabulary item is seenvocabulary = defaultdict()vocabulary.default_factory = vocabulary.__len__analyze = self.build_analyzer()j_indices = []indptr = _make_int_array()values = _make_int_array()indptr.append(0)for doc in raw_documents:feature_counter = {}for feature in analyze(doc):try:feature_idx = vocabulary[feature]if feature_idx not in feature_counter:feature_counter[feature_idx] = 1else:feature_counter[feature_idx] += 1except KeyError:# Ignore out-of-vocabulary items for fixed_vocab=Truecontinuej_indices.extend(feature_counter.keys())values.extend(feature_counter.values())indptr.append(len(j_indices))if not fixed_vocab:# disable defaultdict behaviourvocabulary = dict(vocabulary)if not vocabulary:raise ValueError("empty vocabulary; perhaps the documents only"" contain stop words")j_indices = np.asarray(j_indices, dtype=np.intc)indptr = np.frombuffer(indptr, dtype=np.intc)values = np.frombuffer(values, dtype=np.intc)X = sp.csr_matrix((values, j_indices, indptr), shape=(len(indptr) - 1, len(vocabulary)), dtype=self.dtype)X.sort_indices()return vocabulary, Xdef fit(self, raw_documents, y=None):"""Learn a vocabulary dictionary of all tokens in the raw documents.Parameters----------raw_documents : iterableAn iterable which yields either str, unicode or file objects.Returns-------self"""self.fit_transform(raw_documents)return selfdef fit_transform(self, raw_documents, y=None):"""Learn the vocabulary dictionary and return term-document matrix.This is equivalent to fit followed by transform, but more efficientlyimplemented.Parameters----------raw_documents : iterableAn iterable which yields either str, unicode or file objects.Returns-------X : array, [n_samples, n_features]Document-term matrix."""# We intentionally don't call the transform method to make# fit_transform overridable without unwanted side effects in# TfidfVectorizer.if isinstance(raw_documents, six.string_types):raise ValueError("Iterable over raw text documents expected, ""string object received.")self._validate_vocabulary()max_df = self.max_dfmin_df = self.min_dfmax_features = self.max_featuresvocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)if self.binary:X.data.fill(1)if not self.fixed_vocabulary_:X = self._sort_features(X, vocabulary)n_doc = X.shape[0]max_doc_count = max_df if isinstance(max_df, numbers.Integral) else max_df * n_docmin_doc_count = min_df if isinstance(min_df, numbers.Integral) else min_df * n_docif max_doc_count < min_doc_count:raise ValueError("max_df corresponds to < documents than min_df")X, self.stop_words_ = self._limit_features(X, vocabulary, max_doc_count, min_doc_count, max_features)self.vocabulary_ = vocabularyreturn Xdef transform(self, raw_documents):"""Transform documents to document-term matrix.Extract token counts out of raw text documents using the vocabularyfitted with fit or the one provided to the constructor.Parameters----------raw_documents : iterableAn iterable which yields either str, unicode or file objects.Returns-------X : sparse matrix, [n_samples, n_features]Document-term matrix."""if isinstance(raw_documents, six.string_types):raise ValueError("Iterable over raw text documents expected, ""string object received.")if not hasattr(self, 'vocabulary_'):self._validate_vocabulary()self._check_vocabulary()# use the same matrix-building strategy as fit_transform_, X = self._count_vocab(raw_documents, fixed_vocab=True)if self.binary:X.data.fill(1)return Xdef inverse_transform(self, X):"""Return terms per document with nonzero entries in X.Parameters----------X : {array, sparse matrix}, shape = [n_samples, n_features]Returns-------X_inv : list of arrays, len = n_samplesList of arrays of terms."""self._check_vocabulary()if sp.issparse(X):# We need CSR format for fast row manipulations.X = X.tocsr()else:# We need to convert X to a matrix, so that the indexing# returns 2D objectsX = np.asmatrix(X)n_samples = X.shape[0]terms = np.array(list(self.vocabulary_.keys()))indices = np.array(list(self.vocabulary_.values()))inverse_vocabulary = terms[np.argsort(indices)]return [inverse_vocabulary[X[i:].nonzero()[1]].ravel() for i in range(n_samples)]def get_feature_names(self):"""Array mapping from feature integer indices to feature name"""self._check_vocabulary()return [t for t, i in sorted(six.iteritems(self.vocabulary_), key=itemgetter(1))]?
?
?
?
?
?
總結
以上是生活随笔為你收集整理的ML之NB:利用NB朴素贝叶斯算法(CountVectorizer/TfidfVectorizer+去除停用词)进行分类预测、评估的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ML之NB:利用朴素贝叶斯NB算法(Tf
- 下一篇: ML之DT:基于DT决策树算法(对比是否