machine learning - Input matches no features in training set; how much more training data do I need? -


i new text mining. working on spam filter. did text cleaning, removed stop words. n-grams features. build frequency matrix , build model using naive bayes. have limited set of training data, facing following problem.

when sentence comes me classification , if none of features match existing features in training frequency vector has zeros.

when send vector classification, useless result.

what can ideal size of training data expect better results?

generally, more data have, better. diminishing returns @ point. idea see if training set size problem plotting cross validation performance while varying size of training set. in scikit-learn has example of type of "learning curve."

scikit-learn learning curve example

scikit-learn learning curve

you may consider bringing in outside sample posts increase size of training set.

as grow training set, may want try reducing bias of classifier. done adding n-gram features, or switching logistic regression or svm model.


Comments