2017-07-28 62 views
0

我一直在试图自我主机与apachesklearn分类,我放在一起,我结束了使用joblib序列化保存的模型,然后在瓶中的应用程序加载它。现在,当运行内置开发服务器的烧瓶时,这个应用程序完美运行,但是当我使用debian 9 apache服务器进行设置时,出现500错误。在深入研究Apache的error.log,我得到:Sklearn分类和瓶问题

AttributeError: module '__main__' has no attribute 'tokenize' 

现在,这是有趣的我,因为当我做了我自己写的标记生成器,Web应用程序给我没有问题,当我在本地运行它。此外,我使用的保存模型是在网络服务器上进行培训的,所以略有不同的库版本应该不成问题。

我的web应用程序的代码是:

import re 
import sys 

from flask import Flask, request, render_template 
from nltk import word_tokenize 
from nltk.stem.wordnet import WordNetLemmatizer 
from sklearn.externals import joblib 

app = Flask(__name__) 



def tokenize(text): 
    # text = text.translate(str.maketrans('','',string.punctuation)) 
    text = re.sub(r'\W+', ' ', text) 
    tokens = word_tokenize(text) 
    lemas = [] 
    for item in tokens: 
     lemas.append(WordNetLemmatizer().lemmatize(item)) 
    return lemas 

@app.route('/') 
def home(): 
    return render_template('home.html') 

@app.route('/analyze',methods=['POST','GET']) 
def analyze(): 
    if request.method=='POST': 
     result=request.form 
     input_text = result['input_text'] 

     clf = joblib.load("model.pkl.z") 
     parameters = clf.named_steps['clf'].get_params() 
     predicted = clf.predict([input_text]) 
     # print(predicted) 
     certainty = clf.decision_function([input_text]) 

     # Is it bonkers? 
     if predicted[0]: 
      verdict = "Not too nuts!" 
     else: 
      verdict = "Bonkers!" 

     return render_template('result.html',prediction=[input_text, verdict, float(certainty), parameters]) 

if __name__ == '__main__': 
    #app.debug = True 
    app.run() 

随着.wsgi文件之中:

import sys 
sys.path.append('/var/www/mysite') 

from conspiracydetector import app as application 

此外,我训练了与此代码模式:

import logging 
import pprint # Pretty stuff 
import re 
import sys # For command line arguments 
from time import time # to show progress 

import numpy as np 
from nltk import word_tokenize 
from nltk.stem.wordnet import WordNetLemmatizer 
from sklearn import metrics 
from sklearn.datasets import load_files 
from sklearn.externals import joblib # In order to save 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import train_test_split 
from sklearn.pipeline import Pipeline 
from sklearn.svm import LinearSVC 

# Tokenizer that does stemming and strips punctuation 
def tokenize(text): 
    # text = text.translate(str.maketrans('','',string.punctuation)) 
    text = re.sub(r'\W+', ' ', text) 
    tokens = word_tokenize(text) 
    lemas = [] 
    for item in tokens: 
     lemas.append(WordNetLemmatizer().lemmatize(item)) 
    return lemas 

if __name__ == "__main__": 
    # NOTE: we put the following in a 'if __name__ == "__main__"' protected 
    # block to be able to use a multi-core grid search that also works under 
    # Windows, see: http://docs.python.org/library/multiprocessing.html#windows 
    # The multiprocessing module is used as the backend of joblib.Parallel 
    # that is used when n_jobs != 1 in GridSearchCV 

    # Display progress logs on stdout 
    print("Initializing...") 
    # Command line arguments 
    save = sys.argv[1] 
    training_directory = sys.argv[2] 

    logging.basicConfig(level=logging.INFO, 
         format='%(asctime)s %(levelname)s %(message)s') 

    dataset = load_files(training_directory, shuffle=False) 
    print("n_samples: %d" % len(dataset.data)) 

    # split the dataset in training and test set: 
    print("Splitting the dataset in training and test set...") 
    docs_train, docs_test, y_train, y_test = train_test_split(
     dataset.data, dataset.target, test_size=0.25, random_state=None) 

    # Build a vectorizer/classifier pipeline that filters out tokens 
    # that are too rare or too frequent 
    # Also remove stop words 
    print("Loading list of stop words...") 
    with open('stopwords.txt', 'r') as f: 
     words = [line.strip() for line in f] 

    print("Stop words list loaded...") 
    print("Setting up pipeline...") 
    pipeline = Pipeline(
     [ 
      # ('vect', TfidfVectorizer(stop_words=words, min_df=0.001, max_df=0.5, ngram_range=(1,1))), 
      ('vect', 
      TfidfVectorizer(tokenizer=tokenize, stop_words=words, min_df=0.001, max_df=0.5, ngram_range=(1, 1))), 
      ('clf', LinearSVC(C=5000)), 
     ]) 

    print("Pipeline:", [name for name, _ in pipeline.steps]) 

    # Build a grid search to find out whether unigrams or bigrams are 
    # more useful. 
    # Fit the pipeline on the training set using grid search for the parameters 
    print("Initializing grid search...") 

    # uncommenting more parameters will give better exploring power but will 
    # increase processing time in a combinatorial way 
    parameters = { 
     # 'vect__ngram_range': [(1, 1), (1, 2)], 
     # 'vect__min_df': (0.0005, 0.001), 
     # 'vect__max_df': (0.25, 0.5), 
     # 'clf__C': (10, 15, 20), 
    } 
    print("Parameters:") 
    pprint.pprint(parameters) 
    grid_search = GridSearchCV(
     pipeline, 
     parameters, 
     n_jobs=-1, 
     verbose=True) 

    print("Training and performing grid search...\n") 
    t0 = time() 
    grid_search.fit(docs_train, y_train) 
    print("\nDone in %0.3fs!\n" % (time() - t0)) 

    # Print the mean and std for each candidate along with the parameter 
    # settings for all the candidates explored by grid search. 
    n_candidates = len(grid_search.cv_results_['params']) 
    for i in range(n_candidates): 
     print(i, 'params - %s; mean - %0.2f; std - %0.2f' 
       % (grid_search.cv_results_['params'][i], 
       grid_search.cv_results_['mean_test_score'][i], 
       grid_search.cv_results_['std_test_score'][i])) 

    # Predict the outcome on the testing set and store it in a variable 
    # named y_predicted 
    print("\nRunning against testing set...\n") 
    y_predicted = grid_search.predict(docs_test) 

    # Save model 
    print("\nSaving model to", save, "...") 
    joblib.dump(grid_search.best_estimator_, save) 
    print("Model Saved! \nPrepare for some awesome stats!") 

我必须承认,我非常难过,经过修改,搜索,并确保我的服务器配置corr实际上,我觉得也许这里有人可能会提供帮助。 任何帮助表示赞赏,如果有任何我需要提供的信息,请让我知道,我会很乐意。

此外,我跑:

  • 蟒蛇3.5.3与NLTK和sklearn。
+0

是否还有更多我需要添加的信息? –

回答

0

我解决了这个问题,虽然不完美,但通过移除我的自定义标记器并退回到sklearn之一,

但是,我仍然在黑暗中如何整合我自己的tokenizer。