twitter - how to train a maxent classifier -
[project stack : java, opennlp, elasticsearch (datastore) , twitter4j read data twitter]
i intend use maxent classifier classify tweets. understand initial step train model. documentation found have gistrainer based train method train model. have managed put simple piece of code makes use of opennlp's maxent classifier train model , predict outcome.
i have used 2 files postive.txt , negative.txt train model
contents of positive.txt
positive positive best positive fantastic positive super positive fine positive nice
contents of negative.txt
negative bad negative ugly negative worst negative worse negative sucks
and java methods below generate outcome.
@override public void traindataset(string source, string destination) throws exception { file[] inputfiles = fileutil.buildfilelist(new file(source)); // trains both positive , negative.txt file modelfile = new file(destination); tokenizer tokenizer = simpletokenizer.instance; categorydatastream ds = new categorydatastream(inputfiles, tokenizer); int cutoff = 5; int iterations = 100; bagofwordsfeaturegenerator bowfg = new bagofwordsfeaturegenerator(); doccatmodel model = documentcategorizerme.train("en", ds, cutoff,iterations, bowfg); model.serialize(new fileoutputstream(modelfile)); } @override public void predict(string text, string modelfile) { inputstream modelstream = null; try{ tokenizer tokenizer = simpletokenizer.instance; string[] tokens = tokenizer.tokenize(text); modelstream = new fileinputstream(modelfile); doccatmodel model = new doccatmodel(modelstream); bagofwordsfeaturegenerator bowfg = new bagofwordsfeaturegenerator(); documentcategorizer categorizer = new documentcategorizerme(model, bowfg); double[] probs = categorizer.categorize(tokens); if(null!=probs && probs.length>0){ for(int i=0;i<probs.length;i++){ system.out.println("double[] probs index " + + " value " + probs[i]); } } string label = categorizer.getbestcategory(probs); system.out.println("label " + label); int bestindex = categorizer.getindex(label); system.out.println("bestindex " + bestindex); double score = probs[bestindex]; system.out.println("score " + score); } catch(exception e){ e.printstacktrace(); } finally{ if(null!=modelstream){ try { modelstream.close(); } catch (ioexception e) { e.printstacktrace(); } } } } public static void main(string[] args) { try { string outputmodelpath = "/home/**/sd-sentiment-analysis/models/trainpostive"; string source = "/home/**/sd-sentiment-analysis/sd-core/src/main/resources/datasets/"; maximunentropyclassifier me = new maximunentropyclassifier(); me.traindataset(source, outputmodelpath); me.predict("this bad", outputmodelpath); } catch (exception e) { e.printstacktrace(); } }
i have following questions.
1) how iteratively train model? also, how add new sentences/words model ? there specific format data file? found file needs have minimum of 2 words separated tab. understanding valid? 2) there publicly available data sets can use train model? found sources movie reviews. project i'm working on involves not movie reviews other things such product reviews, brand sentiments etc. 3) this helps extent. there working example somewhere publicly available? couldn't find documentation maxent.
please me out. kind'a blocked on this.
1) can store samples in database. used accumulo once this. @ interval rebuild model , reprocess data. 2) format is: categoryname space sample newline. no tabs 3) sounds want combine general sentiment topic or entity. use name finder or regex find entity or add entity class labels doccat include product name etc , samples have specific
Comments
Post a Comment