被標(biāo)注的命名實(shí)體被放在《START》《END》范圍中,并標(biāo)出了實(shí)體的類別。接下來(lái)是對(duì)命名實(shí)體識(shí)別模型的訓(xùn)練,先上代碼:
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.StringReader;
import java.util.Collections;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.featuregen.AggregatedFeatureGenerator;
import opennlp.tools.util.featuregen.PreviousMapFeatureGenerator;
import opennlp.tools.util.featuregen.TokenClassFeatureGenerator;
import opennlp.tools.util.featuregen.TokenFeatureGenerator;
import opennlp.tools.util.featuregen.WindowFeatureGenerator;
/**
* 中文命名實(shí)體識(shí)別模型訓(xùn)練組件
*
* @author ddlovehy
*
*/
public class NamedEntityMultiFindTrainer {
// 默認(rèn)參數(shù)
private int iterations = 80;
private int cutoff = 5;
private String langCode = “general”;
private String type = “default”;
// 待設(shè)定的參數(shù)
private String nameWordsPath; // 命名實(shí)體詞庫(kù)路徑
private String dataPath; // 訓(xùn)練集已分詞語(yǔ)料路徑
private String modelPath; // 模型存儲(chǔ)路徑
public NamedEntityMultiFindTrainer() {
super();
// TODO Auto-generated constructor stub
}
public NamedEntityMultiFindTrainer(String nameWordsPath, String dataPath,
String modelPath) {
super();
this.nameWordsPath = nameWordsPath;
this.dataPath = dataPath;
this.modelPath = modelPath;
}
public NamedEntityMultiFindTrainer(int iterations, int cutoff,
String langCode, String type, String nameWordsPath,
String dataPath, String modelPath) {
super();
this.iterations = iterations;
this.cutoff = cutoff;
this.langCode = langCode;
this.type = type;
this.nameWordsPath = nameWordsPath;
this.dataPath = dataPath;
this.modelPath = modelPath;
}
/**
* 生成定制特征
*
* @return
*/
public AggregatedFeatureGenerator prodFeatureGenerators() {
AggregatedFeatureGenerator featureGenerators = new AggregatedFeatureGenerator(
new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2),
new WindowFeatureGenerator(new TokenClassFeatureGenerator(), 2,
2), new PreviousMapFeatureGenerator());
return featureGenerators;
}
/**
* 將模型寫入磁盤
*
* @param model
* @throws Exception
*/
public void writeModelIntoDisk(TokenNameFinderModel model) throws Exception {
File outModelFile = new File(this.getModelPath());
FileOutputStream outModelStream = new FileOutputStream(outModelFile);
model.serialize(outModelStream);
}
/**
* 讀出標(biāo)注的訓(xùn)練語(yǔ)料
*
* @return
* @throws Exception
*/
public String getTrainCorpusDataStr() throws Exception {
// TODO 考慮入持久化判斷直接載入標(biāo)注數(shù)據(jù)的情況 以及增量式訓(xùn)練
String trainDataStr = null;
trainDataStr = NameEntityTextFactory.prodNameFindTrainText(
this.getNameWordsPath(), this.getDataPath(), null);
return trainDataStr;
}
/**
* 訓(xùn)練模型
*
* @param trainDataStr
* 已標(biāo)注的訓(xùn)練數(shù)據(jù)整體字符串
* @return
* @throws Exception
*/
public TokenNameFinderModel trainNameEntitySamples(String trainDataStr)
throws Exception {
ObjectStream《NameSample》 nameEntitySample = new NameSampleDataStream(
new PlainTextByLineStream(new StringReader(trainDataStr)));
System.out.println(“**************************************”);
System.out.println(trainDataStr);
TokenNameFinderModel nameFinderModel = NameFinderME.train(
this.getLangCode(), this.getType(), nameEntitySample,
this.prodFeatureGenerators(),
Collections.《String, Object》 emptyMap(), this.getIterations(),
this.getCutoff());
return nameFinderModel;
}
/**
* 訓(xùn)練組件總調(diào)用方法
*
* @return
*/
public boolean execNameFindTrainer() {
try {
String trainDataStr = this.getTrainCorpusDataStr();
TokenNameFinderModel nameFinderModel = this
.trainNameEntitySamples(trainDataStr);
// System.out.println(nameFinderModel);
this.writeModelIntoDisk(nameFinderModel);
return true;
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
return false;
}
}
?。?/p>
注:
參數(shù):iterations是訓(xùn)練算法迭代的次數(shù),太少了起不到訓(xùn)練的效果,太大了會(huì)造成過(guò)擬合,所以各位可以自己試試效果;
cutoff:語(yǔ)言模型掃描窗口的大小,一般設(shè)成5就可以了,當(dāng)然越大效果越好,時(shí)間可能會(huì)受不了;
langCode:語(yǔ)種代碼和type實(shí)體類別,因?yàn)闆](méi)有專門針對(duì)中文的代碼,設(shè)成“普通”的即可,實(shí)體的類別因?yàn)槲覀兿胗?xùn)練成能識(shí)別多種實(shí)體的模型,于是設(shè)置為“默認(rèn)”。
說(shuō)明:
prodFeatureGenerators()方法用于生成個(gè)人訂制的特征生成器,其意義在于選擇什么樣的n-gram語(yǔ)義模型,代碼當(dāng)中顯示的是選擇窗口大小為5,待測(cè)命名實(shí)體詞前后各掃描兩個(gè)詞的范圍計(jì)算特征(加上自己就是5個(gè)),或許有更深更準(zhǔn)確的意義,請(qǐng)大家指正;
trainNameEntitySamples()方法,訓(xùn)練模型的核心,首先是將如上標(biāo)注的訓(xùn)練語(yǔ)料字符串傳入生成字符流,再通過(guò)NameFinderME的train()方法傳入上面設(shè)定的各個(gè)參數(shù),訂制特征生成器等等,關(guān)于源實(shí)體映射對(duì),就按默認(rèn)傳入空Map就好了。
源代碼開(kāi)源在:https://github.com/Ailab403/ailab-mltk4j,test包里面對(duì)應(yīng)有完整的調(diào)用demo,以及file文件夾里面的測(cè)試語(yǔ)料和已經(jīng)訓(xùn)練好的模型。
評(píng)論
查看更多