Clinical Records of Traditional Chinese Medicine

Image

The corpus construction is a fundamental and indispensable task for groping for NLP technics to automatic recognition of TCM valuable knowledge. In this paper, we have successfully presented a method of building a fine-grained annotated entity corpus based on case record of TCM. This method introduced detailed steps as well as the implementing process, involving data selection, draft guideline development, iterative annotations for guideline updating, consistence assessment, and corpus construction. High IAA value was achieved finally in our annotation work, indicating that our approaches are effective and the corpus is of high quality. This work lays a solid foundation for future TCM corpus construction and NER researches. There are still some inevitable shortcomings in our work, such as the entity types were not comprehensive enough. Because of the limitation of time, we can’t complete the marking of all existent entities in our dataset. In the future, we will annotate more entity types, such as symptoms and prescriptions, to enrich the guideline and corpus by using the method introduced in this paper. More types of TCM clinical record from different sources will also be annotated to improve the applicability of the corpus. Furthermore, based on the corpus, we will develop more corresponding algorithms to support the NLP techniques. Last point, deep research of the polysemy, abbreviation, relationship between entities are also the next focuses in our further research work.