[ElasticSearch] 한글 형태소 분석기 nori

한글 형태소 분석기 nori

소개

설치

elasticsearch-plugin 이용해서 설치. 모든 노드에 설치 되어야하고 설치 이후에는 노드 별 재기동이 필요.

elasticsearch-plugin install analysis-nori

Analysis

형태소 분석을 적용해보고 싶은 text 에 대해 다음과 같이 analysis 테스트가 가능하다.

curl -X GET "$HOSTNAME:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "nori_tokenizer",
  "text": "뿌리가 깊은 나무는",
  "attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
  "explain": true
}
'

다음과 같이 분석 결과 반환해준다.

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "nori_tokenizer",
      "tokens" : [
        {
          "token" : "뿌리",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "word",
          "position" : 0,
          "leftPOS" : "NNG(General Noun)",
          "morphemes" : null,
          "posType" : "MORPHEME",
          "reading" : null,
          "rightPOS" : "NNG(General Noun)"
        },
        {
          "token" : "가",
          "start_offset" : 2,
          "end_offset" : 3,
          "type" : "word",
          "position" : 1,
          "leftPOS" : "J(Ending Particle)",
          "morphemes" : null,
          "posType" : "MORPHEME",
          "reading" : null,
          "rightPOS" : "J(Ending Particle)"
        },
        {
          "token" : "깊",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "word",
          "position" : 2,
          "leftPOS" : "VA(Adjective)",
          "morphemes" : null,
          "posType" : "MORPHEME",
          "reading" : null,
          "rightPOS" : "VA(Adjective)"
        },
        {
          "token" : "은",
          "start_offset" : 5,
          "end_offset" : 6,
          "type" : "word",
          "position" : 3,
          "leftPOS" : "E(Verbal endings)",
          "morphemes" : null,
          "posType" : "MORPHEME",
          "reading" : null,
          "rightPOS" : "E(Verbal endings)"
        },
        {
          "token" : "나무",
          "start_offset" : 7,
          "end_offset" : 9,
          "type" : "word",
          "position" : 4,
          "leftPOS" : "NNG(General Noun)",
          "morphemes" : null,
          "posType" : "MORPHEME",
          "reading" : null,
          "rightPOS" : "NNG(General Noun)"
        },
        {
          "token" : "는",
          "start_offset" : 9,
          "end_offset" : 10,
          "type" : "word",
          "position" : 5,
          "leftPOS" : "J(Ending Particle)",
          "morphemes" : null,
          "posType" : "MORPHEME",
          "reading" : null,
          "rightPOS" : "J(Ending Particle)"
        }
      ]
    },
    "tokenfilters" : [ ]
  }
}

Analyzer 적용

index 생성

아래와 같이 custom analyzer 를 생성하고, 해당 analyzer 로 형태소 분석을 적용하고자 하는 text 필드에 custom analyzer 를 지정해준다.

curl -X PUT "$HOSTNAME:9200/_template/template_ims?pretty" -H 'Content-Type: application/json' -d' 
{
  "template" : "nori-ims",
    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "nori_korean":{
                    "type": "custom",
                    "tokenizer": "nori_tokenizer"
                    }
                }
            }
        }
    },
  "mappings" : { 
    "doc": {
      "properties": {
        "issue_title": {
        "type": "text",
        "analyzer": "nori_korean"
        },
        "issue_number": {
          "type": "integer"
        },
        .
        .
        .
        "Closed Date": {
          "type": "date",
          "format": "yyyy/MM/dd HH:mm:ss"
        },
        "issue_details": {
          "type": "text",
          "analyzer": "nori_korean"
        },
        "actions": {
          "type": "text",
          "analyzer": "nori_korean"
        }
      }
    }
  }
}
'

해당 인덱스에 데이터를 인덱싱 한 후, 특정 _id 값의 텍스트가 분석 되었는지 확인할 수 있다.

curl -X POST '$HOSTNAME:9200/target_index_name/target_doc_name/doc_id/_termvector?fields=field_name\&pretty=true'

You Might Also Like

[Books] Source Code – Bill Gates

Ollama ✕ WSL 2 ✕ VSCode Code GPT

re:Invent 2022 참가 후기