[Elasticsearch] Analyzer, Tokenizer, Filter 개념 및 활용 방법

Elasticsearch Analyzer & Tokenizer & Filter 개념 및 사용 방법

Elasticsearch 에서 index를 만들 때 settings와 mappings를 정해주는 것이 좋다. settings는 해당 Index에서 사용할 analyzer, tokenizer, filter들을 세팅할 수 있다. 그리고 mappings에는 사용할 필드들을 선언하고 타입 등을 설정해 주는데, settings에 만들어 놓은 analyzer, tokenizer, filter을 각 필드 별로 적용하여 더 정확한 검색이 가능하다.

먼저 index를 새로 생성하기 전에 _analyze 로 임의의 text가 tokenizer, filter 로 어떻게 분석되는지 알아보면 좋다. Text가 Elasticsearch 저장소 안에 어떻게 분석되어 최종적으로 저장되는 볼 수 있는 아주 좋은 연습이다. _analyze API에서 analzer로 index에 적용할 수 있기 때문에 내가 원하는 analyze 결과 값을 찾아 analyzer 를 만들면 된다.

_analyze vs analyzer

* _analyze = analyzer, tokenizer, filter를 테스트하기 위한 API

* analyzer = 실질적으로 인덱스에 저장되는 데이터의 처리를 담당

1. 분석으로 _analyzer 이해하기

_analyze 예시

GET _analyze
{
  "text": "The blue light is getting smaller quickly.",
  "tokenizer": "whitespace",
  "filter": [
    "lowercase",
    "stop",
    "snowball"
    ]
}

* tokenizer는 단 하나만 적용된다. (때문에 [] 배열이 아닌 "" 바로 입력하는 값) : "whitespace"는 스페이스를 기준으로 text를 자른다.

* filter는 여러개 적용이 되기 때문에 [] 배열 안에 필터를 넣는다.

filter 배열 안에 들어간 순으로 처리가 되기 때문에 lowercase를 먼저 넣고 stop (불용어 처리)를 해주는 것이 좋다.

ex. the 가 불용어인 경우, stop을 먼저하게되면 "The"는 사라지지않고 lowercase 거친 후 "the"가 결과에 남게 됨

결과 :

{
  "tokens" : [
    {
      "token" : "blue",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "light",
      "start_offset" : 9,
      "end_offset" : 14,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "get",
      "start_offset" : 18,
      "end_offset" : 25,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "smaller",
      "start_offset" : 26,
      "end_offset" : 33,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "quickly.",
      "start_offset" : 34,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    }
  ]
}

위와 같이 분석이 아니라 analyzer로 사용해서 mapping 후 위의 데이터를 입력하면 blue, light, get, smaller, quickly 이렇게 5 단어가 Es 저장소에 저장된다. 검색할 때 "match"를 사용하면 검색하는 단어도 위와 동일한 analyzer에 의해 바뀌어 검색된다.

여기서 filter를 거쳐 getting 이 -> get으로 변형되어 match 쿼리를 날릴 때 gets, getting, getted 등으로 검색해도 된다. 근데 got, gotten은 안되는 걸 보니 불규칙 동사 중에는 안되는 것들이 아직 있는듯 하다.

위의 tokenizer = whitespace / filters = lowercase, stop, snowball 은 이미 Es에서 정한 analyzer "snowball"로 간략하게 사용할 수 있다.

GET _analyze
{
  "text": "The blue light is getting smaller quickly.",
  "analyzer": "snowball"
}

analyzer == (tokenizer + filters)

이런 공식이 나오게 된다.

2. analyzer 사용해보기

위에서처럼 이미 존재하는 analyzer말고 tokenizer와 filters을 내가 커스텀해서 나만의 analyzer를 만들 수도 있다.

먼저 analyzer를 필드에 설정하기

PUT test_index
{
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "analyzer": "snowball"
      }
    }
  }
}

mappings 안에 properties > fileds > type을 지정해 주었는데 이제 type 과 동일한 Depth에 "analyzer" : _사용할_analyzer__ 를 입력해 주기

데이터 입력하기

PUT test_index/_doc/1
{
  "message": "The blue light is getting smaller quickly."
}

검색해보기

GET test_index/_search
{
  "query": {
    "match": {
      "message": "getted"
    }
  }
}

get, gets, getting, getten 모두 검색 가능

결과 :

"hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "message" : "The blue light is getting smaller quickly."
        }
      }
    ]

3. Custom Filter 만들어 적용하기 (사용자 정의 토큰필터)

test_index_3이라는 인덱스를 새로 만들기

- settings > analysis > analyzer 안에 내가 사용할 나만의 custom analyzer를 만들어서 그 안에 tokenizer와 filter를 추가해 준다. filter도 커스텀으로 만들 수 있다. analyzer와 동일한 뎁스에 filter: { "커스텀 필터 이름" : { } } 으로 선언 해주고 filter: [] 안에서 사용한다.

PUT test_index_3
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "my_stop_filter",
            "snowball"
          ]
        }
      },
      "filter": {
        "my_stop_filter": {
          "type": "stop",
          "stopwords": [
            "이건빼줘"
          ]
        }
      }
    }
  }
}

"my_stop_filter" 라는 커스텀 필터를 설정했다. type: "stop"으로 불용어 필터이고 stopwords 는 "이건빼줘" 하나이다.

위처럼 매핑한 test_index_3 에서 임의의 text _analyze를 사용해서 분석해보기

GET test_index_3/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "hi there 이건빼줘 불용어야"
}

결과 :

{
  "tokens" : [
    {
      "token" : "hi",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "there",
      "start_offset" : 3,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "불용어야",
      "start_offset" : 14,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    }
  ]
}

"이건빼줘" 가 불용어 처리로 filter에서 걸러져서 결과에 나오지 않는다.

4. Custom Analyzer mapping에 적용하기

위에서 analysis > analyzer와 filter를 직접 커스텀해서 사용했는데 이제 해당 analyzer를 mapping 할 때 필드에 적용시키기

PUT test_index_4
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "my_stop_filter",
            "snowball"
          ]
        }
      },
      "filter": {
        "my_stop_filter": {
          "type": "stop",
          "stopwords": [
            "이건빼줘"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  }
}

"settings"와 같은 뎁스에서 "mappings"를 선언해준다. "properties" 안에 필드 명을 넣고 각 필드명의 타입, 그리고 적용할 analyzer를 적어 준다.

데이터 입력하기

PUT test_inex_4/_doc/1
{
  "message": "aBcDeF 이건빼줘 HI 이건빼줘"
}

검색하기

GET test_index_4/_search
{
  "query": {
    "match": {
      "message": "이건빼줘"
    }
  }
}

결과 :

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

htis 검색결과가 나오지 않는다.

search query를 abcdef, hi (대소문자 구별없이) 검색하면 결과가 나온다.

GET test_index_4/_search
{
  "query": {
    "match": {
      "message": "abcDEF"
    }
  }
}

결과 :

    "hits" : [
      {
        "_index" : "test_index_4",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.13353139,
        "_source" : {
          "message" : "aBcDeF 이건빼줘 HI 이건빼줘"
        }
      }
    ]

Reference : https://esbook.kimjmin.net/06-text-analysis/6.3-analyzer-1/6.4-custom-analyzer

'Elasticsearch' 카테고리의 다른 글

[Elasticsearch] Metrics Aggregations 집계 정리 (0)	2022.11.10
[Elasticsearch] mapping field 수정 추가 삭제 (0)	2022.11.10
[Elasticsearch] bool 복합 query 개념 정리 및 예제 (0)	2022.11.09
[Elasticsearch] object vs nested 설명 및 예제 (0)	2022.11.08
[elasticsearch] PUT vs POST 사용법 및 차이 (0)	2022.11.07

GOOD DAY

[Elasticsearch] Analyzer, Tokenizer, Filter 개념 및 활용 방법

Elasticsearch Analyzer & Tokenizer & Filter 개념 및 사용 방법

_analyze vs analyzer

* _analyze = analyzer, tokenizer, filter를 테스트하기 위한 API

* analyzer = 실질적으로 인덱스에 저장되는 데이터의 처리를 담당

1. 분석으로 _analyzer 이해하기

analyzer == (tokenizer + filters)

2. analyzer 사용해보기

3. Custom Filter 만들어 적용하기 (사용자 정의 토큰필터)

4. Custom Analyzer mapping에 적용하기

'Elasticsearch' 카테고리의 다른 글

티스토리툴바

[Elasticsearch] Analyzer, Tokenizer, Filter 개념 및 활용 방법

Elasticsearch Analyzer & Tokenizer & Filter 개념 및 사용 방법

_analyze vs analyzer

* _analyze = analyzer, tokenizer, filter를 테스트하기 위한 API

* analyzer = 실질적으로 인덱스에 저장되는 데이터의 처리를 담당

1. 분석으로 _analyzer 이해하기

analyzer == (tokenizer + filters)

2. analyzer 사용해보기

3. Custom Filter 만들어 적용하기 (사용자 정의 토큰필터)

4. Custom Analyzer mapping에 적용하기

'Elasticsearch' 카테고리의 다른 글

'Elasticsearch' Related Articles

티스토리툴바