데이터 없이 생성형 AI를 활용하여 개체명인식(NER) 분류 - 금융 도메인¶

토큰 분류는 문장의 개별 토큰에 레이블을 할당합니다. 가장 일반적인 토큰 분류 작업 중 하나는 개체명 인식(Named Entity Recognition, NER)입니다. 개체명 인식은 문장에서 사람, 위치 또는 조직과 같은 각 개체의 레이블을 찾으려고 시도합니다.

GPT-3로 생성된 금융 NER 데이터 세트에서 klue/roberta-small를 파인 튜닝하여 새로운 개체를 탐지합니다.
추론을 위해 파인 튜닝 모델을 사용합니다.

In [ ]:

from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

환경 설정¶

In [ ]:

!pip install -q openai

In [ ]:

openapi_key = 'OPEN API KEY' # OPEN API KEY 입력

In [ ]:

import numpy as np
import openai
import pandas as pd
import pyarrow as pa
import re
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import time

In [ ]:

# 구글 드라이브를 사용하는 경우
from google.colab import drive

drive.mount('/content/drive')
PATH = "/content/drive/MyDrive/"

데이터 세트 가져오기¶

참고자료: 데이터 없이 NER 모델 학습하기

샘플 데이터 생성¶

In [ ]:

# 샘플 개체명 리스트를 real_entities로 생성
real_entities = [
    {
        'class_name': '금융 기업명',
        'entity_names': [
            '삼성증권',
            '하나금융지주',
            'KB금융',
            '신한금융',
            '미래에셋대우'
        ]
    },

    {
        'class_name': '금융 용어',
        'entity_names': [
            '주식 배당률',
            '이자율 스왑',
            '투자 포트폴리오',
            '파생 상품',
            '자산 액면가'
        ]
    },
    {
        'class_name': '금융 지수',
        'entity_names': [
            '다우 산업지수',
            '나스닥 지수',
            'S&P 500',
            '한국종합주가지수',
            '상해 종합지수'
        ]
    },
    {
        'class_name': '금융 이벤트',
        'entity_names': [
            '배당 지급일',
            '실적 발표일',
            '주주총회',
            '신규 상장',
            '분할 상장'
        ]
    },
    {
        'class_name': '금융 거래소',
        'entity_names': [
            '뉴욕증권거래소',
            '런던증권거래소',
            '도쿄증권거래소',
            '상하이증권거래소',
            '홍콩증권거래소'
        ]
    }
]

GPT-3를 사용한 개체명 리스트 확장¶

In [ ]:

def generate(prompts, model='text-davinci-003', n=1, max_tokens=512):
    openai.api_key = openapi_key

    response = openai.Completion.create(
        model = model,
        prompt = prompts,
        echo = False,
        n = n,
        max_tokens = max_tokens,
        # stop = '\n'
    )

    texts = [c.text.strip() for c in response.choices]
    return texts


def construct_entity_prompt(class_name, entity_names, k=10):
    prompt = f'These are <{class_name}> entity names. Generate {k} new <{class_name}> entity names.\n\n'
    prompt += 'Entity names:\n'
    for e in entity_names:
        prompt += f'- {e}\n'
    prompt += '\nGenerated names:\n-'
    return prompt


def postprocess_entities(synthetic_entities):
    processed = []
    for ents in synthetic_entities:
        ents = f'- {ents}'.split('\n')
        ents = [e.split('- ')[1].strip() for e in ents]
        processed += ents
    return processed


synthetic_entities = []
for real_ent in tqdm(real_entities):
    class_name, entity_names = real_ent['class_name'], real_ent['entity_names']
    prompt = construct_entity_prompt(class_name, entity_names)

    syn_entities = generate(prompt, n=10)
    syn_entities = postprocess_entities(syn_entities)
    syn_entities = list(set(syn_entities))

    synthetic_entities.append({'class_name': class_name, 'entity_names': syn_entities})

100%|██████████| 5/5 [00:51<00:00, 10.31s/it]

In [ ]:

synthetic_entities

'''출력값 예시
[{'class_name': '금융 기업명',
  'entity_names': ['NH금융',
   '인터내셔널증권',
   '나눔금융',
   '한화신한투자',
   '외환종합금융',
   '펀드마켓',
   '메릴린치투자증권',
   'HSBC코리아',
   '신원금융',...
'''

In [ ]:

all_entities = []
for real, synthetic in zip(real_entities, synthetic_entities):
    all_entities.append({
        'class_name': real['class_name'],
        'entity_names': list(set(real['entity_names'] + synthetic['entity_names']))
    })

In [ ]:

all_entities

'''출력값 예시
[{'class_name': '금융 기업명',
  'entity_names': ['NH금융',
   '인터내셔널증권',
   '나눔금융',
   '한화신한투자',
   '외환종합금융',
   '펀드마켓',
   '메릴린치투자증권',
   'HSBC코리아',
   '신원금융',...

'''

GPT-3를 사용한 개체명인식 데이터 세트 생성¶

setence, tokens, ner_tags 로 구성된 데이터 세트 생성

In [ ]:

def sample_entities(all_entities, min_k=1, max_k=3):
    k = np.random.randint(min_k, max_k+1)
    idxs = np.random.choice(range(len(all_entities)), size=k, replace=False)

    entities = []
    for i in idxs:
        ents = all_entities[i]
        name = np.random.choice(ents['entity_names'])
        entities.append({'class_name': ents['class_name'], 'entity_name': name})

    return entities


def construct_sentence_prompt(entities, style='dialog'):
    prompt = f'Generate a {style} sentence including following entities.\n\n'

    entities_string = ', '.join([f"{e['entity_name']}({e['class_name']})" for e in entities])
    prompt += f'Entities: {entities_string}\n'
    prompt += 'Sentence:'
    return prompt


def construct_labels(generated, entities, class2idx):
    labels = [class2idx['outside']] * len(generated)
    for ent in entities:
        l = class2idx[ent['class_name']]
        for span in re.finditer(ent['entity_name'].lower(), generated.lower()):
            s, e = span.start(), span.end()
            labels[s] = l
            labels[s+1:e] = [l+1] * (e-s-1)
    return labels


class2idx = {e['class_name']: i*2 for i, e in enumerate(all_entities)}
class2idx['outside'] = len(class2idx) * 2

data = []
for _ in tqdm(range(100)):
    batch_entities = [sample_entities(all_entities) for _ in range(10)]
    batch_prompts = [construct_sentence_prompt(ents) for ents in batch_entities]
    batch_generated = generate(batch_prompts, model='text-davinci-003')

    for generated, entities in zip(batch_generated, batch_entities):
        labels = construct_labels(generated, entities, class2idx)
        data.append({'sentence': generated, 'tokens': list(generated), 'ner_tags': labels})

    time.sleep(10)

100%|██████████| 100/100 [25:21<00:00, 15.22s/it]

In [ ]:

data
'''출력 예시
[{'sentence': 'SC제일금융의 메릴랜드 밸류 지수에 대한 주주총회 사전 인승을 기대합니다.',
  'tokens': ['S',
   'C',
   '제',
   '일',
   '금',
   '융',
   '의',
   ' ',
   '메',
   '릴',
   '랜',
   '드',
   ' ',
   '밸',
   '류',
   ' ',
   '지',
   '수',
   '에',
   ' ',
   '대',
   '한',
   ' ',
   '주',
   '주',
   '총',
   '회',
   ' ',
   '사',
   '전',
   ' ',
   '인',
   '승',
   '을',
   ' ',
   '기',
   '대',
   '합',
   '니',
   '다',
   '.'],
  'ner_tags': [0,
   1,
   1,
   1,
   1,
   1,
   10,
   10,
   4,
   5,
   5,
   5,
   5,
   5,
   5,
   5,
   5,
   5,
   10,
   10,
   10,
   10,
   10,
   6,
   7,
   7,
   7,
   7,
   7,
   7,
   7,
   7,
   7,
   10,
   10,
   10,
   10,
   10,
   10,
   10,
   10]},
'''

각 ner_tag의 앞에 붙은 문자는 개체의 토큰 위치를 나타냅니다:

B-는 개체의 시작을 나타냅니다.
I-는 토큰이 동일한 개체 내부에 포함되어 있음을 나타냅니다(예를 들어 State 토큰은 Empire State Building와 같은 개체의 일부입니다).
0는 토큰이 어떤 개체에도 해당하지 않음을 나타냅니다.

NER TAG 리스트
- FC - 금융 회사
- FT - 금융 용어
- FI - 금융 지수
- FE - 금융 이벤트
- FX - 금융 거래소

In [ ]:

# LABELS = ['B-FC', 'I-FC', 'B-FT', 'I-FT', 'B-FI', 'I-FI', 'B-FE', 'I-FE', 'B-FX', 'I-FX', 'O']
# FC - 금융 회사
# FT - 금융 용어
# FI - 금융 지수
# FE - 금융 이벤트
# FX - 금융 거래소

In [ ]:

# id와 레이블 매핑 딕셔너리 선언

id2label = {
    0: "B-FC",
    1: "I-FC",
    2: "B-FT",
    3: "I-FT",
    4: "B-FI",
    5: "I-FI",
    6: "B-FE",
    7: "I-FE",
    8: "B-FX",
    9: "I-FX",
    10: "O"
}
label2id = {
    "B-FC": 0,
    "I-FC": 1,
    "B-FT": 2,
    "I-FT": 3,
    "B-FI": 4,
    "I-FI": 5,
    "B-FE": 6,
    "I-FE": 7,
    "B-FX": 8,
    "I-FX": 9,
    "O": 10
}

생성한 데이터 세트를 저장하고, 다시 훈련할 데이터 세트를 가져옵니다:

In [ ]:

# 생성된 NER 데이터 저장
pd.DataFrame(data, columns=['sentence', 'tokens', 'ner_tags']).to_csv(PATH+'fin_ner_dataset.csv', index=False)

Mounted at /content/drive

In [ ]:

# df = pd.DataFrame(data, columns=['sentence', 'tokens', 'ner_tags']) # 저장된 데이터로 불러오지 않는 경우
df = pd.read_csv(PATH+'fin_ner_dataset.csv')

In [ ]:

# 저장된 데이터에 대한 전처리 수행
df['tokens'] = df['tokens'].apply(lambda x: [token.replace("'", "") for token in x[1:-1].split(', ')])
df['ner_tags'] = df['ner_tags'].apply(lambda x: [tag.replace("'", "") for tag in x[1:-1].split(', ')])
df['ner_tags'] = df['ner_tags'].apply(lambda tags: [id2label[int(tag)] for tag in tags])

In [ ]:

df['ner_tags']

Out[ ]:

0      [B-FC, I-FC, I-FC, I-FC, I-FC, I-FC, O, O, B-F...
1      [B-FI, I-FI, I-FI, I-FI, I-FI, I-FI, I-FI, I-F...
2      [O, B-FX, I-FX, I-FX, I-FX, I-FX, I-FX, I-FX, ...
3      [O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...
4      [O, O, O, O, O, O, O, O, B-FT, I-FT, I-FT, I-F...
                             ...                        
995    [B-FC, I-FC, I-FC, I-FC, I-FC, I-FC, O, O, B-F...
996    [B-FI, I-FI, I-FI, I-FI, I-FI, I-FI, I-FI, I-F...
997    [O, O, O, O, O, O, O, B-FC, I-FC, I-FC, I-FC, ...
998    [O, O, O, O, B-FE, I-FE, I-FE, I-FE, I-FE, I-F...
999    [O, B-FI, I-FI, I-FI, I-FI, I-FI, I-FI, I-FI, ...
Name: ner_tags, Length: 1000, dtype: object

In [ ]:

df.head()

Out[ ]:

	sentence	tokens	ner_tags
0	SC제일금융의 메릴랜드 밸류 지수에 대한 주주총회 사전 인승을 기대합니다.	[S, C, 제, 일, 금, 융, 의, , 메, 릴, 랜, 드, , 밸, 류, ...	[B-FC, I-FC, I-FC, I-FC, I-FC, I-FC, O, O, B-F...
1	미국내국제금융지수가 최근 상승하고 있는 것으로 보입니다.	[미, 국, 내, 국, 제, 금, 융, 지, 수, 가, , 최, 근, , 상, ...	[B-FI, I-FI, I-FI, I-FI, I-FI, I-FI, I-FI, I-F...
2	"서메니아증권거래소에서는 금융 기금 트렌드 분석이라는 금융 이벤트를 주기적으로 진행...	[", 서, 메, 니, 아, 증, 권, 거, 래, 소, 에, 서, 는, , 금, ...	[O, B-FX, I-FX, I-FX, I-FX, I-FX, I-FX, I-FX, ...
3	Are you up to date with the latest Hang Seng C...	[A, r, e, , y, o, u, , u, p, , t, o, , d, ...	[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...
4	네가 알고있는 주식 배당률은 어떤가요?	[네, 가, , 알, 고, 있, 는, , 주, 식, , 배, 당, 률, 은, ...	[O, O, O, O, O, O, O, O, B-FT, I-FT, I-FT, I-F...

In [ ]:

df.isna().sum()

Out[ ]:

sentence    0
tokens      0
ner_tags    0
dtype: int64

Hugging Face Datasets 형태로 변환¶

In [ ]:

x_train, x_test, y_train, y_test = train_test_split(df.drop(['ner_tags'], axis=1), df['ner_tags'], test_size=0.2, random_state=42)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(800, 2) (200, 2) (800,) (200,)

In [ ]:

# 훈련 데이터셋 생성
train_data = {"sentence": x_train['sentence'], "tokens": x_train['tokens'], "ner_tags": y_train}
train_dataset = Dataset.from_dict(train_data)

# 테스트 데이터셋 생성
test_data = {"sentence": x_test['sentence'], "tokens": x_test['tokens'], "ner_tags": y_test}
test_dataset = Dataset.from_dict(test_data)

dataset = DatasetDict({"train": train_dataset, "test": test_dataset})

In [ ]:

print(type(dataset))		# <class 'datasets.arrow_dataset.Dataset'>
dataset.keys()

<class 'datasets.dataset_dict.DatasetDict'>

Out[ ]:

dict_keys(['train', 'test'])

In [ ]:

dataset

Out[ ]:

DatasetDict({
    train: Dataset({
        features: ['sentence', 'tokens', 'ner_tags'],
        num_rows: 800
    })
    test: Dataset({
        features: ['sentence', 'tokens', 'ner_tags'],
        num_rows: 200
    })
})

In [ ]:

# label_list = dataset["train"].features[f"ner_tags"].feature.names
label_list = list(id2label.values())
label_list

Out[ ]:

['B-FC',
 'I-FC',
 'B-FT',
 'I-FT',
 'B-FI',
 'I-FI',
 'B-FE',
 'I-FE',
 'B-FX',
 'I-FX',
 'O']

In [ ]:

label_list[-1] == 'O'

Out[ ]:

True

다음 예제를 살펴보세요:

In [ ]:

dataset['train'][0]

Out[ ]:

{'sentence': '블루베리증권거래소에서는 이마트금융과 거래할 수 있습니다.',
 'tokens': ['블',
  '루',
  '베',
  '리',
  '증',
  '권',
  '거',
  '래',
  '소',
  '에',
  '서',
  '는',
  ' ',
  '이',
  '마',
  '트',
  '금',
  '융',
  '과',
  ' ',
  '거',
  '래',
  '할',
  ' ',
  '수',
  ' ',
  '있',
  '습',
  '니',
  '다',
  '.'],
 'ner_tags': ['B-FX',
  'I-FX',
  'I-FX',
  'I-FX',
  'I-FX',
  'I-FX',
  'I-FX',
  'I-FX',
  'I-FX',
  'O',
  'O',
  'O',
  'O',
  'B-FC',
  'I-FC',
  'I-FC',
  'I-FC',
  'I-FC',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O']}

전처리¶

다음으로 tokens 필드를 전처리하기 위해 klue/roberta-small 토크나이저를 가져옵니다:

In [ ]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("klue/roberta-small")

Downloading (…)okenizer_config.json:   0%|          | 0.00/375 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/248k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/752k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

위의 예제 tokens 필드를 보면 입력이 이미 토큰화된 것처럼 보입니다. 그러나 실제로 입력은 아직 토큰화되지 않았으므로 단어를 하위 단어로 토큰화하기 위해 is_split_into_words=True를 설정해야 합니다. 예제로 확인합니다:

In [ ]:

example = dataset["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

Out[ ]:

['[CLS]',
 '블',
 '루',
 '베',
 '리',
 '증',
 '권',
 '거',
 '래',
 '소',
 '에',
 '서',
 '는',
 '이',
 '마',
 '트',
 '금',
 '융',
 '과',
 '거',
 '래',
 '할',
 '수',
 '있',
 '습',
 '니',
 '다',
 '.',
 '[SEP]']

그러나 이로 인해 [CLS]과 [SEP]라는 특수 토큰이 추가되고, 하위 단어 토큰화로 인해 입력과 레이블 간에 불일치가 발생합니다. 하나의 레이블에 해당하는 단일 단어는 이제 두 개의 하위 단어로 분할될 수 있습니다. 토큰과 레이블을 다음과 같이 재정렬해야 합니다:

word_ids 메소드로 모든 토큰을 해당 단어에 매핑합니다.
특수 토큰 [CLS]와 [SEP]에 -100 레이블을 할당하여, PyTorch 손실 함수가 해당 토큰을 무시하도록 합니다.
주어진 단어의 첫 번째 토큰에만 레이블을 지정합니다. 같은 단어의 다른 하위 토큰에 -100을 할당합니다.

다음은 토큰과 레이블을 재정렬하고 최대 입력 길이보다 길지 않도록 시퀀스를 잘라내는 함수를 만드는 방법입니다:

In [ ]:

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                try:
                    if label[word_idx] == 'O':
                        label_ids.append(10)
                    else:
                        label_ids.append(int(label2id[label[word_idx]]))
                except ValueError:
                    label_ids.append(-100)
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

전체 데이터 세트에 전처리 함수를 적용하려면, 🤗 Datasets map 함수를 사용하세요. batched=True로 설정하여 데이터 세트의 여러 요소를 한 번에 처리하면 map 함수의 속도를 높일 수 있습니다:

In [ ]:

tokenized_wnut = dataset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

이제 DataCollatorWithPadding를 사용하여 예제 배치를 만들어봅시다. 데이터 세트 전체를 최대 길이로 패딩하는 대신, 동적 패딩을 사용하여 배치에서 가장 긴 길이에 맞게 문장을 패딩하는 것이 효율적입니다.

In [ ]:

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

평가¶

훈련 중 모델의 성능을 평가하기 위해 평가 지표를 포함하는 것이 유용합니다. 🤗 Evaluate 라이브러리를 사용하여 빠르게 평가 방법을 가져올 수 있습니다. 이 작업에서는 seqeval 평가 지표를 가져옵니다. (평가 지표를 가져오고 계산하는 방법에 대해서는 🤗 Evaluate 빠른 둘러보기를 참조하세요). Seqeval은 실제로 정밀도, 재현률, F1 및 정확도와 같은 여러 점수를 산출합니다.

In [ ]:

import evaluate

seqeval = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

먼저 NER 레이블을 가져온 다음, compute에 실제 예측과 실제 레이블을 전달하여 점수를 계산하는 함수를 만듭니다:

In [ ]:

tokenized_wnut['train']['ner_tags']

In [ ]:

tokenized_wnut['train']['labels']

In [ ]:

labels = list(id2label.values())

def compute_metrics(p):
    predictions, labels = p

    # ['O' 레이블 가중치]*0.5 -> 'O' 레이블로만 예측되는 문제 개선(클래스 불균형 문제)
    for i in range(len(predictions)):
        for j in range(len(predictions[i])):
            predictions[i][j][10] = predictions[i][j][10] * 0.5

    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

이제 compute_metrics 함수를 사용할 준비가 되었으며, 훈련을 설정하면 이 함수로 되돌아올 것입니다.

훈련¶

Trainer를 사용하여 모델을 파인 튜닝하는 방법에 익숙하지 않은 경우, 여기에서 기본 튜토리얼을 확인하세요!

이제 모델을 훈련시킬 준비가 되었습니다! AutoModelForSequenceClassification로 klue/roberta-small을 가져오고 예상되는 레이블 수와 레이블 매핑을 지정하세요:

In [ ]:

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(
    "klue/roberta-small", num_labels=11, id2label=id2label, label2id=label2id
)

Some weights of the model checkpoint at klue/roberta-small were not used when initializing RobertaForTokenClassification: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at klue/roberta-small and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

이제 세 단계만 거치면 끝입니다:

TrainingArguments에서 하이퍼파라미터를 정의하세요. output_dir는 모델을 저장할 위치를 지정하는 유일한 매개변수입니다. 이 모델을 허브에 업로드하기 위해 push_to_hub=True를 설정합니다(모델을 업로드하기 위해 Hugging Face에 로그인해야합니다.) 각 에폭이 끝날 때마다, Trainer는 seqeval 점수를 평가하고 훈련 체크포인트를 저장합니다.
Trainer에 훈련 인수와 모델, 데이터 세트, 토크나이저, 데이터 콜레이터 및 compute_metrics 함수를 전달하세요.
train()를 호출하여 모델을 파인 튜닝하세요.

In [ ]:

# Pytorch만 가능합니다..
training_args = TrainingArguments(
    output_dir="ko_fin_ner_roberta_small_model",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=30,
    weight_decay=0.1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()

Cloning https://huggingface.co/Hyeonseo/ko_fin_ner_roberta_small_model into local empty directory.
WARNING:huggingface_hub.repository:Cloning https://huggingface.co/Hyeonseo/ko_fin_ner_roberta_small_model into local empty directory.
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

[750/750 08:15, Epoch 30/30]

Epoch	Training Loss	Validation Loss	Precision	Recall	F1	Accuracy
1	No log	1.027180	0.121514	0.166213	0.140391	0.723684
2	No log	0.713578	0.236045	0.403270	0.297787	0.769451
3	No log	0.528869	0.342237	0.558583	0.424431	0.828518
4	No log	0.440442	0.418386	0.607629	0.495556	0.872998
5	No log	0.376811	0.412371	0.653951	0.505796	0.886585
6	No log	0.348386	0.475836	0.697548	0.565746	0.895309
7	No log	0.323648	0.547667	0.735695	0.627907	0.903890
8	No log	0.309744	0.570248	0.752044	0.648649	0.901459
9	No log	0.316751	0.616740	0.762943	0.682095	0.909611
10	No log	0.295014	0.617647	0.801090	0.697509	0.914474
11	No log	0.280616	0.667411	0.814714	0.733742	0.919479
12	No log	0.274912	0.685268	0.836512	0.753374	0.926630
13	No log	0.274262	0.700229	0.833787	0.761194	0.929205
14	No log	0.286177	0.677419	0.801090	0.734082	0.923770
15	No log	0.270306	0.687912	0.852861	0.761557	0.927632
16	No log	0.275169	0.703620	0.847411	0.768850	0.929348
17	No log	0.272116	0.699774	0.844687	0.765432	0.930492
18	No log	0.283104	0.697941	0.831063	0.758706	0.929920
19	No log	0.285688	0.725173	0.855586	0.785000	0.931922
20	0.278600	0.279229	0.726027	0.866485	0.790062	0.931922
21	0.278600	0.260436	0.735499	0.863760	0.794486	0.934926
22	0.278600	0.260266	0.709172	0.863760	0.778870	0.935927
23	0.278600	0.302626	0.722727	0.866485	0.788104	0.934211
24	0.278600	0.279959	0.743056	0.874659	0.803504	0.937500
25	0.278600	0.283812	0.728311	0.869210	0.792547	0.936070
26	0.278600	0.281343	0.733945	0.871935	0.797011	0.937071
27	0.278600	0.288116	0.740741	0.871935	0.801001	0.935784
28	0.278600	0.289354	0.737931	0.874659	0.800499	0.936213
29	0.278600	0.288943	0.748252	0.874659	0.806533	0.936785
30	0.278600	0.287272	0.743649	0.877384	0.805000	0.937357

/usr/local/lib/python3.10/dist-packages/seqeval/metrics/v1.py:57: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Out[ ]:

TrainOutput(global_step=750, training_loss=0.1970674082438151, metrics={'train_runtime': 496.2034, 'train_samples_per_second': 48.367, 'train_steps_per_second': 1.511, 'total_flos': 364940822643456.0, 'train_loss': 0.1970674082438151, 'epoch': 30.0})

훈련이 완료되면, push_to_hub() 메소드를 사용하여 모델을 허브에 공유할 수 있습니다.

In [ ]:

trainer.push_to_hub()

추론¶

좋아요, 이제 모델을 파인 튜닝했으니 추론에 사용할 수 있습니다!

추론을 수행하고자 하는 텍스트를 가져와봅시다:

In [ ]:

text = dataset['train'][9]['sentence']
text

Out[ ]:

'나스닥투자증권에서 시작된 발동성 가치 상태 효과는 투자자들에게 좋은 기회를 제공합니다.'

파인 튜닝된 모델로 추론을 시도하는 가장 간단한 방법은 pipeline()를 사용하는 것입니다. 모델로 NER의 pipeline을 인스턴스화하고, 텍스트를 전달해보세요:

FC - 금융 회사
FT - 금융 용어
FI - 금융 지수
FE - 금융 이벤트
FX - 금융 거래소

In [ ]:

from transformers import pipeline

classifier = pipeline("ner", model="hyeonseo/ko_fin_ner_roberta_small_model")
classifier(text)

Out[ ]:

[{'entity': 'B-FE',
  'score': 0.3856297,
  'index': 6,
  'word': '발동',
  'start': 14,
  'end': 16},
 {'entity': 'I-FE',
  'score': 0.86740315,
  'index': 7,
  'word': '##성',
  'start': 16,
  'end': 17},
 {'entity': 'I-FE',
  'score': 0.7324382,
  'index': 8,
  'word': '가치',
  'start': 18,
  'end': 20},
 {'entity': 'I-FE',
  'score': 0.7178092,
  'index': 9,
  'word': '상태',
  'start': 21,
  'end': 23}]

원한다면, pipeline의 결과를 수동으로 복제할 수도 있습니다:

텍스트를 토큰화하고 PyTorch 텐서를 반환합니다:

In [ ]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("hyeonseo/ko_fin_ner_roberta_small_model")
text = dataset['train'][10]['sentence']
print(text)

inputs = tokenizer(text, return_tensors="pt")

"신한저축은행까지 일주일 안에 인출되면 될까요?"

입력을 모델에 전달하고 logits을 반환합니다:

In [ ]:

from transformers import AutoModelForTokenClassification
import torch

model = AutoModelForTokenClassification.from_pretrained("hyeonseo/ko_fin_ner_roberta_small_model")
with torch.no_grad():
    logits = model(**inputs).logits

가장 높은 확률을 가진 클래스를 모델의 id2label 매핑을 사용하여 텍스트 레이블로 변환합니다:

In [ ]:

# 'O' 클래스로만 예측되는 문제를 개선하기 위해 'O' 클래스의 매핑을 제한하는 threshold 설정
threshold = logits.sum()/(len(logits)*10)

for i in range(len(logits)):
    for j in range(len(logits[i])):
        if sum(logits[i][j]) >= threshold:
            logits[i][j][10] = logits[i][j][10] * (-1)

predictions = torch.argmax(logits, dim=2)

predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]

print(text)
print("예측 / 실제")
for i in range(len(predicted_token_class)):
    print(predicted_token_class[i] + " / " + dataset['train'][10]['ner_tags'][i])

"신한저축은행까지 일주일 안에 인출되면 될까요?"
예측 / 실제
B-FE / O
B-FC / B-FC
B-FC / I-FC
I-FC / I-FC
I-FC / I-FC
I-FC / I-FC
I-FC / I-FC
I-FI / O
I-FE / O
I-FE / O
I-FC / O
I-FE / O
B-FE / O
B-FE / O
I-FI / O
B-FE / O
B-FE / O

개선이 필요한 사항
- 10,000 건 가량의 데이터 세트로 훈련하여 데이터 세트를 추가한다면, 성능이 더 나아질 것으로 추정
- GPT-3로 생성된 개체명 매핑이 부정확함 -> 데이터 품질이 나쁨
시사점
- 데이터 세트가 부족한 도메인에서 유용하게 활용될 수 있을 것으로 보임

'IT > 인공지능' 카테고리의 다른 글

[생성형AI][LLM] RAG 기반 기술문서 QA Gemma 모델 (Hugging Face) (0)	2024.02.24
[생성형AI][LLM] Gemma 모델 파인튜닝 (Hugging Face) (3)	2024.02.24
[생성형AI][Text2Video] Sora: 콘텐츠 제작의 미래를 선도하는 비디오 생성 모델 (0)	2024.02.20
[생성형AI][RAG] 증상 기반 법정감염병 판별 챗봇 (0)	2024.02.09
[언어모델 변천사 A to Z] RNN부터 GPT까지 가볍게 살펴보기 (0)	2023.03.25

logN^블

[생성형AI][LLM] 데이터 없이 생성형 AI를 활용하여 개체명인식(NER) 분류 - 금융 도메인

데이터 없이 생성형 AI를 활용하여 개체명인식(NER) 분류 - 금융 도메인¶

환경 설정¶

데이터 세트 가져오기¶

샘플 데이터 생성¶

GPT-3를 사용한 개체명 리스트 확장¶

GPT-3를 사용한 개체명인식 데이터 세트 생성¶

Hugging Face Datasets 형태로 변환¶

전처리¶

평가¶

훈련¶

추론¶

'IT > 인공지능' 카테고리의 다른 글

티스토리툴바

[생성형AI][LLM] 데이터 없이 생성형 AI를 활용하여 개체명인식(NER) 분류 - 금융 도메인

데이터 없이 생성형 AI를 활용하여 개체명인식(NER) 분류 - 금융 도메인¶

환경 설정¶

데이터 세트 가져오기¶

샘플 데이터 생성¶

GPT-3를 사용한 개체명 리스트 확장¶

GPT-3를 사용한 개체명인식 데이터 세트 생성¶

Hugging Face Datasets 형태로 변환¶

전처리¶

평가¶

훈련¶

추론¶

'IT > 인공지능' 카테고리의 다른 글

관련글

티스토리툴바