오정보 임베딩

챗GPT 오정보 임베딩

저자
소속

1 텍스트 → 벡터 DB

1.1 환경설정

코드
# !pip install pinecone-client
import dotenv
import os
import pinecone

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

pinecone.init(
    api_key=os.getenv("PINECONE_API_KEY"),
    environment=os.getenv("PINECONE_API_ENV"),
)
원천: 벡터 DB 쥬피터 노트북

2 텍스트 데이터

서울 R 미트업에서 발표한 “챗GPT와 오정보(Misinformation)” 자막 .srt 파일을 텍스트로 변환시킨 텍스트 파일이다.

코드
with open('../data/LibriSpeech/misinfo_chatGPT.txt', 'r') as f:
    contents = f.read()
print(contents[:100])
Well, thank you very much, Dr. Ahn. This is a real pleasure to be able to speak across the ocean. I 
원천: 벡터 DB 쥬피터 노트북

2.1 토큰 크기

텍스트 데이터 토큰 크기를 추정하는 것은 비용뿐만 아니라 추후 개발방향에 전략을 세우는데 큰 시사점을 제시한다.

정규표현식

코드
import re

def estimate_tokens(text):
    tokens = re.findall(r'\b\w+\b|\S', text)
    return len(tokens)

print(f"정규표현식 추정토큰수: {estimate_tokens(contents)}")
정규표현식 추정토큰수: 7974
원천: 벡터 DB 쥬피터 노트북

tiktoken

코드
import tiktoken

def num_tokens_from_string(string: str) -> int:
    """Returns the number of tokens in a text string."""    
    encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = len(encoding.encode(string))
    return num_tokens

# num_tokens_from_string("Hello World!",)
num_tokens=num_tokens_from_string(contents)
num_tokens
7852
원천: 벡터 DB 쥬피터 노트북

2.2 비용

코드
# 출처: https://github.com/OpsConfig/OpenAI_Lab/blob/3a8c55160a6790fc790ef1c2c797d83c716eee94/Context-based-search-Version2.ipynb
# Based on https://openai.com/api/pricing/ on 01/29/2023
# If you were using this for approximating pricing with Azure OpenAI adjust the values below with: https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/

#MODEL  USAGE
#Ada     v1 $0.0040 / 1K tokens
#Babbage v1 $0.0050 / 1K tokens
#Curie   v1 $0.0200 / 1K tokens
#Davinci v1 $0.2000 / 1K tokens

#MODEL  USAGE
#Ada     v2 $0.0004 / 1K tokens
#This Ada model, text-embedding-ada-002, is a better and lower cost replacement for our older embedding models. 

n_tokens_sum = num_tokens

ada_v1_embeddings_cost = (n_tokens_sum/1000) *.0040
babbage_v1_embeddings_cost = (n_tokens_sum/1000) *.0050
curie_v1_embeddings_cost = (n_tokens_sum/1000) *.02
davinci_v1_embeddings_cost = (n_tokens_sum/1000) *.2

ada_v2_embeddings_cost = (n_tokens_sum/1000) *.0004

print("Number of tokens: " + str(n_tokens_sum) + "\n")

print("MODEL        VERSION    COST")
print("-----------------------------------")
print("Ada" + "\t\t" + "v1" + "\t$" + '%.8s' % str(ada_v1_embeddings_cost))
print("Babbage" + "\t\t" + "v1" + "\t$" + '%.8s' % str(babbage_v1_embeddings_cost))
print("Curie" + "\t\t" + "v1" + "\t$" + '%.8s' % str(curie_v1_embeddings_cost))
print("Davinci" + "\t\t" + "v1" + "\t$" + '%.8s' % str(davinci_v1_embeddings_cost))
print("Ada" + "\t\t" + "v2" + "\t$" + '%.8s' %str(ada_v2_embeddings_cost))
Number of tokens: 7852

MODEL        VERSION    COST
-----------------------------------
Ada     v1  $0.031408
Babbage     v1  $0.03926
Curie       v1  $0.15704
Davinci     v1  $1.570400
Ada     v2  $0.003140
원천: 벡터 DB 쥬피터 노트북

3 임베딩

3.1 텍스트 쪼개기

코드
import pandas as pd

sentences = contents.split(". ")

df = pd.DataFrame(sentences, columns=['text'])

print(df.head())
                                                text
0                      Well, thank you very much, Dr
1                                                Ahn
2  This is a real pleasure to be able to speak ac...
3  I wish I was there in person, but I did get th...
4                               So thank you so much
원천: 벡터 DB 쥬피터 노트북

3.2 임베딩

코드
import uuid

def get_embedding(text: str, model="text-embedding-ada-002") -> list[float]:
    return openai.Embedding.create(input=[text], model=model)["data"][0]["embedding"]

# embedding = get_embedding("Your text goes here", model="text-embedding-ada-002")
# print(len(embedding))
# df['n_tokens'] = df["embedding"].apply(lambda x: len(x))


df["embedding"] = df.text.apply(lambda x: get_embedding(x))
df['vector_id'] = [str(uuid.uuid4()) for _ in range(len(df))]

df.to_csv("misinfo-embeddings.csv")
원천: 벡터 DB 쥬피터 노트북

4 벡터 데이터베이스

앞서 임베딩이 마무리되었으며 이를 벡터 데이터베이스에 밀어넣고 다양한 후속작업을 수행할 수 있도록 준비한다. 관계형 데이터베이스가 그 수가 많은 만큼 벡터 데이터베이스는 종류가 많다. 선두 주자로 평가받는 파인콘(Pinecone) 회원가입하면 API 형태로 벡터 데이터베이스를사용할 수 있다.

4.1 DB 생성

코드
# Pick a name for the new index
index_name = 'misinfo'

# Check whether the index with the same name already exists - if so, delete it
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)
    
# Creates new index
pinecone.create_index(name=index_name, dimension=len(df['embedding'][0]))
index = pinecone.Index(index_name=index_name)

# Confirm our index was created
pinecone.list_indexes()
['misinfo']
원천: 벡터 DB 쥬피터 노트북

4.2 임베딩 삽입

코드
import itertools

def chunks(iterable, batch_size=100):
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

for batch in chunks([(str(t), v) for t, v in zip(df.vector_id, df.embedding)]):
    index.upsert(vectors=batch, namespace = "misinfo_namespace")

index.describe_index_stats()    
{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'misinfo_namespace': {'vector_count': 368}},
 'total_vector_count': 368}
원천: 벡터 DB 쥬피터 노트북

5 검색

“why misinformation is dangerous?”

코드
pd.set_option('display.max_colwidth', None)

from openai.embeddings_utils import get_embedding, cosine_similarity

def get_embedding(text, model="text-embedding-ada-002"):
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

df_similarities = df.copy()

def search_docs(df, user_query, top_n=3):
    embedding = get_embedding(user_query, model="text-embedding-ada-002")

    df_similarities["similarities"] = df.embedding.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df_similarities.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    
    return res

question = "why misinformation is dangerous?\n\n"

res = search_docs(df, question, top_n=3)
res.text
186    And we know that, you know, our surgeon general and other major leaders around the world have recognized the ways in which misinformation can affect our health, and they can affect the health of democracies
130                                                                                                                                                                And bullshit is a part of the misinformation story
16                                                                                                                We've studied misinformation during the pandemic and during elections and all sorts of other topics
Name: text, dtype: object
원천: 벡터 DB 쥬피터 노트북

6 시각화