1 파이썬 가상환경

파이썬을 계속 사용하다보니 무조건 가상환경을 사용해야 한다는 걸 절실히 느끼게 된다. 시간이 지나면 어떤 패키지들을 설치했었는지 확인이 되지 않고 어떤 것이 문제가 되어 잘 돌던 코드가 제대로 실행되지 않는지 파악이 힘드는 지경에 이르게 된다.

파이썬3에서 venv, virtualenv 두가지 가상환경 팩키지가 제공되는데 선택을 해야한다. 결론은 파이썬3에서 venv가 지원되니 별도 패키지 설치없이 venv로 가는 것이 좋다.

python3 -m venv <가상환경 명칭>
source <가상환경 명칭>/bin/activate
pip install -U pip
pip install pandas
pip freeze > requirements.txt

가상환경 생성부터 주요한 가상환경 설정 방법을 순차적으로 파악해보자.

py-3.10.9 tidyverse in ~/venv
○ → python3 -m venv venv

py-3.10.9 tidyverse in ~/venv
○ → source venv/bin/activate

## .\venv\Scripts\activate ## 윈도우즈

○ → which python
/Users/tidyverse/venv/venv/bin/python

 |venv|py-3.9.6 tidyverse in ~/venv
○ → pip install -U pip
Requirement already satisfied: pip in ./venv/lib/python3.9/site-packages (21.2.4)
Collecting pip
  Using cached pip-23.0-py3-none-any.whl (2.1 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.2.4
    Uninstalling pip-21.2.4:
      Successfully uninstalled pip-21.2.4
Successfully installed pip-23.0

 |venv|py-3.9.6 tidyverse in ~/venv
○ → pip install pandas
Collecting pandas
  Downloading pandas-1.5.3-cp39-cp39-macosx_10_9_x86_64.whl (12.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.0/12.0 MB 5.7 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Using cached pytz-2022.7.1-py2.py3-none-any.whl (499 kB)
Collecting numpy>=1.20.3
  Downloading numpy-1.24.2-cp39-cp39-macosx_10_9_x86_64.whl (19.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.8/19.8 MB 4.3 MB/s eta 0:00:00
Collecting python-dateutil>=2.8.1
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting six>=1.5
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: pytz, six, numpy, python-dateutil, pandas
Successfully installed numpy-1.24.2 pandas-1.5.3 python-dateutil-2.8.2 pytz-2022.7.1 six-1.16.0

 |venv|py-3.9.6 tidyverse in ~/venv
○ → pip freeze
numpy==1.24.2
pandas==1.5.3
python-dateutil==2.8.2
pytz==2022.7.1
six==1.16.0

 |venv|py-3.9.6 tidyverse in ~/venv
○ → pip freeze > requirements.txt

 |venv|py-3.9.6 tidyverse in ~/venv
○ → tree -L 2
.
├── requirements.txt
└── venv
    ├── bin
    ├── include
    ├── lib
    └── pyvenv.cfg

4 directories, 2 files

 |venv|py-3.9.6 tidyverse in ~/venv
○ → deactivate

py-3.10.9 tidyverse in ~/venv
○ →

2 전형적인 프로젝트

격리된 파이썬 개발환경과 데이터, 코드, ipynb 등이 모두 갖춰진 프로젝트 디렉토리 구조는 다음과 같다.

project_name/
    venv/
    data/
    code/
        main.py
        module1.py
        module2.py
        ...
    notebooks/
        analysis.ipynb
        exploratory.ipynb
        ...
    requirements.txt
    README.md

3 NLP 작업 분류

Hugging Face 웹사이트에서 Transformer를 다운로드 받아 다양한 자연어 처리 작업을 수행한다.

먼저, 작업흐름은 앞서 준비한 파이썬 가상환경에 허깅 페이스에서 Transfomer를 설치하고 이어 각 NLP 작업(task)에 맞춰 후속작업을 이어나간다.

Hello Transformers from R

R, Reticulate, and Hugging Face Models

graph LR
    A["venv 가상환경"] --> T(("Transformer"))
    B["R reticulate"] --> T(("Transformer"))
    T ---> C["텍스트 분류(Text Classification)"]
    T ---> D["개체명인식(NER)"]
    T ---> E["질의 응답(Question & Answering)"]
    T ---> F["요약(Summarization)"]
    T ---> G["번역(Translation)"]
    T ---> H["텍스트 생성(Text Generation)"]    
    style A fill:#FF6655AA
    style T fill:#88ffFF

환경 설정

파이썬 가상환경을 준비하고 transformers 및 연관 패키지를 설치한다.

코드

library(reticulate)

use_python("~/venv/venv/bin/python")
reticulate::py_config()
reticulate::py_available()

reticulate::py_install("transformers", pip = TRUE)
reticulate::py_install(c("torch", "sentencepiece"), pip = TRUE)

3.1 감정 분류

영문 텍스트 감정을 분류하는 작업을 수행하자.

코드

library(reticulate)
library(tidyverse)

use_python("~/venv/venv/bin/python")

text <- ("Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.")

# Importing 🤗 transformers into R session
transformers <- reticulate::import("transformers")

# model_name <- "bert-base-uncased"
# model <- transformers$AutoModel$from_pretrained(model_name)

# Instantiate a pipeline
classifier <- transformers$pipeline(task = "text-classification")

# Generate predictions
outputs <- classifier(text)

# Convert predictions to tibble
outputs %>% 
  pluck(1) %>% 
  as_tibble()

3.2 NER

개체명 인식은 텍스트 내부에 지명, 인명, 제품 등을 자동으로 인식하는 과정이다.

코드

# Download model for ner task
ner_tagger <- transformers$pipeline(task = "ner", aggregation_strategy = "simple")

# Make predictions
outputs <- ner_tagger(text)

# Convert predictions to tibble
# This takes some bit of effort since some of the variables are numpy objects 

# Function that takes a list element and converts
# it to a character
to_r <- function(idx){
  # Obtain a particular output from entire named list
  output_idx = outputs %>% 
    pluck(idx)
  
  # Convert score from numpy to integer
  output_idx$score = paste(output_idx$score) %>% 
    as.double()
  
  return(output_idx)
  
}

# Convert outputs to tibble
map_dfr(1:length(outputs), ~to_r(.x))

3.3 질의응답

텍스트에 질문을 던지고 해당 대답을 찾아내는 작업을 수행해보자.

코드

# Specify task
reader <- transformers$pipeline(task = "question-answering")

# Question we want answered
question <-  "What does the customer want?"

# Provide model with question and context
outputs <- reader(question = question, context = text)
outputs %>% 
  as_tibble()

3.4 요약

텍스트가 매우 긴 경우 이를 단순히 요약할 수 있다.

코드

summarizer <- transformers$pipeline("summarization")
outputs <- summarizer(text, max_length = 56L, clean_up_tokenization_spaces = TRUE)
outputs

3.5 번역

Language Technology Research Group at the University of Helsinki 에서 사전학습모형을 다운로드 받아 번역작업을 수행할 수 있다.

코드

# This requires python package sentencepiece
sentencepiece <- reticulate::import("sentencepiece")

# Explicitly specifying the model you want
translator <- transformers$pipeline(
  task = "translation",
  model = "Helsinki-NLP/opus-mt-tc-big-en-ko") # model = "Helsinki-NLP/opus-mt-en-de"

outputs <- translator(text, clean_up_tokenization_spaces = TRUE,
                      min_length = 100L)

outputs

3.6 텍스트 생성

고객이 남긴 고객의 소리에 다음과 같이 응답원이 처음을 시작하면 기계가 반응을 자동생성시켜 답신을 작성할 수 있다.

코드

generator <- transformers$pipeline("text-generation")
response <- "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt <- paste(text, "\n\nCustomer service response:\n", response)
outputs <- generator(prompt, max_length = 200L)

outputs %>% 
  pluck(1, "generated_text") %>% 
  cat()

3.7 참고문헌

--- title: "chatGPT" subtitle: "Hugging Face" author: - name: 이광춘 url: https://www.linkedin.com/in/kwangchunlee/ affiliation: 한국 R 사용자회 affiliation-url: https://github.com/bit2r title-block-banner: true #title-block-banner: "#562457" format: html: css: css/quarto.css theme: flatly code-fold: true toc: true toc-depth: 3 toc-title: 목차 number-sections: true highlight-style: github self-contained: false filters: - lightbox - interview-callout.lua lightbox: auto link-citations: true knitr: opts_chunk: message: false warning: false collapse: true comment: "#>" R.options: knitr.graphics.auto_pdf: true editor_options: chunk_output_type: console --- # 파이썬 가상환경 파이썬을 계속 사용하다보니 무조건 가상환경을 사용해야 한다는 걸 절실히 느끼게 된다. 시간이 지나면 어떤 패키지들을 설치했었는지 확인이 되지 않고 어떤 것이 문제가 되어 잘 돌던 코드가 제대로 실행되지 않는지 파악이 힘드는 지경에 이르게 된다. 파이썬3에서 `venv`, `virtualenv` 두가지 가상환경 팩키지가 제공되는데 선택을 해야한다. 결론은 파이썬3에서 `venv`가 지원되니 별도 패키지 설치없이 `venv`로 가는 것이 좋다. 1. `python3 -m venv <가상환경 명칭>` 1. `source <가상환경 명칭>/bin/activate` 1. `pip install -U pip` 1. `pip install pandas` 1. `pip freeze > requirements.txt` 가상환경 생성부터 주요한 가상환경 설정 방법을 순차적으로 파악해보자. ::: {.panel-tabset} ## 생성 ```bash py-3.10.9 tidyverse in ~/venv ○ → python3 -m venv venv ``` ## 활성화 ```bash py-3.10.9 tidyverse in ~/venv ○ → source venv/bin/activate ## .\venv\Scripts\activate ## 윈도우즈 ``` ## 파이썬 ```bash ○ → which python /Users/tidyverse/venv/venv/bin/python ``` ## pip 설치 ```bash |venv|py-3.9.6 tidyverse in ~/venv ○ → pip install -U pip Requirement already satisfied: pip in ./venv/lib/python3.9/site-packages (21.2.4) Collecting pip Using cached pip-23.0-py3-none-any.whl (2.1 MB) Installing collected packages: pip Attempting uninstall: pip Found existing installation: pip 21.2.4 Uninstalling pip-21.2.4: Successfully uninstalled pip-21.2.4 Successfully installed pip-23.0 ``` ## 판다스 설치 ```bash |venv|py-3.9.6 tidyverse in ~/venv ○ → pip install pandas Collecting pandas Downloading pandas-1.5.3-cp39-cp39-macosx_10_9_x86_64.whl (12.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.0/12.0 MB 5.7 MB/s eta 0:00:00 Collecting pytz>=2020.1 Using cached pytz-2022.7.1-py2.py3-none-any.whl (499 kB) Collecting numpy>=1.20.3 Downloading numpy-1.24.2-cp39-cp39-macosx_10_9_x86_64.whl (19.8 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.8/19.8 MB 4.3 MB/s eta 0:00:00 Collecting python-dateutil>=2.8.1 Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB) Collecting six>=1.5 Using cached six-1.16.0-py2.py3-none-any.whl (11 kB) Installing collected packages: pytz, six, numpy, python-dateutil, pandas Successfully installed numpy-1.24.2 pandas-1.5.3 python-dateutil-2.8.2 pytz-2022.7.1 six-1.16.0 ``` ## freeze ```bash |venv|py-3.9.6 tidyverse in ~/venv ○ → pip freeze numpy==1.24.2 pandas==1.5.3 python-dateutil==2.8.2 pytz==2022.7.1 six==1.16.0 ``` ## `requirements.txt` ```bash |venv|py-3.9.6 tidyverse in ~/venv ○ → pip freeze > requirements.txt ``` ## 가상환경 구조 ```bash |venv|py-3.9.6 tidyverse in ~/venv ○ → tree -L 2 . ├── requirements.txt └── venv ├── bin ├── include ├── lib └── pyvenv.cfg 4 directories, 2 files ``` ## `deactivate` ```bash |venv|py-3.9.6 tidyverse in ~/venv ○ → deactivate py-3.10.9 tidyverse in ~/venv ○ → ``` ::: # 전형적인 프로젝트 격리된 파이썬 개발환경과 데이터, 코드, ipynb 등이 모두 갖춰진 프로젝트 디렉토리 구조는 다음과 같다. ```bash project_name/ venv/ data/ code/ main.py module1.py module2.py ... notebooks/ analysis.ipynb exploratory.ipynb ... requirements.txt README.md ``` # NLP 작업 분류 [`Hugging Face`](https://huggingface.co/) 웹사이트에서 Transformer를 다운로드 받아 다양한 자연어 처리 작업을 수행한다. 먼저, 작업흐름은 앞서 준비한 파이썬 가상환경에 허깅 페이스에서 Transfomer를 설치하고 이어 각 NLP 작업(task)에 맞춰 후속작업을 이어나간다. [[Hello Transformers from R](https://rpubs.com/eR_ic/transfoRmers)]{.aside} [[R, Reticulate, and Hugging Face Models](https://cengiz.me/posts/huggingface/)]{.aside} ```{mermaid} graph LR A["venv 가상환경"] --> T(("Transformer")) B["R reticulate"] --> T(("Transformer")) T ---> C["텍스트 분류(Text Classification)"] T ---> D["개체명인식(NER)"] T ---> E["질의 응답(Question & Answering)"] T ---> F["요약(Summarization)"] T ---> G["번역(Translation)"] T ---> H["텍스트 생성(Text Generation)"] style A fill:#FF6655AA style T fill:#88ffFF ``` ## 환경 설정 {-} 파이썬 가상환경을 준비하고 `transformers` 및 연관 패키지를 설치한다. ```{r} #| eval: false library(reticulate) use_python("~/venv/venv/bin/python") reticulate::py_config() reticulate::py_available() reticulate::py_install("transformers", pip = TRUE) reticulate::py_install(c("torch", "sentencepiece"), pip = TRUE) ``` ## 감정 분류 영문 텍스트 감정을 분류하는 작업을 수행하자. ```{r} #| eval: false library(reticulate) library(tidyverse) use_python("~/venv/venv/bin/python") text <- ("Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.") # Importing 🤗 transformers into R session transformers <- reticulate::import("transformers") # model_name <- "bert-base-uncased" # model <- transformers$AutoModel$from_pretrained(model_name) # Instantiate a pipeline classifier <- transformers$pipeline(task = "text-classification") # Generate predictions outputs <- classifier(text) # Convert predictions to tibble outputs %>% pluck(1) %>% as_tibble() ``` ## NER 개체명 인식은 텍스트 내부에 지명, 인명, 제품 등을 자동으로 인식하는 과정이다. ```{r} #| eval: false # Download model for ner task ner_tagger <- transformers$pipeline(task = "ner", aggregation_strategy = "simple") # Make predictions outputs <- ner_tagger(text) # Convert predictions to tibble # This takes some bit of effort since some of the variables are numpy objects # Function that takes a list element and converts # it to a character to_r <- function(idx){ # Obtain a particular output from entire named list output_idx = outputs %>% pluck(idx) # Convert score from numpy to integer output_idx$score = paste(output_idx$score) %>% as.double() return(output_idx) } # Convert outputs to tibble map_dfr(1:length(outputs), ~to_r(.x)) ``` ## 질의응답 텍스트에 질문을 던지고 해당 대답을 찾아내는 작업을 수행해보자. ```{r} #| eval: false # Specify task reader <- transformers$pipeline(task = "question-answering") # Question we want answered question <- "What does the customer want?" # Provide model with question and context outputs <- reader(question = question, context = text) outputs %>% as_tibble() ``` ## 요약 텍스트가 매우 긴 경우 이를 단순히 요약할 수 있다. ```{r} #| eval: false summarizer <- transformers$pipeline("summarization") outputs <- summarizer(text, max_length = 56L, clean_up_tokenization_spaces = TRUE) outputs ``` ## 번역 [Language Technology Research Group at the University of Helsinki](https://huggingface.co/Helsinki-NLP) 에서 사전학습모형을 다운로드 받아 번역작업을 수행할 수 있다. ```{r} #| eval: false # This requires python package sentencepiece sentencepiece <- reticulate::import("sentencepiece") # Explicitly specifying the model you want translator <- transformers$pipeline( task = "translation", model = "Helsinki-NLP/opus-mt-tc-big-en-ko") # model = "Helsinki-NLP/opus-mt-en-de" outputs <- translator(text, clean_up_tokenization_spaces = TRUE, min_length = 100L) outputs ``` ## 텍스트 생성 고객이 남긴 고객의 소리에 다음과 같이 응답원이 처음을 시작하면 기계가 반응을 자동생성시켜 답신을 작성할 수 있다. ```{r} #| eval: false generator <- transformers$pipeline("text-generation") response <- "Dear Bumblebee, I am sorry to hear that your order was mixed up." prompt <- paste(text, "\n\nCustomer service response:\n", response) outputs <- generator(prompt, max_length = 200L) outputs %>% pluck(1, "generated_text") %>% cat() ``` ## 참고문헌 - [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098103231/) - [Reticulate: R Interface to Python](https://rstudio.github.io/reticulate/index.html)