17 연구 논문과 학술 문서

LaTeX 품질, 마크다운 편의성

과학기술 논문 작성에서 LaTeX의 전문성과 마크다운의 편의성을 동시에 누리는 방법을 배워봅시다.

17.1 학술 논문의 Document as Code

17.1.1 기존 방식의 한계

대부분의 연구자들이 겪는 논문 작성의 어려움:

서식 지옥: LaTeX 컴파일 오류와 복잡한 명령어
협업 어려움: 이메일로 주고받는 tex 파일들
재현성 부족: 그래프와 표를 별도로 생성하여 삽입
버전 관리 불가: paper_final_v5_real_final.tex의 악몽

17.1.2 Document as Code 접근법

graph TD
    A[원본 데이터] --> B[분석 스크립트]
    B --> C[그래프/표 생성]
    C --> D[쿼토 문서]
    D --> E[PDF/HTML/Word 출력]
    
    F[Git 버전관리] --> A
    F --> B
    F --> D
    
    G[AI 어시스턴트] --> H[문헌 리뷰]
    G --> I[결과 해석]
    H --> D
    I --> D

17.2 학술 논문 템플릿 설정

17.2.1 YAML 헤더 구성

---
title: "머신러닝을 활용한 텍스트 분류 연구"
subtitle: "Document as Code 패러다임 적용 사례"
author:
  - name: "이광춘"
    affiliation: "공익법인 한국 R 사용자회"
    email: "kwangchun@r2bit.com"
    orcid: "0000-0000-0000-0000"
  - name: "공동저자"
    affiliation: "한국대학교"
abstract: |
  본 연구에서는 Document as Code 패러다임을 적용하여 
  재현가능한 머신러닝 연구를 수행하였다. 전체 연구 과정을 
  버전 관리하고 AI 어시스턴트를 활용하여 연구 효율성을 
  높였다. 제안한 방법론은 기존 방법 대비 10% 성능 향상을 보였다.

keywords: ["머신러닝", "텍스트분류", "재현가능연구", "Document as Code"]

date: today
date-format: "YYYY년 MM월 DD일"

format:
  pdf:
    documentclass: article
    fontsize: 11pt
    geometry: 
      - margin=2.5cm
    number-sections: true
    toc: false
    bibliography-title: "참고문헌"
    citeproc: true
  html:
    theme: cosmo
    number-sections: true
    toc: true

bibliography: assets/references.bib
csl: assets/apa-single-spaced.csl
link-citations: true

crossref:
  fig-title: "그림"
  tbl-title: "표"
  eq-title: "식"

lang: ko-KR
---

17.2.2 논문 구조 템플릿

# 초록 {.unnumbered}

본 연구에서는...

**키워드**: 머신러닝, 텍스트분류, 재현가능연구

# 서론

연구 배경과 목적을 설명합니다.

## 연구 동기

현재 텍스트 분류 분야에서...

## 연구 목적

본 연구의 목적은 다음과 같습니다:

1. Document as Code 패러다임을 활용한 재현가능한 연구 방법론 제안
2. AI 어시스턴트를 활용한 연구 효율성 향상 방안 검증  
3. 기존 방법론과의 성능 비교 분석

# 문헌 리뷰

## 텍스트 분류 연구 동향

최근 3년간의 주요 연구들을 살펴보면...

## Document as Code 개념

@wilson2017good 에서 제시한 재현가능한 연구 방법론에 따르면...

# 연구 방법론

## 데이터 수집

본 연구에서는 다음과 같은 데이터를 수집하였습니다.

## 전처리 과정

데이터 전처리는 다음 단계로 구성됩니다:

## 모델 설계

제안하는 모델의 구조는 @fig-model-architecture 와 같습니다.

# 실험 및 결과

## 실험 설계

## 결과 분석

## 성능 평가

# 결론

## 연구 기여점

## 향후 연구 방향

# 참고문헌 {.unnumbered}

::: {#refs}
:::

17.3 재현가능한 데이터 분석

17.3.1 R을 활용한 통계 분석

# 데이터 로딩 및 전처리
data <- read_csv("data/text_classification_data.csv") %>%
  filter(!is.na(text), !is.na(category)) %>%
  mutate(
    text_length = str_length(text),
    word_count = str_count(text, "\\w+"),
    category = factor(category)
  )

# 기본 통계
glimpse(data)

# 기술 통계량
desc_stats <- data %>%
  group_by(category) %>%
  summarise(
    count = n(),
    mean_length = mean(text_length),
    sd_length = sd(text_length),
    mean_words = mean(word_count),
    sd_words = sd(word_count),
    .groups = "drop"
  )

desc_stats %>%
  gt() %>%
  tab_header(
    title = "카테고리별 텍스트 특성",
    subtitle = "길이 및 단어 수 통계"
  ) %>%
  cols_label(
    category = "카테고리",
    count = "개수",
    mean_length = "평균 길이",
    sd_length = "길이 표준편차",
    mean_words = "평균 단어수",
    sd_words = "단어수 표준편차"
  ) %>%
  fmt_number(
    columns = c(mean_length, sd_length, mean_words, sd_words),
    decimals = 1
  ) %>%
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  )

표 17.1: 데이터셋 기술통계량

17.3.2 시각화

p1 <- ggplot(data, aes(x = text_length, fill = category)) +
  geom_histogram(alpha = 0.7, bins = 30, position = "identity") +
  facet_wrap(~category, scales = "free_y") +
  theme_paper +
  labs(
    title = "카테고리별 텍스트 길이 분포",
    x = "텍스트 길이 (문자 수)",
    y = "빈도",
    fill = "카테고리"
  ) +
  scale_fill_viridis_d()

p2 <- ggplot(data, aes(x = category, y = text_length, fill = category)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 0.5) +
  theme_paper +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "카테고리별 텍스트 길이 박스플롯",
    x = "카테고리",
    y = "텍스트 길이 (문자 수)",
    fill = "카테고리"
  ) +
  scale_fill_viridis_d()

gridExtra::grid.arrange(p1, p2, ncol = 2)

17.4 머신러닝 모델링

17.4.1 Python 통합

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# R 데이터 가져오기
py_data = r.data

# 특성 추출
vectorizer = TfidfVectorizer(
    max_features=1000, 
    stop_words=None,  # 한국어는 별도 처리 필요
    ngram_range=(1, 2)
)

X = vectorizer.fit_transform(py_data['text'])
y = py_data['category']

# 학습/테스트 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 모델 학습
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000)
}

results = {}
for name, model in models.items():
    # 교차 검증
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_weighted')
    
    # 모델 학습
    model.fit(X_train, y_train)
    
    # 테스트 예측
    y_pred = model.predict(X_test)
    
    results[name] = {
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'model': model,
        'predictions': y_pred
    }
    
    print(f"\n{name} 결과:")
    print(f"교차 검증 F1 점수: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
    print("\n분류 보고서:")
    print(classification_report(y_test, y_pred))

17.4.2 결과 시각화

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for i, (name, result) in enumerate(results.items()):
    cm = confusion_matrix(y_test, result['predictions'])
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[i])
    axes[i].set_title(f'{name} 혼동행렬')
    axes[i].set_xlabel('예측')
    axes[i].set_ylabel('실제')

plt.tight_layout()
plt.show()

# R로 결과 전달
r.model_results = pd.DataFrame({
    'model': list(results.keys()),
    'cv_f1_mean': [r['cv_mean'] for r in results.values()],
    'cv_f1_std': [r['cv_std'] for r in results.values()]
})

17.4.3 성능 비교표

model_results %>%
  gt() %>%
  tab_header(
    title = "머신러닝 모델 성능 비교",
    subtitle = "5-fold 교차 검증 결과"
  ) %>%
  cols_label(
    model = "모델",
    cv_f1_mean = "F1 점수 (평균)",
    cv_f1_std = "F1 점수 (표준편차)"
  ) %>%
  fmt_number(
    columns = c(cv_f1_mean, cv_f1_std),
    decimals = 3
  ) %>%
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  ) %>%
  tab_style(
    style = cell_fill(color = "lightblue"),
    locations = cells_body(
      rows = cv_f1_mean == max(cv_f1_mean)
    )
  )

표 17.2: 모델 성능 비교

17.5 AI 어시스턴트 활용

17.5.1 결과 해석

#> 모델 성능 분석 결과:
#> 1. Random Forest가 Logistic Regression보다 우수한 성능을 보임
#> 2. 교차 검증 표준편차가 낮아 안정적인 성능
#> 3. 텍스트 분류 태스크에서 앙상블 방법의 효과를 확인

17.5.2 문헌 리뷰 자동 생성

#> 관련 연구 동향:
#> 최근 텍스트 분류 연구는 transformer 기반 모델이 주류를 이루고 있으나,
#> 전통적인 앙상블 방법도 여전히 경쟁력을 유지하고 있다.
#> 특히 소규모 데이터셋에서는 Random Forest의 성능이 우수하다고 보고되고 있다.

17.6 협업과 피어리뷰

17.6.1 GitHub 기반 협업

# 브랜치 생성
git checkout -b feature/results-analysis

# 변경사항 커밋
git add sci_paper.qmd data/ 
git commit -m "Add machine learning results analysis

- Implement Random Forest and Logistic Regression
- Add performance comparison table and figures  
- Include AI-assisted interpretation
- Update references with new citations"

# 푸시 및 풀 리퀘스트
git push origin feature/results-analysis
gh pr create --title "Add ML Results Analysis" \
             --body "Complete analysis section with reproducible code"

17.6.2 리뷰 대응

# 리뷰어 의견: "통계적 유의성 검정 추가 필요"
# 대응: McNemar 검정 추가

library(exact2x2)

# 모델 예측 결과 비교
rf_correct <- (rf_predictions == y_test)
lr_correct <- (lr_predictions == y_test)

# McNemar 검정
mcnemar_result <- mcnemar.exact(
  table(rf_correct, lr_correct)
)

print(mcnemar_result)

17.7 최종 출력 및 배포

17.7.1 다중 포맷 출력

# PDF (학술지 제출용)
quarto render sci_paper.qmd --to pdf

# HTML (웹 공개용)  
quarto render sci_paper.qmd --to html

# Word (공동 연구자 리뷰용)
quarto render sci_paper.qmd --to docx

17.7.2 arXiv 프리프린트

# arXiv 메타데이터
arxiv:
  primary_class: "cs.CL"
  categories: ["cs.CL", "cs.LG", "stat.ML"]
  title: "Machine Learning Text Classification with Document as Code"
  authors: 
    - "Kwangchun Lee"
    - "Co-Author Name"
  abstract: |
    We present a reproducible approach to machine learning research
    using the Document as Code paradigm...

17.7.3 재현가능성 체크리스트

모든 데이터 파일이 포함되거나 다운로드 스크립트 제공
의존성 패키지 목록 (requirements.txt, renv.lock)
랜덤 시드 고정
환경 설정 문서화
실행 가능한 코드 청크
그래프와 표의 자동 생성
Git 저장소 공개
라이선스 명시

17.8 다음 단계

다음 장에서는 정부 보고서, 기업 기술 문서, API 문서 등 다양한 기술 문서 작성 방법을 다루겠습니다. 학술 논문과는 다른 요구사항과 독자층을 고려한 맞춤형 접근법을 배워보세요.

실습 과제

현재 진행 중인 연구나 관심 있는 주제로 간단한 논문 초고를 작성해보세요. 데이터 분석부터 결과 해석까지 전체 과정을 하나의 쿼토 문서에서 관리하는 경험을 해보는 것이 목표입니다.

학술 윤리

AI 어시스턴트를 사용한 내용은 해당 학술지의 정책에 따라 적절히 공개해야 합니다. 대부분의 학술지는 AI 사용에 대한 투명한 공개를 요구합니다.

AbdulMajedRaja. (2020). Penguins dataset overview - iris alternative in r using palmerpenguins. Programming with R. https://www.programmingwithr.com/penguins-dataset-overview-iris-alternative-in-r/

Alexander, R. (2023). Telling stories with data: With applications in r. CRC Press.

Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452–454. https://doi.org/10.1038/533452a

Curty, R., White, T., Lessing, I., Janee, G., Brun, J., & Liu, kristi. (2024). Introduction to reproducible publications with RStudio. Carpentries. https://github.com/carpentries-incubator/reproducible-publications-quarto

Edwards, A. W. F. (2000). The genetical theory of natural selection. Genetics, 154(4), 1419–1426.

European Commission. (2021). Horizon Europe: Open science. Publications Office of the European Union.

Gorman, K. B., Williams, T. D., & Fraser, W. R. (2014). Ecological sexual dimorphism and environmental variability within a community of antarctic penguins (genus pygoscelis). PLoS ONE, 9(3), e90081. https://doi.org/10.1371/journal.pone.0090081

Hyde, A. (2021). Single source publishing - a investigation of what single source publishing is and how this “holy grail” can be achieved. https://coko.foundation/articles/single-source-publishing.html

KB, G., TD, W., & WR, F. (2014). Ecological sexual dimorphism and environmental variability within a community of antarctic penguins (genus pygoscelis). PLoS ONE, 9(3)(e90081), –13. https://doi.org/10.1371/journal.pone.0090081

Knuth, D. E. (1984). Literate programming. Comput. J., 27(2), 97–111. https://doi.org/10.1093/comjnl/27.2.97

Levy, I. (2019). Eugenics and the ethics of statistical analysis. GEORGETOWN PUBLIC POLICY REVIEW. https://gppreview.com/2019/12/16/eugenics-ethics-statistical-analysis/

Markowetz, F. (2015). Five selfish reasons to work reproducibly. Genome Biology, 16(1), 274. https://doi.org/10.1186/s13059-015-0850-7

National Institutes of Health. (2020). NIH Policy for Data Management and Sharing. NIH Office of Science Policy.

Nature Portfolio. (2023). Reporting standards and availability of data, materials, code and protocols. Web page.

Perkel, J. (2016). Democratic databases: Science on GitHub. Nature, 538(7623), 127–128.

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer Science & Business Media. https://doi.org/10.1007/0-387-28695-0

Wilson, G. (2016). Modern scientific authoring. Carpentries. https://swcarpentry.github.io/modern-scientific-authoring/

Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLoS Computational Biology, 13(6), e1005510.

이광춘. (2023). 공간정보의 역사 및 공간정보 처리기법. PROPBIX, 13.

정환봉. (2020). 여당 의원 176명 중 누가?...차별금지법 발의할 ’의인’을 구합니다. 한겨레 신문. http://www.hani.co.kr/arti/politics/assembly/949422.html