1 자연어 → SQL

code-davinci-002이 사용중단(deprecated)되어 text-davinci-002, text-davinci-003 모델들이 Codex 기반으로 프로그램 작성이 가능하다. (Plaetsen, 2023)

영어 프롬프트

코드

prompt = "Get all the users that are older than 35 years old"
model = "text-davinci-002"
temperature = 0.0
max_tokens = 50

response = openai.Completion.create(
    engine=model,
    prompt=prompt,
    temperature=temperature,
    max_tokens=max_tokens,
)

print(response.choices[0].text)



SELECT * FROM users WHERE age > 35;

원천: SQL 쥬피터 노트북

한글 프롬프트

코드

prompt = "users 테이블에서 35세 이상 사용자를 추출하는 SQL 코드를 작성하세요"
model = "text-davinci-002"
temperature = 0.5
max_tokens = 50

response = openai.Completion.create(
    engine=model,
    prompt=prompt,
    temperature=temperature,
    max_tokens=max_tokens,
)

print(response.choices[0].text)

.

```mysql
SELECT * FROM users WHERE age >= 35;
```

원천: SQL 쥬피터 노트북

1.1 함수

generate_sql_query() 함수를 작성하여 국,영문 프롬프트를 작성하여 바로 SQL 쿼리를 작성하는 것도 가능하다.

코드

def generate_sql_query(prompt, max_tokens = 100):
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        temperature=0,
        max_tokens=max_tokens
    )
    return response.choices[0].text

generate_sql_query("Get all the users that are older than 35 years old from users table")

'\n\nSELECT * FROM users WHERE age > 35;'

원천: SQL 쥬피터 노트북

2 Langchain + SQL

국영문 프롬프트를 주어 자연어를 SQL 쿼리문을 작성하는 것에서 한걸음 더 들어가 데이터베이스에서 SQL 쿼리문을 실행하여 결과값을 얻는 것도 가능하다. 소프트웨어 카펜트리 SQL 교과정에서 사용하는 survey.db를 대상으로 테이블이 존재하는지 존재한다면 몇개인지 테이블 명은 무엇인지 파악할 수 있다.

Databases and SQL

코드

from langchain import OpenAI, SQLDatabase, SQLDatabaseChain
dvd_db = SQLDatabase.from_uri("sqlite:///survey.db")
llm = OpenAI(temperature=0)

db_chain = SQLDatabaseChain.from_llm(llm, dvd_db, verbose=True)
db_chain.run("How many tables are there? If there exists tables, list table names")



> Entering new SQLDatabaseChain chain...
How many tables are there? If there exists tables, list table names
SQLQuery: SELECT name FROM sqlite_master WHERE type='table';
SQLResult: [('Person',), ('Site',), ('Survey',), ('Visited',)]
Answer: There are 4 tables: Person, Site, Survey, Visited.
> Finished chain.

' There are 4 tables: Person, Site, Survey, Visited.'

원천: SQL 쥬피터 노트북

2.1 정렬과 중복제거

Sorting and Removing Duplicates에 언급된 “Person 테이블에 있는 과학자의 전체 이름을 성을 기준으로 정렬하여 표시하는 쿼리를 작성합니다.” 라는 퀴즈문제를 풀어보자.

SWC 코드

코드

library(DBI)
survey_db <- dbConnect(RSQLite::SQLite(), "jupyterlab/survey.db")
dbGetQuery(survey_db, "SELECT personal, family FROM Person ORDER BY family ASC;")
#>    personal   family
#> 1     Frank Danforth
#> 2   William     Dyer
#> 3  Anderson     Lake
#> 4     Frank  Pabodie
#> 5 Valentina  Roerich

챗GPT 코드

코드

db_chain.run("Write a query that displays the full names of the scientists in the Person table, ordered by family name.")



> Entering new SQLDatabaseChain chain...
Write a query that displays the full names of the scientists in the Person table, ordered by family name.
SQLQuery: SELECT "personal" || ' ' || "family" AS "Full Name" FROM "Person" ORDER BY "family" ASC LIMIT 5;
SQLResult: [('Frank Danforth',), ('William Dyer',), ('Anderson Lake',), ('Frank Pabodie',), ('Valentina Roerich',)]
Answer: The full names of the scientists in the Person table, ordered by family name, are Frank Danforth, William Dyer, Anderson Lake, Frank Pabodie, and Valentina Roerich.
> Finished chain.

' The full names of the scientists in the Person table, ordered by family name, are Frank Danforth, William Dyer, Anderson Lake, Frank Pabodie, and Valentina Roerich.'

원천: SQL 쥬피터 노트북

2.2 총계

Calculating New Values에 언급된 “자세히 읽어본 결과, 발렌티나 로리히가 염도를 백분율로 보고하고 있다는 것을 알 수 있습니다. 설문 조사 테이블에서 그녀의 모든 염도 측정값을 100으로 나눈 값을 반환하는 쿼리를 작성합니다.(After further reading, we realize that Valentina Roerich was reporting salinity as percentages. Write a query that returns all of her salinity measurements from the Survey table with the values divided by 100.)” 라는 퀴즈문제를 풀어보자.

SWC 코드

코드

dbGetQuery(survey_db, "SELECT taken, reading / 100 FROM Survey WHERE person = 'roe' AND quant = 'sal';")
#>   taken reading / 100
#> 1   752         0.416
#> 2   837         0.225

챗GPT 코드

코드

db_chain.run('After further reading, we realize that Valentina Roerich(roe) was reporting salinity as percentages.\
              Write a query that returns all of her salinity measurements from the Survey table with the values divided by 100.')



> Entering new SQLDatabaseChain chain...
After further reading, we realize that Valentina Roerich(roe) was reporting salinity as percentages.              Write a query that returns all of her salinity measurements from the Survey table with the values divided by 100.
SQLQuery: SELECT person, quant, reading/100 AS reading FROM Survey WHERE person = 'roe' AND quant = 'sal';
SQLResult: [('roe', 'sal', 0.41600000000000004), ('roe', 'sal', 0.225)]
Answer: Valentina Roerich reported salinity measurements of 0.416 and 0.225.
> Finished chain.

' Valentina Roerich reported salinity measurements of 0.416 and 0.225.'

원천: SQL 쥬피터 노트북

2.3 챗GPT 모델

앞선 완성(Completion) 모형 대신 챗GPT 모형을 사용하여 고급 SQL문을 작성할 수 있다. 데이터베이스의 데이터베이스 스키마 정보를 확보하자.

코드

db_chain.run('print database schema info')



> Entering new SQLDatabaseChain chain...
print database schema info
SQLQuery: SELECT * FROM sqlite_master;
SQLResult: [('table', 'Person', 'Person', 2, 'CREATE TABLE Person (id text, personal text, family text)'), ('table', 'Site', 'Site', 3, 'CREATE TABLE Site (name text, lat real, long real)'), ('table', 'Survey', 'Survey', 5, 'CREATE TABLE Survey (taken integer, person text, quant text, reading real)'), ('table', 'Visited', 'Visited', 4, 'CREATE TABLE Visited (id integer, site text, dated text)')]
Answer: The database schema info is: Person (id text, personal text, family text), Site (name text, lat real, long real), Survey (taken integer, person text, quant text, reading real), Visited (id integer, site text, dated text).
> Finished chain.

' The database schema info is: Person (id text, personal text, family text), Site (name text, lat real, long real), Survey (taken integer, person text, quant text, reading real), Visited (id integer, site text, dated text).'

원천: SQL 쥬피터 노트북

프롬프트 참조: TalkToSQL

Combining Data에 나오는 사례를 제대로 SQL 쿼리문을 작성하기 위해서는 system_prompt에 역할 부여는 물론 Few-Shot learning을 위한 사례도 전달하고 user_prompt 프롬프트 작성에 데이터베이스 테이블 정보도 넘겨줘야 원하는 쿼리문을 작성할 가능이 높아진다.

코드

system_prompt = """
    You are the world's best SQL expert. Help me convert natural language to valid SQL queries. Only respond with valid SQL queries, nothing else.
    You must learn the column names based on the information the user gives you and build valid SQL queries. Never guess the column names.
    These are the examples:

    query: get all people names
    answer: SELECT name from people;

    query: get all cars whose owner name is aaron
    answer: SELECT c.* FROM people p JOIN cars c ON p.id = c.owner_id WHERE p.name = 'aaron';
"""

query='Write a query that lists all radiation readings from the DR-1 site step-by-step'

user_prompt = f"""
    This is my database information:
    Person (id text, personal text, family text), 
    Site (name text, lat real, long real), 
    Survey (taken integer, person text, quant text, reading real), 
    Visited (id integer, site text, dated text)

    query: {query}
    answer:
"""



completion = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ],
)
print(completion.choices[0].message.content)

원천: SQL 쥬피터 노트북

SWC 코드

코드

dbGetQuery(survey_db, "SELECT
   Survey.reading
FROM
   Site
   JOIN
      Visited
  JOIN
      Survey
      ON Site.name = Visited.site
      AND Visited.id = Survey.taken
WHERE
   Site.name = 'DR-1'
   AND Survey.quant = 'rad';")
#>   reading
#> 1    9.82
#> 2    7.80
#> 3   11.25

챗GPT 코드

코드

dbGetQuery(survey_db, "SELECT s.reading
           FROM Visited v
           JOIN Survey s ON v.id = s.taken AND v.site = 'DR-1'
           WHERE s.quant = 'rad';")
#>   reading
#> 1    9.82
#> 2    7.80
#> 3   11.25

3 관련정보

참고문헌

Plaetsen, M. V. (2023). Using OpenAI with structured data: A beginner’s guide. In Medium. Medium. https://medium.com/@margauxvanderplaetsen/using-openai-with-structured-data-a-beginners-guide-2d12719a72a9