Rust로 구현한 소스 코드 하이라이터 @ntalbs' stuff

소스 코드 하이라이터는 어떻게 만드는 것일까? Rust로 간단히 소스 코드 하이라이터를 만들어 보기로 했다. 소스 코드를 스캔해 토큰 목록을 구한 다음 토큰 타입에 따라 적절히 렌더링하면 될 것 같다.

토큰 타입

토큰 종류는 다음과 같이 정의한다.

#[derive(Debug, PartialEq)]
pub(crate) enum Token {
    Whitespace(String),
    NewLine(),
    Punctuation(String),
    Number(String),
    String(String),
    LineComment(String),
    BlockComment(String),
    Name(String),
    Keyword(String),
    Eof,
}

Whitespace는 공배를 나타낸다. Whitespace가 NewLine도 포함하게 할 수도 있지만, 렌더링 시 행번호를 출력할 때 편하게 하기 위해 따로 구분했다.

모든 특수문자나 연산자는 Punction으로 매핑할 것이다. 소스 코드 하이라이터에서는 각 연산자나 특수문자의 의미를 알 필요가 없다. 여기서는 모두 동일하게 렌더링할 것이다.

숫자나 문자열을 다르게 렌더링하기 위해 Number와 String을 정의했다. 주석은 LineComment와 BlockComment를 따로 구분했다.

함수명이나 변수명은 모두 Name으로, 키워드는 Keyword로 매핑할 것이다. Eof는 내부적으로만 사용할 것이며, 코드를 렌더링할 때는 사용하지 않는다.

스캐너

이제 스캐너(scanner)를 만들어보자. 스캐너는 렉서(lexer) 또는 구문 분석기(lexical analyser)라고도 한다. 스캐너는 소스 코드를 읽어 토큰 목록을 생성한다. Scanner 구조체는 다음과 같이 정의한다.

pub struct Scanner<'a> {
    iter: Peekable<Chars<'a>>,
    current_char: Option<char>,
}

Scanner 구조체는 두 필드를 가진다. 입력에 대한 반복자(iterator) iter 필드는 반복자를 전진하지 않으면서 다음 값을 엿보기 위해 Peekable을 사용한다.

current_char는 현재 문자를 저장하기 위한 필드다. 처음에는 값이 없으므로 Option<char> 타입을 사용한다.

peek와 조합해 스캐닝 시 두 문자를 볼 수 있다. 뒤에 살펴보겠지만 토큰 타입을 정하는 데 두 문자를 봐야 알 수 있는 경우가 있다.

생성자는 다음과 같이 작성할 수 있다.

impl<'a> Scanner<'a> {
    pub(crate) fn new(input: &'a str) -> Self {
        Self {
            iter: input.chars().peekable(),
            current_char: Option::None,
        }
    }
...

그리고 Scanner에 다음 도우미 메서드를 추가한다.

    fn advance(&mut self) -> Option<char> {
        self.iter.next()
    }

    fn peek(&mut self) -> Option<&char> {
        self.iter.peek()
    }

    fn is_at_end(&mut self) -> bool {
        self.peek().is_none()
    }

advance 메서드는 다음 문자를 리턴하고 반복자를 전진한다. peek 메서드는 반복자를 전진하지 않으면서 다음 문자를 리턴한다. is_at_end 메서는 입력의 끝에 도달했는지 확인하는 메서드다.

이제 반복자를 통해 입력을 한 글자씩 읽어가며 토큰을 구할 것이다. 입력을 읽어 토큰 목록을 생성하는 메서드는 다음과 같다.

    pub(crate) fn scan(&mut self) -> Vec<Token> {
        let mut tokens: Vec<Token> = Vec::new();

        while !self.is_at_end() {
            let token = self.next_token();
            match token {
                Token::Eof => break,
                _ => tokens.push(token),
            }
        }
        tokens
    }

실제로 토큰을 생성하는 메서드는 next_token()이다.

    fn next_token(&mut self) -> Token {
        self.current_char = self.advance();
        match self.current_char {
            Some(' ') => self.whitespace(),
            Some('\n') => self.newline(),
            Some('"') => self.string(),
            Some('/') => match self.peek() {
                Some('/') => self.line_comment(),
                Some('*') => self.block_comment(),
                Some(_) => self.punctuation(),
                None => Token::Eof,
            },
            Some(c) if Self::is_punctuation(c) => self.punctuation(),
            Some(c) if c.is_ascii_digit() => self.number(),
            Some(_) => self.name_or_keyword(),
            None => Token::Eof,
        }
    }

Whitespace

현재 문자가 공백이면 whitespace 메서드를 호출한다. whitespace 메서드는 연속된 공백을 모두 묶어 Whitespace 토큰을 생성할 것이다.

    fn whitespace(&mut self) -> Token {
        let mut buf = String::from(self.current_char.unwrap());
        while let Some(' ') = self.peek() {
            buf.push(' ');
            self.advance();
        }
        Token::Whitespace(buf)
    }

Newline

현재 문자가 줄바꿈 문자면 newline 메서드를 호출한다.

    fn newline(&mut self) -> Token {
        Token::NewLine()
    }

String

프로그래밍 언어에 따라 홑따옴표로 문자열을 감쌀 수도 있지만, 여기서는 문자열을 쌍따옴표로 감싼 경우만 처리한다. 쌍따옴표 문자를 만나면 string 메서드를 호출한다.

    fn string(&mut self) -> Token {
        let mut buf = String::from("\"");

        let mut prev_char: char = '\n';
        while let Some(&c) = self.peek() {
            match c {
                '\n' => break,
                '"' if prev_char != '\\' => break, // if prev_char=='\\', then escaped
                _ => {
                    buf.push(c);
                    prev_char = c;
                    self.advance();
                }
            }
        }
        match self.advance() {
            Some('\n') => (), // non-terminated string
            Some('"') => buf.push('"'),
            Some(_) => (),    // shouldn't come here
            None => (),       // EOF, do nothing
        }
        Token::String(buf)
    }

string 메서드는 조금 복잡하다. 기본 로직은 쌍따옴표부터 시작해 문자열이 끝날 때까지 문자를 읽어들인 다음 토큰을 생성해 리턴하는 것이지만, 문자열이 제대로 끝나지 않은 경우, 문자열 안에 쌍따옴표가 있는 경우 등의 예외 처리가 필요하다.

코드를 실행할 것이 아니므로, 입력에 오류가 포함되어 있다고 해서 에러를 리턴하면 안 된다. 그냥 스캔한 대로 토큰을 리턴해 나중에 렌더러가 적절히 표현할 수 있도록 한다.

Punctuation

모든 특수문자는 Punction 토큰이 되어야 하지만 예외가 있다. 쌍따옴표(")는 문자열의 시작을 뜻하므로 별도 처리가 필요하다. 슬래시 문자는 뒤에 따라오는 문자에 따라 LineComment 또는 BlockComment가 될 수도 있고 그냥 Punctuation이 될 수도 있다.

/는 조금 나중에 다루기로 하고 Puncuation을 먼저 처리하자. 다음과 같이 is_punctuation 도우미 함수를 만든다. 이 함수는 주어진 문자가 알파벳이나 숫자도 아니고, 공백도 아니고, 쌍따옴표도 아닌 경우에 true를 리턴한다.

    fn is_punctuation(c: char) -> bool {
        match c {
            c if c.is_alphanumeric() => false,
            c if c.is_ascii_whitespace() => false,
            '"' => false,
            _ => true,
        }
    }

그리고 다음과 같이 punctuation 메서드를 정의한다. next_token 메서드에서 특수문자를 만나면 이 메서드를 호출한다.

    fn punctuation(&mut self) -> Token {
        let mut buf = String::from(self.current_char.unwrap());

        while let Some(&c) = self.peek() {
            if !Self::is_punctuation(c) {
                break;
            }
            buf.push(c);
            self.advance();
        }
        Token::Punctuation(buf)
    }

연속된 특수문자는 하나의 토큰으로 병합한다.

Number

숫자를 만나면 number 메서드를 호출한다.

    fn number(&mut self) -> Token {
        let mut buf = String::from(self.current_char.unwrap());
        while let Some(&c) = self.peek() {
            if !c.is_ascii_digit() && c != '.' && c != '_' {
                break;
            }
            buf.push(c);
            self.advance();
        }
        Token::Number(buf)
    }

숫자는 3.14와 같이 숫자에 .이 포함되어 있을 수도 있고, 1_000과 같이 _가 포함될 수 있으므로 이에 대한 처리가 들어갔다. 위 구현은 숫자 전체에 소수점이 한 번만 나오는지 확인하지 않으므로 1.2.3과 같은 유효하지 않은 숫자도 그냥 숫자 토큰으로 변환할 것이다. 그러나 문법 하이라이터에서 이런 문제는 중요하지 않다.

Name, Keyword

LineComment, BlockComment도 아니고 위에서 설명한 다른 토큰 타입도 아니라면 가능한 토큰은 Name 아니면 Keyword다. 그러나 next_token 메서드 안에서 한 글자만 보고 다음 토큰이 Name이 될지 Keyword가 될 지는 아직 알 수 없으므로, 다음과 같이 name_or_keyword 메서드를 작성한다.

    fn name_or_keyword(&mut self) -> Token {
        let mut buf = String::from(self.current_char.unwrap());

        while let Some(&c) = self.peek() {
            if c == ' ' || c == '\n' || !Self::is_valid_for_identifier(c) {
                break;
            }
            buf.push(c);
            self.advance();
        }
        if Self::is_keyword(&buf) {
            Token::Keyword(buf)
        } else {
            Token::Name(buf)
        }
    }

name_or_keyword 메서드 안에서는 is_valid_for_identifier 도우미 함수를 사용한다. 이 함수는 주어진 문자가 식별자(identifier)로 사용할 수 있는 문자인지 확인해 true/false를 리턴한다.

    fn is_valid_for_identifier(c: char) -> bool {
        c.is_alphanumeric() || c == '_'
    }

식별자라고 생각하고 문자열을 읽은 수 is_keyword 도우미 함수로 키워드인지 판단한다. is_keyword 함수가 true를 리턴하면 Keyword 토큰을, false를 리턴하면 Name 토큰을 리턴한다.

is_keyword 도우미 함수는 다음과 같다. 함수 안에서 Rust의 모든 키워드를 담은 HashSet을 가지고 있다. lasy_static 또는 phf를 사용해 별도 키워드 집합을 정의할 수도 있지만, 여기서는 외부 종속성을 피하기 위해 사용하지 않았다.

    fn is_keyword(name: &str) -> bool {
        let keywords: HashSet<&str> = HashSet::<_>::from_iter([
            "Self",
            "abstract",
            "as",
            "async",
            "await",
             // ...
            "while",
            "yield",
        ]);
        keywords.contains(name)
    }

이 함수는 Rust 키워드만 담고 있다. 다른 언어를 지원하는 것은 어려워 보이지 않는다. 어떻게 하면 여러 언어를 지원할 수 있는지는 직접 생각해보기 바란다.

LineComment

이제 설명을 미루었던 LineComment, BlockComment를 설명하려 한다. 현재 문자가 /인 경우 다음 문자에 따라 토큰 타입이 바뀐다. 다음 문자도 /인 경우는 LineComment가, *인 경우는 BlockComment가 된다. 둘 다 아닌 경우에는 그냥 나눗셈을 나타내는 /인 경우이므로 Punctation이 될 것이다.

    fn next_token(&mut self) -> Token {
        self.current_char = self.advance();
        match self.current_char {
            // ...
            Some('/') => match self.peek() {
                Some('/') => self.line_comment(),
                Some('*') => self.block_comment(),
                Some(_) => self.punctuation(),
                None => Token::Eof,
            },
            // ...
            None => Token::Eof,
        }
    }

//를 만난 경우에는 line_comment 메서드를 호출한다. //부터 해당 행의 끝까지를 LineComment로 보면 된다.

    fn line_comment(&mut self) -> Token {
        let mut buf = String::from(self.current_char.unwrap());

        while let Some(&c) = self.peek() {
            if c == '\n' {
                break;
            }
            buf.push(c);
            self.advance();
        }
        Token::LineComment(buf)
    }

BlockComment

/*를 만난 경우에는 block_comment 메서드를 호출한다. */를 만날 때까지 계속 읽어야 한다.

    fn block_comment(&mut self) -> Token {
        let mut buf = String::from(self.current_char.unwrap());

        while let Some(&c) = self.peek() {
            buf.push(c);
            self.current_char = self.advance();
            if self.current_char == Some('*') && self.peek() == Some(&'/') {
                self.current_char = self.advance();
                buf.push('/');
                break;
            }
        }

        Token::BlockComment(buf)
    }

그러나 BlockComment는 작업이 좀더 필요하다. block_comment는 BlockComment 토큰 하나를 리턴하므로 나중에 행번호와 함께 출력하고 싶은 경우 문제가 생긴다. 토큰 안에 여러 행을 갖고 있기 때문이다.

이건 렌더러에서 BlockComment를 별도 처리할 수도 있고 BlockComment를 사전에 여러 행으로 미리 나누어 놓을 수도 있다. block_comment 메서드가 행별로 나뉜 BlockComment의 리스트를 리턴하게 하는 방법은 고려 대상이 아니다. nex_token 메서드가 하나의 토큰을 리턴해야 하기 때문이다.

토큰 목록을 생성하는 것은 스캐너의 일이므로, 스캐너 안에서 BlockComment를 행별로 쪼개놓으려 한다.

따라서 다음과 같이 문자열을 받아 행별로 쪼개 BlockComment 목록을 리턴하는 도우미 함수를 만는다. 각 BlockComment 토큰 사이에 Newline 토큰을 넣어 주어야 한다.

    fn break_by_line(str: String) -> Vec<Token> {
        let mut bcl = str.split('\n').map(|line| Token::BlockComment(line.into()));
        let mut ret = Vec::new();
        if let Some(line) = bcl.next() {
            ret.push(line);
        }
        for line in bcl {
            ret.push(Token::NewLine());
            ret.push(line);
        }
        ret
    }

그리고 scan 메서드를 다음과 같이 수정한다.

    pub(crate) fn scan(&mut self) -> Vec<Token> {
        let mut tokens: Vec<Token> = Vec::new();

        while !self.is_at_end() {
            let token = self.next_token();
            match token {
                Token::Eof => break,
                Token::BlockComment(s) => tokens.append(&mut break_by_line(s)),
                _ => tokens.push(token),
            }
        }
        tokens
    }

이렇게 해서 스캐너 구현이 끝났다.

렌더링

토큰 목록만 있으면 렌더링은 식은죽 먹기다. 여기서는 콘솔 렌더러만 살펴 볼 것이지만, HTML 렌더러를 구현하는 것도 어렵지 않을 것이다.

먼저 다음과 같은 도우미 함수가 필요하다. 토큰 타입에 따라 적절한 색으로 바꾼다. 콘솔에서 출력 색상을 바꾸는 것은 colorust를 사용했다. colorust는 colored-rs를 모방한 간단한 라이브러리로, ANSI 이스케이프 코드를 사용해 출력 생상을 바꾼다.

fn render_token_to_console(token: &Token) -> String {
    match token {
        Token::Whitespace(s) => s.into(),
        Token::NewLine() => "\n".into(),
        Token::Punctuation(s) => s.red(),
        Token::Number(s) => s.yellow(),
        Token::String(s) => s.bright_magenta(),
        Token::LineComment(s) => s.green(),
        Token::BlockComment(s) => s.bright_green(),
        Token::Name(s) => s.white(),
        Token::Keyword(s) => s.blue(),
        _ => "".into(),
    }
}

이제 입력 소스 코드를 콘솔로 렌더링하는 함수를 다음과 같이 작성할 수 있다.

pub fn render_to_console(input: &Vec<Token>) {
    for token in input {
        print!("{}", render_token_to_console(token));
    }
    println!();
}

행번호도 함께 출력하고 싶을 때는 다음 함수를 사용하면 된다.

pub fn render_to_console_with_line_num(input: &Vec<Token>) {
    let mut num: usize = 1;
    print!("{num:-3} ");
    for token in input {
        print!("{}", render_token_to_console(token));
        if *token == Token::NewLine() {
            num += 1;
            print!("{num:-3} ");
        }
    }
    println!();
}

마무리

지금까지 문법 하이라이터를 작성하는 방법을 살펴보았다. 매우 초보적인 구현이라 실제로 사용하기는 어렵겠지만, 소스 코드 하이라이터를 어떤 식으로 만들 수 있는지 이해하는 데는 충분할 것이다.

여기서 구현한 문법 하이라이터는 Rust만 지원하고, 렌더링도 콘솔에 출력하는 방법만 살펴보았다. 그러나 위 설명을 잘 이해했다면, 다른 여러 언어를 지원하는 것이나 HTML 렌더러를 추가하는 것도 어렵지 않을 것이다.

참고

prism-rs 문법 하이라이터 전체 소스 코드를 확인할 수 있다.
colorust colored-rs를 흉내낸 간단한 라이브러리.