정규 표현식 — Pattern과 Matcher로 문자열을 파싱하는 방법

로그에서 IP 주소를 추출하거나, 이메일 형식을 검증할 때 정규식을 쓰면 되는 건 아는데 -- 잘못 쓰면 서버가 멈출 수도 있다는 걸 알고 있는가?

정규 표현식(Regular Expression) 은 문자열에서 패턴을 매칭·추출·치환하는 도구다. Java에서는 Pattern(불변, 스레드 안전)과 Matcher(상태 있음, 스레드마다 생성)로 사용한다.

기본 사용법

Pattern과 Matcher

JAVA

// 1. 패턴 컴파일 (비용이 크므로 재사용 권장)
private static final Pattern EMAIL_PATTERN =
    Pattern.compile("[\\w.+-]+@[\\w-]+\\.[\\w.]+");

// 2. Matcher 생성 (스레드마다 새로 만들어야 함)
Matcher matcher = EMAIL_PATTERN.matcher("contact@example.com");

// 3. 매칭 확인
if (matcher.matches()) {       // 전체 문자열이 패턴과 일치하는가?
    System.out.println("유효한 이메일");
}

if (matcher.find()) {          // 문자열 내에서 패턴을 찾는가?
    System.out.println("발견: " + matcher.group());
}

if (matcher.lookingAt()) {     // 문자열 시작 부분이 패턴과 일치하는가?
    System.out.println("시작 부분 일치");
}

간편 메서드

JAVA

// String.matches() — 매번 Pattern을 컴파일하므로 반복 사용 시 비효율
boolean valid = "test@email.com".matches("[\\w.+-]+@[\\w-]+\\.[\\w.]+");

// String.replaceAll()
String cleaned = "Hello   World".replaceAll("\\s+", " ");
// "Hello World"

// String.split()
String[] parts = "a,b,,c".split(",", -1);
// ["a", "b", "", "c"]

자주 쓰는 정규식 문법

문자 클래스

패턴	의미
`.`	줄바꿈 제외 모든 문자
`\d`	숫자 `[0-9]`
`\D`	숫자가 아닌 문자
`\w`	단어 문자 `[a-zA-Z0-9_]`
`\W`	단어 문자가 아닌 것
`\s`	공백 문자
`\S`	공백이 아닌 문자
`[abc]`	a, b, c 중 하나
`[^abc]`	a, b, c가 아닌 문자
`[a-z]`	a부터 z까지

반복

패턴	의미
`*`	0회 이상
`+`	1회 이상
`?`	0 또는 1회
`{n}`	정확히 n회
`{n,}`	n회 이상
`{n,m}`	n회 이상 m회 이하

앵커

패턴	의미
`^`	문자열(또는 줄) 시작
`$`	문자열(또는 줄) 끝
`\b`	단어 경계

캡처 그룹

번호 그룹

JAVA

Pattern datePattern = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})");
Matcher m = datePattern.matcher("2026-03-19");

if (m.matches()) {
    String full = m.group(0);  // "2026-03-19" (전체 매칭)
    String year = m.group(1);  // "2026"
    String month = m.group(2); // "03"
    String day = m.group(3);   // "19"
}

명명된 그룹

JAVA

Pattern pattern = Pattern.compile(
    "(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})");
Matcher m = pattern.matcher("2026-03-19");

if (m.matches()) {
    String year = m.group("year");   // "2026"
    String month = m.group("month"); // "03"
    String day = m.group("day");     // "19"
}

비캡처 그룹

매칭은 하되 캡처하지 않으려면 (?:...)를 사용합니다.

JAVA

// 캡처 그룹: group(1)에 "http" 또는 "https"가 잡힘
Pattern p1 = Pattern.compile("(https?)://(.+)");

// 비캡처 그룹: group(1)에 바로 호스트가 잡힘
Pattern p2 = Pattern.compile("(?:https?)://(.+)");

역참조

캡처한 그룹을 같은 패턴 안에서 다시 참조할 수 있습니다.

JAVA

// 연속 중복 단어 찾기 (예: "the the")
Pattern duplicateWord = Pattern.compile("\\b(\\w+)\\s+\\1\\b");
Matcher m = duplicateWord.matcher("This is is a test test.");

while (m.find()) {
    System.out.println("중복: " + m.group()); // "is is", "test test"
}

\\1은 첫 번째 캡처 그룹의 값을 참조합니다.

탐욕적 vs 게으른 vs 소유적 매칭

탐욕적 (Greedy) — 기본

JAVA

String html = "<b>bold</b> and <i>italic</i>";
Pattern greedy = Pattern.compile("<.+>");
// 매칭: "<b>bold</b> and <i>italic</i>"
// 가능한 많이 매칭

게으른 (Lazy/Reluctant)

JAVA

Pattern lazy = Pattern.compile("<.+?>");
// 매칭: "<b>", "</b>", "<i>", "</i>"
// 가능한 적게 매칭

소유적 (Possessive)

JAVA

Pattern possessive = Pattern.compile("<.++>");
// 매칭 실패 — 한 번 소비한 문자를 돌려주지 않음
// 백트래킹을 하지 않으므로 성능이 좋지만 매칭이 안 될 수 있음

실전 예제

로그 파싱

JAVA

private static final Pattern LOG_PATTERN = Pattern.compile(
    "(?<timestamp>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\.\\d{3})" +
    "\\s+(?<level>\\w+)" +
    "\\s+\\[(?<thread>[^]]+)]" +
    "\\s+(?<logger>\\S+)" +
    "\\s+-\\s+(?<message>.+)"
);

public record LogEntry(String timestamp, String level, String thread,
                       String logger, String message) {}

public static LogEntry parseLog(String line) {
    Matcher m = LOG_PATTERN.matcher(line);
    if (!m.matches()) return null;
    return new LogEntry(
        m.group("timestamp"), m.group("level"),
        m.group("thread"), m.group("logger"), m.group("message")
    );
}

IP 주소 추출

JAVA

private static final Pattern IP_PATTERN = Pattern.compile(
    "\\b(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})\\b");

public static List<String> extractIPs(String text) {
    List<String> ips = new ArrayList<>();
    Matcher m = IP_PATTERN.matcher(text);
    while (m.find()) {
        ips.add(m.group(1));
    }
    return ips;
}

문자열 치환

JAVA

// 카멜케이스를 스네이크케이스로
private static final Pattern CAMEL_PATTERN =
    Pattern.compile("([a-z])([A-Z])");

public static String toSnakeCase(String camel) {
    return CAMEL_PATTERN.matcher(camel)
        .replaceAll(mr -> mr.group(1) + "_" + mr.group(2).toLowerCase());
}
// "getUserName" → "get_user_name"

전방/후방 탐색 (Lookahead/Lookbehind)

매칭은 하되 결과에 포함하지 않는 패턴입니다.

JAVA

// 전방 긍정 탐색: 뒤에 "원"이 오는 숫자
Pattern price = Pattern.compile("\\d+(?=원)");
// "1000원" → "1000" 매칭 (원은 결과에 미포함)

// 전방 부정 탐색: 뒤에 "원"이 오지 않는 숫자
Pattern notPrice = Pattern.compile("\\d+(?!원)");

// 후방 긍정 탐색: 앞에 "$"가 있는 숫자
Pattern dollar = Pattern.compile("(?<=\\$)\\d+");
// "$100" → "100" 매칭

// 후방 부정 탐색: 앞에 "$"가 없는 숫자
Pattern notDollar = Pattern.compile("(?<!\\$)\\d+");

주의할 점

Pattern 컴파일 캐싱

String.matches()는 매번 Pattern.compile()을 호출한다. 반복 사용 시 반드시 Pattern을 static final로 캐싱하자.

JAVA

// 나쁜 예 — 매번 컴파일
return input.matches("\\d{4}-\\d{2}-\\d{2}");

// 좋은 예 — 컴파일 결과 재사용
private static final Pattern DATE_PATTERN =
    Pattern.compile("\\d{4}-\\d{2}-\\d{2}");

return DATE_PATTERN.matcher(input).matches();

ReDoS (백트래킹 폭발)

중첩된 반복 패턴은 입력에 따라 백트래킹이 지수적으로 증가해 서버를 멈출 수 있다.

JAVA

// 위험: (a+)+$에 "aaaaaaaaaaaaaaaaaX"를 매칭하면 폭발
Pattern dangerous = Pattern.compile("(a+)+$");

방지 전략은 다음과 같다.

중첩된 반복을 피한다: (a+)+ → a+
소유적 수량자를 사용한다: a++는 백트래킹하지 않는다
사용자 입력에 정규식을 적용할 때는 입력 길이를 먼저 제한한다

유용한 플래그

JAVA

Pattern.compile("hello", Pattern.CASE_INSENSITIVE); // 대소문자 무시
Pattern.compile("^line$", Pattern.MULTILINE);        // ^$가 줄 단위로 매칭
Pattern.compile("hello . world", Pattern.DOTALL);    // .이 줄바꿈도 매칭
Pattern.compile(
    "\\d{4}  # 연도\n" +
    "-\\d{2} # 월\n" +
    "-\\d{2} # 일",
    Pattern.COMMENTS  // 주석과 공백 무시
);

정리

항목	핵심
Pattern	불변, 스레드 안전. `static final`로 캐싱 필수
Matcher	상태 있음. 스레드마다 새로 생성
캡처 그룹	`(?<name>...)`으로 명명하면 가독성 향상
탐욕적 vs 게으른	`.+`(최대 매칭) vs `.+?`(최소 매칭). 기본은 탐욕적
소유적 매칭	`.++` -- 백트래킹 없음. ReDoS 방지에 유용
ReDoS	중첩 반복 패턴(`(a+)+`)에서 백트래킹 폭발. 사용자 입력에 주의
String.matches()	매번 컴파일하므로 반복 사용 시 비효율