/

/

한글 유니코드 구성 살펴보기

/

/

한글 유니코드 구성 살펴보기

Share

🇰🇷

한글 유니코드 구성 살펴보기

Created

2022/10/31

Tags

Korean

Unicode

한글

유니코드

2022-10-31

한영 타이핑 변환기(github)를 만들면서 한글 유니코드가 어떻게 구성되어 있는 지 공부할 필요가 생겼습니다. 공부한 내용을 기록으로 남깁니다.

한글 유니코드는 크게 3가지로 구성되어 있습니다.

1.

자모음 유니코드 (ᄀ, ᄁ, ᄉ, ᄊ, ᅡ, ᅤ, ᅪ, ᅰ, ᆨ, ᆸ, ᆹ, ᆭ)

2.

자모음 호환 유니코드 (ㄱ, ㄴ, ㄷ, ㄹ, ㅏ, ㅑ, ㅓ, ㅕ, ㄹ, ㅁ, ㄺ, ㅄ)

3.

음절 유니코드 (가, 나, 밤, 힣)

자모음 유니코드

Hangul Jamo (Unicode block) - Wikipedia

From Wikipedia, the free encyclopedia Hangul Jamo ( Korean: 한글 자모, Korean pronunciation: [ˈha̠ːnɡɯɭ t͡ɕa̠mo̞]) is a Unicode block containing positional ( choseong, jungseong, and jongseong) forms of the Hangul consonant and vowel clusters. They can be used to dynamically compose syllables that are not available as precomposed Hangul syllables in Unicode, specifically syllables that are not used in standard modern Korean.

https://en.wikipedia.org/wiki/Hangul_Jamo_(Unicode_block)

자모음 유니코드는 초성과 중성, 종성으로 이루어진 유니코드 입니다.

아래의 표와 같이 이루어져 있습니다.

같은 기역이더라도 0x1100(ᄀ, 초성)과 0x11A8(ᆨ, 종성)은 다른 글자입니다.

자모음 호환 유니코드

Hangul Compatibility Jamo - Wikipedia

From Wikipedia, the free encyclopedia Hangul Compatibility Jamo is a Unicode block containing Hangul characters for compatibility with the South Korean national standard KS X 1001 (formerly KS C 5601). Its block name in Unicode 1.0 was Hangul Elements.

https://en.wikipedia.org/wiki/Hangul_Compatibility_Jamo

자모음 호환 유니코드는 KS X 1001와 호환되는 유니코드 입니다.

KS X 1001 (구. KS C 5601)은 - 한글과 한자를 상호 변환하기 위한 코드입니다 - 많은 레거시 한글 인코딩에 사용됩니다 (EUC-KR, Microsoft’s Unified Hangul Code (UHC)) - 한글 음절과 CJK 표의문자(한자), 그리스어, 키릴, 일본어 (히라가나, 카타카나) 등을 포함하고 있습니다

자음과 모음으로 구성되어 있습니다.

예를 들면, ㄱ(0x3131) + ㅏ(0x314F) + ㅁ(0x3141) = 감 입니다. 국어시간에 배운 원리랑 동일합니다.

음절 유니코드

Hangul Syllables - Wikipedia

Hangul Syllables is a Unicode block containing precomposed Hangul syllable blocks for modern Korean. The syllables can be directly mapped by algorithm to sequences of two or three characters in the Hangul Jamo Unicode block: one of U+1100-U+1112: the 19 modern Hangul leading consonant jamos; one of U+1161-U+1175: the 21 modern Hangul vowel jamos; none, or one of U+11A8-U+11C2: the 27 modern Hangul trailing consonant jamos.

https://en.wikipedia.org/wiki/Hangul_Syllables

음절 유니코드는 현대 한국어의 음절이 유니코드 값으로 매핑된 유니코드 입니다.

이 유니코드 표에 따라서 한국어 음절의 시작은 가이고 끝 음절은 힣 입니다.

그래서 정규표현식에서 한글을 찾을 때 /가-힣/ 으로 찾을 수 있었던 것입니다.

Reference

유니코드 관련 일부 정리....

유니코드는 대부분의 모든 문자를 포함하는 표준이다. 이 중 한글은 256자의 한글자모 관련 코드와 11184개의 한글 음절 코드, 기타 확장 코드들로 구성된다. 즉, 자모는 초성,중성,종성 각각을 표시하기 위한 집합이며, 이들이 조합되어 완성된 문자가 정의된 부분이 음절코드이다. 이러한 문자들은 정의된 순서대로 정렬되어 있으므로, 해당 코드와 정렬 순서를 이용해 필요한 코드를 조합하거나 분리할 수 있다.(한글자모의 경우 위 순서 뒤로 추가적인 문자들이 들어간다.

https://liveupdate.tistory.com/149

한글이 자소분리될 때 해결 방법(ㅎㅏㄴㄱㅡㄹ -> 한글)

운영체제마다 다른 '유니코드 정규화 방식' 대응하는 방법 업무를 진행하다가 파일 다운로드 시 'Windows' 환경에서 한글이 자음과 모음으로 분리되는 자음모음 분리 현상, 자소 분리 현상이 발생했습니다. 인코드, 디코드 방식으로는 해결이 되지 않았지만 원인을 찾기가 힘들어 삽질을 했었는데요. 이를 해결하기 위해 새롭게 알게 된 내용과 해결 방법을 적어 보도록 하겠습니다. 개요 유니코드 정규화 방식이란?

https://egg-programmer.tistory.com/293