android:텍스트문장비교하기 [법학위키]

텍스트문장비교하기

목적

안드로이드 앱 내에서 Speech-to-text로 만든 문장을 제시어와 비교하려고 한다.

다음과 같이 문장 중 특정 단어가 잘못되었음을 나타내려고 하는 것이다.

글자수로 비교하기

가장 쉽게 생각할 수 있는 것은 문장을 글자로 분해한 후, 그 글자를 하나하나 비교하는 것이다.

다음과 같이 짤 수 있을 것이다.

// 비교하기 : 제시어와 답변 스피치를 비교한다.
fun compareText(quote : String, speechText : String) : AnswerTexts
{
    val quoteLength = quote.length
    val speechTextLength = speechText.length
 
    val maxLength = if (quoteLength >= speechTextLength) quoteLength else speechTextLength
    val minLength = if (quoteLength <= speechTextLength) quoteLength else speechTextLength
 
    var correctScore = 0
 
    // 정답확인용 배열을 만드는 것이다.
    val textArray = ArrayList<AnswerText>(maxLength)
 
 
    // 제시어와 답변 문장 중에서 길이가 긴 문장의 길이에 맞추어서 정답확인용 배열을 초기화한다.
    for (i in 0 until maxLength)
    {
        textArray.add(AnswerText())  // TextModel.kt 파일에 초기화를 한  data class가 있다.
    }
 
    // 정답확인용 배열에 답변 문장을 한 글자씩 입력한다. 그리고 그 답변 문장이 제시어와 차이가 있는지를 저장한다.
    for (i in 0 until minLength) {
 
        textArray[i].text = speechText[i]
        textArray[i].index = i
        if (quote[i] == speechText[i])
        {
            textArray[i].correctness = true
            correctScore++
        }else
        {
            textArray[i].correctness = false
        }
    }
 
    val answerTexts : AnswerTexts =
        AnswerTexts(textArray, correctScore, correctScore.toFloat() / maxLength)
 
    return answerTexts
 
}
 
// 스피치 대답 구조
data class AnswerText(
    var text : Char = '_',
    var correctness : Boolean = false,
    var index : Int = 0
)
 
// 정답 반환 구조체
data class AnswerTexts(
    var answerText : ArrayList<AnswerText>,
    var simpleScore : Int = 0,
    var percentageScore : Float = 0.0f
)

그런데 이렇게 비교하니 치명적인 문제가 있었다.

바로 글자의 위치의 문제이다. speech-to-text로 얻은 문장의 띄어쓰기는 제시어의 띄어쓰기와 다를 수 있다. 그런데 위의 코드는 글자의 위치가 다르기만 하더라도 그 이후부터 이어진 글자는 모두 다르게 인식한다.

만약 중간에 한 단어를 제시어보다 길게 말한 경우 그 이후의 글자는 모두 하나씩 뒤로 넘어가게 되어 위치가 바뀌므로 모두 틀린 것으로 인식하였다.

제시어	답변
Be wise in what is good and innocent in what is evil	Be wise in ~~the good and innocent in what is evil~~

위와 같이 한 글자가 틀린 이후로는 그 이후의 글자는 모두 틀린 글자로 처리가 되는 것이다.

Java-diff-utils

1. 소개

텍스트를 비교하는 글자로 자바진영에는 Java-diff-utils가 존재했다.

Myer's diff 알고리즘과 HistogramDiff 알로리즘을 이용했다고 한다.

이를 이용하면 위의 예시문이 아래와 같이 틀린 부분만 정확하게 집어 준다.

제시어	답변
Be wise in what is good and innocent in what is evil	Be wise in ~~the~~ good and innocent in what is evil

2. 설치

모듈단계의 그래들 파일¹⁾에 다음의 의존성을 추가한다.

    // Java Diff - phrase difference compare util
    implementation("io.github.java-diff-utils:java-diff-utils:4.15")

3. 사용예제

Java-diff-utils 페이지에도 있지만, 이 라이브러리를 사용하는 방법은 아래와 같다.

@Composable
@Throws(DiffException::class)
fun TestGenerator_Second()  {
    val first = listOf("This is a test senctence. and this is very long sentence", "This is the second line.", "And here is the finish.")
    val second = listOf("This is a test for diffutils. and this is very long diffutils sentence", "This is the second line.")
 
    val generator = DiffRowGenerator.create()
        .showInlineDiffs(true)
        .inlineDiffByWord(true) //show the ~ ~ and ** ** around each different word instead of each letter
        .oldTag { f: Boolean? -> "~" } //introduce markdown style for strikethrough
        .newTag { f: Boolean? -> "**" } //introduce markdown style for bold
        .build()
 
    val rows: List<DiffRow> = generator.generateDiffRows(first, second)
 
    rows.forEach() {
//        Text(text = it.oldLine, color = Color.Yellow)
        Text(text = richtextFromMarkdown(it.oldLine, "~"), color = Color.Yellow)
        Text(text = richtextFromMarkdown(it.newLine, "**"), color = Color.White)
    }
}

generator.generateDiffRows(비교군, 대상군)라는 것에 주의하면 된다.

그런데 위 예제에서 비교군(oldLine)에서는 차이점이 있는 부분의 앞뒤에 “물결무늬(~)“를 마크다운으로 추가해주고, 대상군(newLine)에서는 차이점이 있는 부분의 앞뒤에 “더블스타(**)“를 마크다운으로 추가해주게 했다.

전후에 마크다운을 삽입하는 것보다는 텍스트의 스타일을 바꾸는 것 - 예를 들어 중간선을 넣는다거나, 폰트 색깔을 변경하는 것 -을 할 수 있다면 문장 비교가 결과가 더 시인성이 있을 것이다.

4. 마크다운을 스타일화된 텍스트로 바꾸기

Jetpack compose에서 제공하는 AnnotatedString을 이용하면 스타일화된 텍스트를 만들 수 있다.

마크다운으로 차이가 있는 부분에만 강조해주는 것이므로 다음과 같이 마크다운을 주심으로 텍스트를 나눈 후에 짝수부분에만 스타일을 설정해 주면 된다.

// 마크다운 텍스트를 리치스타일로 변환하는 유틸
fun richtextFromMarkdown(inputText : String, divider : String) : AnnotatedString
{
    val listOfString = inputText.split(divider)
    val oldLineStyle = SpanStyle(color = Color.Red)
    val newLineStyle = SpanStyle(color = Color.Blue, textDecoration = TextDecoration.LineThrough)
 
    val formattedString = buildAnnotatedString {
        listOfString.forEachIndexed() { index, s ->
            if (index % 2 == 0) {
                append(s)
            }else {
                withStyle(if (divider == "~") { oldLineStyle  } else { newLineStyle} ) {append(s)}
            }
        }
 
    }
 
    return formattedString
}

5. 두 텍스트 사이의 차이점을 점수화하기

'레벤스타인 거리'라는 측정 알고리즘이 존재한다.

https://en.wikipedia.org/wiki/Levenshtein_distance

¹⁾

build.gradle.kts (Module :app)

로그인하면 댓글을 남길 수 있습니다.

로그인

목차

목적