Skip to content

incorrect extraction of deep text from a document with corrections #49

@kosloot

Description

@kosloot

the text() extraction function fails to extract the correct text from a sentence where the last Word is a Correction, and the sentence is followed by another sentence.
This came up in: LanguageMachines/foliautils#66

When the last Word is truly a Word, a space separator is added, and everything is fine. But in case of a Correction the space is omitted, gluing the 2 sentences text together.
Example (rather braindead, but is proves the point)

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="Walter" generator="libfolia-v2.12" version="2.5.1">
  <metadata type="native">
    <annotations>
      <token-annotation alias="tokconfig-deu" set="https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-deu.foliaset.ttl">
        <annotator processor="FoLiA-correct.1"/>
        <annotator processor="ucto.1"/>
      </token-annotation>
      <paragraph-annotation>
        <annotator processor="ucto.1"/>
      </paragraph-annotation>
      <sentence-annotation>
        <annotator processor="ucto.1"/>
      </sentence-annotation>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <correction-annotation set="Ticcl-set">
        <annotator processor="FoLiA-correct.1"/>
      </correction-annotation>
    </annotations>
    <provenance>
      <processor xml:id="ucto.1" begindatetime="2022-10-06T12:10:53" command="ucto -X -L deu --textredundancy=full --id Walter bug.in bug.folia.xml" folia_version="2.5.1" host="kobus" name="ucto" user="sloot" version="0.26">
        <processor xml:id="ucto.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
        <processor xml:id="uctodata.1" name="uctodata" type="datasource" version="0.9.1">
          <processor xml:id="uctodata.1.1" name="tokconfig-deu" type="datasource" version="0.2"/>
        </processor>
      </processor>
      <processor xml:id="FoLiA-correct.1" begindatetime="2022-10-06T12:11:06" command="FoLiA-correct --ngram=3 -e folia.xml -O OUT --rank=data/DeutscheEssays.RANK.withunderscore.ranked --unk=data/DeutscheEssays.UNK.withunderscore.unk --punct=data/DeutscheEssays.UNK.withunderscore.punct" folia_version="2.5.1" host="kobus" name="FoLiA-correct" user="sloot" version="0.19">
        <processor xml:id="FoLiA-correct.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
      </processor>
    </provenance>
    <meta id="language">deu</meta>
  </metadata>
  <text xml:id="Walter.text">
    <p xml:id="Walter.p.1">
      <t>chat... Von</t>
      <s xml:id="Walter.p.1.s.1">
        <t>chat...</t>
        <w xml:id="Walter.p.1.s.1.w.1" class="WORD" processor="ucto.1" space="no">
          <t>chat</t>
        </w>
        <correction xml:id="Walter.p.1.s.1.correction.1">
          <new>
            <w xml:id="Walter.p.1.s.1.w.3.edit.1" processor="FoLiA-correct.1">
              <t>...</t>
            </w>
          </new>
          <original auth="no">
            <w xml:id="Walter.p.1.s.1.w.3" class="PUNCTUATION-MULTI" processor="ucto.1">
              <t>...</t>
            </w>
          </original>
        </correction>
      </s>
      <s xml:id="Walter.p.1.s.2">
        <t>Von</t>
        <w xml:id="Walter.p.1.s.2.w.1" class="WORD" processor="ucto.1">
          <t>Von</t>
        </w>
      </s>
    </p>
  </text>
</FoLiA>

When parsing this file, withe folialint:

bug.xml failed: inconsistent text: node p(Walter.p.1) has a mismatch for the text in set:current
the element text ='chat... Von'
 the deeper text ='chat...Von'

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions