Add -c option to split-sentences.perl #25

jelmervdl · 2021-02-17T10:07:45Z

Some documents contain extremely long lines of generated text (most often links to search page results) that take forever to parse with the regular expressions in split-sentences.perl. Using the -c option these lines can be completely ignored.

jelmervdl · 2021-02-17T10:08:50Z

moses/ems/support/split-sentences.perl

-	$text = $text.$words[$i];
+	if (scalar(@words) > 0) {
+		$text = $text.$words[$i];
+	}


It apparently also contains a fix for warnings caused by blank (or only whitespace) lines in the input.

kpu · 2021-02-17T20:45:08Z

Ideally we'd replace buffering then splitting with splitting on the fly. Then if there's something long and no split we throw it out. Here I'm a bit concerned we're throwing out stuff that would correctly split. I understand your immediate need though.

Add -c option to split-sentences.perl

e968869

Some documents contain extremely long lines of generated text (most often links to search page results) that take forever to parse with the regular expressions in split-sentences.perl. Using the -c option these lines can be completely ignored.

jelmervdl commented Feb 17, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add -c option to split-sentences.perl #25

Add -c option to split-sentences.perl #25

Uh oh!

jelmervdl commented Feb 17, 2021

Uh oh!

jelmervdl Feb 17, 2021 •

edited

Loading

Uh oh!

kpu commented Feb 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add -c option to split-sentences.perl #25

Are you sure you want to change the base?

Add -c option to split-sentences.perl #25

Uh oh!

Conversation

jelmervdl commented Feb 17, 2021

Uh oh!

jelmervdl Feb 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kpu commented Feb 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jelmervdl Feb 17, 2021 •

edited

Loading