Skip to content

Conversation

@pmachapman
Copy link
Collaborator

@pmachapman pmachapman commented Feb 4, 2026

Fixes: #372

Example USFM:

\id MAN
\c 1.
\v v1 Verse 1

Example warnings:

warnings: [
  "USFM versification error in project TEA, expected verse “”, actual verse “MAN 1.”, mismatch type InvalidChapterNumber (parallel corpus 6983d7b5415e5f0ee3585264, monolingual corpus 6983d7b5415e5f0ee3585260)",
  "USFM versification error in project TEA, expected verse “”, actual verse “MAN -1:v1”, mismatch type InvalidVerseNumber (parallel corpus 6983d7b5415e5f0ee3585264, monolingual corpus 6983d7b5415e5f0ee3585260)",
]

This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.79%. Comparing base (611fbf2) to head (5ca0162).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #381      +/-   ##
==========================================
+ Coverage   72.73%   72.79%   +0.06%     
==========================================
  Files         423      423              
  Lines       36029    36089      +60     
  Branches     4969     4974       +5     
==========================================
+ Hits        26205    26271      +66     
+ Misses       8733     8727       -6     
  Partials     1091     1091              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pmachapman pmachapman force-pushed the invalid_verse_or_chapter branch from 39cf6ef to 5ca0162 Compare February 5, 2026 00:51
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning messages are a bit awkward. How difficult would it be to generate messages that are more appropriate for the new warnings?

@ddaspit reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).

Copy link
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning messages are a bit awkward. How difficult would it be to generate messages that are more appropriate for the new warnings?

Good idea. That requires a modification to Serval - see sillsdev/serval/#871

@pmachapman made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 reviewed 2 files and all commit messages, and made 5 comments.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @pmachapman).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):

        }

        public UsfmVersificationError(

Rather than adding another constructor, is there way to just check the error type in CheckError() itself?


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 94 at r1 (raw file):

                    return true;
                }
                if (!_verseRef.Value.Valid)

I'm surprised that these kinds of errors (especially the bad verse numbers) aren't captured in the verse ref's valid status somehow 🤔. It is what it is, but it'd certainly be nice if it were haha.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):

                    Type == UsfmVersificationErrorType.ExtraVerse
                    || Type == UsfmVersificationErrorType.InvalidChapterNumber
                    || Type == UsfmVersificationErrorType.InvalidVerseNumber

I wonder if we could supply an expected ref. We should have some idea of what verse or chapter is expected next while parsing. If we can make a reasonable guess, I think we should add it for consistency with our attempts for other types of errors, but if we can't no worries. If the bad chapter number is something like \c 1^, that's pretty easy to find without an expected ref, but what about something like \c text: Is that possible? That would not be quite so easy to track down.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 292 at r1 (raw file):

            // See whether the chapter number is invalid
            VerseRef verseRef = state.VerseRef.Clone();

Why do we need to do this? Does the number get changed when the state is updated so we need the raw value itself? It would be nice if we could just pass the bad verse ref to the existing constructor and check for an error in CheckError() - same with invalid verses. If we have to do it, we have to do it, but it'd be nice to have the error checking all in one place.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):

                {
                    _errors.Add(versificationError);
                    verseInError = true;

What's the particular duplicate error situation you're trying to avoid?

Copy link
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmachapman made 5 comments.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Rather than adding another constructor, is there way to just check the error type in CheckError() itself?

I did that originally, but because we don't have an expected verse, and when I passed a verseRef it met the other conditions in CheckError(), it got confused with the other warnings. I figured another constructor made sense, as I also needed the actual string value to display to the user so they could resolve it. (If we don't show the actual string value to the user, the verse ref will only store "-1" for an invalid chapter, which isn't helpful to return in an error message)


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 94 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I'm surprised that these kinds of errors (especially the bad verse numbers) aren't captured in the verse ref's valid status somehow 🤔. It is what it is, but it'd certainly be nice if it were haha.

_verseRef.Value.Valid will return false, but as it is not clear just what the cause is (i.e. there is no comparable enum to UsfmVersificationErrorType returned by VerseRef), I deal with the invalid chapter/verse error just above this logic.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I wonder if we could supply an expected ref. We should have some idea of what verse or chapter is expected next while parsing. If we can make a reasonable guess, I think we should add it for consistency with our attempts for other types of errors, but if we can't no worries. If the bad chapter number is something like \c 1^, that's pretty easy to find without an expected ref, but what about something like \c text: Is that possible? That would not be quite so easy to track down.

I tried that, but it is hard to tell what it should have been. Chapters can be skipped in a USFM file. If we were to parse the string and guess, then we might as well not throw an error and use the parsed string?


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 292 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Why do we need to do this? Does the number get changed when the state is updated so we need the raw value itself? It would be nice if we could just pass the bad verse ref to the existing constructor and check for an error in CheckError() - same with invalid verses. If we have to do it, we have to do it, but it'd be nice to have the error checking all in one place.

I get the raw value and pass it into the error object so we can return it in the error message, so that the error message doesn't just say something like "invalid chapter number -1" but will say the actual value that caused the error. (the VerseRef struct does not record what the invalid chapter number value is).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

What's the particular duplicate error situation you're trying to avoid?

A missing verse error

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@ddaspit made 1 comment.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93).

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 5 comments and resolved 4 discussions.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

I did that originally, but because we don't have an expected verse, and when I passed a verseRef it met the other conditions in CheckError(), it got confused with the other warnings. I figured another constructor made sense, as I also needed the actual string value to display to the user so they could resolve it. (If we don't show the actual string value to the user, the verse ref will only store "-1" for an invalid chapter, which isn't helpful to return in an error message)

I see, and you couldn't easily add logic there to differentiate between the cases?

That's interesting. For verses, you can still access the bad verse value via vref.Verse (even though vref.VerseNum doesn't give you any clues), but not so with chapters? It's strange to me that they are handled differently.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 94 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

_verseRef.Value.Valid will return false, but as it is not clear just what the cause is (i.e. there is no comparable enum to UsfmVersificationErrorType returned by VerseRef), I deal with the invalid chapter/verse error just above this logic.

I see. So even if you grab the verse's ValidStatus enum, it doesn't clue you in? If you're getting vref.Valid == false, then ValidStatus must be set to something other than ValidStatus.Valid - I wonder what it is 🤔


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

I tried that, but it is hard to tell what it should have been. Chapters can be skipped in a USFM file. If we were to parse the string and guess, then we might as well not throw an error and use the parsed string?

Well, that is true - that could be 2.0 😆. The USFM could skip chapters or verses, but if it is skipping, then that's the sort of thing we'd want to be catching anyways since this checking should all use the versification as the gold standard. I was just thinking that we could use some of the functions in ScrVers to see what the next expected reference is (e.g. something similar to what's done in ScrVers.FirstIncludedVerse() but only looking at subsequent verses). If it's impossible to have totally uninterpretable verses/chapters like \c text or \v word123word, then I'm fine with not having an expected ref, but if it is possible, I think we ought to include something like that since I believe we can make a reasonable guess and there would be no other way for the error to be actionable without running Ctrl+F in every USFM file.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 292 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

I get the raw value and pass it into the error object so we can return it in the error message, so that the error message doesn't just say something like "invalid chapter number -1" but will say the actual value that caused the error. (the VerseRef struct does not record what the invalid chapter number value is).

I see. Makes sense! Like I said above, it'd odd to me that the bad chapter value can't be accessed like the verses can, but we'll make do!


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

A missing verse error

OK, yeah, I see. I was just wondering if it were worth allowing both errors to register since they're capturing different problems. And if a user got two warnings like Missing MAT 1:2 and Invalid verse MAT 1:schmutz, that may be helpful in locating/understanding the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The USFM parser should trap the Trace warnings from VerseRef

4 participants