-
Notifications
You must be signed in to change notification settings - Fork 21
Implement ChapmanMeanLengthTrue #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a corrected version of the Chapman Mean Length algorithm by introducing the ChapmanMeanLengthTrue class, which uses Longest Common Subsequence (LCS) for calculating string similarity instead of a simple length-based heuristic.
Key Changes:
- Implements
ChapmanMeanLengthTruemetric using LCS-based similarity calculation - Adds comprehensive unit tests for the new metric
- Updates documentation and fixes spelling errors in existing Chapman classes
- Sets C# language version to 12 to enable modern C# features
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
src/SimMetrics.Net/Metric/ChapmanMeanLengthTrue.cs |
New metric class implementing LCS-based Chapman Mean Length algorithm |
tests/SimMetrics.Net.Tests/SimilarityClasses/LengthBased/ChapmanMeanLengthTrueTests.cs |
Unit tests for the new ChapmanMeanLengthTrue metric |
src/SimMetrics.Net/SimMetrics.Net.csproj |
Sets LangVersion to 12 to enable C# 12 features |
src/SimMetrics.Net/Metric/ChapmanMeanLength.cs |
Adds documentation clarifying this is a heuristic, fixes spelling in description |
src/SimMetrics.Net/Metric/ChapmanLengthDeviation.cs |
Fixes spelling error in description |
tests/SimMetrics.Net.Tests/SimMetrics.Net.Tests.csproj |
Removes redundant AssemblyName, PackageId properties and commented code |
tests/SimMetrics.Net.Tests/AssertUtil.cs |
Refactors to use file-scoped namespace |
src/SimMetrics.Net/Extensions.cs |
Adds braces around single-line if statement |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| namespace SimMetrics.Net.Metric | ||
| { | ||
| /// <summary> | ||
| /// This method only the lengths of the two words, not at the actual characters. |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar issue: The sentence is missing a verb. Should be "This method looks at only the lengths of the two words" or "This method uses only the lengths of the two words".
| /// This method only the lengths of the two words, not at the actual characters. | |
| /// This method uses only the lengths of the two words, not the actual characters. |
| { | ||
| /// <summary> | ||
| /// This method only the lengths of the two words, not at the actual characters. | ||
| /// It uses some cutoff(ChapmanMeanLengthMaxString) and a polynomial scaling(1 - num2^4) to produce a score. |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after opening parenthesis in both "cutoff(ChapmanMeanLengthMaxString)" and "scaling(1 - num2^4)". Should be "cutoff (ChapmanMeanLengthMaxString)" and "scaling (1 - num2^4)".
| /// It uses some cutoff(ChapmanMeanLengthMaxString) and a polynomial scaling(1 - num2^4) to produce a score. | |
| /// It uses some cutoff (ChapmanMeanLengthMaxString) and a polynomial scaling (1 - num2^4) to produce a score. |
| using SimMetrics.Net.Metric; | ||
| using Xunit; | ||
|
|
||
| namespace SimMetrics.Net.Tests.SimilarityClasses.LengthBased; |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test file uses file-scoped namespace declaration (with semicolon), while all other test files in the SimilarityClasses directory use block-scoped namespace declarations (with braces). Consider using block-scoped namespaces for consistency with the existing codebase.
| namespace SimMetrics.Net.Metric; | ||
|
|
||
| /// <summary> | ||
| /// Correct Chapman Mean Length implementation. | ||
| /// </summary> | ||
| public sealed class ChapmanMeanLengthTrue : AbstractStringMetric | ||
| { | ||
| private const double DefaultMismatchScore = 0.0; | ||
| private const double DefaultPerfectScore = 1.0; | ||
|
|
||
| public override double GetSimilarity(string firstWord, string secondWord) | ||
| { | ||
| if (string.IsNullOrEmpty(firstWord) || string.IsNullOrEmpty(secondWord)) | ||
| { | ||
| return DefaultMismatchScore; | ||
| } | ||
|
|
||
| // Compute LCS length | ||
| var lcs = LongestCommonSubsequence(firstWord, secondWord); | ||
|
|
||
| // Chapman Mean Length formula | ||
| var score = 2.0 * lcs / (firstWord.Length + secondWord.Length); | ||
|
|
||
| return score switch | ||
| { | ||
| < DefaultMismatchScore => DefaultMismatchScore, | ||
| > DefaultPerfectScore => DefaultPerfectScore, | ||
| _ => score | ||
| }; | ||
| } | ||
|
|
||
| public override string GetSimilarityExplained(string firstWord, string secondWord) | ||
| { | ||
| throw new NotImplementedException(); | ||
| } | ||
|
|
||
| public override double GetSimilarityTimingEstimated(string firstWord, string secondWord) | ||
| { | ||
| return 0.0; | ||
| } | ||
|
|
||
| public override double GetUnnormalisedSimilarity(string firstWord, string secondWord) | ||
| { | ||
| return GetSimilarity(firstWord, secondWord); | ||
| } | ||
|
|
||
| public override string LongDescriptionString => "A true implementation of the Chapman Mean Length algorithm"; | ||
|
|
||
| public override string ShortDescriptionString => nameof(ChapmanMeanLengthTrue); | ||
|
|
||
| private static int LongestCommonSubsequence(string s1, string s2) | ||
| { | ||
| int m = s1.Length, n = s2.Length; | ||
| int[,] dp = new int[m + 1, n + 1]; | ||
|
|
||
| for (var i = 0; i < m; i++) | ||
| { | ||
| for (var j = 0; j < n; j++) | ||
| { | ||
| if (s1[i] == s2[j]) | ||
| { | ||
| dp[i + 1, j + 1] = dp[i, j] + 1; | ||
| } | ||
| else | ||
| { | ||
| dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return dp[m, n]; |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file uses file-scoped namespace declaration (with semicolon), while all other files in the Metric directory use block-scoped namespace declarations (with braces). Consider using block-scoped namespaces for consistency with the existing codebase.
| namespace SimMetrics.Net.Metric; | |
| /// <summary> | |
| /// Correct Chapman Mean Length implementation. | |
| /// </summary> | |
| public sealed class ChapmanMeanLengthTrue : AbstractStringMetric | |
| { | |
| private const double DefaultMismatchScore = 0.0; | |
| private const double DefaultPerfectScore = 1.0; | |
| public override double GetSimilarity(string firstWord, string secondWord) | |
| { | |
| if (string.IsNullOrEmpty(firstWord) || string.IsNullOrEmpty(secondWord)) | |
| { | |
| return DefaultMismatchScore; | |
| } | |
| // Compute LCS length | |
| var lcs = LongestCommonSubsequence(firstWord, secondWord); | |
| // Chapman Mean Length formula | |
| var score = 2.0 * lcs / (firstWord.Length + secondWord.Length); | |
| return score switch | |
| { | |
| < DefaultMismatchScore => DefaultMismatchScore, | |
| > DefaultPerfectScore => DefaultPerfectScore, | |
| _ => score | |
| }; | |
| } | |
| public override string GetSimilarityExplained(string firstWord, string secondWord) | |
| { | |
| throw new NotImplementedException(); | |
| } | |
| public override double GetSimilarityTimingEstimated(string firstWord, string secondWord) | |
| { | |
| return 0.0; | |
| } | |
| public override double GetUnnormalisedSimilarity(string firstWord, string secondWord) | |
| { | |
| return GetSimilarity(firstWord, secondWord); | |
| } | |
| public override string LongDescriptionString => "A true implementation of the Chapman Mean Length algorithm"; | |
| public override string ShortDescriptionString => nameof(ChapmanMeanLengthTrue); | |
| private static int LongestCommonSubsequence(string s1, string s2) | |
| { | |
| int m = s1.Length, n = s2.Length; | |
| int[,] dp = new int[m + 1, n + 1]; | |
| for (var i = 0; i < m; i++) | |
| { | |
| for (var j = 0; j < n; j++) | |
| { | |
| if (s1[i] == s2[j]) | |
| { | |
| dp[i + 1, j + 1] = dp[i, j] + 1; | |
| } | |
| else | |
| { | |
| dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]); | |
| } | |
| } | |
| } | |
| return dp[m, n]; | |
| namespace SimMetrics.Net.Metric | |
| { | |
| /// <summary> | |
| /// Correct Chapman Mean Length implementation. | |
| /// </summary> | |
| public sealed class ChapmanMeanLengthTrue : AbstractStringMetric | |
| { | |
| private const double DefaultMismatchScore = 0.0; | |
| private const double DefaultPerfectScore = 1.0; | |
| public override double GetSimilarity(string firstWord, string secondWord) | |
| { | |
| if (string.IsNullOrEmpty(firstWord) || string.IsNullOrEmpty(secondWord)) | |
| { | |
| return DefaultMismatchScore; | |
| } | |
| // Compute LCS length | |
| var lcs = LongestCommonSubsequence(firstWord, secondWord); | |
| // Chapman Mean Length formula | |
| var score = 2.0 * lcs / (firstWord.Length + secondWord.Length); | |
| return score switch | |
| { | |
| < DefaultMismatchScore => DefaultMismatchScore, | |
| > DefaultPerfectScore => DefaultPerfectScore, | |
| _ => score | |
| }; | |
| } | |
| public override string GetSimilarityExplained(string firstWord, string secondWord) | |
| { | |
| throw new NotImplementedException(); | |
| } | |
| public override double GetSimilarityTimingEstimated(string firstWord, string secondWord) | |
| { | |
| return 0.0; | |
| } | |
| public override double GetUnnormalisedSimilarity(string firstWord, string secondWord) | |
| { | |
| return GetSimilarity(firstWord, secondWord); | |
| } | |
| public override string LongDescriptionString => "A true implementation of the Chapman Mean Length algorithm"; | |
| public override string ShortDescriptionString => nameof(ChapmanMeanLengthTrue); | |
| private static int LongestCommonSubsequence(string s1, string s2) | |
| { | |
| int m = s1.Length, n = s2.Length; | |
| int[,] dp = new int[m + 1, n + 1]; | |
| for (var i = 0; i < m; i++) | |
| { | |
| for (var j = 0; j < n; j++) | |
| { | |
| if (s1[i] == s2[j]) | |
| { | |
| dp[i + 1, j + 1] = dp[i, j] + 1; | |
| } | |
| else | |
| { | |
| dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]); | |
| } | |
| } | |
| } | |
| return dp[m, n]; | |
| } |
| int[,] dp = new int[m + 1, n + 1]; | ||
|
|
||
| for (var i = 0; i < m; i++) | ||
| { | ||
| for (var j = 0; j < n; j++) | ||
| { | ||
| if (s1[i] == s2[j]) | ||
| { | ||
| dp[i + 1, j + 1] = dp[i, j] + 1; | ||
| } | ||
| else | ||
| { | ||
| dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return dp[m, n]; |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The LCS implementation uses O(m*n) space complexity. For large strings, consider optimizing to use O(min(m, n)) space by only keeping the previous and current rows of the DP table, since you only need the final result and not the actual LCS itself.
| int[,] dp = new int[m + 1, n + 1]; | |
| for (var i = 0; i < m; i++) | |
| { | |
| for (var j = 0; j < n; j++) | |
| { | |
| if (s1[i] == s2[j]) | |
| { | |
| dp[i + 1, j + 1] = dp[i, j] + 1; | |
| } | |
| else | |
| { | |
| dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]); | |
| } | |
| } | |
| } | |
| return dp[m, n]; | |
| // Ensure s2 is the shorter string to minimize space | |
| if (n > m) | |
| { | |
| // Swap to always use less space | |
| var temp = s1; s1 = s2; s2 = temp; | |
| m = s1.Length; n = s2.Length; | |
| } | |
| int[] prev = new int[n + 1]; | |
| int[] curr = new int[n + 1]; | |
| for (int i = 0; i < m; i++) | |
| { | |
| for (int j = 0; j < n; j++) | |
| { | |
| if (s1[i] == s2[j]) | |
| { | |
| curr[j + 1] = prev[j] + 1; | |
| } | |
| else | |
| { | |
| curr[j + 1] = Math.Max(prev[j + 1], curr[j]); | |
| } | |
| } | |
| // Swap rows for next iteration | |
| var tempArr = prev; prev = curr; curr = tempArr; | |
| } | |
| return prev[n]; |
| [Theory] | ||
| [InlineData("Davdi", 0.800000)] | ||
| [InlineData("david", 0.800000)] | ||
| [InlineData("David", 1.000000)] | ||
| [InlineData("Maday", 0.400000)] | ||
| [InlineData("Daves", 0.600000)] | ||
| [InlineData("divaD", 0.200000)] | ||
| [InlineData("Dave", 0.666667)] | ||
| [InlineData("Dovid", 0.800000)] | ||
| [InlineData("Dadiv", 0.600000)] | ||
| [InlineData("Da.v.id", 0.833333)] | ||
| [InlineData("Dav id", 0.909091)] | ||
| [InlineData("12345", 0.000000)] | ||
| [InlineData("Divad", 0.600000)] | ||
| [InlineData("D-avid", 0.909091)] | ||
| [InlineData("xxxxx", 0.000000)] |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding test cases for edge conditions like null strings, empty strings, and identical strings to ensure the metric handles these cases correctly (the implementation handles null/empty by returning 0.0).
| <VersionPrefix>1.0.5.0</VersionPrefix> | ||
| <Authors>Hamed Fathi;Stef Heyenrath</Authors> | ||
| <TargetFrameworks>net20;net35;net40;net45;netstandard1.0;netstandard2.0</TargetFrameworks> | ||
| <LangVersion>12</LangVersion> |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting LangVersion to 12 (C# 12) may cause compatibility issues with the older target frameworks (net20, net35, net40). C# 12 features may not be fully supported by these legacy frameworks. Consider using a lower LangVersion that's compatible across all target frameworks, or only use language features that are supported by the oldest target framework.
| <LangVersion>12</LangVersion> | |
| <LangVersion>2</LangVersion> |
| if (string.IsNullOrEmpty(firstWord) || string.IsNullOrEmpty(secondWord)) | ||
| { | ||
| return DefaultMismatchScore; | ||
| } |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition string.IsNullOrEmpty(firstWord) || string.IsNullOrEmpty(secondWord) returns 0.0 even when both strings are empty. Two empty strings should arguably return 1.0 (perfect match) since they are identical. Consider checking if (firstWord == null || secondWord == null) return DefaultMismatchScore; followed by a separate check for the case when both strings are empty: if (firstWord.Length == 0 && secondWord.Length == 0) return DefaultPerfectScore;
| if (string.IsNullOrEmpty(firstWord) || string.IsNullOrEmpty(secondWord)) | |
| { | |
| return DefaultMismatchScore; | |
| } | |
| if (firstWord == null || secondWord == null) | |
| { | |
| return DefaultMismatchScore; | |
| } | |
| if (firstWord.Length == 0 && secondWord.Length == 0) | |
| { | |
| return DefaultPerfectScore; | |
| } | |
| if (firstWord.Length == 0 || secondWord.Length == 0) | |
| { | |
| return DefaultMismatchScore; | |
| } |
| namespace SimMetrics.Net.Metric; | ||
|
|
||
| /// <summary> | ||
| /// Correct Chapman Mean Length implementation. |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation says "Correct Chapman Mean Length implementation" but doesn't explain what makes this "correct" compared to the existing ChapmanMeanLength class, or what the algorithm actually does. Consider adding more context about the Chapman Mean Length algorithm and how this differs from the existing implementation.
| /// Correct Chapman Mean Length implementation. | |
| /// Implements the Chapman Mean Length similarity algorithm, which measures the similarity between two strings | |
| /// based on the length of their longest common subsequence (LCS), normalized by the mean length of the input strings. | |
| /// This implementation is considered "correct" as it strictly follows the original Chapman Mean Length definition, | |
| /// addressing inaccuracies or deviations present in the existing <see cref="ChapmanMeanLength"/> class. | |
| /// The similarity score is calculated as 2 * LCS(firstWord, secondWord) / (|firstWord| + |secondWord|). | |
| /// For more details, see: Chapman, S. (1995). "String similarity metrics for information retrieval". |
| if (s1[i] == s2[j]) | ||
| { | ||
| dp[i + 1, j + 1] = dp[i, j] + 1; | ||
| } | ||
| else | ||
| { | ||
| dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]); | ||
| } |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both branches of this 'if' statement write to the same variable - consider using '?' to express intent better.
| if (s1[i] == s2[j]) | |
| { | |
| dp[i + 1, j + 1] = dp[i, j] + 1; | |
| } | |
| else | |
| { | |
| dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]); | |
| } | |
| dp[i + 1, j + 1] = (s1[i] == s2[j]) | |
| ? dp[i, j] + 1 | |
| : Math.Max(dp[i, j + 1], dp[i + 1, j]); |
No description provided.