Skip to content

Conversation

@StefH
Copy link
Owner

@StefH StefH commented Dec 6, 2025

No description provided.

@StefH StefH self-assigned this Dec 6, 2025
@StefH StefH requested a review from Copilot December 6, 2025 17:00
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a corrected version of the Chapman Mean Length algorithm by introducing the ChapmanMeanLengthTrue class, which uses Longest Common Subsequence (LCS) for calculating string similarity instead of a simple length-based heuristic.

Key Changes:

  • Implements ChapmanMeanLengthTrue metric using LCS-based similarity calculation
  • Adds comprehensive unit tests for the new metric
  • Updates documentation and fixes spelling errors in existing Chapman classes
  • Sets C# language version to 12 to enable modern C# features

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/SimMetrics.Net/Metric/ChapmanMeanLengthTrue.cs New metric class implementing LCS-based Chapman Mean Length algorithm
tests/SimMetrics.Net.Tests/SimilarityClasses/LengthBased/ChapmanMeanLengthTrueTests.cs Unit tests for the new ChapmanMeanLengthTrue metric
src/SimMetrics.Net/SimMetrics.Net.csproj Sets LangVersion to 12 to enable C# 12 features
src/SimMetrics.Net/Metric/ChapmanMeanLength.cs Adds documentation clarifying this is a heuristic, fixes spelling in description
src/SimMetrics.Net/Metric/ChapmanLengthDeviation.cs Fixes spelling error in description
tests/SimMetrics.Net.Tests/SimMetrics.Net.Tests.csproj Removes redundant AssemblyName, PackageId properties and commented code
tests/SimMetrics.Net.Tests/AssertUtil.cs Refactors to use file-scoped namespace
src/SimMetrics.Net/Extensions.cs Adds braces around single-line if statement

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

namespace SimMetrics.Net.Metric
{
/// <summary>
/// This method only the lengths of the two words, not at the actual characters.
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar issue: The sentence is missing a verb. Should be "This method looks at only the lengths of the two words" or "This method uses only the lengths of the two words".

Suggested change
/// This method only the lengths of the two words, not at the actual characters.
/// This method uses only the lengths of the two words, not the actual characters.

Copilot uses AI. Check for mistakes.
{
/// <summary>
/// This method only the lengths of the two words, not at the actual characters.
/// It uses some cutoff(ChapmanMeanLengthMaxString) and a polynomial scaling(1 - num2^4) to produce a score.
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after opening parenthesis in both "cutoff(ChapmanMeanLengthMaxString)" and "scaling(1 - num2^4)". Should be "cutoff (ChapmanMeanLengthMaxString)" and "scaling (1 - num2^4)".

Suggested change
/// It uses some cutoff(ChapmanMeanLengthMaxString) and a polynomial scaling(1 - num2^4) to produce a score.
/// It uses some cutoff (ChapmanMeanLengthMaxString) and a polynomial scaling (1 - num2^4) to produce a score.

Copilot uses AI. Check for mistakes.
using SimMetrics.Net.Metric;
using Xunit;

namespace SimMetrics.Net.Tests.SimilarityClasses.LengthBased;
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test file uses file-scoped namespace declaration (with semicolon), while all other test files in the SimilarityClasses directory use block-scoped namespace declarations (with braces). Consider using block-scoped namespaces for consistency with the existing codebase.

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +74
namespace SimMetrics.Net.Metric;

/// <summary>
/// Correct Chapman Mean Length implementation.
/// </summary>
public sealed class ChapmanMeanLengthTrue : AbstractStringMetric
{
private const double DefaultMismatchScore = 0.0;
private const double DefaultPerfectScore = 1.0;

public override double GetSimilarity(string firstWord, string secondWord)
{
if (string.IsNullOrEmpty(firstWord) || string.IsNullOrEmpty(secondWord))
{
return DefaultMismatchScore;
}

// Compute LCS length
var lcs = LongestCommonSubsequence(firstWord, secondWord);

// Chapman Mean Length formula
var score = 2.0 * lcs / (firstWord.Length + secondWord.Length);

return score switch
{
< DefaultMismatchScore => DefaultMismatchScore,
> DefaultPerfectScore => DefaultPerfectScore,
_ => score
};
}

public override string GetSimilarityExplained(string firstWord, string secondWord)
{
throw new NotImplementedException();
}

public override double GetSimilarityTimingEstimated(string firstWord, string secondWord)
{
return 0.0;
}

public override double GetUnnormalisedSimilarity(string firstWord, string secondWord)
{
return GetSimilarity(firstWord, secondWord);
}

public override string LongDescriptionString => "A true implementation of the Chapman Mean Length algorithm";

public override string ShortDescriptionString => nameof(ChapmanMeanLengthTrue);

private static int LongestCommonSubsequence(string s1, string s2)
{
int m = s1.Length, n = s2.Length;
int[,] dp = new int[m + 1, n + 1];

for (var i = 0; i < m; i++)
{
for (var j = 0; j < n; j++)
{
if (s1[i] == s2[j])
{
dp[i + 1, j + 1] = dp[i, j] + 1;
}
else
{
dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]);
}
}
}

return dp[m, n];
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file uses file-scoped namespace declaration (with semicolon), while all other files in the Metric directory use block-scoped namespace declarations (with braces). Consider using block-scoped namespaces for consistency with the existing codebase.

Suggested change
namespace SimMetrics.Net.Metric;
/// <summary>
/// Correct Chapman Mean Length implementation.
/// </summary>
public sealed class ChapmanMeanLengthTrue : AbstractStringMetric
{
private const double DefaultMismatchScore = 0.0;
private const double DefaultPerfectScore = 1.0;
public override double GetSimilarity(string firstWord, string secondWord)
{
if (string.IsNullOrEmpty(firstWord) || string.IsNullOrEmpty(secondWord))
{
return DefaultMismatchScore;
}
// Compute LCS length
var lcs = LongestCommonSubsequence(firstWord, secondWord);
// Chapman Mean Length formula
var score = 2.0 * lcs / (firstWord.Length + secondWord.Length);
return score switch
{
< DefaultMismatchScore => DefaultMismatchScore,
> DefaultPerfectScore => DefaultPerfectScore,
_ => score
};
}
public override string GetSimilarityExplained(string firstWord, string secondWord)
{
throw new NotImplementedException();
}
public override double GetSimilarityTimingEstimated(string firstWord, string secondWord)
{
return 0.0;
}
public override double GetUnnormalisedSimilarity(string firstWord, string secondWord)
{
return GetSimilarity(firstWord, secondWord);
}
public override string LongDescriptionString => "A true implementation of the Chapman Mean Length algorithm";
public override string ShortDescriptionString => nameof(ChapmanMeanLengthTrue);
private static int LongestCommonSubsequence(string s1, string s2)
{
int m = s1.Length, n = s2.Length;
int[,] dp = new int[m + 1, n + 1];
for (var i = 0; i < m; i++)
{
for (var j = 0; j < n; j++)
{
if (s1[i] == s2[j])
{
dp[i + 1, j + 1] = dp[i, j] + 1;
}
else
{
dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]);
}
}
}
return dp[m, n];
namespace SimMetrics.Net.Metric
{
/// <summary>
/// Correct Chapman Mean Length implementation.
/// </summary>
public sealed class ChapmanMeanLengthTrue : AbstractStringMetric
{
private const double DefaultMismatchScore = 0.0;
private const double DefaultPerfectScore = 1.0;
public override double GetSimilarity(string firstWord, string secondWord)
{
if (string.IsNullOrEmpty(firstWord) || string.IsNullOrEmpty(secondWord))
{
return DefaultMismatchScore;
}
// Compute LCS length
var lcs = LongestCommonSubsequence(firstWord, secondWord);
// Chapman Mean Length formula
var score = 2.0 * lcs / (firstWord.Length + secondWord.Length);
return score switch
{
< DefaultMismatchScore => DefaultMismatchScore,
> DefaultPerfectScore => DefaultPerfectScore,
_ => score
};
}
public override string GetSimilarityExplained(string firstWord, string secondWord)
{
throw new NotImplementedException();
}
public override double GetSimilarityTimingEstimated(string firstWord, string secondWord)
{
return 0.0;
}
public override double GetUnnormalisedSimilarity(string firstWord, string secondWord)
{
return GetSimilarity(firstWord, secondWord);
}
public override string LongDescriptionString => "A true implementation of the Chapman Mean Length algorithm";
public override string ShortDescriptionString => nameof(ChapmanMeanLengthTrue);
private static int LongestCommonSubsequence(string s1, string s2)
{
int m = s1.Length, n = s2.Length;
int[,] dp = new int[m + 1, n + 1];
for (var i = 0; i < m; i++)
{
for (var j = 0; j < n; j++)
{
if (s1[i] == s2[j])
{
dp[i + 1, j + 1] = dp[i, j] + 1;
}
else
{
dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]);
}
}
}
return dp[m, n];
}

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +74
int[,] dp = new int[m + 1, n + 1];

for (var i = 0; i < m; i++)
{
for (var j = 0; j < n; j++)
{
if (s1[i] == s2[j])
{
dp[i + 1, j + 1] = dp[i, j] + 1;
}
else
{
dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]);
}
}
}

return dp[m, n];
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The LCS implementation uses O(m*n) space complexity. For large strings, consider optimizing to use O(min(m, n)) space by only keeping the previous and current rows of the DP table, since you only need the final result and not the actual LCS itself.

Suggested change
int[,] dp = new int[m + 1, n + 1];
for (var i = 0; i < m; i++)
{
for (var j = 0; j < n; j++)
{
if (s1[i] == s2[j])
{
dp[i + 1, j + 1] = dp[i, j] + 1;
}
else
{
dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]);
}
}
}
return dp[m, n];
// Ensure s2 is the shorter string to minimize space
if (n > m)
{
// Swap to always use less space
var temp = s1; s1 = s2; s2 = temp;
m = s1.Length; n = s2.Length;
}
int[] prev = new int[n + 1];
int[] curr = new int[n + 1];
for (int i = 0; i < m; i++)
{
for (int j = 0; j < n; j++)
{
if (s1[i] == s2[j])
{
curr[j + 1] = prev[j] + 1;
}
else
{
curr[j + 1] = Math.Max(prev[j + 1], curr[j]);
}
}
// Swap rows for next iteration
var tempArr = prev; prev = curr; curr = tempArr;
}
return prev[n];

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +25
[Theory]
[InlineData("Davdi", 0.800000)]
[InlineData("david", 0.800000)]
[InlineData("David", 1.000000)]
[InlineData("Maday", 0.400000)]
[InlineData("Daves", 0.600000)]
[InlineData("divaD", 0.200000)]
[InlineData("Dave", 0.666667)]
[InlineData("Dovid", 0.800000)]
[InlineData("Dadiv", 0.600000)]
[InlineData("Da.v.id", 0.833333)]
[InlineData("Dav id", 0.909091)]
[InlineData("12345", 0.000000)]
[InlineData("Divad", 0.600000)]
[InlineData("D-avid", 0.909091)]
[InlineData("xxxxx", 0.000000)]
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding test cases for edge conditions like null strings, empty strings, and identical strings to ensure the metric handles these cases correctly (the implementation handles null/empty by returning 0.0).

Copilot uses AI. Check for mistakes.
<VersionPrefix>1.0.5.0</VersionPrefix>
<Authors>Hamed Fathi;Stef Heyenrath</Authors>
<TargetFrameworks>net20;net35;net40;net45;netstandard1.0;netstandard2.0</TargetFrameworks>
<LangVersion>12</LangVersion>
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting LangVersion to 12 (C# 12) may cause compatibility issues with the older target frameworks (net20, net35, net40). C# 12 features may not be fully supported by these legacy frameworks. Consider using a lower LangVersion that's compatible across all target frameworks, or only use language features that are supported by the oldest target framework.

Suggested change
<LangVersion>12</LangVersion>
<LangVersion>2</LangVersion>

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +19
if (string.IsNullOrEmpty(firstWord) || string.IsNullOrEmpty(secondWord))
{
return DefaultMismatchScore;
}
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition string.IsNullOrEmpty(firstWord) || string.IsNullOrEmpty(secondWord) returns 0.0 even when both strings are empty. Two empty strings should arguably return 1.0 (perfect match) since they are identical. Consider checking if (firstWord == null || secondWord == null) return DefaultMismatchScore; followed by a separate check for the case when both strings are empty: if (firstWord.Length == 0 && secondWord.Length == 0) return DefaultPerfectScore;

Suggested change
if (string.IsNullOrEmpty(firstWord) || string.IsNullOrEmpty(secondWord))
{
return DefaultMismatchScore;
}
if (firstWord == null || secondWord == null)
{
return DefaultMismatchScore;
}
if (firstWord.Length == 0 && secondWord.Length == 0)
{
return DefaultPerfectScore;
}
if (firstWord.Length == 0 || secondWord.Length == 0)
{
return DefaultMismatchScore;
}

Copilot uses AI. Check for mistakes.
namespace SimMetrics.Net.Metric;

/// <summary>
/// Correct Chapman Mean Length implementation.
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says "Correct Chapman Mean Length implementation" but doesn't explain what makes this "correct" compared to the existing ChapmanMeanLength class, or what the algorithm actually does. Consider adding more context about the Chapman Mean Length algorithm and how this differs from the existing implementation.

Suggested change
/// Correct Chapman Mean Length implementation.
/// Implements the Chapman Mean Length similarity algorithm, which measures the similarity between two strings
/// based on the length of their longest common subsequence (LCS), normalized by the mean length of the input strings.
/// This implementation is considered "correct" as it strictly follows the original Chapman Mean Length definition,
/// addressing inaccuracies or deviations present in the existing <see cref="ChapmanMeanLength"/> class.
/// The similarity score is calculated as 2 * LCS(firstWord, secondWord) / (|firstWord| + |secondWord|).
/// For more details, see: Chapman, S. (1995). "String similarity metrics for information retrieval".

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +70
if (s1[i] == s2[j])
{
dp[i + 1, j + 1] = dp[i, j] + 1;
}
else
{
dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]);
}
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both branches of this 'if' statement write to the same variable - consider using '?' to express intent better.

Suggested change
if (s1[i] == s2[j])
{
dp[i + 1, j + 1] = dp[i, j] + 1;
}
else
{
dp[i + 1, j + 1] = Math.Max(dp[i, j + 1], dp[i + 1, j]);
}
dp[i + 1, j + 1] = (s1[i] == s2[j])
? dp[i, j] + 1
: Math.Max(dp[i, j + 1], dp[i + 1, j]);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants