(Improvements, need help testing) Rune related and span-ish optimizations #1247

pl752 · 2025-12-11T07:40:45Z

I want to propose set of changes aimed at improving performance, which I have implemented and used for some time in my (private) projects.
The main goal of these changes is to significantly reduce allocations to heap by using stack allocations, array pool and avoiding unnecessary allocations in first place.
I have created topic in mailing list
I will appreciate opinions and help with testing, as I was used these changes for a while without any anomalies, though I didn't run thorough tests with all versions (I am using fb 3 server). Also the changes shouldn't have changed observable behavior.

ations and optimized rune operations

niekschoemaker · 2025-12-12T00:02:07Z

Personally most of the changes seem to make sense, but I would make the case that the Auth part does become way to complex with these changes (and also not sure how often that code even runs, cause I suppose it runs once per connection so probably not too hot of a path)

The other parts do seem to make sense, especially the ReaderWriter optimizations, as those run for each query.

Did you however happen to run the benchmarks against this to see what actual change it makes to performance?

pl752 · 2025-12-12T06:15:04Z

Unfortunately, I haven't got to running benchmarks yet, however changes resulted in significant reduction of cpu time usage and allocations in application performance profiling runs, I will try to perform more thorough benchmarks and correctness tests soon, when I will have some free time

pl752 · 2025-12-12T06:20:58Z

Also I agree that auth part is a case of over-optimization and can be omitted. I just applied change pattern to everything which allocates temporary buffers and I have got an eye on. So optimizations for things which run once per session/connection aren't necessary

pl752 · 2025-12-12T07:05:16Z

Upd: I have run the Perf thing I found in a solution (idk if it is any representative) And yeah, the speed difference is pretty negligible, however reduction in allocations can be clearly observed

BenchmarkDotNet v0.15.8, Windows 10 (10.0.19044.6691/21H2/November2021Update)
AMD Ryzen 7 5800H with Radeon Graphics 3.20GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.101
  [Host]  : .NET 8.0.22 (8.0.22, 8.0.2225.52707), X64 RyuJIT x86-64-v3
  NuGet   : .NET 8.0.22 (8.0.22, 8.0.2225.52707), X64 RyuJIT x86-64-v3
  Project : .NET 8.0.22 (8.0.22, 8.0.2225.52707), X64 RyuJIT x86-64-v3

Jit=RyuJit  Platform=X64  Toolchain=.NET 8.0
WarmupCount=3

| Method  | Job     | BuildConfiguration | DataType             | Count | Mean        | Error     | StdDev    | Ratio | Gen0    | Allocated | Alloc Ratio |
|-------- |-------- |------------------- |--------------------- |------ |------------:|----------:|----------:|------:|--------:|----------:|------------:|
| Execute | NuGet   | ReleaseNuGet       | bigint               | 100   | 20,322.3 us | 212.53 us | 188.40 us |  1.00 | 31.2500 |  307.4 KB |        1.00 |
| Execute | Project | Release            | bigint               | 100   | 20,160.8 us | 175.47 us | 146.52 us |  0.99 |       - | 237.61 KB |        0.77 |
|         |         |                    |                      |       |             |           |           |       |         |           |             |
| Fetch   | NuGet   | ReleaseNuGet       | bigint               | 100   |    482.7 us |   4.17 us |   3.90 us |  1.00 |  6.8359 |  56.64 KB |        1.00 |
| Fetch   | Project | Release            | bigint               | 100   |    484.2 us |   3.33 us |   2.78 us |  1.00 |  4.8828 |  40.35 KB |        0.71 |
|         |         |                    |                      |       |             |           |           |       |         |           |             |
| Execute | NuGet   | ReleaseNuGet       | varch(...) utf8 [30] | 100   | 20,406.9 us | 217.86 us | 193.12 us |  1.00 | 31.2500 | 311.34 KB |        1.00 |
| Execute | Project | Release            | varch(...) utf8 [30] | 100   | 20,251.5 us | 118.63 us | 110.97 us |  0.99 |       - | 238.43 KB |        0.77 |
|         |         |                    |                      |       |             |           |           |       |         |           |             |
| Fetch   | NuGet   | ReleaseNuGet       | varch(...) utf8 [30] | 100   |    490.7 us |   3.71 us |   3.47 us |  1.00 |  6.8359 |  60.51 KB |        1.00 |
| Fetch   | Project | Release            | varch(...) utf8 [30] | 100   |    494.8 us |   6.60 us |   5.85 us |  1.01 |  4.8828 |   41.1 KB |        0.68 |

// * Hints *
Outliers
  CommandBenchmark.Execute: NuGet   -> 1 outlier  was  removed (21.28 ms)
  CommandBenchmark.Execute: Project -> 2 outliers were removed (20.81 ms, 21.15 ms)
  CommandBenchmark.Fetch: Project   -> 2 outliers were removed (499.44 us, 507.61 us)
  CommandBenchmark.Execute: NuGet   -> 1 outlier  was  removed (21.71 ms)
  CommandBenchmark.Fetch: Project   -> 1 outlier  was  removed (528.00 us)

Also firebird 3 is used, disk used is OEM samsung nvme 2tb (pm9a1, aka oem 980 pro), 32gb of ddr4 ram @3200MT JEDEC, dual channel ofc

…ed static, breaking tests)

pl752 · 2025-12-12T09:17:49Z

Upd2: Ran tests with firebird 3 (no embedded), so it does need further testing with other versions (especially embedded and batch operations in modern fb), there was an issue with boolean reading due to _smallbuffer being used both for reading useful bytes and pad (which doesn't affect types which don't get padded). Also, small test run time reduction was observed (aka 24.1 -> 23.5 mins, but without repeatability checks) and no changes in pass/failed/skipped numbers were noticed (after the fix)

pl752 · 2025-12-12T09:53:49Z

Upd3: performed tests with embedded engine, all passed

pl752 · 2025-12-12T15:43:57Z

Upd4:
TLDR: Written some benchmarks specific to my (unfortunately private) solution's queries. Changes in query execution timing sometimes is hard to register due to fb3 engine being the main bottleneck in testing scenarios even in ideal conditions (localhost with fast cpu and nvme), however, it seems that query creation/preparation benefited significantly and also massive boost observed in string operations due to rune conversion rework and also positive side effects in memory and local cpu time utilization can be observed.

Benchmark results:

//Update multiple: Optimized (local_opt2)
| Method                                      | UpdateRows | Mean        | Error     | StdDev    | Gen0    | Allocated |
|-------------------------------------------- |----------- |------------:|----------:|----------:|--------:|----------:|
| Update_MainDeliveryById_Merge_RollbackAsync | 25         |  1,695.5 us |  33.42 us |  39.78 us |  3.9063 |  42.78 KB |
| Update_MainDeliveryById_Merge_RollbackAsync | 1000       | 42,000.3 us | 481.56 us | 426.89 us | 83.3333 | 867.56 KB |


//Update multiple: Original (master)
| Update_MainDeliveryById_Merge_RollbackAsync | 25         |  1,704.9 us |  33.17 us |  52.62 us |  3.9063 |  46.98 KB |
| Update_MainDeliveryById_Merge_RollbackAsync | 1000       | 42,416.1 us | 634.02 us | 593.06 us | 83.3333 |  985.1 KB |


//Single insert/upsert: Optimized
| Select_LoadWBSellerAccountsAsync            | -         |    717.2 us |  14.08 us |  14.46 us |  1.9531 |  29.26 KB |
| Insert_Upsert_WbDocCache_RollbackAsync      | -         |    708.6 us |  12.61 us |  11.18 us |  3.9063 |  33.58 KB |


//Single insert/upsert: Original
| Select_LoadWBSellerAccountsAsync            | -         |    741.5 us |  14.50 us |  18.86 us |  3.9063 |   33.5 KB |
| Insert_Upsert_WbDocCache_RollbackAsync      | -         |    724.0 us |  13.94 us |  18.13 us |  3.9063 |  37.22 KB |


//Select multiple mixed (3 int, 1 literal char string): Optimized
| Method                              | Rows   | Mean           | Error        | StdDev        | Gen0       | Gen1      | Allocated    |
|------------------------------------ |------- |---------------:|-------------:|--------------:|-----------:|----------:|-------------:|
| SelectAndMap_Main_ReusedBufferAsync | 10     |       741.1 us |     19.82 us |      57.83 us |          - |         - |     47.52 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100    |     3,507.2 us |     66.56 us |      81.74 us |          - |         - |     421.2 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000   |    31,362.5 us |  5,421.76 us |  15,986.17 us |          - |         - |   4078.15 KB |
| SelectAndMap_Main_ReusedBufferAsync | 10000  |   321,426.4 us | 17,596.63 us |  51,884.06 us |  4000.0000 |         - |  40710.48 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100000 | 3,394,208.6 us | 67,371.50 us | 193,301.47 us | 49000.0000 | 9000.0000 | 407078.14 KB |


//Select multiple mixed (3 int, 1 literal char string): Original 
	(Yes, 1.09 to >2x in speed and 10x in allocation volumes 
	and when profiling, actually, ~100x difference in allocate/free event counters)
| Method                              | Rows   | Mean         | Error      | StdDev      | Median       | Gen0        | Gen1        | Allocated     |
|------------------------------------ |------- |-------------:|-----------:|------------:|-------------:|------------:|------------:|--------------:|
| SelectAndMap_Main_ReusedBufferAsync | 10     |     1.611 ms |  0.0506 ms |   0.1453 ms |     1.604 ms |           - |           - |     457.27 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100    |    11.138 ms |  0.1882 ms |   0.1760 ms |    11.129 ms |           - |           - |    4511.55 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000   |    34.017 ms |  4.9421 ms |  14.4162 ms |    25.360 ms |   5000.0000 |   1000.0000 |   44988.63 KB |
| SelectAndMap_Main_ReusedBufferAsync | 10000  |   346.085 ms | 23.2037 ms |  68.4167 ms |   337.300 ms |  55000.0000 |  11000.0000 |  449544.78 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100000 | 3,709.593 ms | 73.9280 ms | 194.7560 ms | 3,695.932 ms | 550000.0000 | 110000.0000 | 4494710.22 KB |

//Select multiple int only (3 int): Optimized
| Method                              | Rows    | Mean           | Error       | StdDev      | Gen0       | Gen1      | Allocated    |
|------------------------------------ |-------- |---------------:|------------:|------------:|-----------:|----------:|-------------:|
| SelectAndMap_Main_ReusedBufferAsync | 10      |       376.7 us |    17.69 us |    50.19 us |          - |         - |     11.73 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100     |     1,102.8 us |    54.87 us |   160.06 us |          - |         - |     63.99 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000    |     4,537.8 us |   689.68 us | 2,033.53 us |          - |         - |    497.61 KB |
| SelectAndMap_Main_ReusedBufferAsync | 10000   |    18,131.8 us |   137.98 us |   115.22 us |          - |         - |   4927.88 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100000  |   176,431.1 us |   956.29 us |   798.54 us |  6000.0000 |         - |  49230.36 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000000 | 1,743,465.8 us | 7,326.00 us | 6,494.31 us | 60000.0000 | 6000.0000 | 497846.96 KB |

//Select multiple int only (3 int): Original
| Method                              | Rows    | Mean           | Error       | StdDev      | Median         | Gen0       | Gen1      | Allocated    |
|------------------------------------ |-------- |---------------:|------------:|------------:|---------------:|-----------:|----------:|-------------:|
| SelectAndMap_Main_ReusedBufferAsync | 10      |       357.9 us |     9.24 us |    25.44 us |       355.4 us |          - |         - |     12.89 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100     |     1,182.7 us |    49.32 us |   142.29 us |     1,158.6 us |          - |         - |     70.78 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000    |     4,541.1 us |   718.80 us | 2,119.41 us |     3,485.7 us |          - |         - |    561.27 KB |
| SelectAndMap_Main_ReusedBufferAsync | 10000   |    18,280.0 us |   340.21 us |   454.17 us |    18,246.9 us |          - |         - |   5561.08 KB |
| SelectAndMap_Main_ReusedBufferAsync | 100000  |   173,885.9 us |   916.01 us |   764.91 us |   173,896.6 us |  6000.0000 |         - |  55558.87 KB |
| SelectAndMap_Main_ReusedBufferAsync | 1000000 | 1,745,630.3 us | 4,109.23 us | 3,642.73 us | 1,745,432.2 us | 68000.0000 | 7000.0000 | 561128.53 KB |

It was a little bit tricky to actually obtain measurements which could show improvements, however some interesting observations can be made.
The main explaination of smallness of timing improvements is that despite my benchmarks doing pretty much nothing aside from opening connection, opening configured transaction, creating queries, filling in parameters, preparing if ran multiple times in a row, executing/reading, mapping selected fields to single instance of structure (to avoid performance noise as much as possible), rolling back the transaction and closing the connection; the db engine seems to use the whole cpu core time, while the application thread is slacking most of the time.
However the string reading benefited heavily due to optimizations which reduced overall allocated object number 10-100x, because of the original rune char enumerator, which allocated every (!) rune as a separate char array resulting in tens of millions char[1] and char[2] objects being allocated and then collected shortly after, while the new methods avoid allocation as much as possible, situation is also worsened by the original rune counting method, which just called full enumeration, creating all the char arrays and then simply counted them while never using char data itself. Reducing allocations to the definitive buffers and strings save a lot of cpu time (as the heap allocation even in dotnet is not cheap operation and during string conversions the client library actually becomes the bottleneck instead of the engine).
Also the 10x memory volume difference when working with strings can be observed due to the char[1]/[2] arrays being not only 2-4 bytes of useful raw data, but also 0-6 bytes of padding (in some cases) and 8-16 bytes of meta array object (containing effectively a Span, aka pointer to real data and length of array), and that's not taking into account object type and reference manager related data.
Also the tests of queries of small volumes of rows usually yielded bigger percentage improvements (1 to 4% and 9 to 100+%) as, I think, that better string processing aided query and parameter preparation phase.
Also the timings are not the whole story, as the changes caused some pretty benefitial side effects: reduced amount of allocations ofc. reduce amount of times GC is called, also stackalloc is free (cause it is not the complex allocator function, but rather a tiny sub esp, size ... add esp, size), and also there is a reduction of cpu time used, observable even without the profiler, as I could clearly see main thread being 2-3% (5-6% during select with char) of whole cpu, while optimized version consumed only 1-3%, which means that on low-end client systems or in situation when the application is heavily uses the thread pool, the db reading task will occupy the thread less, thus providing more time for other tasks, when pool is exhausted and queue is used, and for other programs on low-end or heavily loaded machines, in theory.
Also the lack of proper benchmark/test coverage was due to the rework being small experiment out of curiosity, when I noticed, that firebird was top 1-2 consumer of cpu time in my application, but then I decided that experiment was pretty successful and the contribution might be useful for other developers and their solutions, so I decided reaching out with the proposal.

pl752 added 5 commits December 11, 2025 11:01

Optimized memory allocations using stackalloc, spans and pooled arrays

5556482

Reworked the rune enumerator to not spam byte[1...4] alloc

ddaf012

ations and optimized rune operations

Removed necessity to allocate 0 size array

3a1e3ce

Elliminated some linq queries and allocations

c058fc5

Adjusted code style

8a921c2

pl752 added 2 commits December 12, 2025 13:57

Fixed boolean buffer overwrite mishap

1c46f86

Fixed static/nonstatic call mismatch (Some internal methods were turn…

c5fadc8

…ed static, breaking tests)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

(Improvements, need help testing) Rune related and span-ish optimizations #1247

(Improvements, need help testing) Rune related and span-ish optimizations #1247

pl752 commented Dec 11, 2025 •

edited

Loading

Uh oh!

niekschoemaker commented Dec 12, 2025

Uh oh!

pl752 commented Dec 12, 2025

Uh oh!

pl752 commented Dec 12, 2025

Uh oh!

pl752 commented Dec 12, 2025 •

edited

Loading

Uh oh!

pl752 commented Dec 12, 2025 •

edited

Loading

Uh oh!

pl752 commented Dec 12, 2025

Uh oh!

pl752 commented Dec 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

(Improvements, need help testing) Rune related and span-ish optimizations #1247

Are you sure you want to change the base?

(Improvements, need help testing) Rune related and span-ish optimizations #1247

Conversation

pl752 commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niekschoemaker commented Dec 12, 2025

Uh oh!

pl752 commented Dec 12, 2025

Uh oh!

pl752 commented Dec 12, 2025

Uh oh!

pl752 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Dec 12, 2025

Uh oh!

pl752 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pl752 commented Dec 11, 2025 •

edited

Loading

pl752 commented Dec 12, 2025 •

edited

Loading

pl752 commented Dec 12, 2025 •

edited

Loading

pl752 commented Dec 12, 2025 •

edited

Loading