Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
588 changes: 588 additions & 0 deletions Source/DigitViewer2/DigitScanner/DigitScanner.cpp

Large diffs are not rendered by default.

28 changes: 28 additions & 0 deletions Source/DigitViewer2/DigitScanner/DigitScanner.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/* DigitScanner.h
*
* Author : Michael Kleber
* Date Created : 01/15/2026
* Last Modified : 01/15/2026
* Copyright 2026 Google LLC
*
*/

#pragma once
#include "PublicLibs/Types.h"

namespace DigitViewer2 {
using namespace ymp;

class BasicDigitReader;

class DigitScanner {
public:
DigitScanner(BasicDigitReader& reader, upL_t d);
void search();

private:
BasicDigitReader& m_reader;
upL_t m_d;
};

}
85 changes: 85 additions & 0 deletions Source/DigitViewer2/DigitScanner/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
Scanning for All Strings of Digits
========
by Michael Kleber

Code in this directory implements a way to scan through a large file of digits until _every_ sequence of $d$ digits has appeared.

Are you wondering "Does my 10-digit phone number appear in the digits of pi?"
Yes it does, somewhere in the first 241,641,121,048 digits.
What about your 16-digit credit card number?
I don't know — we haven't calculated enough digits of pi to see every 16-digit number.
(Yet.)

## Background

Pi, and many other numbers you can compute with y-cruncher, are believed to be [normal numbers](https://en.wikipedia.org/wiki/Normal_number).
This would mean that every sequence of $d$ decimal digits should appear in it, in approximately $1/(10^d)$ of the possible locations.
(That's what you would expect if the digits were random... and we have every reason to believe that pi's digits behave like random ones _from this particular point of view_.)

That leads to asking the very natural question:
"Out of the $10^d$ sequences of $d$ digits, which one takes the longest to appear, and how many digits does it take?"

* For d=1, the digit 0 is the last one to show up in pi, all the way out at the 32nd place after the decimal point: 3.1415926535897932384626433832795**0**2...
* For d=2 you need to go out to 606 places before you finally see the two-digit sequence 68.
* When Fabrice Bellard calculated 2.7 trillion digits of pi, he scanned for all sequences up to d=11, reported [here](https://bellard.org/pi/pi2700e9/pidigits.html#:~:text=scan%20decimal%20expansion%20of%20pi) in 2010.
* The scan for d=12 used the code in this directory, running on the [100 trillion digits computed by Google](https://pi.delivery/).
* The scan for d=13 used the code in this directory, running on the [314 trillion digits computed by StorageReview](https://www.storagereview.com/review/storagereview-sets-new-pi-record-314-trillion-digits-on-a-dell-poweredge-r7725).

| d | digits needed | last d-digit seq |
|:-:|---------------------:|:----------------:|
| 1| 32 | `0` |
| 2| 606 | `68` |
| 3| 8,555 | `483` |
| 4| 99,849 | `6716` |
| 5| 1,369,564 | `33394` |
| 6| 14,118,312 | `569540` |
| 7| 166,100,506 | `1075656` |
| 8| 1,816,743,912 | `36432643` |
| 9| 22,445,207,406 | `172484538` |
| 10| 241,641,121,048 | `5918289042` |
| 11| 2,512,258,603,207 | `56377726040` |
| 12| 27,261,146,164,637 | `717542605965` |
| 13| 294,420,436,740,325 | `8683109988379` |

* These are recorded in the [On-line Encyclopedia of Integer Sequences](https://oeis.org/) as entries [A036903](https://oeis.org/A036903) and [A032510](https://oeis.org/A032510).

For a 50-50 chance of seeing all sequences of 14 digits, you would need
[around 3.26 _quadrillion_](https://www.wolframalpha.com/input?i=N%5Bexp%28-n+exp%28-w%2Fn%29%29%5D+where+n+%3D+10%5E14+and+w+%3D+3.26+quadrillion)
random digits, so don't hold your breath.


## Algorithm

### Basic idea
To search for every string of $d$ digits:
* Make a bitvector of $10^d$ zeros
* Look at strings of $d$ digits one at a time, considered as a $d$-digit number $n$.
* If the $n$'th bit in the bitstring is a $0$, then you've found a new string!
* Go you! Add one to the variable "how many strings I've found so far."
* If that variable equals $10^d$, you've seen them all! Have a party.
* If the $n$'th bit in the bitstring is already a $1$, nothing to see here, move along.

If you have a lot of digits, a lot of memory, and a lot of time, this will do the job.

If you don't have $10^d$ bits of memory, then you could scan the digits more than once —
"Okay _this_ time I'm going to only pay attention to $d$-digit strings that start with a 7."
This multi-scan idea is not implemented here. Call a friend with more RAM.

### Parallelization and efficiency
To run this search faster, we use many threads. We can't have all those threads writing to the same memory at once
(their changes might clobber each other), so we implement a little mapreduce-like arrangement: The mapper threads each own a
chunk of digits and convert them into d-digit values; the reducer threads each own a chunk of memory and flip bits from 0 to 1
when the value is seen. The shuffling between mappers and reducers is implemented by storing the values in an NxN array
of vectors of values, where vector (i,j) holds values produced by mapper i and consumed by reducer j.

We stop that approach when the bitvector is getting close to all 1's, and switch to a new phase where we track the arrival
of the last few thousand strings in a (mutex-guarded) hash map that remembers at what position those strings finally appear.
This lets us keep using many threads and still find out which string took the longest to first show up.

The bitvector phase of the search is sped up by issuing memory prefetch hints, since the CPU spending all its time
asking for randomly-placed individual bits in a very large span of memory is a latency-pessimal access pattern.
The hash map phase uses a quick little Bloom filter to do less hashing.

The cutover point between the two search phases, the memory prefetch hint details, and the number of threads to use
are definitely sensitive to what exact hardware you're running on. If you plan to run this code for large $d$
(say 10 or up), you may profit from tuning these to your setup.
14 changes: 14 additions & 0 deletions Source/DigitViewer2/DigitViewer/DigitViewerTasks.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
#include "DigitViewer2/DigitWriters/BasicDigitWriter.h"
#include "DigitViewer2/DigitWriters/BasicTextWriter.h"
#include "DigitViewer2/DigitWriters/BasicYcdSetWriter.h"
#include "DigitViewer2/DigitScanner/DigitScanner.h"
#include "DigitViewerTasks.h"
namespace DigitViewer2{
////////////////////////////////////////////////////////////////////////////////
Expand Down Expand Up @@ -479,8 +480,21 @@ void to_ycd_file_partial(BasicDigitReader& reader){
);
process_write(reader, start_pos, end_pos - start_pos, writer, start_pos);
}
void find_last_d_string(BasicDigitReader& reader){
Console::println("\n\nFind Last d-Digit String");
Console::println();

// Get d from the user.
upL_t d = Console::scan_label_upL_range("Enter d (1-13): ", 1, 13);
Console::println();

DigitScanner scanner(reader, d);
scanner.search();
}
////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////
}


2 changes: 2 additions & 0 deletions Source/DigitViewer2/DigitViewer/DigitViewerTasks.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,11 @@ void compute_stats(BasicDigitReader& reader);
void to_text_file(BasicDigitReader& reader);
void to_ycd_file_all(BasicDigitReader& reader);
void to_ycd_file_partial(BasicDigitReader& reader);
void find_last_d_string(BasicDigitReader& reader);
////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////
}
#endif

19 changes: 16 additions & 3 deletions Source/DigitViewer2/DigitViewer/DigitViewerUI2.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,11 @@ void Menu_TextFile(BasicTextReader& reader){
Console::println("Compress digits 1 - N into one or more .ycd files.", 'G');
Console::print(" 4 ", 'w');
Console::println("Compress a subset of digits into .ycd files.", 'G');
Console::print(" 5 ", 'w');
Console::println("Search for all d-digit strings.", 'G');

Console::println("\nEnter your choice:", 'w');
upL_t c = Console::scan_label_upL_range("option: ", 0, 4);
upL_t c = Console::scan_label_upL_range("option: ", 0, 5);
Console::println();

switch (c){
Expand All @@ -73,6 +75,9 @@ void Menu_TextFile(BasicTextReader& reader){
case 4:
to_ycd_file_partial(reader);
return;
case 5:
find_last_d_string(reader);
return;
default:;
}
}
Expand Down Expand Up @@ -115,14 +120,16 @@ void Menu_YcdFile(BasicYcdSetReader& reader){
Console::println("Compress digits 1 - N into one or more .ycd files.", 'G');
Console::print(" 4 ", 'w');
Console::println("Compress a subset of digits into .ycd files.", 'G');
Console::print(" 5 ", 'w');
Console::println("Search for all d-digit strings.", 'G');
Console::println();

Console::print(" 5 ", 'w');
Console::print(" 6 ", 'w');
Console::print("Add search directory.", 'G');
Console::println(" (if .ycd files are in multiple paths)", 'Y');

Console::println("\nEnter your choice:", 'w');
upL_t c = Console::scan_label_upL_range("option: ", 0, 5);
upL_t c = Console::scan_label_upL_range("option: ", 0, 6);
Console::println();

switch (c){
Expand All @@ -142,6 +149,10 @@ void Menu_YcdFile(BasicYcdSetReader& reader){
to_ycd_file_partial(reader);
return;
case 5:
find_last_d_string(reader);
return;

case 6:
Console::println("\nEnter directory:");
reader.add_search_path(Console::scan_utf8());
break;
Expand Down Expand Up @@ -200,3 +211,5 @@ void Menu_Main(){
////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////
}


3 changes: 3 additions & 0 deletions Source/DigitViewer2/Objects.mk
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,12 @@ CURRENT += DigitWriters/BasicTextWriter.cpp
CURRENT += DigitWriters/BasicYcdFileWriter.cpp
CURRENT += DigitWriters/BasicYcdSetWriter.cpp

CURRENT += DigitScanner/DigitScanner.cpp

CURRENT += DigitViewer/DigitViewerTasks.cpp
CURRENT += DigitViewer/DigitViewerUI2.cpp


SOURCES := $(SOURCES) $(addprefix $(CURRENT_DIR)/, $(CURRENT))
endif

2 changes: 2 additions & 0 deletions Source/DigitViewer2/SMC_DigitViewer2.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,5 @@

#include "DigitViewer/DigitViewerTasks.cpp"
#include "DigitViewer/DigitViewerUI2.cpp"

#include "DigitScanner/DigitScanner.cpp"
7 changes: 7 additions & 0 deletions Source/PublicLibs/BasicLibs/StringTools/ToString.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,13 @@ YM_NO_INLINE std::string tostrln(uiL_t x, NumberFormat format){
YM_NO_INLINE std::string tostrln(siL_t x, NumberFormat format){
return tostr(x, format) += "\r\n";
}
YM_NO_INLINE std::string tostr_width(uiL_t x, int width){
std::ostringstream out;
out << std::setfill('0');
out << std::setw(width);
out << x;
return out.str();
}
////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////
Expand Down
1 change: 1 addition & 0 deletions Source/PublicLibs/BasicLibs/StringTools/ToString.h
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ YM_NO_INLINE std::string tostrln (uiL_t x, NumberFormat format = NORMAL);
YM_NO_INLINE std::string tostrln (siL_t x, NumberFormat format = NORMAL);
static std::string tostrln (u32_t x, NumberFormat format = NORMAL){ return tostrln((uiL_t)x, format); }
static std::string tostrln (s32_t x, NumberFormat format = NORMAL){ return tostrln((siL_t)x, format); }
YM_NO_INLINE std::string tostr_width (uiL_t x, int width);
////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////
// Float
Expand Down
1 change: 1 addition & 0 deletions TinyTestData/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Minimal .ycd file of 1 million decimal digits, just to have for testing purposes.
Binary file added TinyTestData/pi1m - 0.ycd
Binary file not shown.
32 changes: 17 additions & 15 deletions VSS - DigitViewer2/DigitViewer2/DigitViewer2.vcxproj
Original file line number Diff line number Diff line change
Expand Up @@ -62,102 +62,102 @@
<VCProjectVersion>15.0</VCProjectVersion>
<ProjectGuid>{78460907-F11F-45DF-A8B3-BCF1D8E54EC5}</ProjectGuid>
<RootNamespace>DigitViewer2</RootNamespace>
<WindowsTargetPlatformVersion>10.0.17763.0</WindowsTargetPlatformVersion>
<WindowsTargetPlatformVersion>10.0</WindowsTargetPlatformVersion>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='04-SSE3|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='07-Penryn|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='13-Haswell|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='17-Skylake|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='00-x86|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='04-SSE3|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='07-Penryn|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='13-Haswell|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='17-Skylake|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='00-x86|x64'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<PlatformToolset>v141</PlatformToolset>
<PlatformToolset>v145</PlatformToolset>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
</PropertyGroup>
Expand Down Expand Up @@ -564,6 +564,7 @@
<ClCompile Include="..\..\Source\DigitViewer2\DigitReaders\BasicYcdSetReader.cpp" />
<ClCompile Include="..\..\Source\DigitViewer2\DigitReaders\InconsistentMetadataException.cpp" />
<ClCompile Include="..\..\Source\DigitViewer2\DigitReaders\ParsingTools.cpp" />
<ClCompile Include="..\..\Source\DigitViewer2\DigitScanner\DigitScanner.cpp" />
<ClCompile Include="..\..\Source\DigitViewer2\DigitViewer\DigitViewerTasks.cpp" />
<ClCompile Include="..\..\Source\DigitViewer2\DigitViewer\DigitViewerUI2.cpp" />
<ClCompile Include="..\..\Source\DigitViewer2\DigitWriters\BasicTextWriter.cpp" />
Expand Down Expand Up @@ -699,6 +700,7 @@
<ClInclude Include="..\..\Source\DigitViewer2\DigitReaders\BasicYcdSetReader.h" />
<ClInclude Include="..\..\Source\DigitViewer2\DigitReaders\InconsistentMetadataException.h" />
<ClInclude Include="..\..\Source\DigitViewer2\DigitReaders\ParsingTools.h" />
<ClInclude Include="..\..\Source\DigitViewer2\DigitScanner\DigitScanner.h" />
<ClInclude Include="..\..\Source\DigitViewer2\DigitViewer\DigitViewerTasks.h" />
<ClInclude Include="..\..\Source\DigitViewer2\DigitViewer\DigitViewerUI2.h" />
<ClInclude Include="..\..\Source\DigitViewer2\DigitWriters\BasicDigitWriter.h" />
Expand Down
Loading