- This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances
- Language evaluated in this benchmark:
- Asia:
jpn,cmn,kor,hin - Europe:
fra,spa,por,ita,nld,eng,deu,fin,rus - Middle east: ,
tur,heb,ara
- Asia:
- This page and graphs are auto-generated from the code
Here is the list of libraries in this benchmark
| Library | Script | Language | Properly Identified | Improperly identified | Not identified | Avg Execution Time | Disk Size |
|---|---|---|---|---|---|---|---|
| TinyLD Heavy | yarn bench:tinyld-heavy |
64 | 99.249% | 0.7478% | 0.0032% | 0.096ms. | 2.0MB |
| TinyLD | yarn bench:tinyld |
64 | 98.5231% | 1.3712% | 0.1057% | 0.1191ms. | 580KB |
| TinyLD Light | yarn bench:tinyld-light |
24 | 97.8778% | 1.9842% | 0.138% | 0.0947ms. | 68KB |
| **langdetect | yarn bench:langdetect |
53 | 95.675% | 4.325% | 0% | 0.3647ms. | 1.8MB |
| node-cld | yarn bench:cld |
160 | 92.3654% | 1.6213% | 6.0133% | 0.0711ms. | > 10MB |
| franc | yarn bench:franc |
187 | 74.2577% | 25.7423% | 0% | 0.2242ms. | 267KB |
| franc-min | yarn bench:franc-min |
82 | 70.3891% | 23.1888% | 6.422% | 0.084ms. | 119KB |
| franc-all | yarn bench:franc-all |
403 | 66.7081% | 33.2919% | 0% | 0.4763ms. | 509KB |
| languagedetect | yarn bench:languagedetect |
52 | 65.2835% | 11.2808% | 23.4357% | 0.1896ms. | 240KB |
We see two group of libraries
tinyld,langdetectandcldover 90% accuracyfrancandlanguagedetectunder 75% accuracy
We see big differences between languages:
- Japanese or Korean are almost at 100% for every libs (lot of unique characters)
- Spanish and Portuguese are really close and cause more false-positive and an higher error-rate
Most libraries are using statistical analysis, so longer is the input text, better will be the detection. So we can often see quotes like this in those library documentations.
Make sure to pass it big documents to get reliable results.
Let's see if this statement is true, and how those libraries behave for different input size (from small to long)
So the previous quote is right, over 512 characters all the libs become accurate enough.
But for a ~95% accuracy threshold:
tinyld(green) reaches it around 24 characterslangdetect(cyan) andcld(orange) reach it around 48 characters
Here we can notice few things about performance:
langdetect(cyan) andfranc(pink) seems to slow down at a similar ratetinyld(green) slow down but at a really flat ratecld(orange) is definitely the fastest and doesn't show any apparent slow down
But we've seen previously that some of those libraries need more than 256 characters to be accurate. It means they start to slow down at the same time they start to give decent results.
- For NodeJS:
TinyLD,langdetectornode-cld(fast and accurate) - For Browser:
TinyLD Lightorfranc-min(small, decent accuracy, franc is less accurate but support more languages)
- Short text (chatbot, keywords, database, ...):
TinyLDorlangdetect - Long text (documents, webpage):
node-cldorTinyLD
franc-allis the worst in terms of accuracy, not a surprise because it tries to detect 400+ languages with only 3-grams. A technical demo to put big numbers but useless for real usage, even a language like english barely reaches ~45% detection rate.languagedetectis light but just not accurate enough
Thanks for reading this article, those metrics are really helpful for the development of tinyld.
It's used in the development to see the impact of every modification and features.
If you want to contribute or see another library in this benchmark, open an issue