You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/abcd-advanced/data-models/index.md
+68-42Lines changed: 68 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -318,6 +318,7 @@ Following version non longer supported by ABCD
318
318
319
319
This table defines the character codes of all alphabetic characters. It is used each time CDS/ISIS needs to know whether a given character is alphabetic (e.g. when performing word indexing using indexing technique 4, or validating alphabetic fields).\\
320
320
A given text character whose code is stored in this table will be considered an alphabetic character.
321
+
321
322
## Syntax actab for ANSI/ISO-8859-1
322
323
323
324
The standard table supplied by UNESCO is given below. Note 32 decimal ANSI codes per line
@@ -395,78 +396,103 @@ See the [full ANSI table](/docs/3.1/abcd-advanced/cisis-utilities/ansi-table) fo
395
396
If you want to include other symbols in Technique 4 or 8 indexing, just get your ANSI code and insert it in the place corresponding to your sequence.\\
396
397
If you do not want the numbers to be included in the indexing by techniques 4 or 8, eliminate the codes 048 to 057.
397
398
398
-
# Upper case conversion table (uctab)
399
+
##Upper Case Conversion Table (uctab / isisuc.tab)
399
400
400
-
This table is used to convert database text (i.e. as stored in the database) to upper case.
401
+
The **uctab** (upper case table) is used to convert database text (as stored in the database) to upper case. It is one of the most important and frequently underestimated resources in CDS/ISIS and ABCD.
401
402
402
-
One of the characteristics of the information search process in CDS/Isis structures is transparency to the presence of accented characters and upper or lower case in search expressions. CDS/Isis will locate the information regardless of whether the accent is incorrectly placed or whether the keywords were written in upper or lower case.\\
403
-
To achieve this goal, the keys in the inverted file are stored in uppercase and all search expressions are automatically converted to uppercase.\\
404
-
This lowercase to uppercase conversion is performed with the help of a table called **uctab** (uc = upper). When a search expression is read each character is indexed in the **uctab** table and replaced by the equivalent value placed in that table.
403
+
One of the vital characteristics of the information retrieval process is its transparency regarding the presence of accented characters and the use of upper or lower case. The system must be able to locate the information regardless of whether the user typed the exact spelling or not.
405
404
406
-
## Syntax uctab for ANSI/ISO-8859-1
405
+
To achieve this goal, the keys in the Inverted File (index) are stored in uppercase, and all search expressions entered by the user are automatically converted to uppercase before the search is executed.
407
406
408
-
This table consists of 256 characters, and each character represents an Ansi Code.
407
+
### Why is Standardization Necessary?
408
+
Identical concepts can be entered into the database in various ways. For example:
409
+
* População
410
+
* população
411
+
* POPULAÇÃO
409
412
410
-
Example:\\
411
-
The letter **ñ** is at (decimal) position 241 in the ANSI character table.\\
412
-
In file **isisuc.tab** the letter **ñ** is also at position 241.\\
413
-
The uppercase of **ñ** is **Ñ**.\\
414
-
At position 241 of the table we have to place the code 209, which corresponds to the **Ñ** in the ANSI character table.
413
+
For these three variations to constitute a **single entry** in the search index, they must undergo a standardization process, resulting in a single access key (e.g., `POPULACAO`).
415
414
416
-
The standard table supplied by UNESCO (without any conversions) is given below:
417
-
```
415
+
This conversion is guided by the `isisuc.tab` table. During indexing, the system reads the extraction commands in the FST (such as `mhu`, `mpu`, `mdu`) and refers to the table to convert the characters. During a search, the system does the same with the term entered by the user, ensuring the searched term matches the key in the index.
416
+
417
+
### Syntax for ANSI/ISO-8859-1
418
+
419
+
The ANSI/ISO-8859-1 table is a fixed matrix map containing **exactly 256 values** (codes from 000 to 255).
420
+
The position of the value in the table represents the original character, and the number written in that position represents the character it should be converted to.
421
+
422
+
**Practical Example:**
423
+
The lowercase letter **a** occupies the (decimal) position 097 in the ASCII table.
424
+
In the `isisuc.tab` file, if we go to the 97th position, we will find the value **065**.
425
+
Code 065 corresponds to the uppercase letter **A**. Thus, the system knows that "a" converts to "A".
426
+
427
+
### The Issue with Special Characters (Ñ, Ç, and Accents)
428
+
Due to the strict limit of 256 positions, **you do not add or remove positions from the file**. You only change the mapping of one code to another.
418
429
430
+
Historically, many standard tables are configured to "clean" accents by mapping accented characters to their unaccented versions:
431
+
*`ñ` (position 241) is converted to `N` (value 078).
432
+
*`ç` (position 231) is converted to `C` (value 067).
433
+
*`á` (position 225) is converted to `A` (value 065).
434
+
435
+
If users from Hispanic or Lusophone countries want the **Ñ** and **Ç** to be indexed as independent letters and retain their spelling in the index, you simply alter the mapping value in their respective positions:
436
+
* Go to position **241** (which represents `ñ`) and change the value from `078` to **`209`** (which is the ANSI code for uppercase `Ñ`).
437
+
* Go to position **231** (which represents `ç`) and change the value from `067` to **`199`** (which is the ANSI code for uppercase `Ç`).
438
+
439
+
Below is an example of the file using the UNESCO standard, which preserves the strict structure of 8 rows and 32 columns. Note that line breaks and spaces must be strictly maintained; otherwise, a fatal error will occur during the inverted file generation.
*(Note: The highlighted numbers in this block determine the accentuation behavior in your database. Edit them according to your library's phonetic policies).*
428
452
429
-
Notice that the **isisuc.tab** table has 8 rows and 32 columns of 3 numbers. This format must be preserved, otherwise an error would be generated when updating the inverted list.
430
-
## Syntax uctab for UTF-8
431
-
- Each line contains: decimal value lowercase = decimal value uppercase
432
-
- Optionally followed by a hash mark (`#`) with comment.
433
-
- One assignment per line
434
-
- It is mandatory to fill in ascending order
435
-
- Empty lines and lines starting with # are considered comment and ignored
436
-
Excerpt from an actual uctab with ~300 lines (if no case conversion is required the character can be omitted)
453
+
### Syntax for UTF-8
437
454
438
-
```
455
+
The configuration for Unicode (UTF-8) databases is much more user-friendly. It does not require a strict positional matrix but rather a direct declaration format:
456
+
* Each line contains: `decimal value lowercase = decimal value uppercase`
457
+
* Optionally followed by a hash mark (`#`) and a comment.
458
+
* It is mandatory to fill it in ascending order.
439
459
440
-
# One assignment per line
441
-
# It is mandatory to fill in ascending order
460
+
Example of explicit mapping (excerpt from the `isisuc_utf8.tab` file):
461
+
462
+
```text
463
+
# One assignment per line, in ascending order
442
464
443
465
097=065 # a -> A
444
466
098=066 # b -> B
445
-
467
+
...
446
468
122=090 # z -> Z
447
469
448
-
195 128=065 # À -> A LATIN CAPITAL LETTER A WITH GRAVE
449
-
195 129=065 # Á -> A LATIN CAPITAL LETTER A WITH ACUTE
470
+
195 128=065 # À -> Converts to uppercase A (without accent)
471
+
195 129=065 # Á -> Converts to uppercase A (without accent)
450
472
473
+
195 164=195 132 # ä -> Ä (Keeps the umlaut)
451
474
```
452
475
453
-
## Location in ABCD
454
-
By default it is placed in the root of the base folder and referenced in the `par/<dbn>.par` file. If you want to use a specific table for a database, place the table in the data folder of the database and modify the `<dbn>.par` file to indicate the new path.
455
-
456
-
Note: An actual installation contains normally an `uctab` file for ANS/ISO-8859-1 **and** an `uctab` file for UTF-8.
476
+
### Location in ABCD
477
+
By default, the table files are located in the root of the `bases` folder and are applied globally to all databases referenced by the `par/<dbn>.par` files. If you want to use a specific table for only one database (e.g., a database with an indigenous language indexing policy), place the table inside the `data` folder of that database and modify the respective `.par` file to indicate the new path.
www/<bases>/par/<dbn>.par # file that points to the tables
458
485
```
459
486
460
-
www/<bases>/isisuc.tab # default
461
-
www/<bases>/isisuc_utf8.tab # default
462
-
www/<bases>/<dbn>/data/isisuc.tab # database specific
463
-
www/<bases>/<dbn>/data/isisuc_utf8.tab # database specific
464
-
www/<bases>/par/<dbn>.par # reference to the table
465
-
```
487
+
:::warning Important Post-Editing Step
488
+
If you decide to change the mapping behavior of a character (for example, making `Ç` index as `Ç` instead of `C`), you **must** run the **Full Inverted File Generation** utility on your database immediately after saving the `isisuc.tab` file. This ensures that the old index keys are recreated using the new conversion rules. If you skip this step, search results will be inconsistent!
489
+
:::
490
+
491
+
466
492
467
493
## Details
468
494
### Link for decimal UTF-8
469
-
[Unicode to decimal converter](https///onlineunicodetools.com/convert-unicode-to-decimal)
495
+
[Unicode to decimal converter](https://onlineunicodetools.com/convert-unicode-to-decimal)
0 commit comments