The field extraction table is the file used in the CDS/Isis structures to update and maintain the search indexes (inverted lists) as well as in the processes related to the exchange of information or the generation of keys to alphabetize the reports. exit. When constructing the field extraction table, the database designer must keep in mind the types of searches he or she wants to enable for information users and try to make the queries retrieve information, whenever possible. CDS/Isis provides a large number of facilities to guarantee the success of information recovery processes, such as:
- 9 different indexing techniques, so that the same field can be stored in indexes in different ways
- Key extraction is formulated through the format language, which allows analysis and transformations on the data before sending it to the indexes
- Transparency in the use of uppercase, lowercase or accented characters in search terms
- Identification of search keys, which makes it easier to determine the origin (mfn, field, occurrence and relative position within the field) of each of the terms contained in the dictionary
The field extraction table is a TXT type file which consists of three columns where the following elements are identified: Ar ID Key identification
Identifies the field tag that will be used to identify the term.
IT Indexing Technique
Specifies the indexing technique to be applied to the lines obtained after applying the extraction format to each record in the database.
Extraction format
Indicates the extraction format to be applied to the registry to obtain the key
ID Key identification
The index file keys (inverted lists) of the CDS/Isis structures consist of five elements: Search Term (key) ID Mfn
Occurrence Number
Sequence Number
The value supplied in column 1 of the FST generates the inverted file ID component , which assigns an ID to each of the keys generated by the extraction format. This identification is very important to ABCD when using authority lists and will generally match the field tag.
Indexing techniques #
To date there are 9 indexing techniques:
0 | Passes each line generated by the extraction format to the inverted list |
1 | Passes each subfield generated by the extraction format to the inverted list |
2 | Passes the elements enclosed between <…> to the inverted list |
3 | Transfers the elements enclosed between /…/ to the inverted list |
4 | Passes each word generated by the extraction format to the inverted list |
5 | Same as technique 1 , adding a prefix to each generated key |
6 | Same as technique 2 , adding a prefix to each generated key |
7 | Same as technique 3 , adding a prefix to each generated key |
8 | Same as technique 4 , adding a prefix to each generated key |
Techniques 2 and 3 have similar effects on key generation; The difference comes from the type of delimiter used to identify the terms to be extracted: if the <…> delimiter is used to identify the key terms, later, when issuing printed reports or screen output, it can be eliminated or replaced by punctuation marks. by applying the MHx or MDx mode command . The /…/ delimiter cannot be replaced so it will always be present in printed or screen output.
When an fst is applied to a record to obtain a key, the order of the operation is as follows:
- The extraction format is used to capture the data from the registry
- The corresponding indexing technique is applied to the information obtained.
- Each individual key that results from this process is assigned the specified Id and is stored in the inverted list including the MFN of the record, the occurrence number from which the key was extracted, and whether the indexing is by word (technique 4 or 8), the relative position of the word with respect to the line generated by the extraction format.
Example: Suppose the following record (in MARC format):
<35> $9(DLC) 90049743l</35>
<10> ^a 90049743</10>
<20> ^a0387974490 (alk. paper)</20>
<40> ^aDLC^cDLC^dDLC</40>
<41>0 ^aeng^bfregerhebjapsparus</41>
<50>00^aGC89^b.E54 1991</50>
<82>00^a551.4/58$220</82>
<100>1 ^aEmery, K. O.^q(Kenneth Orris),^d1914-</100>
<245>10^aSea levels and tide gauges /^cK.O Emery, David G. Aubrey.</245>
<260> ^aNew York :^bSpringer-Verlag,^cc1991.</260>
<300> ^axiv, 237 p. :^bill., maps :^c29 cm.</300>
<500> ^aIn English, with summaries in French, German, Hebrew, Japanese, Spanish, and Russian.</500
<504> ^aIncludes bibliographical references (p. 207-226) and indexes.</504>
<650> 0^aSea level.</650>
<650> 0^aSubsidences (Earth movements)</650>
<650> 0^aTide-gages.</650>
<650> 0^aDatabase management^xCongresses.</650>
<650> 0^aArtificial intelligence^xCongresses.</650>
<700>1 ^aAubrey, David G.</700>
<5>20000113 35151</5>
<935>LA<935>
we want to obtain the following keys:
Title (245) | to be retrieved for each of the words |
Authors (100 and 700) | to be recovered in full (last name + first name) and independently by last name or first name |
Subjects (650) | that can be recovered by complete phrase or by any of the words that form them |
Languages (41) | all languages (note: in subfield b of field 41 the languages are included in a string where every 3 characters represent the code for a different language |
Publisher (260) | as it appears in the document |
Edition date(260) | as it appears in the document |
LC Classification (50) | in such a way that it allows a general search for the first level of the classification and also for the complete classification |
nowrap | Date of entry into the database (5) |
The Fst that we need to define for these purposes is the following
Title (245) | 245 4 v245^a |
Authors (100 and 700) | 100 0 v100^a/ 700 0 v700^a/ 100 4 v100^a/ 700 4 v700^a/ |
Subjects (650) | 650 1 (v650*2/) 650 4 MHL(v650*2/) |
Languages (41) | 41 0 v41^a/v41^b.3/ v41^b*3.3/ v41^b*6.3/ v41^b*9.3/ v41^b*12.3/ v41^b*15.3/ v41^b*18.3 |
Publisher (260) | 260 0 v260^b |
Edition date(260) | 260 0 v260^c |
LC Classification (50) | 50 0 v260^a/v260^a,v260^b |
nowrap|Database entry date (5) | 5 0 v5.4/v5.6/v5 |
Explanation :
When we prepare an FST it is necessary to be clear about the concept of how the terms are stored in the inverted list (see Structure of inverted files )

The inverted list is a set of 6 files, 5 of which are indexes to the term dictionary, which (with the .ifp extension) houses all the keys extracted from the database through the application of the table of terms. field extraction (fst) on each of the records. The term dictionary is an alphabetical list of all the access points that we have extracted from the database (with the help of the .fst) and each key has an associated list of pointers that define the place from which the term was extracted . This list of pointers is called “postings” and each “posting” has 4 components:
**Mfn** del registro del cual se extrajo la clave
**Id** del campo, tal como fué indicado en la primera columna de la FST
**Número de la ocurrencia** del campo desde el cual se extrajo la clave
**Posición relativa de la palabra** dentro del campo desde el cual se extrajo la clave (cuando el campo se indizó por técnica 4)
For example, if the term Education appears in records 1 and 20 in the subject field (v76) and is also found in record 35 in the title field (v16): Distance education methods, by applying the following Fst about registration:
76 0 (v76/)
16 4 v16
The dictionary of terms will refer to the term Education in the following way:
EDUCACION
1 76 1 1
20 76 1 1
35 16 1 3
Three “postings” have been generated for the term education. The first, 1 76 1 1</font>indicates that the keyword comes from MFN 1, first occurrence of field 76 and is located at the beginning of the field. The second pointer 20 76 1 1 specifies that it is also found in Mfn 20, field 76, first occurrence and first word and finally, 35 16 1 3 indicates that record 35 contains the term education, extracted from field 16, first occurrence and it is also the third word in the field.
The 0 indexing technique always places the value 1 as the relative position of the key within the field. The rest of the indexing techniques list the position of the key within the field. The relative position of a term within the field that contains it is what allows defining proximity searches (operators . and $ of the CDS/Isis search language). The distance between two terms is determined by obtaining the difference between their relative positions. The value of the occurrence number is used when applying the operator (F) for which the search expression is true when all the terms that are combined come from the same occurrence of the repeatable field. It is also used in the generation of the Lists of authorities that assist the entry of the records (see: Terminology control: Lists of authorities ).
Let’s now analyze the FST that we mentioned at the beginning of this page:
245 4 v245^a
Extracts subfield a from field 245 and applies indexing technique 4 to the result. Each word obtained is sent to the inverted list with the identifier 245
100 0 v100^a/
700 0 (v700^a/)
Extracts subfield a from field 100 and subfield a from field 700. Analyzes the result obtained trying to identify lines (technique 0) and each line obtained is sent to the inverted list with the identifier 100 or 700 depending on the case. Note that the v100^a, v700^a extraction format would not produce the results required by the following
reasons:
- The 0-indexing technique searches for lines in the field generated by the extraction format. As we are not generating line breaks (/) per occurrence, the extraction format will produce a single string with all the authors contiguously and from that string the first 60 characters will be taken which will be stored in the inverted list
- Field 700 is repeatable; Therefore, if not edited as a repeatable group the format will extract all occurrences of the field creating a single phrase, and then each author would not be sent as a separate key to the inverted list
100 4 v100^a/
700 4 (v700^a/)
Extracts subfield a from fields 100 and 700 of the record. Each occurrence is placed on a new line. From the generated list it extracts each of the words (technique 4) and sends them to the inverted list with the identifier 100 or 700 as appropriate. Each word drags the occurrence number that the author occupies in the field as well as its relative position within each occurrence.
If we are indexing by words, why is it necessary to include line breaks to separate fields and occurrences? For the following reason: if a separation is not included between v100^a and v700^a, the last word of v100^a would appear attached to the first word of the first occurrence of v700^a, producing an erroneous entry in the index. Likewise, if you do not separate occurrences of v700 with a line break, the first word of the next occurrence would appear stuck with the last word of the previous occurrence.
650 1 (v650*2/)
In this example we are extracting each subfield of field 650 and generating, for each one, an entry in the inverted list with the identification 650. Why v650*2? The record presented in the example is cataloged according to the Marc format and the two indicators are being included before the subfield a:
00^aDatabase management^xCongresses.
00^aArtificial intelligence^xCongresses.
If we express the fst format as 650 1 (v650/) an attempt will be made to identify all the subfields of each of the occurrences of field 650; Therefore, the portion corresponding to the indicators will be taken as a subfield and we will have a series of keys generated only with the indicators of field 650. By expressing the extraction format in the form (v650*2/) we are indicating an offset of 2 positions with respect to the start of the field and the indicators will not be taken into account.
When indexing technique 1 is applied, it is necessary to verify that the extraction format contains subfields; That is, if we set the key extraction format to 650 1 mhu, (v650*2/) we will be generating wrong keys since, by definition, the MHU Mode replaces the subfields with punctuation marks, causing the subfields to disappear when the format is applied. on the record and in this case the key will be generated by making a single phrase with all the subfields and the index would then contain only the first 60 characters of the phrase obtained (for example: ARTIFICIAL INTELLIGENCE. C0NGR), with the consequent loss of points of access to the registry.
650 4 MHU(v650*2/)
Same reasoning as in the previous case, but the words are extracted from the lines obtained
41 0 v41^a/v41^b.3/ v41^b*3.3/ v41^b*6.3/ v41^b*9.3/ v41^b*12.3/ v41^b*15.3/ v41^b*18.3
Since subfield b of field 41 has an input pattern that specifies that each language occupies 3 positions, using the offset and length options we can send each language to the inverted list
50 0 v50^a/v50^a,v50^b
In this example we are generating two keys for each LC classification. The first v50^a will allow us to search by thematic groups. The second key generated will allow us to locate a particular classification number. Note again the presence of the line break character (/), which forces two independent keys to be generated
v5.4/v5.6/v5
With the document entry date we are generating three keys: the first v5.4 will allow us to quickly locate all the materials entered in a year; the second v5.6 will recover one month’s income; and the third v5 the income corresponding to one day.
Note that generating these three keys (by year, year-month and year-month-day) makes information retrieval more efficient than generating a single key at the year, month and day level and applying the right truncation operator to search by year and by year and month
Using prefixes in the key generation process #
As the dictionary of terms is a single file with all the keys arranged alphabetically, the authors are presented mixed with the titles, with the keywords and, in general, with all the fields that have been indexed in the FST.
If we want to have the keys separated according to the field from which they were generated, we have two solutions:
- Use prefixes when generating indexing keys in order to create subdictionaries within the dictionary of terms
- Create separate dictionaries according to the content of each of the fields.
According to the first option, if the fst
Title (245) | 245 4 v245^a |
Authors (100 and 700) | 100 0 v100^a/,(v700^a/) 100 4 v100^a/(v700^a/) |
Subjects (650) | 650 1 (v650*2/) 650 4 (v650*2/) |
Languages (41) | 41 0 v41^a/v41^b.3/ v41^b*3.3/ v41^b*6.3/ v41^b*9.3/ v41^b*12.3/ v41^b*15.3/ v41^b*18.3 |
Publisher (260) | 260 0 v260^b |
Edition date(260) | 260 0 v260^c |
LC Classification (50) | 50 0 v260^a/v260^a,v260^b |
Date of entry into the database (5) | 5 0 v5.4/v5.6/v5 |
we change it to
**Title (245)** | 245 8 ‘/T:/’,v245^a |
**Authors (100 and 700)** | 100 0 “A:”v100^a/,(|A:|v700^a/) 100 8 ‘/A:/’,v100^a/(|A:|v700^a/) |
**Subjects (650)** | 650 5 ‘/M:/’,(v650\*2/) 650 4 ‘/M:/’,(v650\*2/) |
**Languages (41)** | 41 8’/I:/’,v41^a,” “v41^b.3,” “v41^b\*3.3, ” “v41^b\*6.3, ” “v41^b\*9.3, ” “v41^ b\*12.3, ” “v41^b\*15.3, ” “v41^b\*18.3 |
**Editorial (260)** | 260 0 “E:”v260^b |
**Edition date(260)** | 260 0 “F:”v260^c |
**LC Classification (50)** | 50 0 “C:”v260^a/”C:”v260^a,v260^b |
**Date of entry into the database (5)** | 5 0 “F:”v5.4/”F:”v5.6/”F:”v5 |
As can be seen we have made the following changes:
Technique | Changed to |
1 | 5 |
4 | 8 |
To the keys that are being indexed with technique 0, it is enough to add a pre-literal with the prefix that we want to use to differentiate the data. For the rest of the indexing techniques (5, 6, 7 and 8) the prefix must be indicated before the extraction format with the following syntax:
- the prefix is enclosed in apostrophes (literal not conditional)
- The literal corresponding to the prefix is enclosed between two special characters that are not included in the prefix.
example:
'/A:/'
'\#A:\#'
In addition to allowing us to see the content of a field in an orderly manner, without mixing the terms obtained from other fields, the search through a prefix is faster than the qualified search; this is:
Searching for M:Education is more efficient than Education/(650) because the qualified search requires reviewing each of the postings for the term.
However, depending on the experience of our end users and the capacity of the computer where we have our database installed, it may be advisable to index the data in several different ways, with prefix and without prefix, in order to provide our users with greater flexibility in searches. More search keys means more disk space and not necessarily lower recovery speed, given the structure of the inverted list (B* tree) which is constantly reorganized so that the height of the tree is always the same in all its branches (the Tree height reflects the number of accesses required to locate a term in the inverted list).
The CDS/Isis family products allow you to define more than one dictionary of terms for a database. That is, we can create a dictionary for authors, another for titles, etc. However, so that the terms from different fields can be combined with each other, in the same search expression, through the Boolean operators it will always be necessary to define a general dictionary that groups them all since it is not possible to cross the terms of a dictionary with terms from another dictionary. The facility of particular dictionaries replaces the use of prefixes to present the user with the terms extracted from a particular field and allows the terms of the same dictionary to be operated logically.
Transparency in the use of uppercase, lowercase and special characters #
One of the benefits of the CDS/Isis search mechanism lies in the transparency it provides in the use of uppercase, lowercase or special characters in the search terms. To achieve this objective, all keys are stored in the inverted list in uppercase , and if we have planned it, the accented characters are transformed into their uppercase equivalent. The search expressions provided by the user are also transformed into uppercase letters, which minimizes errors due to user typing errors.
The conversion of keys and search expressions is done using the ISISUC.TAB file , which must be in accordance with the character set adopted for the database (see uppercase to lowercase conversion table).
When we index the fields with technique 4 or 8 (by words), CDS/Isis uses the ISISAC.TAB table to establish the composition of the “word” concept; that is, the ISISAC.TAB table tells CDS/Isis which characters to consider as alphabetical to form the words. Any character not inserted in ISISAC.TAB will be considered a separator and will terminate the word.
Suppose that in ISISUC.TAB we make the letter ñ equivalent to its uppercase expression Ñ . If in ISISAC.TAB we do not include the code corresponding to the Ñ (209 in Ansi)
las palabras: aparecerán en el índice como
niño NI O
cañería CA ERIA
cañaveral CA AVERAL
acuñación ACU ACION
That is, each word is divided into two, generating two entries in the dictionary, since since the Ñ is not included in ISISAC.TAB, it is considered a separator just like a punctuation mark.
See:
ANSI character set #
Alphabetic character table (ISISAC.TAB) #
048 049 050 051 052 053 054 055 056 057 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 097 098 099 100 101 102 103 104 105 106 107 108 109 110
111 112 113 114 115 116 117 118 119 120 121 122 192 193 194 195 196 197 199 200 201 202 203 204 205 206 207 209 210 211 212 213 214 216 217 218 219 220 221 224 225 226 227 228 229 231 232 233 234 235 236 237 238 239 241 242 243 244 245 246 248 249 250 251 252 253 255
Each line must have 32 ansi characters. In this table the numbers have been included as alphabetical characters so that the numerical references can be indexed by technique 4. It corresponds to the values 048 049 050 051 052 053 054 055 056 057, corresponding to the digits 0 1 2 3 5 6 7 8 9. Ñ (209) has also been included as an alphabetical character
Lowercase to uppercase conversion table (ISISUC.TAB) #
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 028 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 065 065 065 065 065 065 065 199 069 069 069 069 073 073 073 073 208 209 079 079 079 079 079 215 216 085 085 085 085 089 222 223 065 065 065 065 065 065 065 199 069 069 069 069 073 073 073 073 079 209 079 079 079 079 079 247 248 085 085 085 085 089 254 089
The 256 ANSI characters are represented in this table and it must be made up of 32 ANSI characters per line, for a total of 8 lines.
Each 3-digit position corresponds to the original value of the ANSI character and must be set to the value you want to assign to the character in the conversion to uppercase when generating the inverted list or using the mpu, mhu, mdu format language commands.
For example, in the case of ñ and Ñ, the table is interpreted as follows:
ñ has the ANSI code value 241; Therefore, if you want to obtain Ñ in the uppercase conversion processes, you must place the value 209 in position 241 of the table, which corresponds to the ANSI code of the character Ñ.
Likewise, if you want the uppercase Ñ (ANSI code 209) to maintain its representation, in position 209 of the table you must place the value 209, which is the ANSI code of the uppercase Ñ.
On the contrary, if you want the characters ñ and Ñ to be both converted to the letter N, you must alter positions 209 and 241 of the table and set them to the value 078.
The isisuc.tab table that is distributed in ABCD the characters ñ and Ñ are transformed to Ñ. If you want your conversion to N, look in the isisuc.tab table and where it reads 209 put 078.