Data files (mst and xrf) only allow retrieval of records sequentially by MFN number. As other forms of access to information are required, for example, authors, countries, subjects, etc., it is necessary to have an additional structure that allows, given a keyword or a search formula, to locate the records that contain the requested terms. . This structure is what under Cds/isis is called Inverted Lists .
The inverted file of the CDS/Isis structures is actually made up of 6 physical files, five of which contain the dictionary search terms (organized as a B* tree ) and the sixth contains the list of pointers associated with each term.
To optimize disk storage, two separate B* trees are maintained: one for terms up to 30 characters (stored in the .N=01 and .L01 files) and another for terms over 30 and up to 60 characters (stored in .N02 and .L02 files). The .CNT file contains control fields for both B* trees). In each file of the B* tree the .N0x file contains the tree nodes and the .L0x file contains the leaves. The sheet records point to the place where the pointers that contain the information to locate the records (postings) in the database are located. This file is identified with the .IFP extension. The physical relationship between these files can be represented as follows:
The records are stored sequentially in the master file and their physical position within the master file is saved in the corresponding record in the reference file. That is, the address and state of register 1 of the master file is stored in position 1 of the reference file, the address and state of register 2 of the master file is stored in position 2 of the reference file… and so on . The new records are added to the end of the master file, creating the corresponding record in the reference file.
As the master structure maintains records of variable length, a special case occurs in the process of updating already existing records: if an already stored record is edited, it may not occupy the same position that it was assigned in the database, depending on the following conditions:
- If the modified version of the record has a length greater than the previous version of the record, then the record is written back to the end of the database and the previous version of the record is marked to indicate that the record is disabled. The references file is modified to place the new position of the record within the database.
- If the length of the modified version of the record does not increase, the inverted list has not been updated and it is the second modification that occurs on the record, the modified record may overwrite its previous version since the intermediate versions generated between the first version and the latest version of the record do not need to be stored to update the inverted list since the two extreme versions are enough to keep the indexes updated.
Record versions that have been modified remain in the database until a master file reorganization process is triggered.
All products in the CDS/ISIS family provide functions for record input and output operations, which automatically manage the physical storage of master and reference files and their corresponding update mechanisms.
Master File Integrity #
Given the close relationship between the master file and the reference file, the integrity of the data in CDS/Isis structures depends on the integrity of these two files: if the .xrf file becomes corrupted, the master file will also be seen as corrupted even when the data is physically correctly stored. There are programs that help restore the .xrf file from the master file, to regain access to the database. One of these programs is the one called MKXRF belonging to the CISIS libraries distributed by Bireme.
The best way to protect data is to ensure joint backup of .mst and .xrf files, have defined audit procedures and know how to apply the database recovery tools that already exist to maintain these structures.
The Master File Control Record #
Every master file has a record with MFN 0 called the Master File Control Record . It is always stored at the beginning of the file, has a fixed length and does not generate any entry in the reference file. Store the following information
- Master file number (always zero)
- Next Mfn to be assigned
- Address of next available record: Points to the start of the next available 512-byte block
- Address of the next record in the master file (points to the offset within the next available block)
- Master file type (always zero)
- Number of applications that have requested data entry blocking
- Database read lock flag
Structure of master file records #
In the master file records, the information is organized into fields of variable length and some are optional, that is, they are not always present in all records. The master file record structure has three sections:
- leader
- directory
- Data Area
The data is stored in the data area one after the other, without any separator between them. The directory section contains fixed-length entries which store pointers to each field contained in the data area. The Leader is also fixed length and contains information about the general characteristics of the record itself, such as: master file number (MFN), record length, the number of fields stored, etc. In schematic form the structure of a record is as follows
The Leader #
The leader data structure occupies 18 bytes with the following information:
- Registry mfn
- Record length
- Pointer to the previous version of the record if it has been modified (block number in the master file)
- Pointer to the previous version of the record if it has been modified (movement within the block)
- Starting position of the data area, where the record fields are located
- Number of fields contained in the record (represents the number of entries present in the Directory section
- Record status (0 = active record, 1 = record marked for deletion)
The Directory #
The directory section of the record has one entry for each of the fields present in the record. When a field, such as the author’s name, occurs more than once in the record (repeatable field), there will be a directory entry for each occurrence of the field. Likewise, in the data area, there will be as many fields as there are occurrences. the repeatable field. The order in which these occurrences are presented corresponds to the order in which the field is entered at the time of capturing information.
Each directory entry has a maximum of 6 bytes and has the following structure:
- Field Tag
- Starting position of the field within the record (expressed in terms of displacement with respect to the beginning of the data area
- Field length
Entries in the directory are stored in the same order as the fields were entered.
The data area #
Following the registry directory is the data area with the information entered in the registry.
An ISIS database provides three ways to store information in fields:
- Elemental fields
- Repeatable fields
- Fields with subfields
Elementary fields are those that have a single instance (occurrence) in the record, e.g.: a person’s date of birth, marital status, etc.
Repeatable fields are those that can have more than one instance in the record, e.g.: the authors of a publication, the subjects it deals with, etc.
Fields with subfields allow the information to be structured within the field in order to provide access to each of the data elements that compose it. Example: identify within the author’s name, the portion corresponding to the first name and the portion corresponding to the last name (name subfield and last name subfield). Each subfield will be preceded by a delimiter consisting of a subfield indicator (the ^ character) and a subfield identifier (a letter or a number).
Example of a record in Marc format stored under a CDS/Isis structure. It has been obtained through direct cataloging from the Library of Congress (LC). The value at the beginning of the record, shown between <…>, corresponds to the field label.
Index Files Data files (mst and xrf) only allow retrieval of records sequentially by MFN number. As other forms of access to information are required, for example, authors, countries, subjects, etc., it is necessary to have an additional structure that allows, given a keyword or a search formula, to locate the records that contain the requested terms. . This structure is what under Cds/isis is called Inverted Lists.
The inverted file of the CDS/Isis structures is actually made up of 6 physical files, five of which contain the dictionary search terms (organized as a B* tree) and the sixth contains the list of pointers associated with each term.
To optimize disk storage, two separate B* trees are maintained: one for terms up to 30 characters (stored in the .N=01 and .L01 files) and another for terms over 30 and up to 60 characters (stored in .N02 and .L02 files). The .CNT file contains control fields for both B-trees . In each B-tree file the .N0x file contains the tree nodes and the .L0x file contains the leaves. The sheet records point to the place where the pointers that contain the information to locate the records (postings) in the database are located. This file is identified with the .IFP extension. The physical relationship between these files can be represented as follows:
center
The physical relationship in the six files that make up the inverted list is given by a pointer, which represents the relative position of the record being pointed to. A relative address is the ordinal number of the record in a given file (for example, the first record is record 1, the second is record 2, etc.). The CNT file points to the .N0x file; the .N0x file points to the L0x and the .L0x file points to the .IFP. Since the .IFP is a file whose records are not necessarily the same length, the pointer from .L0x to .IFP has two components: the block number and the offset within the block, each expressed as an integer.
.IFP file format
The .IFP file contains the list of pointers (postings) for each term in the dictionary. Each pointer consists of 4 elements to identify the record from which the key is generated:
MFN Mfn of record TAG Field identifier OCC Occurrence number of the field from which the key is extracted CNT Sequential number of the term in the field
Each term will have as many pointers as fields have referred to it in the database. The list of pointers is stored in ascending sequence of MFN/TAG/OCC/CNT. When the inverted list is loaded by a Full Generation process each list is made up of one or more adjacent segments. As updates are made, additional segments can be created when new pointers need to be added. In this case, a new segment is created by linking it to the other segments so that the MFN/TAG/OCC/SEQ sequence is maintained.
Each time a division of this type occurs, the pointers of the segment where the new pointer was to be inserted are distributed equally between this segment and the new one just created. New segments are always created at the end of the file.
The keys are generated according to the specifications contained in a so-called Field Extraction Table (.fst) which contains specifies how the access points to the database will be generated for each field. There are 8 indexing techniques different methods for obtaining the keys, in order to satisfy all the information recovery requirements to be applied to a database.
Inverted lists are normally updated in the data entry procedure. However, there are situations that require these files to be regenerated (corruption of indexes, loading large batches of information into the database, changes in indexing strategies). Therefore, it is necessary to activate special processes for the maintenance of the indexes, to process the entire database and build the inverted lists again. This process is called Complete generation of the inverted list and in schematic form it consists of the following steps:
- Generation of the unclassified key file
In this first step, each of the database records is read and the indexing techniques specified in the Field Extraction Table (.fst) are applied to each field. As a result of this process, two files are generated: .LN1 with terms less than equal to 30 characters; and .LN2 with terms longer than 30 characters. Both files (.LN1 and .LN2) are TXT type files so they can be viewed by a text editor.
Example of the .LN1 file generated for Mfn’s 1-5 of the CDS database
1 24 1 1 TECHNIQUES
1 24 1 8 INDIVIDUAL
1 24 1 9 PLANTS
2 70 1 1 BOSIAN, G.
2 24 1 2 CONTROLLED
2 24 1 3 CLIMATE
2 24 1 6 PLANT
2 24 1 7 CHAMBER
2 24 1 10 INFLUENCE
3 70 1 1 BOSIAN, G.
3 24 1 1 CONTROL
3 24 1 3 CONDITIONS
3 24 1 6 PLANT
3 24 1 7 CHAMBER
3 24 1 8 FULLY
3 24 1 9 AUTOMATIC
3 24 1 10 REGULATION
3 24 1 12 WIND
3 24 1 13 VELOCITY
3 24 1 16 RELATIVE
3 24 1 17 HUMIDITY
3 24 1 19 CONFORM
3 24 1 22 FIELD
3 24 1 23 CONDITIONS
3 69 1 2 MOISTURE
3 69 1 4 WIND
3 69 1 6 ECOSYSTEMS
4 70 1 2 WENT, F.W.
4 24 1 2 ELECTRIC4 2
4 1 3 HYGROMETER
4 24 1 4 APPARATUS
4 24 1 6 MEASURING
4 24 1 7 WATER
4 24 1 8 VAPOUR
4 24 1 9 LOSS
4 24 1 11 PLANTS
4 24 1 14 FIELD
4 69 1 3 MOISTURE
5 70 1 1 GALE, J.
5 24 1 1 ANTI
5 24 1 5 RESEARCH
5 24 1 6 TOOL
5 24 1 9 STUDY
5 24 1 12 EFFECTS
5 24 1 14 WATER
5 24 1 15 STRESS
5 24 1 17 PLANT
5 24 1 18 BEHAVIOUR
Example of the .LN2 file generated for Mfn’s 1-5 of the CDS database
1 70 1 1 MAGALHAES, A.C.
1 70 1 2 FRANCO, C.M.
1 24 1 4 MEASUREMENT
1 24 1 6 TRANSPIRATION
1 69 1 1 PLANT PHYSIOLOGY
1 69 1 2 PLANT TRANSPIRATION
1 69 1 3 MEASUREMENT AND INSTRUMENTS
2 24 1 12 ASSIMILATION
2 24 1 14 TRANSPIRATION
2 69 1 1 PLANT EVAPOTRANSPIRATION
3 24 1 14 TEMPERATURE
3 24 1 21 MICROCLIMATIC
3 69 1 1 PLANT PHYSIOLOGY
3 69 1 3 TEMPERATURE
3 69 1 5 MEASUREMENT AND INSTRUMENTS
4 70 1 1 GRIEVE, B.J.
4 69 1 1 HYGROMETERS
4 69 1 2 PLANT TRANSPIRATION
4 69 1 4 WATER BALANCE
5 70 1 2 POLJAKOFF-MAYBER, A.
5 24 1 2 TRANSPIRANTS
5 69 1 1 PLANT PHYSIOLOGY
5 69 1 2 SOIL MOISTURE
5 69 1 3 PLANT TRANSPIRATION
5 69 1 4 EVAPOTRANSPIRATION
5 69 1 5 MEASUREMENT AND INSTRUMENTS
The first four columns contain the information that will give rise to the .IFP file pointer. The values of: MFN TAG OCC SEQ are read. The output is sorted by MFN since it comes from sequential reading of the master file.
- Key classification
As the inverted list is presented classified by alphabetical order of the keys, the second step consists of arranging the keys alphabetically. As a result, the files .LK1 and .LK2 are obtained, which contain the same keys as .LN1 and .LN2, only ordered ascending by key.
Example of the .Lk1 file generated for Mfn’s 1-5 of the CDS database
5 24 1 1 ANTI
4 24 1 4 APPARATUS
3 24 1 9 AUTOMATIC
5 24 1 18 BEHAVIOUR
2 70 1 1 BOSIAN, G.
3 70 1 1 BOSIAN, G.
2 24 1 7 CHAMBER
3 24 1 7 CHAMBER
2 24 1 3 CLIMATE
3 24 1 3 CONDITIONS
3 24 1 23 CONDITIONS
3 24 1 19 CONFORM
3 24 1 1 CONTROL
2 24 1 2 CONTROLLED
3 69 1 6 ECOSYSTEMS
5 24 1 12 EFFECTS
4 24 1 2 ELECTRIC
3 24 1 22 FIELD
4 24 1 14 FIELD
3 24 1 8 FULLY
5 70 1 1 GALE, J.
3 24 1 17 HUMIDITY
4 24 1 3 HYGROMETER
1 24 1 8 INDIVIDUAL
2 24 1 10 INFLUENCE
4 24 1 9 LOSS
4 24 1 6 MEASURING
3 69 1 2 MOISTURE
4 69 1 3 MOISTURE
2 24 1 6 PLANT
3 24 1 6 PLANT
5 24 1 17 PLANT
1 24 1 9 PLANTS
4 24 1 11 PLANTS
3 24 1 10 REGULATION
3 24 1 16 RELATIVE
5 24 1 5 RESEARCH
5 24 1 15 STRESS
5 24 1 9 STUDY
1 24 1 1 TECHNIQUES
5 24 1 6 TOOL
4 24 1 8 VAPOUR
3 24 1 13 VELOCITY
4 24 1 7 WATER
5 24 1 14 WATER
4 70 1 2 WENT, F.W.
3 24 1 12 WIND
3 69 1 4 WIND
Example of the .LK2 file generated for Mfn’s 1-5 of the CDS database
2 24 1 12 ASSIMILATION
5 69 1 4 EVAPOTRANSPIRATION
1 70 1 2 FRANCO, C.M.
4 70 1 1 GRIEVE, B.J.
4 69 1 1 HYGROMETERS
1 70 1 1 MAGALHAES, A.C.
1 24 1 4 MEASUREMENT
1 69 1 3 MEASUREMENT AND INSTRUMENTS
3 69 1 5 MEASUREMENT AND INSTRUMENTS
5 69 1 5 MEASUREMENT AND INSTRUMENTS
3 24 1 21 MICROCLIMATIC
2 69 1 1 PLANT EVAPOTRANSPIRATION
1 69 1 1 PLANT PHYSIOLOGY
3 69 1 1 PLANT PHYSIOLOGY
5 69 1 1 PLANT PHYSIOLOGY
1 69 1 2 PLANT TRANSPIRATION
4 69 1 2 PLANT TRANSPIRATION
5 69 1 3 PLANT TRANSPIRATION
5 70 1 2 POLJAKOFF-MAYBER, A.
5 69 1 2 SOIL MOISTURE
3 24 1 14 TEMPERATURE
3 69 1 3 TEMPERATURE
5 24 1 2 TRANSPIRANTS
1 24 1 6 TRANSPIRATION
2 24 1 14 TRANSPIRATION
4 69 1 4 WATER BALANCE
The first four columns contain the information that will give rise to the .IFP file pointer. The values of: MFN TAG OCC SEQ are read. The output is sorted by KEY.
Note: These examples and graphics were produced with CISIS version 10-30 which supports short keys of up to 10 characters and long keys of up to 30 characters. The standard version of ABCD works with CISIS 16-60, which works with short keys of up to 16 characters and long keys of up to 60 characters (see: http://wiki.bireme.org/es/index.php/CISIS)
Note the handling of repeatable fields, e.g. field 650, where each occurrence is maintained in a separate field but identified with the same label. The art of working with CDS/Isis tools lies in good use of the formatting language, to extract and manage the information stored under these structures.