Notes from Data Modeling Essentials, Simsion & Witt (2005), pt. 1 (Ch. 1-3)
Chapter 1: What is Data Modeling?
- On the rhetorical nature of data modeling: “In some cases, there is a single, correct approach. Far more often there will be several options. Asking the right questions (and coming up with the best answers) requires a detailed understanding of the relevant business area, as well as knowledge of data modeling principles and techniques” (6).
- The process of data modeling: 1) analysis of business requirements; 2) design for those requirements. Typically this is an evolutionary process.
- The impact of data modeling on program design is profound: good data models make initial programming simpler and cheaper, revising programming simpler and cheaper and encourage systemic sustainability (i.e., no scrapping an entire system when the data model changes).
- The authors note that the usefulness of a data model lies in its conciseness (9). The overlaps with perspicuity from the rhetorical tradition might be generative.
- What makes a good data model? 1) Completeness; 2) Nonredundancy; 3) Enforcement of rules; 4) Data reusability (the rhetorical flexibility of the model is essential for multi-domain analyses) . . . as such, data should be organized (as much as possible) independent of any specific application; 5) Stability and flexibility; 6) Elegance (“Elegant models are typically simple, consistent and easily described and summarized” (13); 7) Communication: is there slippage between the meaning of the data appended by the designers, coders and users?; 8) Integration – how easily can data models be integrated across multiple sections of an industry/business without confusion/redundancy?
- There are three levels or tiers of data modeling: 1) conceptual; 2) logical; and 3) physical. Here’s a handy workflow chart that demonstrates the current process of creating all three layers. For us, the emphases are on the “Develop Information Requirements” and “Build Conceptual Data Model” steps:
- The conceptual data model is developed (mostly) independent of specific technology; rather, it’s purpose/focus is the communication among the data modeler and business stakeholders. The logical model is the translation of the conceptual model into structures that can be implemented by the DBMS (database management system). Typically these DBMSs are relational in nature. The physical model “incorporates any changes necessary to achieve adequate performance and is presented in terms of tables and columns, together with a specification of physical storage and access mechanisms” (17).
- The biggest challenge in building a data model: “In our experience, the most serious problem with terminology is that its ambiguity frequently reflects a lack of clarity in methodology, roles, and deliverables. In particular, it may effectively license a database technician to make changes to tables and columns without the involvement of the data modeler” (20).
- Different approaches to data modeling: 1) process-driven approach: this is the traditional approach that works back from what systems do. Essentially, the data modeler identifies all the process currently working and the data that makes each of them possible. Then the data modeler designs a model “to support this fairly precise set of data requirements, typically using ‘mechanical’ techniques such as normalization” (20); 2) data-driven approach: develop the data model before the process-model. This promotes: a) reusability of data (data can be used for different processes – not one process); b) consistency in naming (rhetorical problems); c) generate a list of the mechanics that might be useful to pull from the database. The value of the data-driven approach is that it provides a broad but specific-enough overview of an entire system’s scope. This allows for understandings of what is possible and not based on the data being collected. As such, the data model actually can function as a mechanism for describing the boundaries of an entire process-system in leaner ways than a massive “large process model” (22).
- Most approaches are blended: process-driven + data-driven.
- Who are the stakeholders in the construction of a data model? 1) system users, owners and/or sponsors; 2) Business specialists or subject matter experts (SME); 3) data modler; 4) process modeler (designs programs that utilize the data model); 5) database designer/administrator (assess the connections/disconnections between the logical model and the data model; 6) systems integration manager (enterprise architect/data administrator that defines standards for consistency, data reuse and data flexibility beyond the immediate project (24).
- Two ways to think about data modeling conventions: 1) Entity Relationship (E-R) approach; and 2) Unified Modeling Language (UML) approach.
- Normalization: “a set of rules for allocating data to tables in such a way as to eliminate certain types of redundancy and incompleteness” (33). Typically accomplished later in the data modeling process but, if implemented from early on, can get around some terminological problems.
- Normalization consists of: 1) put data in tabular form (by removing repeating groups); and 2) remove duplicated data to separate tables.
- In essence, the process of normalization often results in many, many smaller tables that contain exclusive criteria. For example, a table that contains information about employee qualifications, departmental locations and pay grades shouldn’t be co-located; rather, each table should have a definable and exclusive (non-overlapping) purpose. You might think about this in much the same way as a linguistic coding scheme for coding data – each piece of data must be assignable to only one code . . . any more than that would render the data difficult to interpret. Database tables work much the same way. One table = one code. Finally, when you can derive data from existing columns, don’t go through the process of adding another space on the table – that’s redundant.
- Primary Key: the combination of two column areas that can unambiguously allow one to identify a specific row in the table. IOW, a column or combination of columns that produces a different value for every row in the table.
- Steps involved in normalization:
- Step 1: Put the data in table form by identifying and eliminating repeating groups
- Identify any determinants (pieces of data that determine the other columns), other than the primary key, and the columns they determine.
- Establish a separate table for each determinant and the columns it determines. The determinant is the primary key in this new table.
- Name the new tables.
- Remove the determined columns from the original table. Leave the determinants to provide links between tables (49).
- Iterate process until you reach 3NF (3rd Normal Form, or when the only determinants of nonkey columns are candidate keys. Candidate keys are the alternative keys that could serve as the primary key).
- In determining determiners, the authors argue that you must ask a SME/business specialist to detail the determinants and dependencies. MY QUESTION: CAN WE AUTOMATE THIS PROCESS IN ESTABLISHING DETERMINANTS VIA DISTANT VIEWING TECHNIQUES WITH ANY ACCURACY?
- Because the process or normalization is really a process of mapping determinants, many have charged normalization of data for models with being arhetorical, or at least non-creative and overly deterministic. The authors argue against this notion (60), noting that “In our prenormalization tidying-up phase, we divide complex facts into more primitive facts. There is a degree of subjectivity in this process. By eliminating the multifact column, we add apparent complexity to the model (the extra columns); on the other hand, if we use a single column, we may hide important relationships amongst data” (61).
Chapter Three: The Entity-Relationship (ER) Approach
- Data structure diagram: composed of 1) a box (represents a table); and 2) an arrow (drawn between two boxes represents a foreign key pointing back to the table where it appears as a primary key) (66)
- Once a rudimentary data structure diagram is produced, it is typically checked against the BI/SME to be sure that it is an accurate (and useful) diagram for understanding the relationship among data.
- Annotations to diagrams are key and provide detail on the kind of information we can make from databases (see diagram below):
- Some key terminology: 1) entity classes: categories of things of interest to the business; represented by boxes on the diagram, and generally implemented as tables; 2) attributes: what we want to know about entity classes; not usually shown on the diagram and generally implemented as columns in tables; 3) relationships: represented by lines with crows’ feet and generally implemented through foreign keys (75).
- The authors note that entity class definitions are essential as they provide working commonplaces from which data modelers can develop their drawings. A good entity class definition answers two questions: 1) what distinguishes instances of this enity class from instances of other entity classes?; and 2) What distinguishes one instance from another? (81).
- The authors note the problem of inexact language, noting that “A glance at a thesaurus will reveal that many common words have multiple meanings, yet these same words are often used with qualification in definitions. In one model, an entity class named ROLE had the definition, “Part, task, or function,” which, far from providing the reader with additional information as to what the entity class represented, widened the range of possibilities” (81).
- Entity class definitions can contain the entity class word itself but should try to avoid other entity class names (re:reference) in the definition. An example:
- Drug: An antibiotic drug as marketed by a particular manufacturer. Variants that are registered as separate entities in Smith’s Index of Therapeutic Drugs are treated as separate instances. Excluded are generic drugs such as penicillin. Examples are: Maxicillin, Minicillin, Extracycline. (82)
- The goal of ER modeling: “We are trying to come up with a set of nonoverlapping categories so that each fact in our world fits into one category only. Different modelers will choose differently shaped grids to achieve the same purpose. Current business terminology is invariably a powerful influence, but we still have room to select, clarify and depart” (106).