
Copyright 1997 CAUSE. From CAUSE/EFFECT Volume 20, Number 2, Summer 1997, pp. 31-33, 38-40, 60. Permission to copy or disseminate all or part of this material is granted provided that the copies are not made or distributed for commercial advantage, the CAUSE copyright and its date appear, and notice is given that copying is by permission of CAUSE, the association for managing and using information resources in higher education. To disseminate otherwise, or to republish, requires written permission. For further information, contact Julia Rudy at CAUSE, 4840 Pearl East Circle, Suite 302E, Boulder, CO 80301 USA; 303-939-0308; e-mail: jrudy@cause.org
by Charles R. Thomas
Colleges and universities are initiating data warehouse projects in order to provide integrated administrative information for planning and reporting purposes, but few institutions are also undertaking the important process of constructing an information architecture that is based on a taxonomy of academic and administrative activities and a detailed data dictionary. This article describes a survey of forty institutions that have active data warehouse projects and outlines the essential steps to constructing an integrated information architecture.
An overall institutional information architecture1 is important to the development of integrated administrative information systems at every college and university, but not many institutions have addressed this basic issue. Independent administrative systems can meet the needs of individual operational units reasonably well without an information architecture, but improved administrative efficiency requires coordination between operational units. Further, increased emphasis on planning by top administrators creates a demand for integrated information that cannot easily be met by independent and non-integrated administrative systems. Many top administrators are frustrated by the lack of integrated information, but are not acquainted with the activities necessary to achieve the required integration.
The trade press is full of articles about the value of a data warehouse for dozens of companies, so many colleges and universities have initiated activities to capitalize on this trend. Such initiation of a data warehouse project will quickly reveal any lack of integration among an institution's core administrative systems. The genesis of this problem is usually the absence of an explicit institutional information architecture. While every college or university has an implicit information architecture, unfortunately the documentation of that architecture is usually buried in file layouts in source COBOL computer program code somewhere in the depths of the administrative computing organization. Further, because they were developed at different times by different people for different offices, seldom are the definitions of basic data elements in legacy systems consistent.
The purpose of this article is to outline some of the technical, organizational, and process requirements essential to the construction of an integrated information architecture, a key prerequisite for implementing a data warehouse. In preparing this article, the author conducted an informal survey of colleges and universities concerning their data warehouse projects. The great diversity of efforts under way at institutions and the rapid progress in this arena make it difficult to present many detailed analyses, but some obvious trends do emerge, and are discussed.
Industry publications regularly state that over 90 percent of medium and large companies already have data warehouse projects in operation.2 Detailed reports of most of the corporate data warehouse projects indicate that they capture data about production operations from a variety of systems on regular cycles, normalize3 that data on the way into the data warehouse, then provide managers with friendly front-end analytical tools for analysis. Many of these reports minimize the effort required to capture the "meta data," that is, the data about the data -- the definitions, codes, categories, and descriptions of data elements in the operational systems. In most colleges and universities, capturing accurate meta data is a sizable task, followed by the process of coming to agreement on an integrated set of definitions for the data warehouse project.
Before writing this article, the author conducted an informal electronic mail survey of approximately 250 institutions who indicated in the CAUSE Institutional Database that they had a data warehouse project either in progress or planned. About 100 indicated projects in progress, which amounts to less than 25 percent of the 425 institutions in the CAUSE database. While this was not a random, stratified sample, the numbers do indicate that colleges and universities are well behind the power curve on implementing data warehouse projects when compared to companies. It is, of course, entirely possible that the industry figures are inflated, but the conclusion would still be reasonable. Still, 85 of the 100 institutions that indicated active data warehouse projects did respond to the informal survey. In the final analysis, only forty of the institutions responded with detailed information about their data warehouse projects, and the other forty-five said they actually did not have an active project, but were still in the planning stages.
As might be expected, the majority of the institutions surveyed and the majority of those with active data warehouse projects were universities. Table 1 shows a comparison of the distribution by type of institution for those surveyed, those with active data warehouse projects, and the general U.S. higher education institutional population.
| Institution Type | Distributions of Institutions Surveyed | With Active Data Warehouse Project | Distribution of All U.S. Institutions |
|---|---|---|---|
University |
60% | 80% | 6% |
Four-Year Institution |
30% | 15% | 56% |
Two-Year Institution |
10% | 5% | 38% |
| Total | 100% | 100% | 100% |
Also as might be expected, the detailed descriptions from the forty responding institutions vary widely. Some clear trends do, however, emerge from examining the data warehouse project descriptions provided. These are summarized below.
"A Data Warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management's decision-making process."
-Bill Inmon, Industry Expert
The formal definition of a data warehouse leaves a lot of latitude for interpretation, and colleges and universities appear to have taken full advantage of that latitude. Some of the responding institutions provide online access to operational data files and refer to this activity as their data warehouse. Most, however, extract data from the operational systems on some regular cycle, normalize the data, then refresh the data warehouse with that data. Most have recognized the need for longitudinal data and provide multiple time-date stamped sets of data. A few keep only the current snapshot plus a limited amount of historical data online.
If the general population of colleges and universities is similar to the group responding to the author's informal survey, institutions are using a limited number of database tools for data warehouse projects. Three database tools accounted for 85 percent of the responses. Some acquired specific tools just for the data warehouse, but most decisions were driven by database tools already in use for other purposes. The two most popular database tools used by the responding institutions were Oracle (50 percent) and Sybase (20 percent), with Microsoft's SQL Server coming in third with 15 percent, ahead of a variety of other tools. Most of the data warehouses reported are maintained on a separate server designated for that purpose.
The institutions responding to the survey reported a fairly wide variety of tools either in use or being considered for use in accessing their data warehouse. Unlike the database decision, it appeared that the choice of client tool was more often than not driven by the requirements of the data warehouse rather than by tools already in place for other purposes. At most responding institutions the choice was based on staff and/or committee recommendations after trials. The predominant direction reported was toward the use of Web tools, with Microsoft desktop tools as a second choice.
Most responding institutions reported support for Windows/Intel and Apple Macintosh desktop computing environments as clients on the institutional network and for the data warehouse. In some cases, this dual client platform support affected the choice of client tools. Web tools were most often mentioned as being in use (20 percent), with MS-Access, MS-Excel, and BrioQuery also being used by 10 percent or more of the respondents.
Only a very small number of the responding institutions reported having all of the major institutional data areas available in their data warehouse. Most reported building the institutional data warehouse gradually, starting with either student data, human resources, or finance data, then planning to evolve toward a comprehensive set of data. According to comments from the responding institutions, this incremental approach is most often due to resource constraints. Table 2 summarizes the frequency of the data areas reported in institutional data warehouses.
| Data Area | Percent |
|---|---|
| Student | 80% |
| Course | 20% |
| Facilities | 23% |
| Finance | 64% |
| Human Resources | 64% |
| Alumni | 18% |
| Other | 41% |
Most responding institutions reported having developed a schedule of data gathering that follows the natural cycle for each major data area. These cycles vary from annual and monthly snapshots to as frequent as every three minutes. In almost all cases, the data cycle and the time and date of the last refresh of the data warehouse are noted for each data area.
As a part of the research for this article, the author visited over thirty of the institutional data warehouse Websites for the responding institutions to learn about the state of the data definitions provided to users. The level of meta data, or information provided to the user describing the data in the warehouses, varied from nearly none to barely adequate in all but a few institutions. Many of the data warehouse sites visited on the Web provide some data descriptions, even though a few of these appeared to be mere reproductions of COBOL file definitions. Few of the data warehouse Websites visited included plain language definitions of data elements, and none included text descriptions of the categories within elements. This would indicate that either more attention should be given to meta data, or complete documentation is not available.
Many of the responding institutions allowed Web visitors to view the description and documentation of their data warehouse, and a few even include the campus data dictionary showing definitions, codes, and categories. Some provided only the briefest description of their data warehouse project, then required special security clearance even for access to the meta data. Most of the responding institutions control access to information in their data warehouse even within the institution, requiring special "log on" techniques with password protection. Security of data and access to the institutional data warehouse are issues requiring attention at the highest levels of institutional administration. A CAUSE task force on privacy and handling of student information in a networked environment recently released a paper reporting on policy issues in this area.4
Industry literature stresses the need for top executive sponsorship for a successful data warehouse project. In spite of these admonitions, all but a few of the institutional efforts appear to be first started in the information technology department without top institutional executive sponsorship. One institutional representative even described the origins of their data warehouse project as "insidious." Institutions responding to the informal electronic mail survey stated a number of interesting motivations for construction of a data warehouse.
Several institutions mentioned "defense" as a primary motivation for constructing a data warehouse. The data warehouse is seen as a way to meet the increasing demand for both data and information with limited staff resources.
Some data warehouse projects were triggered by the installation of new operational systems that were not capable of delivering all of the traditional reports. Many respondents mentioned data integration as a primary motivation; the data warehouse provides a way to normalize data for administrative access without major disruptions to diverse legacy operational systems that are doing the job for an administrative department.
Security is another motivation; the legacy system is protected from hackers by its inaccessibility, and data are made available through a "copy" in the data warehouse. The author has labeled this level of security the Principle of Least Regret -- "We will regret it a lot less if someone sees something they should not, than if the master file is blown away."
Some respondents describe a "mission creep" approach to their data warehouse project. They plan to start small, make some information available to executives, then leverage the availability of that information to acquire resources to make more available.
Many institutions, as well as many companies, see no immediate direct financial savings resulting from their data warehouse project; however, they anticipate important benefits, including:
It is also beneficial that data in the warehouse are "time-date" stamped, so users will not have to be as familiar with the production schedules of the operational data systems, and data will be accessible on a "24x7" (twenty-four hours a day, seven days a week) basis instead of only during traditional business hours.
Information systems are an important competitive resource for most corporations, so most maintain them as close as possible to the current technological state of the art. Because of limited resources, many colleges and universities have, on the other hand, been limping along with much older information technology to support administrative operations. The administrative information systems in many colleges and universities have grown out of the technologies of the '60s, so it is not surprising to discover that many institutions do not have an integrated information architecture.
Many of the administrative applications evolved from the relatively simple automation of clerical tasks, office by office. The systems were designed to support specific office functions in such locations as the registrar's office, the accounting office, the alumni office, etc., and few individuals outside of those offices were required to know either the structure or the definitions of the data. Requests for information from top administrators were typically satisfied by those functional offices, and integrated information was usually just not available. Some institutional executives have referred to their campus as a collection of "data fiefdoms," and it was very important to know which questions to ask of which operational office any time information was needed.
Increased federal, state, and other agency reporting requirements have forced institutions to continuously augment already decrepit information systems until some have been "maintained" to the point of near collapse. Informally, many chief information officers will admit that as much as 85 percent of their technical talent is occupied with maintaining legacy systems, leaving few resources for attention to tasks like documentation and information architecture. Some institutions have installed new proprietary software to solve operational problems, but have not addressed the fundamental information architecture issues. Some executives have mistakenly thought that proprietary software would provide the information architecture for their institution.
The economics of information systems in colleges and universities are also quite different from the corporate environment. A university will typically have over 150 major administrative application systems, each processing a relatively small number of transactions per year. The accounting system in a major university might process one or two million transactions in a year. The registrar in an institution with 25,000 students might record six course registrations per student for each of three terms, which would amount to fewer than 500,000 transactions in one year. A major corporation will have far fewer systems, and will number transactions in the billions per year. For example, the peak processing volume of United Airlines Apollo system is 17,500 transactions per second, which translates to over a billion transactions in one day. The significant differences between corporate and academic information systems are best described in Table 3.
| Corporate | Academic | |
|---|---|---|
| Application Systems | Few | Many |
| Transactions | Many | Few |
| Mission | Clear | Hazy |
| Managment Style | Autocratic | Collegial |
The juxtaposition of the number of systems and the number of transactions between corporations and academic institutions has created a very different set of information economics over the years. These differences, when combined with the significant differences in both mission and management style between corporations and higher education institutions, have resulted in a less than integrated institutional information architecture at most colleges and universities. The same comparison illustrates the difference in complexity of data warehouse projects between the corporate and academic environments.
An institutional information architecture should include a comprehensive taxonomy of academic and administrative activities and a complete data dictionary with detailed descriptions of all of the information stored in all administrative systems. While every organization has an implicit information architecture, only at a few institutions is it both integrated and explicit. Such a resource can provide valuable information about the campus and the way it operates for both new and existing administrators. Many institutions are well on the way to an institutional intranet as a way to improve communication about administrative matters, and an explicit information architecture is a perfect candidate for that publication medium.
At the root of an integrated information architecture for any specific college or university should be a comprehensive inventory of all administrative activities on the campus. The most accurate list of academic and administrative departments on any campus is usually the chart of accounts in the business office, but this document seldom provides more than a fiscal picture of the organizational unit. In fact, in the '70s, when the National Center for Higher Education Management Systems (NCHEMS, http://www. nchems.com/) was developing a standard Program Classification Structure,5 the analysts started with institutional charts of accounts, but soon discovered that the institutional telephone directory contained a more complete listing of all of the academic and administrative organizations on each campus. Some campus organizations do not use the institution's fiscal system, but few operate without a telephone or a network connection.
A series of campus focus group meetings should be held to develop the institution's local taxonomy of academic and administrative activities. These meetings can be coordinated either by an outside consultant or with an internal facilitator serving as a catalyst, but some individual should be responsible for producing draft copies of the developing taxonomy. The important point is that each institutional information architecture should include an explicit inventory of all of the organizational units with the local organizational unit identifier and a brief description of the unit's activities. In the process of constructing an institutional taxonomy, most institutions generally discover that several units are performing the same activities. Of course, activities that should be performed, but are not, are also sometimes discovered in the process. Administrative discussions can then focus on which organizational unit should be responsible for which activities. It may be appropriate for more than one organizational unit to be responsible for a specific administrative activity, but that should be by design and properly coordinated, rather than by default living with uncoordinated duplicative efforts that inevitably produce different results.
The resulting taxonomy will be a major reference in data dictionary activities described below. Each data element description will include references to the administrative units using that element, and the cross references from this source will provide a list of all of the data elements used by each administrative unit. When changes to data definitions, codes, categories, or descriptions are anticipated, this cross-reference list will provide the means to determine which organizational units will be affected by the change. The institutional taxonomy should be a dynamic document, maintained by a non-parochial office, and accessible to the appropriate individuals on the campus intranet. It is particularly valuable to new executives for learning about the campus organizational structure and administrative activities.6
The administrative systems in colleges and universities have always involved data elements with definitions, codes, categories, and descriptions. In most cases these data elements were always known by someone, and in some cases even documented. The problem has always been the form and location of this documentation or the identification of the person who knew the definitions. In any major operational office, like the registrar's office, some individual is usually completely familiar with all of the data elements maintained by that office. Sometimes this information is even available in printed form, but seldom is it widely available to other offices.
Over the years, institutional administrative systems were developed sequentially, and usually independently. As a result, different administrative systems might have entirely different sets of descriptions and codes for the same data element describing the same entity on the campus. As a part of any Information Technology Review, NCHEMS consultants typically ask administrators if their campus maintains a data dictionary. The answer is invariably yes, but when a copy is requested, the administrator usually refers the consultant to the computing center. Later the data dictionary typically turns out to be a COBOL file definition in a source program listing, somewhere in the computing center files.
In years past, the primary, and sometimes only, purpose of most administrative information systems was to support a single operational office. In recent years the amount of data maintained in administrative information systems has increased dramatically, and access to that data has become important to many individuals outside the one operational office. Some of the increase in the number of data elements maintained by colleges and universities is driven by external reporting requirements, and some represents the centralization of "fugitive" data previously maintained on paper by a wide variety of individual offices. To illustrate the dramatic increase in the number of data elements maintained by colleges and universities, the original NCHEMS Data Element Dictionary,7 published in 1971 and updated in 1973, contained approximately 250 data elements in five separate administrative areas. The CHESS Data Definitions for Colleges and Universities, published in 1996, contains definitions for 781 data elements in six separate administrative areas.8
From an organizational perspective, the development of an institutional data dictionary is a daunting task that should involve a wide variety of campus offices. Beyond the operational office with primary responsibility for maintaining data, several other offices must be consulted before the definition, codes, categories, or descriptions of any data element can be changed. Campuses with well developed data dictionaries usually have an information policy and advisory committee structure for dealing with any changes to their data dictionary. The campus data dictionary is also a handy reference for the many individuals who now have access to the core administrative information systems.
Maintenance of the campus data dictionary should be assigned to a non-parochial organization, which many times is the office of institutional research, analytical studies, planning, or similar administrative department. Unfortunately this task is usually viewed as a technical problem to be assigned to a data administrator within the institution's information technology unit. In these cases it should be no surprise that the data dictionary addresses the technical aspects of data, and is sometimes less than useful to non-technical administrators. Like the institutional taxonomy of academic and administrative activities, the institutional data dictionary should be a dynamic document, accessible to the appropriate individuals on the campus intranet.
Higher education institutions may be behind corporations in the development
of data warehouses, but the dramatic differences between the two environments
require very different approaches. Institutions would be better served by
adopting a strategic approach to all administrative information systems,
including data warehouses, rather than the incremental approach currently
being employed by many colleges and universities. Data warehouse activities
may provide some of the anticipated benefits, and they may even prolong
the life of some legacy systems in the short term. The more important predecessor
activity for most institutions, however, is the campus information architecture.
It is key to the integration of data in the operational administrative information
systems, and therefore key to the long-term viability of the data warehouse.
The process of constructing a taxonomy of administrative activities will
document which campus offices perform which administrative activities, and
is the first step toward building an institutional information architecture.
When undertaken as suggested earlier in this paper, the process can be at
least as important as the product. Individuals throughout the campus will
become more aware of all of the administrative processes and of their interrelationships.
The building and maintenance of a campus data dictionary requires even more detailed work than the taxonomy, but is no less important to the institution's information architecture. The process will undoubtedly uncover many discrepancies in data definitions within the campus community, so it is important to have an appropriate forum for resolving differences. It is also important to have a top level information policy group to emphasize the importance of maintaining data in formats that are consistent with institutional standards.
Responsibility for the construction and maintenance of the campus information architecture is typically lodged in the administrative information systems unit, but the task is far too important to be relegated to a technical solution. An office with a campuswide perspective and a non-technical view of information should lead the endeavor. The effort should be supported by an information policy and advisory committee structure that represents the perspectives of all information users on the campus.
Participation in the committee structure surrounding the construction and maintenance of the campus information architecture can also enhance intra-campus communication about data and information and improve individual office productivity significantly. The availability of the institutional taxonomy and data dictionary as dynamic documents on a campus intranet will provide current and future benefits to the institution, as every individual has access to, and is able to understand, the appropriate information to perform his or her function.
1 Information architecture refers to the specification of the identifiers, codes, names, and abbreviations of the entities (students, faculty, courses, organizational units, etc.) about which data are maintained in an organization's record systems and the definitions, allowable categories, and coding conventions for the set of data elements used to describe each of these entities. The term also encompasses the identification of the data elements needed by each organizational unit for the conduct of their activities.
2 Mary Ryan Garcia, "Data Warehouses Grow Up," Beyond Computing, September 1996, 25-26.
3 Normalize is the term used by database administrators to describe the process of translating different data coding structures from disparate operational systems into a "standard" coding structure for the data warehouse.
4 See http://www.cause.org/issues/issue.asp?issue=privacy for resources about privacy policy and issues, as well as information about how to order the CAUSE white paper produced by this task force.
5 Douglas J. Collier, Program Classification Structure, Second Edition, Technical Report 106 (Boulder, Colo.: National Center for Higher Education Management Systems, 1978).
6 Recent NCHEMS activities have led to the development and 1996 publication on diskette of the author's CHESS Taxonomy of Administrative Activities in Colleges and Universities. This resource identifies over 700 different administrative activities organized according to the structure of the Integrated PostSecondary Education Data System (IPEDS), data generated by an annual survey conducted by the National Center for Education Statistics. For more information, contact the author at crt@nwu.edu.
7 Data Element Dictionary, Second Edition, Technical Report 51 (Boulder, Colo.: NCHEMS, 1973).
8 The CHESS Data Definitions for Colleges and Universities, published on diskette in 1996, contains definitions for 781 data elements supported by 130 tables of codes, categories, and descriptions. Approximately 300 of these data elements are defined in detail by a government or other agency external to the institution. The CHESS Data Definitions include the text of all of the definitions and descriptions of these standards, as well as example entries for the other 480 data elements. The example entries are based on the data dictionaries of several higher education institutions, and all of the definitions include probable uses by referencing the appropriate administrative activities in the CHESS Taxonomy. For more information, contact the author at crt@nwu.edu.
Charles R. Thomas (crt@nwu.edu) is a senior consultant with NCHEMS, where he specializes in assisting colleges, universities, and other non-profit organizations with strategic planning for information technology. He is also President of the Consortium for Higher Education Software Services (CHESS). Thomas was the first executive director of CAUSE and one of the four founding staff members of NCHEMS.