After reading this article you will learn about:- 1. Meaning of Data Warehouse 2. Characteristics of Data Warehouse 3. Benefits 4. Dimensions.
Meaning of Data Warehouse:
As companies have grown larger they have become separated both geographically and culturally from the markets and customers they serve. Disney, an American corporation, has operations in Europe, Asia and Australasia, as well as in the USA. Benetton, the French fashion brand has operations across five continents. In retailing alone it operates over 7000 stores and concessions.
Companies such as these generate a huge volume of data that needs to be converted into information that can be used for both operational and analytical purposes. The data warehouse is a solution to that problem. Data ware houses are really no more than repositories of large amounts of operational, historical and other customer-related data.
Data volume can reach terabyte levels, i.e. 2 40 bytes of data. A warehouse is a repository for data imported from other databases. Attached to the front end of the warehouse is a set of analytical procedures for making sense out of the data. Retailers, home shopping companies and banks have been early adopters of data warehouses. Different people have different definitions for a data warehouse.
The most popular definition came from Bill Inmon, who provided the following:
“A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.”
Subject-Oriented:
A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a particular subject.
Integrated:
A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant:
Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile:
Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
Ralph Kimball provided a more concise definition of a data warehouse:
“A data warehouse is a copy of transaction data specifically structured for query and analysis.”
This is a functional view of a data warehouse. Kimball did not address how the data warehouse is built like Inmon did, rather he focused on the functionality of a data warehouse. A data warehouse (DW) is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.
A data warehouse maintains its functions in three layers:
Layer:1 Staging
Layer: 2 Integration
Layer: 3 access.
Staging is used to store raw data for use by developers. The integration layer is used to integrate data and to have a level of abstraction from users. The access layer is for getting data out for users. One thing to mention about data warehouse is that they can be subdivided into data marts.
With data marts it stores subsets of data from a warehouse, which focuses on a specific aspect of a company like sales or a marketing process. This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, catalogued and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support.
However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.
Characteristics of Data Warehouse:
i. Subject-oriented :
The warehouse organizes data around the essential subjects of the business (customers and products) rather than around applications such as inventory management or order processing.
i.Integrated:
It is consistent in the way that data from several sources is extracted and transformed. For example, coding conventions are standardized: M _ male, F _ female.
ii. Time-variant:
Data are organized by various time-periods (e.g. months).
iii. Non-volatile:
The warehouse’s database is not updated in real time. There is periodic bulk uploading of transactional and other data. This makes the data less subject to momentary change. There are a number of steps and processes in building a warehouse.
First, you must identify where the relevant data is stored. This can be a challenge.When the Commonwealth Bank opted to implement CRM in its retail banking business, it found that relevant customer data were resident on over 80 separate systems.
Secondly, data must be extracted from those systems. It is possible that when these systems were developed they were not expected to align with other systems. The data then needs to be transformed into a standardized, consistent and clean format. Data in different systems may have been stored in different forms. Also, the cleanliness of data from different parts of the business may vary.
The culture in sales may be very driven by quarterly performance targets. Getting sales representatives to maintain their customer fi les may be not straightforward. Much of their information may be in their heads. On the other hand, direct marketers may be very dedicated to keeping their data in good shape.
After transformation, the data then needs to be uploaded into the warehouse. Archival data that have little relevance to today’s operations may be set aside, or only uploaded if there is sufficient space. Recent operational and transactional data from the various functions, channels and touch points will most probably be prioritized for uploading. Refreshing the data in the warehouse is important. This may be done on a daily or weekly basis depending upon the speed of change in the business and its environment
Benefits of a Data Warehouse:
A data warehouse maintains a copy of information from the source transaction systems.
This architectural complexity provides the opportunity to:
a. Maintain data history, even if the source transaction systems do not.
b. Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization has grown by merger.
c. Improve data, by providing consistent codes and descriptions, flagging or even fixing bad data.
d. Present the organization’s information consistently.
e. Provide a single common data model for all data of interest regardless of the data’s source.
f. Restructure the data so that it makes sense to the business users.
g. Restructure the data so that it delivers excellent query performance, even for complex analytic queries, without impacting the operational systems.
h. Add value to operational business applications, notably customer relationship management (CRM) systems.
Dimensions of Data Warehouse:
A dimension is a data element that categorizes each item in a data set into non-overlapping regions. A data warehouse dimension provides the means to “slice and dice” data in a data warehouse. Dimensions provide structured labeling information to otherwise unordered numeric measures. For example, “Customer”, “Date”, and “Product” are all dimensions that could be applied meaningfully to a sales receipt. A dimensional data element is similar to a categorical variable in statistics.
The primary function of dimensions is threefold: to provide filtering, grouping and labeling. For example, in a data warehouse where each person is categorized as having a gender of male, female or unknown, a user of the data warehouse would then be able to filter or categorize each presentation or report by either filtering based on the gender dimension or displaying results broken out by the gender.
Each dimension in a data warehouse may have one or more hierarchies applied to it. For the “Date” dimension, there are several possible hierarchies: “Day > Month > Year”, “Day > Week > Year”, “Day > Month > Quarter > Year”, etc.
Types of dimension:
a. Conformed dimension
b. Junk dimension
c. Degenerate dimension
d. Role-playing dimension
a. Conformed dimension:
A conformed dimension is a set of data attributes that have been physically implemented in multiple database tables using the same structure, attributes, domain values, definitions and concepts in each implementation.
Dimensions are conformed when they are either exactly the same (including keys) or one is a perfect subset of the other. Most important, the row headers produced in the answer sets from two different conformed dimensions must be able to match perfectly.
Conformed dimensions are either identical or strict mathematical subsets of the most granular, detailed dimension. Dimension tables are not conformed if the attributes are labeled differently or contain different values. Conformed dimensions come in several different flavors.
At the most basic level, conformed dimensions mean exactly the same thing with every possible fact table to which they are joined. The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts.
b. Junk dimension:
A junk dimension is a convenient grouping of typically low-cardinality flags and indicators. By creating an abstract dimension, these flags and indicators are removed from the fact table while placing them into a useful dimensional framework.
A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables. The nature of these attributes is usually text or various flags, e.g. non-generic comments or just simple yes/no or true/false indicators. These kinds of attributes are typically remaining when all the obvious dimensions in the business process have been identified and thus the designer is faced with the challenge of where to put these attributes that do not belong in the other dimensions.
One solution is to create a new dimension for each of the remaining attributes, but due to their nature, it could be necessary to create a vast number of new dimensions resulting in a fact table with a very large number of foreign keys. The designer could also decide to leave the remaining attributes in the fact table but this could make the row length of the table unnecessarily large if, for example, the attributes is a long text string.
The solution to this challenge is to identify all the attributes and then put them into one or several Junk Dimensions. One Junk Dimension can hold several true/false or yes/no indicators that have no correlation with each other, so it would be convenient to convert the indicators into a more describing attribute. An example would be an indicator about whether a package had arrived, instead of indicating this as “yes” or “no”, it would be converted into “arrived” or “pending” in the junk dimension.
The designer can choose to build the dimension table so it ends up holding all the indicators occurring with every other indicator so that all combinations are covered. This sets up a fixed size for the table itself which would be 2x rows, where x is the number of indicators. This solution is appropriate in situations where the designer would expect to encounter a lot of different combinations and where the possible combinations are limited to an acceptable level.
In a situation where the number of indicators are large, thus creating a very big table or where the designer only expect to encounter a few of the possible combinations, it would be more appropriate to build each row in the junk dimension as new combinations are encountered. To limit the size of the tables, multiple junk dimensions might be appropriate in other situations depending on the correlation between various indicators.
Junk dimensions are also appropriate for placing attributes like non-generic comments from the fact table. Such attributes might consist of data from an optional comment field when a customer places an order and as a result will probably be blank in many cases. Therefore the junk dimension should contain a single row representing the blanks as a surrogate key that will be used in the fact table for every row returned with a blank comment field
c. Degenerate dimension:
A dimension key, such as a transaction number, invoice number, ticket number, or bill-of- lading number, that has no attributes and hence does not join to an actual dimension table. Degenerate dimensions are very common when the grain of a fact table represents a single transaction item or line item because the degenerate dimension represents the unique identifier of the parent. Degenerate dimensions often play an integral role in the fact table’s primary key.
d. Role-playing dimensions:
Dimensions are often recycled for multiple applications within the same database. For instance, a “Date” dimension can be used for “Date of Sale”, as well as “Date of Delivery”, or “Date of Hire”. This is often referred to as a “role-playing dimension”.