Author:David C. Sisk
Date: 03/06/98
Note: Confidential and proprietary document
-Application Transmittal Form
- Fee Transmittal Form
Title of the Invention:
Method and apparatus for utilizing Summary Indexes in a Database Management System
Cross Reference to related applications:
None.
Statement offederally sponsored research/development:
None.
Reference to a microfiche appendix:
None.
Background of the invention:
Database management systems typically use some form of indexing method to efficiently perform read, update, and delete operations on rows or records in a table.Very typical index types include B-tree indexes and Bitmap indexes.These index types store information about each row in the associated table.Some conceptual background is provided on these common index types below.
A B-tree index on a particular table, for instance, would store a row ID or record number that points back to the specific row in the associated table (ie., there is a one-to-one correspondence between a row in the table and a row in each index assigned to the table).Along with the row ID, the value of the key field or column upon which the index has been created would also be stored, typically in the order specified upon creation of the index in the DBMS.Please see figure 1A and 1B below.In other words, given a table with row ID, column1, column2,…, column N, a B-tree index created on column1 would contain row ID and column1 with the records in the index sorted by column1 rather than the random order of the column1 values in the table.In this way, a query that filters the rows to be operated on (whether read, update, or delete) can be identified more quickly by the DBMS by first reading the index to retrieve the row ID(s), then retrieving the full row(s) from the table using the row ID(s) from the index.B-tree indexes tend to be very effective when built on high-selectivity columns (ie. columns where the ratio of number of distinct values in the column divided by the total number of rows in the table is close or equal to the value of 1.00) if the number of rows retrieved is a relatively small percentage of the total number of rows in the entire table.This characteristic tends to be particularly useful in on-line transaction processing systems.Most DBMS's (particularly relational DBMS's using standard Structured Query Language or SQL to access rows) provide query optimizer functionality that will estimate or calculate whether a particular query can be satisfied most efficiently by accessing the table directly, or by using one or more indexes to access the table in question.In some cases, a B-tree index can be constructed such that the results of a read-only query can be satisfied entirely by accessing only the index (ie. does not have to perform reads directly on the table because all of the necessary values are already physically stored in the index).Please see figure 1C below.When an index is created with all of the columns necessary to satisfy a particular variety of read-only queries in this manner, it is typically referred to as a "full-table" index.
Figure 1A: Logical structure of sample table TABLE1
Row ID | Column1 | Column2 | Column3 | Column4 |
01 | XBAA | 1687 | AAAA | 5498 |
02 | ZEZA | 8903 | AAAA | 3245 |
03 | BBDX | 4598 | CCCC | 7865 |
04 | TSAZ | 2382 | AAAA | 1967 |
05 | RRCA | 1496 | BBBB | 3491 |
06 | GZFB | 2597 | BBBB | 5917 |
07 | LSAE | 3623 | CCCC | 8319 |
08 | BRRA | 6413 | CCCC | 7227 |
09 | ETLP | 3791 | AAAA | 4628 |
10 | JTTL | 5615 | AAAA | 1671 |
Figure 1B: Logical structure of single-level B-tree index on TABLE1.COLUMN1.
Branch node | Column1 | Row ID |
AAAA-FZZZ | BBDX | 03 |
BRRA | 08 | |
ETLP | 09 | |
GAAA-RZZZ | GZFB | 06 |
JTTL | 10 | |
LSAE | 07 | |
RRCA | 05 | |
SAAA-ZZZZ | TSAZ | 04 |
XBAA | 01 | |
ZEZA | 02 |
Figure 1C:Logical structure of "full-table" single-level B-tree index on all columns of TABLE1.
Branch node | Column1 | Column3 | Column2 | Column4 | Row ID |
AAAA-FZZZ | BBDX | CCCC | 4598 | 7865 | 03 |
BRRA | CCCC | 6413 | 7227 | 08 | |
ETLP | AAAA | 3791 | 4628 | 09 | |
GAAA-RZZZ | GZFB | BBBB | 2597 | 5917 | 06 |
JTTL | AAAA | 5615 | 1671 | 10 | |
LSAE | CCCC | 3623 | 8319 | 07 | |
RRCA | BBBB | 1496 | 3491 | 05 | |
SAAA-ZZZZ | TSAZ | AAAA | 2382 | 1967 | 04 |
XBAA | AAAA | 1687 | 5498 | 01 | |
ZEZA | AAAA | 8903 | 3245 | 02 |
In comparison, a Bitmap index on a particular table would typically store in the index header the set of distinct values contained in the indexed table column, then the row ID and a string of bits with each bit representing the on/off value of the respective distinct value stored in the index header.Please see figure 1D below.If the selectivity of the indexed column is very low (0.1 or less), then the physical size of the entire bitmap index will be much smaller than a B-tree index created on the same column.Since this index type is physically much smaller for low selectivity columns, it is much faster to read than a B-tree index on the same column.Because of it's method of construction, however, this index type is very inefficient to update.For instance, if an insert were performed on the associated table that introduced a new distinct value into the indexed column, it would be necessary to append a new bit to each row in the index as well as appending the new distinct value to the header.Due to this characteristic, this index type is typically limited to read-only systems (such as data warehouses or decision support systems) rather than online transaction processing systems.
Figure 1D:Logical structure of bitmap index TABLE1_I03 on TABLE1.COLUMN3
AAAA | BBBB | CCCC | Row ID |
1 | 0 | 0 | 01 |
1 | 0 | 0 | 02 |
1 | 0 | 0 | 04 |
1 | 0 | 0 | 09 |
1 | 0 | 0 | 10 |
0 | 1 | 0 | 05 |
0 | 1 | 0 | 06 |
0 | 0 | 1 | 03 |
0 | 0 | 1 | 07 |
0 | 0 | 1 | 08 |
Please note that the above figures present logical representations of the contents of the index structures.Various database vendors have chosen to physically implement these index structures in various manners.For instance, Row ID may be the first attribute stored in the index rather than the last attribute.Likewise, the bitmap index may actually be physically stored in a B-tree structure.The B-tree structure may be single-level or multi-level.The B-tree index may be implemented as an ISAM index by omitting the branch nodes from the logical structure.The physical implementation specifics will vary with the particular database vendor and will typically be decided by each vendor based upon their particular objectives and practices.
Data warehouses, decision support systems, operational reporting systems, and other types of read-only access systems have gained tremendous popularity in the recent decade because of the strategic value they provide to corporate knowledge workers.These types of database systems tend to consist of tables with low-selectivity columns (rather than the high-selectivity columns found in OLTP systems), and can typically grow to sizes that are orders of magnitude greater than typical OLTP systems.The two most common indexing techniques utilized in these types of systems include B-tree indexes constructed as "full-table indexes", and bitmap indexes.In either case, there is a one-to-one correspondence between a row in the table, and the record stored in each associated index.In other words, if there are 100 million rows in the table, then there will be 100 million records in the index, regardless of whether it is of type B-tree or Bitmap (or any other currently existing index type).
An index type is needed that will allow data warehousing or decision support type queries (which typically perform reads for summary or calculation on large numbers of rows) to be satisfied in a much more efficient manner than is currently possible with available indexing methods.The summary index will satisfy this need.
Brief Summary of the Invention:
summary indexes will provide the necessary abbreviated structure to store the information from the associated table in a pre-summarized or pre-aggregated format.The number of records in the aggregate index will be a function of the selectivity of the indexed column(s) in the associated table, rather than a function of the number of rows in the associated table.By storing the data pre-summarized based on chosen key columns in the associated table, the size of the summary index will typically be several orders of magnitude smaller than the associated table, given that the selectivity of the indexed columns is typical of data warehousing and decision support systems.Since the physical size of an summary index is orders of magnitude smaller than the associated table, read-only queries based on the key column upon which the index is built can execute orders of magnitude faster.One or more summary indexes can be created on a particular table's column set in much the same manner that B-tree or Bitmap indexes are created.summary indexes have the following distinctive characteristics:
1) An summary index will have a logical structure different from B-tree, Bitmap, or any other known and common index type in that the structure will not contain a row ID that references a specific row in the associated table.
2) An summary index will indirectly reference one or more rows in the associated table, rather than directly referencing one and only one row in the table (as in other index types such as B-tree or Bitmap).
3) A summary index will be orders of magnitude smaller in size and record count (and orders of magnitude faster to read) than a standard B-tree or Bitmap index build on the same column set, given low selectivity of the column set.
4) The physical size and number of records stored in the summary index will be a function of the selectivity of the indexed columns in the associated table rather than a function of the number of rows in the associated table.
5) The summary index can be utilized (typically by the DBMS's query optimizer) to satisfy read-only calculation queries without having to access the associated table.
6) The summary index can be maintained by the DBMS's index maintenance procedures as rows are added, changed, or removed from the associated table in a manner similar to maintenance that occurs with B-tree or Bitmap indexes.The algorithm to perform this maintenance will be quite different given that the records in the aggregate index indirectly reference a group of rows in the associated table rather than directly referencing a single row in the associated table.
7) One or more aggregate indexes can be created on any particular table's collection of columns in a manner very similar to current methods of creating a B-tree or Bitmap index on a table's columns.The procedure used by the DBMS to create the aggregate index will be quite different from the procedure used to create a B-tree, Bitmap, or other currently known index types, although the index creation command could (and should) look very similar in syntax to commands currently used to create more common index types.
Brief Description of the several views of the drawing:
None.
Detailed Description of the Invention:
Most database systems for online transaction processing or data warehousing are of the relational variety, and typically use near-ANSI standard Structured Query Language (or SQL) for data access.Therefore, all examples given in this section will employ the use of SQL and will specifically refer to relational DBMS's, even though the indexing method presented is applicable to other types of DBMS's, including hierarchical, object, object-relational, etc.
An summary index is a pre-summarized or pre-aggregated collection of information indirectly related back to the detailed information contained in it's associated table.According to common data warehousing and decision support system terminology, the set of columns in a table that would typically be used in filtering or grouping conditions for a query are called "dimensions" or "dimension columns";the set of columns that would typically have calculations performed on them (such as sums, counts, or averages) are called "measures" or "measure columns".The table containing the rows that populate the above-mentioned columns is typically called the "fact table".It is typical of a data warehouse, decision support system, data mart, etc., to have dimension columns that have a low selectivity.An aggregate index would be built by summarizing the chosen measure columns based on a chosen set of dimension columns, then storing this pre-aggregated information in the index structure.
Consider the fact table logical structure with sample values listed in figure 1.The table contains 100 million rows.For illustration purposes, consider that Column DIMENSION_1 contains 100 distinct values, column DIMENSION_2 contains 50 distinct values, column DIMENSION_3 contains 20 distinct values, and column DIMENSION_4 contains 1000 distinct values.Each of these dimension columns has an extremely low selectivity (Number of distinct values/total number of rows).In typical data warehouse or decision support systems, the selectivity of each dimension column will be very low, normally ranging from as low as 0.0001 to possibly as high as 1.0.The collection of dimension columns forms the primary key of FACT_TABLE_1 (ie. the four dimension columns concatenated together form a unique identifier for each row in addition to the DBMS-assigned row ID).
Figure 1:FACT_TABLE_1
Row ID | Dimension_1 | Dimension_2 | Dimension_3 | Dimension_4 | Measure_1 | Measure_2 |
000000001 | AAAA | 1111 | XXXX | 1122 | 100 | 2000 |
000000002 | AAAA | 2222 | YYYY | 1133 | 150 | 1250 |
000000003 | AAAA | 2222 | YYYY | 2266 | 75 | 1750 |
000000004 | BBBB | 1111 | YYYY | 7799 | 225 | 1000 |
000000005 | CCCC | 3333 | XXXX | 8800 | 200 | 1500 |
000000006 | BBBB | 1111 | ZZZZ | 7755 | 125 | 2500 |
000000007 | BBBB | 4444 | YYYY | 4488 | 100 | 2250 |
000000008 | AAAA | 2222 | XXXX | 2211 | 175 | 2000 |
000000009 | AAAA | 6666 | ZZZZ | 6600 | 200 | 750 |
000000010 | CCCC | 1111 | ZZZZ | 5522 | 75 | 1500 |
000000011 | BBBB | 2222 | ZZZZ | 1144 | 50 | 1000 |
000000012 | EEEE | 2222 | ZZZZ | 3322 | 100 | 1250 |
000000013 | BBBB | 4444 | ZZZZ | 8855 | 250 | 2500 |
000000014 | DDDD | 0000 | XXXX | 5522 | 175 | 750 |
000000015 | CCCC | 7777 | YYYY | 1122 | 25 | 500 |
000000016 | CCCC | 5555 | ZZZZ | 1122 | 225 | 500 |
000000017 | AAAA | 8888 | YYYY | 4488 | 175 | 1000 |
000000018 | CCCC | 1111 | YYYY | 3322 | 100 | 1500 |
000000019 | BBBB | 2222 | XXXX | 4488 | 200 | 750 |
000000020 | DDDD | 9999 | XXXX | 7755 | 150 | 2000 |
…. | …. | …. | …. | …. | …. | …. |
100000000 | BBBB | 4444 | YYYY | 5522 | 250 | 1500 |
Figure 2A below is a logical representation of a summary index created on column DIMENSION_1, along with a very probable SQL statement that would be used to create the index. (Note that DBMS vendors will typically have a semi-proprietary index creation statement syntax specific to their particular DBMS.The index creation statement is given merely as an example of the likely syntax of any particular DBMS vendor's index creation statement for this particular index type.)Example SQL queries that could be satisfied by being directed to the aggregate index by the DBMS's query optimizer are shown in Figure 2B.Each dimension value appears only once in the index while it appears many times in the associated table.Each chosen measure value is summed and stored based on the chosen dimension value, and an optional count is recorded as well.The count value is optional, but adds value in that it can be used to satisfy a query requesting a count, or can be used as the divisor in a query requesting an average.Note that since column DIMENSION_1 in the fact table contains 100 distinct values, the aggregate index can contain a maximum of 100 records, regardless of the number of rows in FACT_TABLE_1.Regardless of the details surrounding the physical implementation of this index, SUM_INDEX_01 will be approximately one millionth the size of FACT_TABLE_1, and will consequently require approximately one-millionth the time and computing resources to read it, thus satisfying any of the queries listed in Figure 2B.In fact, almost any query requesting summary information with dimension_1 as the predicate of the query could be satisfied by directing the execution toward the summary index rather than the associated table.Any of these queries would execute tremendously faster (approximately 1 million times faster) by having the DBMS's query optimizer direct the read activity toward the aggregate index rather than the table.
Figure 2A:summary index on FACT_TABLE_1.DIMENSION_1
CREATE SUMMARY INDEX sum_index_01 ON fact_table_1 DIMENSIONS (dimension_1) MEASURES (measure_1, measure_2)…
Branch node | Dimension_1 | Sum(Measure_1) | Sum(Measure_2) | Count (Dimension_1) |
AAAA-DZZZ | AAAA | 200550 | 3500250 | 15050 |
BBBB | 550150 | 1250005 | 10025 | |
CCCC | 325550 | 9050550 | 50525 | |
DDDD | 925025 | 5506250 | 25050 | |
EAAA-RZZZ | EEEE | 700575 | 7752500 | 31075 |
…. | …. | |||
…. | …. |
Figure 2B:Sample SQL queries that can be resolved using SUM_INDEX_1
1 |
SELECT sum(measure_1) FROM fact_table_1 WHERE dimension_1 = 'AAAA' |
2 |
SELECT dimension_1, sum(measure_1) FROM fact_table_1 GROUP BY dimension_1 |
3 |
SELECT sum(measure_1), average(measure_2) FROM fact_table_1 WHERE dimension_1 IN ('AAAA','CCCC','EEEE') |
4 |
SELECT dimension_1, average(measure_1), average(measure_2) FROM fact_table_1 GROUP BY dimension_1 HAVING average(measure_1) > 10000 |
5 |
SELECT count(*) FROM fact_table_1 WHERE dimension_1 IN ('AAAA','EEEE') |
An summary index with a compound key (that is, more than one dimension column) can also be created on the fact table in question.In figure 3A below, an summary index has been created on columns DIMENSION_1 and DIMENSION_2.Recall that column DIMENSION_1 contains 100 distinct values and column DIMENSION_2 contains 50 distinct values in FACT_TABLE_1.Therefore, a summary index created on the DIMENSION_1 and DIMENSION_2 columns of FACT_TABLE_1 will contain a maximum of5000 records (100 * 50), making it approximately 1/20,000 the size of the associated table.Therefore, queries utilizing this index will execute approximately 20,000 times faster than queries reading the table directly.See figure 3B below for sample queries that can be resolved using SUM_INDEX_2.
Figure 3A:summary index on FACT_TABLE_1.DIMENSION_1, DIMENSION_2
CREATE SUMMARY INDEX sum_index_2 ON fact_table_1 DIMENSIONS (dimension_1, dimension_2) MEASURES (measure_1, measure_2) …
Branch node | Dimension_1 | Dimension_2 | Sum(Measure_1) | Sum(Measure_2) | Count (dimension1, dimension2) |
AAAA0000-EZZZ9999 | AAAA | 0000 | 150050 | 1250500 | 10250 |
AAAA | 1111 | 100150 | 1000250 | 7525 | |
AAAA | 2222 | 250500 | 1500750 | 5510 | |
AAAA | 3333 | 125050 | 1125250 | 11250 | |
AAAA | 4444 | 100500 | 2505200 | 9250 | |
AAAA | 5555 | 305105 | 2250250 | 8000 | |
AAAA | 6666 | 150250 | 1250025 | 7250 | |
AAAA | 7777 | 175750 | 3100500 | 6240 | |
AAAA | 8888 | 225000 | 1200075 | 12260 | |
AAAA | 9999 | 125750 | 1502450 | 10450 | |
BBBB | 1111 | 425250 | 2225400 | 15051 | |
BBBB | 2222 | 100225 | 500250 | 2575 | |
BBBB | 3333 | 225550 | 1125750 | 8250 | |
BBBB | 4444 | 150025 | 1750250 | 7500 | |
…. | …. | …. | …. | …. | |
…. | …. | …. | …. | …. |
Figure 3B:Sample SQL queries that can be resolved using SUM_INDEX_2
1 |
SELECT dimension_2, sum(measure_1) FROM fact_table_1 WHERE dimension_1 = 'AAAA' GROUP BY dimension_2 |
2 |
SELECT dimension_1, dimension_2, sum(measure_1) FROM fact_table_1 GROUP BY dimension_1, dimension_2 |
3 |
SELECT sum(measure_1), average(measure_2) FROM fact_table_1 WHERE dimension_1 IN ('AAAA','CCCC','EEEE') AND dimension_2 = '2222' |
4 |
SELECT dimension_1, dimension_2, average(measure_1), average(measure_2) FROM fact_table_1 WHERE dimension_1 in ('AAAA','CCCC') GROUP BY dimension_1, dimension_2 HAVING average(measure_1) > 10000 |
5 |
SELECT count(*) FROM fact_table_1 WHERE dimension_1 IN ('AAAA','EEEE') AND dimension_2 = '3333' |
Logical structure of a summary index:
Purpose:
Definition:
Creation:
Maintenance:
Utilization:
Claim or claims:
Abstract of the Disclosure:
The logical structure of an summary index is defined, as well as the techniques involved in creating one or more summary indexes on a database table, allowing the DBMS to maintain the contents of the summary index(es), and allowing the DBMS's query optimizer to utilize the summary index(es) to resolve read-only queries.Summary indexes tremendously reduce the execution time of read-only queries that summarize or perform calculations on large numbers of logically related rows or records.
Drawings:
None.
Executed Oath or Declaration:
Sequence listing:
None.
Document converted from word 8 by
MSWordView (mswordview 0.5.9)
MSWordView written by Caolan McNamara