Method and Apparatus for utilizing Summary Indexes in a database management system.


Author:David C. Sisk
Date: 03/06/98
Note: Confidential and proprietary document


-Application Transmittal Form
- Fee Transmittal Form



Title of the Invention:

Method and apparatus for utilizing Summary Indexes in a Database Management System


Cross Reference to related applications:

None.


Statement offederally sponsored research/development:

None.


Reference to a microfiche appendix:

None.


Background of the invention:

Database management systems typically use some form of indexing method to efficiently perform read, update, and delete operations on rows or records in a table.Very typical index types include B-tree indexes and Bitmap indexes.These index types store information about each row in the associated table.Some conceptual background is provided on these common index types below.

A B-tree index on a particular table, for instance, would store a row ID or record number that points back to the specific row in the associated table (ie., there is a one-to-one correspondence between a row in the table and a row in each index assigned to the table).Along with the row ID, the value of the key field or column upon which the index has been created would also be stored, typically in the order specified upon creation of the index in the DBMS.Please see figure 1A and 1B below.In other words, given a table with row ID, column1, column2,…, column N, a B-tree index created on column1 would contain row ID and column1 with the records in the index sorted by column1 rather than the random order of the column1 values in the table.In this way, a query that filters the rows to be operated on (whether read, update, or delete) can be identified more quickly by the DBMS by first reading the index to retrieve the row ID(s), then retrieving the full row(s) from the table using the row ID(s) from the index.B-tree indexes tend to be very effective when built on high-selectivity columns (ie. columns where the ratio of number of distinct values in the column divided by the total number of rows in the table is close or equal to the value of 1.00) if the number of rows retrieved is a relatively small percentage of the total number of rows in the entire table.This characteristic tends to be particularly useful in on-line transaction processing systems.Most DBMS's (particularly relational DBMS's using standard Structured Query Language or SQL to access rows) provide query optimizer functionality that will estimate or calculate whether a particular query can be satisfied most efficiently by accessing the table directly, or by using one or more indexes to access the table in question.In some cases, a B-tree index can be constructed such that the results of a read-only query can be satisfied entirely by accessing only the index (ie. does not have to perform reads directly on the table because all of the necessary values are already physically stored in the index).Please see figure 1C below.When an index is created with all of the columns necessary to satisfy a particular variety of read-only queries in this manner, it is typically referred to as a "full-table" index.

Figure 1A: Logical structure of sample table TABLE1
Row ID Column1 Column2 Column3 Column4
01 XBAA 1687 AAAA 5498
02 ZEZA 8903 AAAA 3245
03 BBDX 4598 CCCC 7865
04 TSAZ 2382 AAAA 1967
05 RRCA 1496 BBBB 3491
06 GZFB 2597 BBBB 5917
07 LSAE 3623 CCCC 8319
08 BRRA 6413 CCCC 7227
09 ETLP 3791 AAAA 4628
10 JTTL 5615 AAAA 1671

Figure 1B: Logical structure of single-level B-tree index on TABLE1.COLUMN1.
Branch node Column1 Row ID
AAAA-FZZZ BBDX 03
  BRRA 08
  ETLP 09
GAAA-RZZZ GZFB 06
  JTTL 10
  LSAE 07
  RRCA 05
SAAA-ZZZZ TSAZ 04
  XBAA 01
  ZEZA 02

Figure 1C:Logical structure of "full-table" single-level B-tree index on all columns of TABLE1.
Branch node Column1 Column3 Column2 Column4 Row ID
AAAA-FZZZ BBDX CCCC 4598 7865 03
  BRRA CCCC 6413 7227 08
  ETLP AAAA 3791 4628 09
GAAA-RZZZ GZFB BBBB 2597 5917 06
  JTTL AAAA 5615 1671 10
  LSAE CCCC 3623 8319 07
  RRCA BBBB 1496 3491 05
SAAA-ZZZZ TSAZ AAAA 2382 1967 04
  XBAA AAAA 1687 5498 01
  ZEZA AAAA 8903 3245 02


In comparison, a Bitmap index on a particular table would typically store in the index header the set of distinct values contained in the indexed table column, then the row ID and a string of bits with each bit representing the on/off value of the respective distinct value stored in the index header.Please see figure 1D below.If the selectivity of the indexed column is very low (0.1 or less), then the physical size of the entire bitmap index will be much smaller than a B-tree index created on the same column.Since this index type is physically much smaller for low selectivity columns, it is much faster to read than a B-tree index on the same column.Because of it's method of construction, however, this index type is very inefficient to update.For instance, if an insert were performed on the associated table that introduced a new distinct value into the indexed column, it would be necessary to append a new bit to each row in the index as well as appending the new distinct value to the header.Due to this characteristic, this index type is typically limited to read-only systems (such as data warehouses or decision support systems) rather than online transaction processing systems.

Figure 1D:Logical structure of bitmap index TABLE1_I03 on TABLE1.COLUMN3
AAAA BBBB CCCC Row ID
1 0 0 01
1 0 0 02
1 0 0 04
1 0 0 09
1 0 0 10
0 1 0 05
0 1 0 06
0 0 1 03
0 0 1 07
0 0 1 08

Please note that the above figures present logical representations of the contents of the index structures.Various database vendors have chosen to physically implement these index structures in various manners.For instance, Row ID may be the first attribute stored in the index rather than the last attribute.Likewise, the bitmap index may actually be physically stored in a B-tree structure.The B-tree structure may be single-level or multi-level.The B-tree index may be implemented as an ISAM index by omitting the branch nodes from the logical structure.The physical implementation specifics will vary with the particular database vendor and will typically be decided by each vendor based upon their particular objectives and practices.

Data warehouses, decision support systems, operational reporting systems, and other types of read-only access systems have gained tremendous popularity in the recent decade because of the strategic value they provide to corporate knowledge workers.These types of database systems tend to consist of tables with low-selectivity columns (rather than the high-selectivity columns found in OLTP systems), and can typically grow to sizes that are orders of magnitude greater than typical OLTP systems.The two most common indexing techniques utilized in these types of systems include B-tree indexes constructed as "full-table indexes", and bitmap indexes.In either case, there is a one-to-one correspondence between a row in the table, and the record stored in each associated index.In other words, if there are 100 million rows in the table, then there will be 100 million records in the index, regardless of whether it is of type B-tree or Bitmap (or any other currently existing index type).

An index type is needed that will allow data warehousing or decision support type queries (which typically perform reads for summary or calculation on large numbers of rows) to be satisfied in a much more efficient manner than is currently possible with available indexing methods.The summary index will satisfy this need.




Brief Summary of the Invention:

summary indexes will provide the necessary abbreviated structure to store the information from the associated table in a pre-summarized or pre-aggregated format.The number of records in the aggregate index will be a function of the selectivity of the indexed column(s) in the associated table, rather than a function of the number of rows in the associated table.By storing the data pre-summarized based on chosen key columns in the associated table, the size of the summary index will typically be several orders of magnitude smaller than the associated table, given that the selectivity of the indexed columns is typical of data warehousing and decision support systems.Since the physical size of an summary index is orders of magnitude smaller than the associated table, read-only queries based on the key column upon which the index is built can execute orders of magnitude faster.One or more summary indexes can be created on a particular table's column set in much the same manner that B-tree or Bitmap indexes are created.summary indexes have the following distinctive characteristics:

1) An summary index will have a logical structure different from B-tree, Bitmap, or any other known and common index type in that the structure will not contain a row ID that references a specific row in the associated table.
2) An summary index will indirectly reference one or more rows in the associated table, rather than directly referencing one and only one row in the table (as in other index types such as B-tree or Bitmap).
3) A summary index will be orders of magnitude smaller in size and record count (and orders of magnitude faster to read) than a standard B-tree or Bitmap index build on the same column set, given low selectivity of the column set.
4) The physical size and number of records stored in the summary index will be a function of the selectivity of the indexed columns in the associated table rather than a function of the number of rows in the associated table.
5) The summary index can be utilized (typically by the DBMS's query optimizer) to satisfy read-only calculation queries without having to access the associated table.
6) The summary index can be maintained by the DBMS's index maintenance procedures as rows are added, changed, or removed from the associated table in a manner similar to maintenance that occurs with B-tree or Bitmap indexes.The algorithm to perform this maintenance will be quite different given that the records in the aggregate index indirectly reference a group of rows in the associated table rather than directly referencing a single row in the associated table.
7) One or more aggregate indexes can be created on any particular table's collection of columns in a manner very similar to current methods of creating a B-tree or Bitmap index on a table's columns.The procedure used by the DBMS to create the aggregate index will be quite different from the procedure used to create a B-tree, Bitmap, or other currently known index types, although the index creation command could (and should) look very similar in syntax to commands currently used to create more common index types.



Brief Description of the several views of the drawing:

None.


Detailed Description of the Invention:

Most database systems for online transaction processing or data warehousing are of the relational variety, and typically use near-ANSI standard Structured Query Language (or SQL) for data access.Therefore, all examples given in this section will employ the use of SQL and will specifically refer to relational DBMS's, even though the indexing method presented is applicable to other types of DBMS's, including hierarchical, object, object-relational, etc.

An summary index is a pre-summarized or pre-aggregated collection of information indirectly related back to the detailed information contained in it's associated table.According to common data warehousing and decision support system terminology, the set of columns in a table that would typically be used in filtering or grouping conditions for a query are called "dimensions" or "dimension columns";the set of columns that would typically have calculations performed on them (such as sums, counts, or averages) are called "measures" or "measure columns".The table containing the rows that populate the above-mentioned columns is typically called the "fact table".It is typical of a data warehouse, decision support system, data mart, etc., to have dimension columns that have a low selectivity.An aggregate index would be built by summarizing the chosen measure columns based on a chosen set of dimension columns, then storing this pre-aggregated information in the index structure.

Consider the fact table logical structure with sample values listed in figure 1.The table contains 100 million rows.For illustration purposes, consider that Column DIMENSION_1 contains 100 distinct values, column DIMENSION_2 contains 50 distinct values, column DIMENSION_3 contains 20 distinct values, and column DIMENSION_4 contains 1000 distinct values.Each of these dimension columns has an extremely low selectivity (Number of distinct values/total number of rows).In typical data warehouse or decision support systems, the selectivity of each dimension column will be very low, normally ranging from as low as 0.0001 to possibly as high as 1.0.The collection of dimension columns forms the primary key of FACT_TABLE_1 (ie. the four dimension columns concatenated together form a unique identifier for each row in addition to the DBMS-assigned row ID).


Figure 1:FACT_TABLE_1

Row ID Dimension_1 Dimension_2 Dimension_3 Dimension_4 Measure_1 Measure_2
000000001 AAAA 1111 XXXX 1122 100 2000
000000002 AAAA 2222 YYYY 1133 150 1250
000000003 AAAA 2222 YYYY 2266 75 1750
000000004 BBBB 1111 YYYY 7799 225 1000
000000005 CCCC 3333 XXXX 8800 200 1500
000000006 BBBB 1111 ZZZZ 7755 125 2500
000000007 BBBB 4444 YYYY 4488 100 2250
000000008 AAAA 2222 XXXX 2211 175 2000
000000009 AAAA 6666 ZZZZ 6600 200 750
000000010 CCCC 1111 ZZZZ 5522 75 1500
000000011 BBBB 2222 ZZZZ 1144 50 1000
000000012 EEEE 2222 ZZZZ 3322 100 1250
000000013 BBBB 4444 ZZZZ 8855 250 2500
000000014 DDDD 0000 XXXX 5522 175 750
000000015 CCCC 7777 YYYY 1122 25 500
000000016 CCCC 5555 ZZZZ 1122 225 500
000000017 AAAA 8888 YYYY 4488 175 1000
000000018 CCCC 1111 YYYY 3322 100 1500
000000019 BBBB 2222 XXXX 4488 200 750
000000020 DDDD 9999 XXXX 7755 150 2000
…. …. …. …. …. …. ….
100000000 BBBB 4444 YYYY 5522 250 1500

Figure 2A below is a logical representation of a summary index created on column DIMENSION_1, along with a very probable SQL statement that would be used to create the index. (Note that DBMS vendors will typically have a semi-proprietary index creation statement syntax specific to their particular DBMS.The index creation statement is given merely as an example of the likely syntax of any particular DBMS vendor's index creation statement for this particular index type.)Example SQL queries that could be satisfied by being directed to the aggregate index by the DBMS's query optimizer are shown in Figure 2B.Each dimension value appears only once in the index while it appears many times in the associated table.Each chosen measure value is summed and stored based on the chosen dimension value, and an optional count is recorded as well.The count value is optional, but adds value in that it can be used to satisfy a query requesting a count, or can be used as the divisor in a query requesting an average.Note that since column DIMENSION_1 in the fact table contains 100 distinct values, the aggregate index can contain a maximum of 100 records, regardless of the number of rows in FACT_TABLE_1.Regardless of the details surrounding the physical implementation of this index, SUM_INDEX_01 will be approximately one millionth the size of FACT_TABLE_1, and will consequently require approximately one-millionth the time and computing resources to read it, thus satisfying any of the queries listed in Figure 2B.In fact, almost any query requesting summary information with dimension_1 as the predicate of the query could be satisfied by directing the execution toward the summary index rather than the associated table.Any of these queries would execute tremendously faster (approximately 1 million times faster) by having the DBMS's query optimizer direct the read activity toward the aggregate index rather than the table.

Figure 2A:summary index on FACT_TABLE_1.DIMENSION_1

CREATE SUMMARY INDEX sum_index_01 ON fact_table_1 DIMENSIONS (dimension_1) MEASURES (measure_1, measure_2)…

Branch node Dimension_1 Sum(Measure_1) Sum(Measure_2) Count (Dimension_1)
AAAA-DZZZ AAAA 200550 3500250 15050
  BBBB 550150 1250005 10025
  CCCC 325550 9050550 50525
  DDDD 925025 5506250 25050
EAAA-RZZZ EEEE 700575 7752500 31075
  …. ….    
  …. ….    

Figure 2B:Sample SQL queries that can be resolved using SUM_INDEX_1

1 SELECT sum(measure_1)
FROM fact_table_1
WHERE dimension_1 = 'AAAA'
2 SELECT dimension_1, sum(measure_1)
FROM fact_table_1
GROUP BY dimension_1
3 SELECT sum(measure_1), average(measure_2)
FROM fact_table_1
WHERE dimension_1 IN ('AAAA','CCCC','EEEE')
4 SELECT dimension_1, average(measure_1), average(measure_2)
FROM fact_table_1
GROUP BY dimension_1
HAVING average(measure_1) > 10000
5 SELECT count(*)
FROM fact_table_1
WHERE dimension_1 IN ('AAAA','EEEE')

An summary index with a compound key (that is, more than one dimension column) can also be created on the fact table in question.In figure 3A below, an summary index has been created on columns DIMENSION_1 and DIMENSION_2.Recall that column DIMENSION_1 contains 100 distinct values and column DIMENSION_2 contains 50 distinct values in FACT_TABLE_1.Therefore, a summary index created on the DIMENSION_1 and DIMENSION_2 columns of FACT_TABLE_1 will contain a maximum of5000 records (100 * 50), making it approximately 1/20,000 the size of the associated table.Therefore, queries utilizing this index will execute approximately 20,000 times faster than queries reading the table directly.See figure 3B below for sample queries that can be resolved using SUM_INDEX_2.


Figure 3A:summary index on FACT_TABLE_1.DIMENSION_1, DIMENSION_2

CREATE SUMMARY INDEX sum_index_2 ON fact_table_1 DIMENSIONS (dimension_1, dimension_2) MEASURES (measure_1, measure_2) …

Branch node Dimension_1 Dimension_2 Sum(Measure_1) Sum(Measure_2) Count (dimension1, dimension2)
AAAA0000-EZZZ9999 AAAA 0000 150050 1250500 10250
  AAAA 1111 100150 1000250 7525
  AAAA 2222 250500 1500750 5510
  AAAA 3333 125050 1125250 11250
  AAAA 4444 100500 2505200 9250
  AAAA 5555 305105 2250250 8000
  AAAA 6666 150250 1250025 7250
  AAAA 7777 175750 3100500 6240
  AAAA 8888 225000 1200075 12260
  AAAA 9999 125750 1502450 10450
  BBBB 1111 425250 2225400 15051
  BBBB 2222 100225 500250 2575
  BBBB 3333 225550 1125750 8250
  BBBB 4444 150025 1750250 7500
  …. …. …. …. ….
  …. …. …. …. ….

Figure 3B:Sample SQL queries that can be resolved using SUM_INDEX_2
1 SELECT dimension_2, sum(measure_1)
FROM fact_table_1
WHERE dimension_1 = 'AAAA'
GROUP BY dimension_2
2 SELECT dimension_1, dimension_2, sum(measure_1)
FROM fact_table_1
GROUP BY dimension_1, dimension_2
3 SELECT sum(measure_1), average(measure_2)
FROM fact_table_1
WHERE dimension_1 IN ('AAAA','CCCC','EEEE')
AND dimension_2 = '2222'
4 SELECT dimension_1, dimension_2, average(measure_1), average(measure_2)
FROM fact_table_1
WHERE dimension_1 in ('AAAA','CCCC')
GROUP BY dimension_1, dimension_2
HAVING average(measure_1) > 10000
5 SELECT count(*)
FROM fact_table_1
WHERE dimension_1 IN ('AAAA','EEEE')
AND dimension_2 = '3333'




Logical structure of a summary index:



Purpose:



Definition:



Creation:



Maintenance:



Utilization:




Claim or claims:




Abstract of the Disclosure:

The logical structure of an summary index is defined, as well as the techniques involved in creating one or more summary indexes on a database table, allowing the DBMS to maintain the contents of the summary index(es), and allowing the DBMS's query optimizer to utilize the summary index(es) to resolve read-only queries.Summary indexes tremendously reduce the execution time of read-only queries that summarize or perform calculations on large numbers of logically related rows or records.


Drawings:

None.


Executed Oath or Declaration:


Sequence listing:

None.








Document converted from word 8 by MSWordView (mswordview 0.5.9)
MSWordView written by Caolan McNamara