Histogram are ways to store detailed information about column data for CBO optimizer to interpret the optimal access path for the query.
If there are no Histogram or table statistics are gathered with method_opt=>'FOR ALL COLUMNS SIZE 1', the optimizer checks the dictionary for:
1. High/Low (USER_TAB_COL_STATISTICS.LOW_VALUE and HIGH_VALUE) of column value,
2. Number of Distinct value (NDV),
3. The number of NULL and number of rows.
CBO generates the column selectivity of 1/NDV.
if there are a non-uniform distribution of data in the column, then CBO using default way of column selectivity cause performance degradation.
In terms of implementation, we could choose to store every distinct value together with the number of rows for that value.
For a small number of values this is efficient and 'width balanced' histograms are used.
For a higher number of distinct values, we should use Height Balanced Histogram
If the number of distinct values is less than or equal to the number of histogram buckets specified (up to 254) then a Frequency Histogram is created.
If the number of distinct values is greater than the number of histogram buckets specified, a Height Balanced Histogram is created.
Frequency Histograms:
These use buckets to record the row count for each distinct value.
Height Balanced Histograms:
These are implemented by dividing the data up into different 'buckets' where
each bucket contains the same number of values. The highest value in each bucket (or END_POINT) is recorded together with the lowest value in the "zero" bucket.
Once the data is recorded in buckets we recognize 2 types of data value - Non-popular values and popular values.
Non-popular values - are those that do not occur multiple times as endpoints.
Popular values - occur multiple times as end points.
We can use Popular and Non-Popular Values to provide users with various statistics. Since we know how many values there are in a bucket we can use this information to estimate the number of rows in total that is covered by Popular and Non-Popular values.
The selectivity for popular values can be obtained by calculation the proportion of bucket endpoints filled by that popular value.
The selectivity for nonpopular values can now be calculated as 1/number non-popular bucket endpoints, so we can now be more accurate about selectivities than the original 1/NDV because we have removed the popular values from the equation.
ALL_TAB_HISTOGRAMS describes histograms on tables and views accessible to the current user.
The ALL_TAB_HISTOGRAMS view contains a one-bucket histogram, which in fact signifies "No histogram" to the Oracle Database software.
Therefore, it should not be queried to indicate the presence or absence of a histogram on a particular column. Instead, query the value of column HISTOGRAM in the ALL_TAB_COL_STATISTICS view.
Density is a column statistic and provides selectivity estimates for
equi-join predicates (e.g. and A.COL1 = B.COL1) and equality predicates
(e.g. COL1 = 'Y').
The density is expressed as a decimal number between 0 and 1.
Values close to 1 indicate that this column is unselective
Values close to 0 indicate that this column is highly selective
The more selective a column, the less rows are likely to be returned by a
query referencing this column in its predicate list.
The column selectivity is part of the equation used to decide on the best
path for a query to take to retrieve the data required in the most effective
manner and hence impacts the final cost value for the query.
If there are no Histogram or table statistics are gathered with method_opt=>'FOR ALL COLUMNS SIZE 1', the optimizer checks the dictionary for:
1. High/Low (USER_TAB_COL_STATISTICS.LOW_VALUE and HIGH_VALUE) of column value,
2. Number of Distinct value (NDV),
3. The number of NULL and number of rows.
CBO generates the column selectivity of 1/NDV.
if there are a non-uniform distribution of data in the column, then CBO using default way of column selectivity cause performance degradation.
In terms of implementation, we could choose to store every distinct value together with the number of rows for that value.
For a small number of values this is efficient and 'width balanced' histograms are used.
For a higher number of distinct values, we should use Height Balanced Histogram
If the number of distinct values is less than or equal to the number of histogram buckets specified (up to 254) then a Frequency Histogram is created.
If the number of distinct values is greater than the number of histogram buckets specified, a Height Balanced Histogram is created.
Frequency Histograms:
These use buckets to record the row count for each distinct value.
Height Balanced Histograms:
These are implemented by dividing the data up into different 'buckets' where
each bucket contains the same number of values. The highest value in each bucket (or END_POINT) is recorded together with the lowest value in the "zero" bucket.
Once the data is recorded in buckets we recognize 2 types of data value - Non-popular values and popular values.
Non-popular values - are those that do not occur multiple times as endpoints.
Popular values - occur multiple times as end points.
We can use Popular and Non-Popular Values to provide users with various statistics. Since we know how many values there are in a bucket we can use this information to estimate the number of rows in total that is covered by Popular and Non-Popular values.
The selectivity for popular values can be obtained by calculation the proportion of bucket endpoints filled by that popular value.
The selectivity for nonpopular values can now be calculated as 1/number non-popular bucket endpoints, so we can now be more accurate about selectivities than the original 1/NDV because we have removed the popular values from the equation.
ALL_TAB_HISTOGRAMS describes histograms on tables and views accessible to the current user.
The ALL_TAB_HISTOGRAMS view contains a one-bucket histogram, which in fact signifies "No histogram" to the Oracle Database software.
Therefore, it should not be queried to indicate the presence or absence of a histogram on a particular column. Instead, query the value of column HISTOGRAM in the ALL_TAB_COL_STATISTICS view.
Density is a column statistic and provides selectivity estimates for
equi-join predicates (e.g. and A.COL1 = B.COL1) and equality predicates
(e.g. COL1 = 'Y').
The density is expressed as a decimal number between 0 and 1.
Values close to 1 indicate that this column is unselective
Values close to 0 indicate that this column is highly selective
The more selective a column, the less rows are likely to be returned by a
query referencing this column in its predicate list.
The column selectivity is part of the equation used to decide on the best
path for a query to take to retrieve the data required in the most effective
manner and hence impacts the final cost value for the query.