1 entries KOLMOV Software Notebook ------------------------ =============================================================================== Entry #1 Tom Lougheed 07/24/86 Kolmogorov-Smirnov Test Introduction ------------ I have begun, today, to write the Fortran code for a program which will perform the Kolmogorov-Smirnov test on a column(s) of data. The purpose of the test is to measure how closely the statistical behavior of a sample (or a population) is approximated either by a distribution function, or by another sample. Appropriate questions one may have in mind when performing the test are "Are the differences between the observed data and my theoretical model normally distributed, with zero mean ?" or "Is the statistical clumping of quazars (sample 1 -- thought by some to be nearby objects) closer to the clumping of distant galaxies (sample 2 -- far away objects) or to the clumping of globular clusters (sample 3 -- nearby objects) ?" Requirements ------------ The action of the program, will ideally be able to provide all of the options listed below; the output in each case is the Kolmogorov statistic, which is the maximum difference in probability between the two sample distributions or between the sample and the proposed distribution; desirable options follow: (1) Compare two sample input tables. (2) Compare a single sample input table with a user-selected distribution (from a list of common distribution functions). The parameters of the distribution are specified by the user. (3) Compare a single sample input table with a user-selected distribution (again, from a list of common distribution functions). The parameters of the distribution are to be calculated from a maximum-likelihood fit to the input table (or alternatively from a minimization of the Kolmogorov statistic ?). In addition to the Kolmogorov statistic, the determined "best" distribution parameters are returned to the user. (4) Compare a single sample input table with several user-selected distributions. The parameters of each distribution are to be calculated from a maximum-likelihood fit to the input table (or by minimizing the Kolmogorov statistic, as above). Output shall be an ordered list, ranging from the "best" distribution with its smallest Kolmogorov statistic, to the "worst" distribution, with its largest Kolmogorov statistic. The distribution parameters will also be written out. The distributions provided in the standard list should include the following: (1) Normal (or Gaussian) -- central limit distribution for mean values of large samples. (2) Uniform -- default distribution, when all possiblities are assumed equally likely. (3) Gamma -- general distribution for data which are bounded below by zero, but (conceptually) unlimited above. The chi-square distribution is a special case of the gamma. Closely related to the the Poisson discrete distribution. (4) Beta -- general distribution for data which are bounded between two limits. Closely related to the binomial binomial distribution. (5) Pareto (or power law) -- simple in form, it is popular among scientists for the ease with which it may be compared to theory. Provision should be made to provide others, as need arises. Also, for these applications, additional generality should be provided in the form of automatic affine transformations for those distributions (gamma, beta, ...) where the location of the origin is important, but not commonly a parameter. Analysis -------- Option (1) should be simple to produce, and option (2) should be not much harder (with the exception of finding on-hand subroutines to evaluate uncommon distribution functions (remember, the DENSITY functions are usually the easy ones to evaluate). Option (3) is simple in the case of the normal (or Gaussian) distribution, but may be far more difficult in the case of other useful distributions; the option will also may be very difficult to provide and once provided, very time consuming, if the parameters are to be fit to reduce the Kolmogorov statistic to a minimum, rather than using a maximum likelihood fit, even though the former is more intuitive in this context. Option (4) is simply a useful embellishment of option (3), which would allow the user to pull the "best" of all (listed) distributions out of the blue; it would, however, require a lot of computer time. In reference to the case where the user does not provide input parameters: If one desires the parameters which make the Kolmogorov statistic as small as possible, one still may want to calculate a maximum-likelihood estimate as a starting value. Maximum-likelihood estimators have many serendipitous properties: they are usually consistent, efficient, and achieve a minimum- variance estimate if one is possible. (Their only failing is that maximum- likelihood estimates are often biased, although the bias is always a diminishing function of the number of datums in the sample; sometimes the formula for the bias is known and can be used to correct the estimate.) Because of these good properties, I expect that a maximum-likelihood estimate will probably be very close to the estimate of the distribution parameters which minimizes the Kolmogorov statistic. Because the calculation of the latter must generally involve time-consuming bounded nonlinear optimization methods, one would probably save computer time by starting with former. A further important and troubling problem is how to treat the case in which the user knows some of the distribution parameters, but not all. For such a case, the remaining unknown parameters are well defined, and may be determined in a manner almost identical to that described above. However, the burden of programming for all the above cases will be increased significantly. The distributions listed above generally require two parameters (three if a shift of origin is allowed), this will roughly triple (octupple) the programming effort required to provide the extra formulae needed. Further discussion with management is suggested. Schedule -------- Work will begin immediately on option (1). This will probably be completed in the month remaining before this writer leaves the Science Data Analysis System (SDAS) team to participate in another project. The second option will probably be only partially completed before the writer's departure. Management shall be obliged to resolve the scheduling problem. Needed Utilities ---------------- The following subroutines will be needed: For options (1) through (4): (Many-dimensional ?) Sorting routine. For option (2): Cumulative distribution function values (one for each distribution function). For options (3) and (4): Maximum-likelihood estimator routines (one for each distribution function). (If a minimum Kolmogorov statistic estimate of parameters is desired, a nonlinear bounded optimization routine will be needed as well, or instead.) (If the user may specify some but not all of the parameters, then many additional estimation routines may be needed). For option (4): No additional utilities are required. Concerning the need for additional estimates, if one or some of the parameters are fixed by the user, see the analysis section above. Summary ------- The first step of the work of coding the Kolmogorov-Smirnov test will commence immediately. Issues concerning advanced options must be resolved. Close ----- I remain, on this 24-th day of July, in the year 1986, your humble servant, Thomas-William Lougheed