33 entries STATISTICS Software Notebook ---------------------------- =============================================================================== Entry #1 scivax::hanisch 08/26/87 possible project Dave, Now that Samy has left us (at least unofficially) I need someone to pick up a portion of the projects I had expected him to do. Someone will need to maintain the Kolmogorov-Smirnov statistics task. More importantly, Eric Feigelson and Takashi Isobe have developed a set of routines for dealing with data sets having known intrinsic errors and upper/lower limits. Doing statistics with such data sets (which are typical of astronomy) is extremely important and, so far, has not been done properly by most observers. Thus, we have the chance to build into SDAS some nice stats routines that no one else has. The point is that Takashi will be coming to visit here in late September. I need someone to work with him for a day or two in order to understand his code (it is all F77 and has already been run successfully on a variety of machines, so it should be pretty clean) and understand how to build an IRAFish shell around it. The basic I/O will be tables, but it may make sense to work from simple ASCII lists as well. I have asked them to send us a tape and documentation now so that we'll have some idea of questions to ask, etc., before Takashi actually shows up. Would you be willing to take this on? Phil is tremendously overloaded, so is Chris, and you seemed to have enjoyed the cooperative project with Chris Blades and Jon Wheatley. If you're up for it, I will see that you get this tape and related documentation as soon as it arrives. Bob =============================================================================== Entry #2 scivax::hanisch 09/09/87 takashi isobe Takashi just called from Penn State. (He's the guy with the non-parametric statistics routines.) I've asked him to postpone his visit until Oct 1/2 to avoid downtime on the cluster. Is this ok with you? Bob =============================================================================== Entry #3 kepler::ball 09/09/87 RE: takashi isobe Yes. I need to start studying the material you gave me so a few weeks from now would be fine. Dave =============================================================================== Entry #4 ti@jabba 10/07/87 Test From: ti@jabba.psu.edu Hello, Dave. I'm now writing the Electronic mail that you want to use between us. In fact, I need several subroutines that we made, since I found the subroutines on the tape are old ones. Please send next subroutines : aarray arisk cox lrank pwlcxn wlcxn ftest twost grdprb bin twokm Return address is ti@jabba.psu.edu; if jabba is not in your file, try 128.118.30.150. If all else fails, try t1i@psuvm.psu.edu. Takashi Isobe PS This mail is second trial. Former one did not make to STSci... =============================================================================== Entry #5 scivax::ball 10/09/87 Survival Routines Hello, Takashi. Glad you found a way to send us mail! I'm not sure what happened with the routines you mentioned since we changed the main file but I am sending them to you under separate messages(a different one for each) because you're less likely to lose it if there should be a error. Just to verify which ones you need here is your list: aarray arisk cox lrank pwlcxn wlcxn ftest twost grdprb bin twokm Please let me know if you need others or if this is not correct. Dave =============================================================================== Entry #6 scivax::ball 10/12/87 Eleven Routines Mailed Takashi, I have just sent all the routines you asked for(see the list I gave in my first two messages). If I don't hear from you by email by tommorrow I will call you on the phone to verify that you have received the routines. Hope everything is okay. Dave =============================================================================== Entry #7 scivax::ball 10/12/87 Network Address Takashi, I checked with our computer staff about your email address. They say that you are not in the network information center address tables(and we just this morning got a new one) so maybe even if you are getting mail from me you should talk to your local network experts. Our staff did add your address to our local copy of the tables as jabba.psu.edu is 128.118.30.150. Dave =============================================================================== Entry #8 scivax::ball 10/15/87 Survival Routines Hello, Takashi. Okay, let's try this again! I am sending the subroutines to you under separate messages(a different one for each) because you're less likely to lose it if there should be a error. Just to verify which ones you need here is your list: aarray arisk cox lrank pwlcxn wlcxn ftest twost grdprb bin twokm Please let me know if you need others or if this is not correct. Dave P.S. Let me know as soon as you get this one please then when you have received all of them as well. Thanks. =============================================================================== Entry #9 scivax::ball 10/15/87 Takashi's network address FYI: My first attempt to send email to Takashi Isobe at Penn State with most recent copies of eleven of the ASURV package routines failed because the network was down at his end. Instead of using the address I have tried . =============================================================================== Entry #10 ti@jabba 10/16/87 Message from ra.arpa received! From: ti@jabba.psu.edu Hello Dave: I got the mail on the Sun which was sent Oct. 15 afternoon from you. It seems working! Since it is much easy for me to work on the Sun, rather than on the IBM main frame, please use this channel everytime. It is a very present day at Penn State. I will take off quite soon to talk a walk; maybe visit the farmer's market. Have a nice weekend. Takashi Isobe =============================================================================== Entry #11 ball@ra 10/19/87 Survival routines From: Dave Ball Takashi, I am still having trouble with the mailers here so here we go again. I might have tried to get too fancy in what I was trying to do. I hope this time this message gets through. The eleven routines you need should be following close behind this e-mail; when you get wlcxn.for, I will have sent all eleven(regardless of the order you have received them in). I guess you should wait until you have them all to let me know how things are. Here's hoping all arrive safely this time. Good luck. Dave P.S. If this looks out of sequence with the other messages, it probably is (or if you already got this message, sorry!). I think I forgot to send this first! =============================================================================== Entry #12 ti@jabba 10/20/87 Survival From: ti@jabba.psu.edu Hi, Ball: I received all subroutines that I asked. Thanks. Altough there are a few bugs in other subroutines, I will not send these now: It may confuse you. I will send them when I find a few more (there are always MORE bugs!) Takashi Isobe PS This is sent to STSCI.ARPA. =============================================================================== Entry #13 ti@jabba 10/29/87 ASURV Rev0.0 From: ti@jabba.psu.edu Hi, Dave: Since we started sending ASURV, we decided to not change Rev0.0 anymore. There are, however, a few changes that I made: XVAR.FOR : Add ISIGN=1 (nearly the top) ASURV.FOR : Add CLOSE(UNIT=2) (nearly the end) EM.FOR : Move PRINT 2070 to after PRINT2055. Also add GOTO 3000 before the line. I will not change the code any further, at least a few months, so please exchange these subroutines and work on them. Thanks. Takashi Isobe C C ********************************************************************** C ********************* SUBROUTTINE XVAR ****************************** C ********************************************************************** C SUBROUTINE XVAR(IND,X,J,NTOT,ISIGN,ZU,ZC,IU,IC) C C * THIS SUBROUTINE DISTINGUISHS UNCENSORED AND CENSORED * C * DATA IN X VARIABLE AND PUT INTO ZU AND ZC. ALSO * C * IF THE DATA CONTAIN UPPER LIMITS, THE SIGN OF THE * C * VALUES IS CHANGED SO THAT THE PROGRAM FOR THE LOWER * C * LIMITS CAN BE USED. ADOPTED FROM SMSDA. * C * * C * INPUT IND(J,I) : INDICATOR OF CENSORING * C * X(J,I) : VARIABLE * C * J : J-TH DATA SETS * C * NTOT : TOTAL NUMBER OF DATA POINTS * C * * C * OUTPUT ISIGN : IF LOWER LIMIT, ISIGN = 1 * C * IF UPPER LIMIT, ISIGN = -1 * C * ZU(K) : UNCENSORED DATA POINTS IN X(J,I) * C * ZC(K) : CENSORED DATA POINTS IN X(J,I) * C * IU : NUMBER OF UNCENSORED DATA POINTS * C * IC : NUMBER OF CENSORED DATA POINTS * C * * C IMPLICIT REAL*8 (A-H,O-Z) DIMENSION IND(10,200),X(10,200),ZU(200),ZC(200) C C ******* NEXT LINE IS ADDED ON OCT 9, 1987 ******************* C ISIGN=1 DO 100 I=1,NTOT IF(IND(J,I).EQ.0) GOTO 100 ISIGN=IND(J,I)/IABS(IND(J,I)) 100 CONTINUE C IC=0 IU=0 DO 300 I=1,NTOT K=IABS(IND(J,I)) IF(K.EQ.0) GOTO 200 C 150 IC=IC+1 ZC(IC)=ISIGN*X(J,I) GOTO 300 200 IU=IU+1 ZU(IU)=ISIGN*X(J,I) 300 CONTINUE RETURN END C C C * SURVIVAL ANALYSIS PACKAGE FOR ASTRONOMERS * C * * C * DEVELOPED BY: TAKASHI ISOBE * C * DEPARTMENT OF ASTRONOMY * C * THE PENSYLVANIA STATE UNIVERSITY * C * 525 DAVEY LAB. UNIVERSITY PARK PA 16802 * C * * C * REV. 0.0 TEST VERSION JUNE 1987 * C * * C * THIS PACKAGE IS WRITTEN TO PROVIDE SEVERAL * C * SURVIVAL ANALYSIS METHODS WHICH ARE USEFUL IN ANALYZING * C * ASTRONOMICAL DATA. SURVIVAL ANALYSIS IS A GROUP OF STATISTICAL * C * METHODS WHICH TREAT PROBLEMS WITH CENSORED DATA (UPPER OR LOWER * C * LIMITS). THIS PACKAGE INCLUDES SOME TECHNIQUES DEVELOPED IN * C * IN OTHER FIELDS (E.G. ECONOMICS, ACTUARIAL SCIENCE, RELIABILITY * C * MATHEMATICS), AND A FEW METHODS DEVELOPED BY ASTRONOMERS. * C * * C * THE METHODS PROVIDED IN THIS PACKAGE ARE : * C * * C * UNIVARIATE DISTRIBUTION : KAPLAN-MEIER ESTIMATOR * C * TWO-SAMPLE TESTS : GEHAN TEST * C * LOGRANK TEST * C * COX-MANTEL TEST * C * PETO AND PETO TEST * C * F TEST [THOUGH PROVIDED IN SMSDA CODE * C * WE OMIT THE F-TEST BECAUSE NO * C * PROBABILITIES ARE ASSIGNED TO * C * F-STATISTIC VALUES.] * C * CORRELATION TESTS : COX PROPORTIONAL HAZARDS MODEL * C * GENERALIZED KENDALL'S TAU (BHK METHOD)* C * LINEAR REGRESSIONS : EM ALGORITHM WITH NORMAL DISTRIBUTION * C * BUCKLEY-JAMES METHOD * C * TWO-DIMENSIONAL KAPLAN-MEIER * C * REGRESSION FOR DUAL-CENSORED DATA * C * * C * * C * INPUTS * C * * C * IS0 : IF 1 : UNIVARIATE PROBLEM * C * 2 : CORRELATION/REGRESSION PROBLEM * C * 3 : EXIT * C * * C * SUBROUTINE UNIVAR, BIVAR C IMPLICIT REAL*8 (A-H,O-Z) CHARACTER*1 BLANK,CHECK,CHAR(4,10) C C C PRINT * PRINT *,' **************************************************' PRINT *,' * *' PRINT *,' * WELCOME TO ASTROSURV *' PRINT *,' * SURVIVAL ANALYSIS PACKAGE *' PRINT *,' * FOR ASTRONOMERS *' PRINT *,' * *' PRINT *,' * DEVELOPED BY T. ISOBE *' PRINT *,' * DEPT. OF ASTRONOMY *' PRINT *,' * THE PENNSYLVANIA STATE UNIVERSITY *' PRINT *,' * *' PRINT *,' * REV 0.0 JUNE 1987 *' PRINT *,' **************************************************' PRINT * PRINT * PRINT * PRINT * PRINT *,' (CARRIAGE RETURN TO CONTINUE) ' READ(*,50) BLANK 50 FORMAT(A1) PRINT * C C * START CONVERSATION WITH A USER * C PRINT * PRINT * PRINT * 100 PRINT *,' MENU ' PRINT * PRINT * PRINT *,' UNIVARIATE DATA BIVARIATE DATA ' PRINT * PRINT * PRINT *,' DISTRIBUTION FUNCTION CORRELATION ' PRINT *,' 1 KAPLAN-MEIER ESTIMATOR 1 COX REGRESSION ' PRINT *,' 2 GEN. KENDALL TAU' PRINT * PRINT * PRINT *,' TWO-SAMPLE TESTS LINEAR REGRESSION ' PRINT *,' 1 GEHAN TEST 1 EM ALGORITHM WITH ' PRINT *,' 2 LOGRANK TEST GAUSSIAN RESIDUALS ' PRINT *,' 3 COX-MANTEL TEST 2 BUCKLEY-JAMES METHOD ' PRINT *,' 4 PETO AND PETO TEST WITH KM RESIDUALS ' PRINT *,' 3 SCHMITT METHOD FOR ' PRINT *,' DUAL CENSORED DATA ' PRINT * PRINT * PRINT * PRINT *,' (CARRIAGE RETURN TO CONTINUE) ' READ(*,50) BLANK C PRINT * C C * CHOICE : UNIVARIATE PROBLEM OR CORRELATION/REGRESSION PROBLEM * C PRINT * PRINT *,' SELECT DATA TYPE: ' PRINT *,' 1 UNIVARIATE DATA ' PRINT *,' 2 BIVARIATE DATA ' PRINT *,' 3 EXIT ' 200 WRITE(*,210) 210 FORMAT($,' CHOICE ? ') C CALL DATA(IS0) C IF((IS0.EQ.1).OR.(IS0.EQ.2).OR.(IS0.EQ.3)) GOTO 300 PRINT *,'PLEASE TYPE ONCE MORE' GOTO 200 C 300 IBACK=0 IF(IS0.EQ.1) CALL UNIVAR(IBACK) IF(IS0.EQ.2) CALL BIVAR(IBACK) IF(IS0.EQ.3) STOP C C *******NEXT LINE IS ADDED ON OCT. 14, 1987.****** C CLOSE(UNIT=2) IF(IBACK.EQ.1) GOTO 100 STOP END C C ********************************************************************** C ********************* SUBROUTINE EM ********************************* C ********************************************************************** C SUBROUTINE EM(IND,X,Y,NTOT,TOL,MAXITS,NVAR,IBET,ALPHA, + ND,NC,OUTPUT,FILE) C C * MAXIMUM LIKELIHOOD ESTIMATION IN A LINEAR MODEL * C * FROM CONFINED AND CENSORED NORMAL DATA * C * * C * FORMAL PARAMETERS : * C * * C * NTOT INPUT : THE NUMBER OF OBSERVATIONS * C * MPLONE INPUT : THE TOTAL NUMBER OF PARAMETERS TO BE * C * ESTIMATED (I.E. NB+1) * C * MAXITS INPUT : THE MAXIMUM NUMBER OF ITERATIONS ALLOWED * C * IBET INPUT : IF THERE ARE DATA WHICH ARE CONFINED * C * BETWEEN TWO VALUES, IBET=1 OTHERWISE 0 * C * ALPHA INPUT : IF ALPHA(NVAR+2).LE.0.0, THE SUBROUTINE * C * CALCULATES INITIAL PARAMETER ESTIMATES. * C * IF >0.0, IT CONTAINS THE INITIAL * C * ESTIMATE OF THE J-TH LOCATION PARAMETER * C * FOR J=1,2,.....,NB. * C * OUTPUT: THE MOST RECENT PARAMETER ESTIMATES * C * BEFORE EXIT FROM THE SUBROUTINE. * C * TOL INPUT : CONVERGE TO THE MAXIMUM LIKELIHOOD * C * PARAMETER ESTIMATES IF THE DIFFERENCE * C * BETWEEN CONSECUTIVE ESTIMATES OF THE * C * J-TH PARAMETER IS LESS THAN TOL(J) FOR * C * J=1,2,........,M. * C * IND INPUT : INDICATOR OF CENSORED DATA * C * XX INPUT : THE DESIGN MATRIX XX(I,J) CONTAINS THE * C * COEFFICIENT OF THE J-TH LOCATION PARA- * C * METER FOR I-TH OBSERVATION. * C * Y INPUT : IF PP(I)=0, THE I-TH OBSERVATION IS * C * COMPLETELY SPECIFIED IN Y(I); IF * C * PP(I)=-1, Y(I) IS LEFT-CENSORED: IF * C * PP(I)=1, Y(I) IS RIGHT-CENSORED: IF * C * PP(I)=5, Y(I) IS CONFINED BETWEEN TWO * C * VALUES. * C * ROWX WORK : THE NUMBER OF ROWS OF X (=NTOT) * C * COLX WORK : THE NUMBER OF COLUMNS OF X (=NB) * C * W WORK : * C * LENW WORK : = NB+NTOT * C * WORK WORK : * C * LENWRK WORK : = NB*NTOT * C * VCOV OUTPUT: IF THE PROCEDURE CONVERGED TO THE MAX- * C * IMUM LIKELIHOOD ESTIMATES, THE FIRST * C * (NB+1)*(NB+1) POSITION CONTAIN AS ESTI- * C * MATE OF THE VARIANCE-COVARIANCE MATRIX * C * OF THESE ESTIMATES. * C * SIGMAA OUTPUT: (J,J) COMPONENT CONTAINS THE STANDARD * C * DEVIATION OF THE J-TH PARAMETER. * C * IFAULT OUTPUT: FAILURE INDICATOR * C * ICHECK OUTPUT: IF ERROR ANALYSIS IS NOT COMPLETED, * C * ICHECK HAS A VALUE : 1 ,OTHERWISE 0 . * C * * C * SUBROUTINES: EMALGO. * C * * C * REF : M.S.WOLYNETZ AS 139 APL.STATIST.VOL.28 195 (1979) * C * PLUS CORRECTIONS IN LATER ISSUES OF APPLIED STATISTICS.* C * * C * SUBROUTINE EMALGO IS COPIED DIRECTLY FROM WOLYNETZ, EXPECT * C * FOR A FEW CHANGES TO PRINT AND COMMENT STATEMENTS. * C C IMPLICIT REAL*8 (A-H,O-Z) INTEGER ROWX,COLX DIMENSION X(10,200),Y(200),IND(200),XX(200,10),Y2(200) DIMENSION W(200),WCEN(200),VCOV(1000),WORK(1000) DIMENSION ALPHA(10),SIGMAA(10,10),TOLA(10) CHARACTER*9 FILE,OUTPUT C C C * INITIALIZATION * C MPLONE=NVAR+2 NB=NVAR+1 ROWX=NTOT COLX=MPLONE LENW=NTOT+MPLONE LENWRK=NTOT*MPLONE C DO 10 I=1,MPLONE TOLA(I)=TOL 10 CONTINUE C C * * C * IF THERE ARE DATAPOINTS CONFINED BETWEN TWO VAULES, READ THE * C * INPUT DATA AGAIN. * C * * IF(IBET.EQ.0) GOTO 80 OPEN(UNIT=2,FILE=FILE,STATUS='OLD',FORM='FORMATTED') DO 40 I=1,NTOT READ(2,30) IND(I),(XX(I,J),J=2,NB),Y(I),Y2(I) 30 FORMAT(I3,3F10.3) 40 CONTINUE C 80 DO 100 I=1,NTOT XX(I,1)=1 DO 90 J=1,NVAR JJ=J+1 XX(I,JJ)=X(J,I) 90 CONTINUE 100 CONTINUE C ICHECK=0 C C * THE SUBPROGRAM FOR THE MAXIMUM LIKELIHOOD ESTIMATES * C 195 CALL EMALGO(NTOT,Y,Y2,IND,MPLONE,XX,ROWX,COLX,W,WCEN,LENW, + VCOV,WORK,LENWRK,ALPHA,TOLA,MAXITS,IFAULT,ICHECK,NC) C IF(ICHECK.NE.0) GOTO 360 IF((OUTPUT.EQ.' ').AND.(IFAULT.LE.0)) GOTO 2000 IF((OUTPUT.NE.' ').AND.(IFAULT.LE.0)) GOTO 2005 C C * CHANGE VCOV ARRAY FROM ONE DIM. TO TWO DIM. * C K=1 DO 300 I=1,MPLONE DO 200 J=1,MPLONE SIGMAA(I,J)=DSQRT(DABS(VCOV(K))) K=K+1 200 CONTINUE 300 CONTINUE C C PRINT FINAL REGRESSION COEFFICIENTS IN OUTPUT FILE. C 360 IF(OUTPUT.EQ.' ') GOTO 3360 WRITE(6,1050) WRITE(6,1020) WRITE(6,1200) ALPHA(1),SIGMAA(1,1) DO 455 J=2,NB JI=J-1 WRITE(6,1250) JI,ALPHA(J),SIGMAA(J,J) 455 CONTINUE WRITE(6,1300) ALPHA(MPLONE) IF(ICHECK.NE.0) GOTO 525 ITE=IFAULT GOTO 545 525 ITE=ICHECK 545 WRITE(6,1350) ITE GOTO 3000 C C FOLLOWING WE HAVE COLLECTED ALL OF WOLYNETZ'S ERROR FLAGS C AND PRINT OUT THE APPROPRIATE ERROR MESSAGE. C 2005 WRITE(6,1000) WRITE(6,1020) WRITE(6,1050) WRITE(6,2001) IF(IFAULT.NE.-1) GOTO 2019 WRITE(6,2002) WRITE(6,2003) GOTO 3000 2019 IF(IFAULT.NE.-4) GOTO 2029 WRITE(6,2012) WRITE(6,2014) GOTO 3000 2029 IF(IFAULT.NE.-5) GOTO 2038 WRITE(6,2021) WRITE(6,2022) WRITE(6,2023) WRITE(6,2024) WRITE(6,2025) WRITE(6,2026) WRITE(6,2027) GOTO 3000 2038 IF(IFAULT.NE.-6) GOTO 2049 WRITE(6,2031) WRITE(6,2032) WRITE(6,2033) WRITE(6,2034) WRITE(6,2035) WRITE(6,2036) WRITE(6,2037) GOTO 3000 2049 IF(IFAULT.NE.-7) GOTO 2056 WRITE(6,2045) GOTO 3000 2056 IF(IFAULT.NE.-8) GOTO 2065 WRITE(6,2055) 2065 WRITE(6,2070) C C PRINT FINAL REGRESSION COEFFICIENTS AND ERROR MESSAGES ON TERMINAL. C 3360 PRINT 1050 PRINT 1020 PRINT 1200,ALPHA(1),SIGMAA(1,1) DO 450 J=2,NB JI=J-1 PRINT 1250,JI,ALPHA(J),SIGMAA(J,J) 450 CONTINUE PRINT 1300,ALPHA(MPLONE) IF(ICHECK.NE.0) GOTO 520 ITE=IFAULT GOTO 540 520 ITE=ICHECK 540 PRINT 1350,ITE GOTO 3000 C 2000 PRINT 1000 PRINT 1020 PRINT 1050 PRINT 2001 IF(IFAULT.NE.-1) GOTO 2010 PRINT 2002 PRINT 2003 GOTO 3000 2010 IF(IFAULT.NE.-4) GOTO 2020 PRINT 2012 PRINT 2014 GOTO 3000 2020 IF(IFAULT.NE.-5) GOTO 2030 PRINT 2021 PRINT 2022 PRINT 2023 PRINT 2024 PRINT 2025 PRINT 2026 PRINT 2027 GOTO 3000 2030 IF(IFAULT.NE.-6) GOTO 2040 PRINT 2031 PRINT 2032 PRINT 2033 PRINT 2034 PRINT 2035 PRINT 2036 PRINT 2037 GOTO 3000 2040 IF(IFAULT.NE.-7) GOTO 2050 PRINT 2045 GOTO 3000 2050 IF(IFAULT.NE.-8) GOTO 2060 PRINT 2055 GOTO 3000 2060 PRINT 2070 C 1000 FORMAT(T10,'REGRESSION ANALYSIS WITH CENSORED DATA') 1020 FORMAT(T10,' PARAMETRIC EM ALGORITHM') 1050 FORMAT(T5,' ') 1100 FORMAT(T8,'DATA TITLE :',T25,60A1) 1150 FORMAT(T8,'TOTAL # OF DATA :',T26,I3,T33,'CENSORED DATA :' + ,T48,I3) 1200 FORMAT(T8,'INTERCEPT COEFF :',F8.4,T38,'#',T39,F8.4) 1250 FORMAT(T8,'SLOPE COEFF ',I1,' :',F8.4,T38,'#',T39, + F8.4,5X) 1300 FORMAT(T8,'STANDARD DEVIATION :',F8.4) 1350 FORMAT(T8,'ITERATIONS :',I3) 2001 FORMAT(T5,'NOTICE :') 2002 FORMAT(T5,'MAXIMUM NUMBER OF ITERATION REACHED AND') 2003 FORMAT(T5,'CONVERGENCE HAS NOT BEEN OBTAINED.') 2012 FORMAT(T5,'NUMBER OF COMPLETELY SPECIFIED DATA IS LESS') 2014 FORMAT(T5,'THAN NB+1') 2021 FORMAT(T5,'THE MATRIX IS NOT POSITIVE DEFINITE,AS') 2022 FORMAT(T5,'DETERMINED BY SUBROUTINE "SYMINV",A MATRIX') 2023 FORMAT(T5,'INVERSION PROCEDURE(HEALY,1968); THE VALUE') 2024 FORMAT(T5,'OF "NULLTY" AND "IFAULT" RETURNED BY SYMINV') 2025 FORMAT(T5,'ARE PLACED IN THE FIRST TWO POSITIONS OF ') 2026 FORMAT(T5,'THE ARRAY "VCOV" BEFORE RETURNING TO THE ') 2027 FORMAT(T5,'CALLING PROGRAM.') 2031 FORMAT(T5,'THE ESTIMATE OF THE VARIANCE- COVARIANCE') 2032 FORMAT(T5,'MATRIX IS NOT POSITIVE DEFINITE, AS DETER-') 2033 FORMAT(T5,'MINED BY SUBROUTINE "SYMINV"(HEALY,1968):') 2034 FORMAT(T5,'THE VALUES OF "NULLTY" AND "IFAULT", RETURNED') 2035 FORMAT(T5,'BY SYMINV ,ARE PLACED IN THE FIRST TWO ') 2036 FORMAT(T5,'POSITIONS OF THE ARRY "VCOV" BEFORE RETURNING') 2037 FORMAT(T5,'TO THE CALLING PROGRAM') 2045 FORMAT(T5,'"ROWX" IS LESS THAN NTOT') 2055 FORMAT(T5,'"COLX" IS LESS THAN NB') 2070 FORMAT(T5,'THE PROGRAM IS STOPPED FOR UNKNOWN REASONS') 3000 RETURN END =============================================================================== Entry #14 ball@ra 10/29/87 Asurv Updates From: Dave Ball Takashi, Thanks for the updates to xvar, em and asurv. I will incorporate them today. Dave =============================================================================== Entry #15 ti@jabba 12/16/87 Question about the reading a tape From: ti@jabba.psu.edu Hello Dave, We have a minor problem... Although we sent tapes made by UNIX, appearently many people tried to read it by VAX and failed. The problem is Control-j and Control-M difference (or they said so). Since you had a same prbolem before, if you have a conversion program from UNIX to VAX, could you send it to me? If you do not have it, could tell me how to do that? I will stay here this Xmas holidays, so please let me know. Thanks. Takashi Isobe =============================================================================== Entry #16 ball@stsci 12/16/87 Tape question From: Dave Ball Takashi, In answer to your question about tapes, I spent about a half a day on the VAX examining the tape you sent and realized after a conversation with a local UNIX expert that the easiest way to read your tape was on the UNIX machine, which was how I finally got the programs. Sorry I don't have a conversion program although I think one would be fairly easy to write. Next, I have finally gotten into the implementation of ASURV in IRAF after several other projects are done (not completed, mind you, just this or that phase is done!) and I have a few minor questions. The following is my "questions" file. There is no rush to these. In fact, there will probably be more to come. Questions about ASURV code Last updated 15-Dec-87 1. In kmestm, just after the call to prnt, there is a "NTOT=IXU+IXC" Is this necessary since ntot is input to this routine? 2. In kmestm, he refers to a "PL estimator": what is the difference between that and the km estimator if any? 3. In kmestm, at the end there are several lines changing ixc and ixu back to original values if the last censored point was changed to a detection but these varaibles are not passed back to the caller. Why bother? Maybe you will have some snow up there for the holidays! Dave =============================================================================== Entry #17 ball@stsci 01/04/88 More ASURV questions From: Dave Ball Takashi, I am still working madly on the ASURV conversion and I have some more questions for you. Your last message must have been cut short somehow; there wasn't any text in the message except a "q"! Anyway here is my expanded questions file including the first few I sent you. I hope you are having a pleasant holiday and are not working TOO hard! Hope you had a Merry Christmas and a happy new year! (I tried to send this to you last week but I think I messed up; if you DID get this, sorry!, and please ignore!) Dave Questions about ASURV code Last updated 30-Dec-87 1. In kmestm, just after the call to prnt, there is a "NTOT=IXU+IXC" Is this necessary since ntot is input to this routine? 2. In kmestm, he refers to a "PL estimator": what is the difference between that and the km estimator if any? 3. In kmestm, at the end there are several lines changing ixc and ixu back to original values if the last censored point was changed to a detection but these varaibles are not passed back to the caller. Why bother? 4. In twost, for the Peto & Peto test when there are no censored data and the test reduces to Gehan, the call to subroutine wlcxn contains only two arguments; shouldn't there be four? 5. In subroutine stat, there are several print statements which I have commented out in the IRAF version. What are they printing? Should some kind of output be produced there in the IRAF version? Or should I just comment them out as I did? 6. In subroutine sumry, there is a variable in the calling sequence called TIME which is not referenced in the routine. I removed it in the IRAF code but should it be there or not? If yes what is it used for? 7. In subroutine aarray, there was a statement IFULL = IPR which would reset IFULL to zero since IPR is not defined in aarray. This would result in the minimum printout even if the user asked for all. This is left over from moving all the printout statements up to a higher level. 8. In subroutines grdprb and bin, I removed references to OUTPUT, IPRINT, and IO which are all related to "verbose" mode printouts, both in the code and in the calling sequences. The IRAF version of twokm will use a verbose flag to control volume of printout. 9. In subroutine datreg, near the end, there is a misspelling of ICENS. 10. For the EM algorithm, there is an input variable called IBET to determine if the dependent variable is a pair of limits. It appears to me that the censoring indicator has a value of 5 in that case, so I included a check in the datreg routine which reads in the data to determine if a second value should be read rather than reopening the input file and rereading the data as is done in subroutine EM. Will this get me into trouble? =============================================================================== Entry #18 scivax::ball 01/04/88 Asurv output Bob, Here is the output file from the statistics stuff. The mistake you noted has been fixed! By the way do you want a little demo on the stuff? Dave *** Two-Sample test *** Title: IR Luminosity of Galaxies Data set: IR Luminosity Tested groups are Normal and Starburst Number of data points: 12; number of upper limits: 5 Number of data points in Group I: 6 Number of data points in Group II: 6 Distinct Failures R(I) M(I) M(I)/R(I) H(I) 31.100 11.000 -31.100 0.091 1.000 30.200 10.000 -30.200 0.100 0.909 30.100 9.000 -30.100 0.111 0.818 29.000 7.000 -29.000 0.143 0.727 28.500 5.000 -28.500 0.400 0.623 26.900 1.000 -26.900 1.000 0.374 0.000 Gehan's Generalized Wilcoxon Test T(I) ID1(I) ID2(I) R1(I) R2(I) 32.400 1 2 0.0 1.0 31.100 0 2 -10.0 11.0 30.200 0 2 -8.0 10.0 30.100 0 1 -6.0 9.0 29.700 1 1 3.0 1.0 29.000 0 2 -3.0 7.0 29.000 1 2 4.0 1.0 28.500 0 1 1.0 4.0 28.500 0 2 1.0 4.0 28.100 1 1 6.0 1.0 27.600 1 1 6.0 1.0 26.900 0 1 6.0 1.0 Test Statistic = 1.652 Probability = 0.0986 Logrank Test T(I) ID1(I) ID2(I) Score(I) 32.400 1 2 0.000 31.100 0 2 0.909 30.200 0 2 0.809 30.100 0 1 0.698 29.700 1 1 -0.302 29.000 0 2 0.555 29.000 1 2 -0.445 28.500 0 1 0.155 28.500 0 2 0.155 28.100 1 1 -0.845 27.600 1 1 -0.845 26.900 0 1 -0.845 Test Statistic = 1.742 Probability = 0.0815 Cox-Mantel Test Risk(I) A(I) 5.000 0.455 4.000 0.400 3.000 0.333 3.000 0.429 1.000 0.200 0.000 0.000 Test Statistic = 1.814 Probability = 0.0696 Peto and Peto's Generalized Wilcoxon Test T(I) Score(I) 32.400 0.000 31.100 0.909 30.200 0.727 30.100 0.545 29.700 -0.273 29.000 0.351 29.000 -0.377 28.500 -0.003 28.500 -0.003 28.100 -0.626 27.600 -0.626 26.900 -0.626 Test Statistic = 1.730 Probability = 0.0837 Kaplan-Meier Estimator Data name: Normal Number of data points: 6; number of lower limits: 3 Variable PL Estimator Error 30.100 0.833 0.152 29.700 C 28.500 0.625 0.213 28.100 C 27.600 C 26.900 0.000 0.000 Percentiles 75 TH 50 TH 25 TH 0.000 0.000 0.000 Mean= 27.767 +/- 0.515 Kaplan-Meier Estimator Data name: Starburst Number of data points: 6; number of lower limits: 2 Variable PL Estimator Error 32.400 C 31.100 0.800 0.179 30.200 0.600 0.219 29.000 C 29.000 0.400 0.219 28.500 0.000 0.000 Percentiles 75 TH 50 TH 25 TH 30.875 29.600 28.812 Mean= 29.460 +/- 0.460 Correlation Test by Cox Proportional Hazard Model Title: Optical infrared luminosities of Galaxies Data: optirlum.2ddat Number of data points is 16 Upper limits in Y X Both Mix 6 0 0 0 Global chi-square is 18.458 with 1 degrees of freedom Probability is 0.0000 Correlation Test by Generalized Kendall's Tau Title: Optical infrared luminosities of Galaxies Data: optirlum.2ddat Number of data points is 16 Upper limits in Y X Both Mix 6 0 0 0 Z-value is 4.380 Probability is 0.0000 (H0, two sided) =============================================================================== Entry #19 ti@jabba 01/07/88 ASURV From: ti@jabba.psu.edu Happy New Year Dave: Although I sent the last mail just before Xmas, I think something happened while it went through the main computer. It was shut down around Xmas. Anyway, how was your holiday? I had a good time in N.Y. Thank you for working on ASURV. I hope that you do not have too many problems to modify the program (I expect dozens of mistakes but not hundreds of them). Takashi Isobe The answers for your questions : 1. In kmestm, just after the call to prnt, there is a "NTOT=IXU+IXC" Is this necessary since ntot is input to this routine? *********** It is not necessary. This one and third one are remnant from the older version. 2. In kmestm, he refers to a "PL estimator": what is the difference between that and the km estimator if any? *********** Not in my program. The official name of the method is Kaplan-Meier Product Limit estimator. Under the PL estimator there are several different estimation methods. 3. In kmestm, at the end there are several lines changing ixc and ixu back to original values if the last censored point was changed to a detection but these varaibles are not passed back to the caller. Why bother? ********** Same as first. You can erase them. 4. In twost, for the Peto & Peto test when there are no censored data and the test reduces to Gehan, the call to subroutine wlcxn contains only two arguments; shouldn't there be four? ********** Yes, you are right. Thanks. 5. In subroutine stat, there are several print statements which I have commented out in the IRAF version. What are they printing? Should some kind of output be produced there in the IRAF version? Or should I just comment them out as I did? ********** These print outs are for statisticians' interest. Astronomers may not want to know them (or don't care). If you want, except the variace, you can drop them. 6. In subroutine sumry, there is a variable in the calling sequence called TIME which is not referenced in the routine. I removed it in the IRAF code but should it be there or not? If yes what is it used for? ********* This is another fossil from the older version. Please remove it. 7. In subroutine aarray, there was a statement IFULL = IPR which would reset IFULL to zero since IPR is not defined in aarray. This would result in the minimum printout even if the user asked for all. This is left over from moving all the printout statements up to a higher level. ******** Yes, please remove it. 8. In subroutines grdprb and bin, I removed references to OUTPUT, IPRINT, and IO which are all related to "verbose" mode printouts, both in the code and in the calling sequences. The IRAF version of twokm will use a verbose flag to control volume of printout. ******** O.K. 9. In subroutine datreg, near the end, there is a misspelling of ICENS. ******** Thank you. I'll correct it. 10. For the EM algorithm, there is an input variable called IBET to determine if the dependent variable is a pair of limits. It appears to me that the censoring indicator has a value of 5 in that case, so I included a check in the datreg routine which reads in the data to determine if a second value should be read rather than reopening the input file and rereading the data as is done in subroutine EM. Will this get me into trouble? ******* I do not think that there is any problem. =============================================================================== Entry #20 scivax::hanisch 01/27/88 help info for Penn State stats pkg Dave, Karen and Judy did a little typing for us. This is the ASURV help manual. It clearly needs to be redone a bit for use in IRAF help files, but at least the real grunt work is done. From: SCIVAX::WIESSNER 12-JAN-1988 10:41 To: HANISCH Subj: Here's the typing 1.0 Introduction Observational astronomers frequently encounter the situation where they observe a particular property (e.g. far infrared emission, calcium line absorption, CO emission) of a previously defined sample of objects, but fail to detect all of them. The data then contains "upper limits" as well as detections, preventing the use of simple and familiar statistical techniques in the analysis. Statistical quantities frequently sought include the mean value of the measured variable for the sample, its distribution function, comparison of this distribution with other samples, and the relation of the measured variable to other properties of the objects. A number of astronomers have recently recognized the existence of a variety of statistical methods, or have derived similar methods on their own, to deal with these problems. The methods are frequently collectively called "survival analysis" or the "analysis of lifetime data," from their origins in actuarial and related fields. ASURV is a menu-driven stand-alone computer package designed to assist astronomers in using some of these methods. Rev. 0.0 of ASURV provides all of the methods described in our papers in the Astrophysical Journal (feigelson and Nelson, 1985; Isobe, Feigelson and Nelson, 1986). No statistical procedure can magically recover information that was never measured at the telescope. However, there is frequently important information implicit in the failure to detect some objects which can be partially recovered under reasonable assumptions. We purposely provide several two-sample tests, correlation tests and linear regressions - each based on different models of where upper limits truly lie - so that the astronomer can judge the importance of the different assumptions. There are reasons for confidence in the methods. First, the volume of scholarly effort devoted to the problem is impressive. A dozen books and scores of papers have been published during the last decade. Second, other people in enterprises involving thousands of lives and billions of dollars appear to believe the methods. Third, our Monte Carlo simulations of a typical astronomical problem (flux-limited satellite observations of a magnitude limited population of uniformly distributed objects) showed several methods to be quite reliable in this context (Isobe et al. 1986). More simulations of astronomical problems are definitely needed. There are also reasons for not applying these methods. We describe some of their known limitations in section 2.3. Because astronomers are frequently unfamiliar with this field of statistics, we devote the first section to background material. Both general issues concerning censored data and specifics regarding the methods used in ASURV are discussed. A more mathematical presentation can be found in many of the references given in section 2.6. Those wishing to get right down to work should read section 2.4, find the appropriate program in section 3, and follow the instructions in section 4. 2.0 Background 2.1 Overview of Survival Analysis Statistical methods dealing with censored data have a long and confusing history. It began in the 17th century with the development of actuarial mathematics for the life insurance industry and demographic studies. Astronomer Edmund Halley was one of the founders. In the mid-20th century, this application grew along with biomedical and clinical research into a major field of biometrics. Similar (and sometimes identical) statistical methods were being developed in two other fields: the theory of reliability, with industrial and engineering applications; and econometrics, with attempts to understand the relations between economic forces in groups of people. Finally. we find the same mathematical problems and some of the same solutions arising in modern astronomy as discussed is section 1 above. Let us give an illustration on the use of survival analysis in three disparate fields. Consider a linear regression problem. First, an epidemiologist wishes to determine how the longevity or "survival time" (dependent variable) depends on the number of cigarettes smoked per day (independent variable). The experiment lasts 10 years, during which some individuals die and others do not. The survival time of the living individuals is only known to be greater than their age when the experiment ends. These values are therefore "right censored data points." Second, a company manufacturing engines wishes to know the average time between breakdowns as a function of engine speed to determine the optimal operating range. Test engines are set running until 20% of them fail; the average "lifetime" and dependence on speed is then calculated with 80% of the data points censored. Third, an astronomer observes a sample of nearby galaxies in the far- infrared to determine the relation between dust and molecular gas. Half have detected infrared luminosities but half are upper limits ("left censored data points"). The infrared luminosities are compared to CO observations, which themselves may contain censored values. Astronomers have dealt with their upper limits in a number of fashions. One is to consider only detections in the analysis; while possibly acceptable for some purposes (e.g. regression), this will clearly bias the results in others (e.g. estimating luminosity functions). A second procedure considers the ratio of detected objects to observed objects in a given sample. When plotted in a range of luminosity bins, this has been called the "fractional luminosity function" and has been frequently used in extragalactic radio astronomy. This is mathematically the same as actuarial life tables. But it is sometimes used incorrectly in comparing different samples: the detected fraction clearly depends on the (usually uninteresting) distance to the objects as well as their (interesting) luminosity. Also, [] N error bars do not apply in these fractional luminosity functions. A third procedure is to take all of the data, including the exact values of the upper limits, and model the properties of the parent population under certain mathematical constraints, such as maximizing the likelihood that the data fits the model. This technique, which is at the root of much of modern survival analysis, was independently developed by astrophysicist Yoram Avni (Avni et al. 1980; Avni and Tananbaum 1986) and is now commonly used in X-ray astronomy. The advantage accrued in learning and using survival analysis methods developed in biometrics, econometrics, actuarial and reliability mathematics is the wide range of useful techniques developed in these fields. In general, astronomers can have faith in the mathematical validity of survival analysis. Issues of great social importance (e.g. establishing Social Security benefits, strategies for dealing with the AIDS epidemic, MIL-STD reliability standards) are based on it. In detail, however, astronomers must study the assumptions underlying each model and be aware that some methods at the forefront of survival analysis may not be fully understood. 2.2 Overview of ASURV The statistical methods for dealing with censored data might be divided into a 2x2 grid: parametric vs. nonparametric, and univariate vs. multivariate. Parametric methods assume that the censored survival times are drawn from a parent population with a known distribution function (e.g. Gaussian, exponential), and is the principal assumption of industrial reliability models. Nonparametric models make no assumption about the parent population, and frequently rely on maximum-likelihood techniques. Univariate methods are devoted to determining the characteristics of the population from which the censored sample is drawn, and comparing such characteristics for two or more censored samples. Bivariate methods measure whether the censored property of the same is correlated with another property of the objects and, if so, to perform a regression that quantifies the relationship between the two vairables. There are few truly multivariate methods for censored data. We have chosen to concentrate on nonparametric models, since the underlying distribution of astronomical populations is usually unknown. Some methods (e.g. EM algorithm regression), however, are fully parametric while others (e.g. Cox regression, Buckley-James regression) are partially parametric. Most of the methods are derived and described in more detail in Schmitt (1985), Feigelson and Nelson (1985) and Isobe et al. (1986). The program within ASURV giving univariate methods is UNIVAR. Its first subprogram is KMESTM, giving the Kaplan-Meier estimator for the distribution function of a randomly censored sample. First derived in 1958, it lies at the root of non- parametric survival analysis. It is the unique, self-consistent, generalized maximum-likelihood estimator for the population from which the sample was drawn. When formulated in cumulative form, it has analytic asymptotic (for large N) error bars. The median is always well-defined, though the mean is not if the lowest point in the sample is an upper limit. It is identical to the differential "redistribute-to-the-right" algorithm, independently, derived by Avni et al. (1980), but when formulated in integral form has simple analytic error analysis. The second univariate program is TWOST, giving the variety of ways to test whether two censored samples are drawn from the same parent population. They are mostly generalizations of standard tests for uncensored data, such as the Wilcoxon and logrank nonparametric two-sample tests. They differ in how the censored data are weighted or "scored" in calculating the statistic. Thus each is more sensitive to differences at one end or the other of the distribution. The Gehan and logrank tests are widely used in biometrics, while some of the others are not. BIVAR provides methods for bivariate data, giving two correlation tests and three linear regression analyses. Cox hazard model (correlation test), EM algorithm, and Buckley-James method (linear regressions) can treat several independent variables if the dependent variable contains only one kind of censoring (i.e. upper of lower limits). Generalized Kendall's tau (correlation test) and Schmitt's binned linear regression can treat mixed censoring including censoring in the independent variable, but can have only one independent variable. We currently do not provide error analysis for Schmitt's method. 2.3 Cautions and Caveats The Kaplan-Meier estimator works with any underlying distribution (e.g. Gaussian, power law, bimodal), but only if the censoring is "random." That is, the probability that the measurement of an object is censored can not depend on the value of the censored variable. At first glance, this may seem to be inapplicable to most astronomical problems: we detect the brighter objects in a sample, so the distribution of upper limits always depends on brightness. However, two factors often serve to randomize the censoring distribution. First, the censored variable may not be correlated with the variable by which the sample was initially identified. Thus, infrared observations of a sample of radio bright objects will be randomly censored if the radio and infrared emission are unrelated to each other. Second, astronomical objects in a sample usually lie at different distances, so that brighter objects are not always the most luminous. In cases where the variable of interest is censored at a particular value (e.g. flux, spectral line equivalent width, stellar rotation rate) rather than randomly censored, then parametric methods appropriate to "Type 1" censoring should be used. These are described by Lawless (1982) and Schneider (1986), but are not included in this package. Thus, the censoring mechanisms of each study MUST be understood individually to judge whether the censoring is or is not likely to be random. The appearance of the data, even if the upper limits are clustered at one end of the distribution, is NOT a reliable measure. A frequent (if philosophically distasteful) escape from the difficulty of determining the nature of the censoring in a given experiment is to define the population of interest to be the observed sample. The Kaplan-Meier estimator then always gives a valid redistribution of the upper limits, though the result may not be applicable in wider contexts. The two-sample tests are somewhat less restrictive than the Kaplan-Meier estimator, since they seek only to compare two samples rather than determine the true underlying distribution. The Gehan test assume that the censoring patterns of the two samples are the same, but does not require that they be random. The logrank test results appear to be correct even when the censoring patterns differ somewhat. There is little known about the limitations of the other two-sample tests given here or the bivariate generalized Kendall's tau rank statistic. The two bivariate correlation coefficients, generalized Kendall's tau and Cox regression, are both known to be inaccurate when many tied values are present in the data. Ties are particularly common when the data is binned. Caution should be used in these cases. Cox regression, though widely used in biostatistical applications, formally applies only if the "hazard rate" is constant; that is, the cumulative distribution function of the censored variable falls exponentially with increasing values. Astronomical luminosity functions, in contrast, are frequently modeled by power law distributions. It is not clear whether or not the results of a Cox regression are significantly affected by this difference. There are a variety of limitations to the three linear regression methods -- EM, BJ, and TWOKM -- presented here. First, only TWOKM works when censoring is present in both variables. Second, EM requires that the residuals about the fitted line follow a Gaussian distribution. BJ and TWOKM are less restrictive, requiring only that the censoring distribution about the fitted line is random. Both we and Schneider (1986) find little difference in the regression coefficients derived from these two methods. Third, there is considerable uncertainty regarding the error analysis of the regression coefficients of all three models. EM gives analytic formulae based on asymptotic normality, while Avni and Tananbaum (1986) numerically calculate and examine the likelihood surface. BJ gives an analytic formula for the slope only, but it lies on a weak theoretical foundation. We do not provide Schmitt's (1985) bootstrap error analysis for TWOKM, partly because of the high computational expense for large data sets. Users may thus wish to run all methods and evaluate the uncertainty with caution. Fourth, use of TWOKM is questionable in certain cases with heavy censoring. The user must arbitrarily choose a bin size. If it is too small, many censored points at the end of the distribution will be changed to detected points. If the bins are too large, accuracy in the regression calculation is reduced. Fifth, the BJ method has a defect in that the final solution occasionally oscillates rather than converges smoothly. If we consider the field of survival analysis from a broader perspective, we note a number of deficiencies with respect to censored statistical problems in astronomy. Most importantly, survival analysis assumes the upper limits in a given experiment are precisely known, while in astronomy they frequently represent n times the RMS noise level in the experimental detector, where n=2, 3, 5, or whatever. Methods adopted to this situation are clearly needed. A related deficiency is the absence of weighted means or regressions. Survival analysis also has not yet produced any true multi-variate techniques, such as a Principal Components Analysis that permits censoring. There also seems to be a dearth of nonparametric goodness-of-fit tests like the Kolmogorov-Smirnov test. Finally, we note that this package - ASURV - is not unique. Nearly a dozen software packages for analyzing censored data have been developed (Wagner and Meeker 1985). Four are part of large multi-purpose commercial statistical software systems: SAS, SPSS, BMDP, and GLIM. These packages are available on many university mainframes. We have found BMDP to be the most useful for astronomers (see Appendices to Feigelson and Nelson 1985, Isobe et al. 1986 for instructions on its use). It provides most of the functions in KMESTM and TWOST as well as a full Cox regression, but no linear regressions. Other packages (CENSOR, DASH, LIMDEP, STATPAC, STAR, SURVAN, SURVREG) were written at various universities, medical institutes and industrial labs, and have not been evaluated by us. It is not clear whether ASURV should be maintained and developed, or whether the astronomical community should adopt one of these other packages. 2.4 Acknowledgments The production and distribution of this package is partially supported by NSF Presidential Young Investigator Award AST 83- 51447 and NASA Research Grant NAG 8-555. We are grateful to Paul Nelson (Dept. Statistics, Kansas State University) for an enduring and fruitful collaboration; to Gillian Knapp and Mark Wardle (Dept. of Astronomy, Princeton) and Mark Halpern (Dept. of Statistics, Standford [sic]) for generously lending us software. 2.5 Annotated Bibliography Amemiya, T. Advanced Econometrics (Harvard U. Press:Cambridge MA) 1985. [Reviews econometric "Tobit" regression models including censoring. Avni, Y., Soltan, A., Tananbaum, H. and Zamorani, G. "A Method for Determining Luminosity Functions Incorporating Both Flux Measurement and Flux Upper Limits, with Applications to the Average X-ray to Optical Luminosity Ration for Quasars," Astrophys. J. 235:694 1980. [The discovery paper in the astronomical literature for the differential Kaplan-Meier estimator.] Avni, Y. and Tananbaum, H. "X-ray Properties of Optically Selected QSOs." Astrophys. J. 305:57 1986. [The discovery paper in the astronomical literature of the linear regression with censored data for the Gaussian model.] Brown, B.J. Jr., Hollander, M., and Korwar, R.M. "Nonparametric Tests of Independence for Censored Data, with Applications to Heart Transplant Studies" in Reliability and Biometry, eds. F. Proschan and R.J. Serfling (Philadelphia: SIAM) p. 327 1974. [Reference for the generalized Kendall's tau correlation coefficient.] Buckley, J. and James, I. "Linear Regression with Censored Data," Biometrika 66:429 1979. [Reference for the linear regression with Kaplan-Meier residuals.] Feigelson, E.D. and Nelson, P.I. "Statistical Methods for Astronomical Data with Upper Limits: I. Univariate Distributions," Astrophys. J. 293:192 1985. [Our first paper, presenting the procedures in UNIVAR here.] Isobe, T., Feigelon, E.D., and Nelson, P.I. "Statistical Methods for Astronomical Data with Upper Limits: II. Correlation and Regression," Astrophys. J. 306:490 1986. [Our second paper, presenting the procedures in BIVAR here.] Kalbfleisch, J.D. and Prentice, R.L. The Statistical Analysis of Failure Time Data (Wiley:New York) 1980. [A well-known monograph with particular emphasis on Cox regression.] Lee, E.T. Statistical Methods for Survival Data Analysis (Lifetime Learning Pub.:Belmont CA) 1980. [Hands-on approach, with many useful examples and FORTRAN programs.] Lawless, J.F. Statistical Models and Methods for Lifetime Data (Wiley:New York) 1982. [A very thorough monograph, emphasizing parametric models and engineering applications.] Miller, R.G. Jr. Survival Analysis (Wiley, New York) 1981. [A good introduction to the field; brief and informative lecture notes from a graduate course at Stanford.] Schmitt, J.H.M.M. "Statistical Analysis of Astronomical Data Containing Upper Bounds: General Methods and Examples Drawn from X-ray Astronomy." Astrophys. J. 293:178 1985. [Similar to our papers, a presentation of survival analysis for astronomers. Derives TWOKM regression for censoring in both variables.] Schneider, H. Truncated and Censored Samples Normal Populations (Dekker:New York) 1986. [Monograph specializing on Gaussian models, with a good discussion of linear regression.] Wagner, A.E. and Meeker, W.Q. Jr. "A Survey of Statistical Software for Life Data Analysis," in 1985 Proceedings of the Statistical Computing Section, (Amer. Stat. Assn.:Washington DC), p. 441. [Summarizes capabilities and gives addresses for other software packages.] Wardle, M. and Knapp, G.R. "The Statistical Distribution of the Neutral-Hydrogen Content of SO Galaxies," Astron. J. 91:23 1986. [Discusses the differential Kaplan-Meier distribution and its error analysis.] Wolynetz, M.S. "Maximum Likelihood Estimation in a Linear Model from Confined and Censored Normal Data," Appl. Stat. 28:195 1979. [Published Fortran code for the EM algorithm linear regression.] 3.0 Outline of the Programs and Subroutines UNIVAR deals with univariate censored data, and is divided into two parts. KMESTM calculates and prints the Kaplan-Meier product-limit estimator of a randomly censored distribution. TWOST computes several nonparametric two-sample tests for comparing censored two or more censored data sets. The code closely follows that given by Lee (1980, Appendix B). Figure 1 shows the various subroutines and their relationships. Detailed remarks can be found in comments are the beginning of each subroutine. BIVAR deals with bivariate censored data, and provides two correlation coefficients (subroutines COXREG and BHK) and three linear regressions (EM, BJ, and TWOKM). Figure 2 shows the structure. COXREG computes the correlation probabilities by Cox's proportional hazard model and BHK computes the generalized Kendall's tau. EM calculates linear regression coefficients by the EM algorithm assuming a normal distribution of residuals, BJ calculates the Buckley-James regression with Kaplan-Meier residuals, and TWOKM calculates the binned two-dimensional Kaplan-Meier distribution and associated linear regression coefficients derived by Schmitt (1985). The code for EM algorithm follows that given by Wolynetz (1979) except minor changes. The code for Buckley-James method is adopted from Halpern (Stanford Department of Statistics; private communication). COXREG, BHK, and TWOKM code were developed entirely by us. 4.0 How to Run ASURV 4.1 General Setup ASURV is designed to be menu-driven. There are two basic input structures: a data file, and a command file. The data file is assumed to reside on disk. For each observed object, it contains the measured value of the variable of interest and an indicator as to whether it is detected or not. The indicator is 0 if detected, -1 if an upper limit (most common in astronomical applications), or +1 if a lower limit (most common in lifetime applications). The command file may either reside on disk, or may be given interactively at the terminal. For the univariate program UNIVAR, the format of the data file is determined by the subroutine DATAIN. It is currently set at 10*(I4,F10.3) for KMESTM, where each column represents a distinct subsample. For TWOST, it is set at I4, 10*(I4,F10.3), where the first column gives the sample identification number. Table 1 gives an example of a TWOST data file called 'Galax.dat'. It shows a hypothetical study where six normal galaxies, six starburst galaxies and six Seyfert galaxies were observed with the IRAS satellite. The variable is the log of the 60-micron luminosity. The problem we address are the estimation of the luminosity functions of each sample, and a determination whether they are different from each other at a statistically significant level. Command file input formats are given in subroutine UNIVAR, statements 2690 to 3225. The input lines are parsed in subroutines DATA and DATA2. Note that the True/False indicators are given in the command file as 1/0 in the first column (not in the 4th column as required by I4 formats). For the bivariate program BIVAR, the data file format is determined by the subroutine DATREG. It is currently set at I4,10F10.3. Table 2 shows an example of a bivariate censored problem. Here one wishes to investigate possible relations between infrared and optical luminosities in a sample of galaxies. Note that adjustment of the format statements in DATAIN and DATREG would allow any program to be run from a single multi-variable, multi-sample data file if desired. Users are encouraged to change the format statements to suit their needs. BIVAR command file input formats are sometimes a bit complex. The examples below illustrate various common cases. The formats for the basic command inputs are given in subroutine BIVAR, statements 6660 to 7115. Additional inputs for the EM algorithm, Buckley-James method, and Schmitt's binned two-dimensional Kaplan-Meier method are given in subroutines R3, R4, and R5 respectively. The current version of ASURV is set up for data sets with fewer than 200 data points and occupies about 0.35 MBy of core memory. For problems with more than 200 points, the array sizes in all subroutines must be changed. Note that for the generalized Kendall's tau correlation coefficient, and possibly other programs, the computation time goes up as a high power of the number of data points. Table 1 Example of UNIVAR Data Input for TWOST IR, Optical and Radio Luminosities of Normal, Starburst and Seyfert Galaxies ____ 0 0 28.5 | 0 0 26.9 | 0 -1 29.7 Normal galaxies 0 -1 28.1 | 0 0 30.1 | 0 -1 27.6 ____ 1 -1 29.0 | 1 0 29.0 | 1 0 30.2 Starburst galaxies 1 -1 32.4 | 1 0 28.5 | 1 0 31.1 ____ 2 0 31.9 | 2 -1 32.3 Seyfert galaxies 2 0 30.4 | 2 0 31.8 ____ | | | #1 #2 #3 Note: #1 : indicator of subsamples (or groups) If you need only K-M estimator, leave out this indicator. #2 : indicator of censoring #3 : variable (in this case, the values are in log form) Table 2 Example of Bivariate Data Input for MULVAR Optical and IR luminosity relation of IRAS galaxies 0 27.2 28.5 0 25.4 26.9 -1 27.2 29.7 -1 25.9 28.1 0 28.8 30.1 -1 25.3 27.6 -1 26.5 29.0 0 27.1 29.0 0 28.9 30.2 -1 29.9 32.4 0 27.0 28.5 0 29.8 31.1 0 30.1 31.9 -1 29.7 32.3 0 28.4 30.4 0 29.3 31.8 | | | #1 #2 #3 ________ ______ Optical IR Note: #1: indicator of censoring #2: independent variable (may be more than one) #3: dependent variable =============================================================================== Entry #21 ball@stsci 03/10/88 Multivariate method restriction in ASURV From: Dave Ball Hi Takashi, Sorry to have been out of touch for a while. I have been working on other things recently and have just gotten back to the survival analysis work. I have a question on some checks that you make in the bivariate/multivariate algorithms to retrict the choice of methods a user may choose after the subroutine datreg reads in the data. I need to know what to check for in each individual algorithm to prohibit its use on inappropriate data, i.e. what type of data is the Buckley-James method not useable for or the em method, etc. for each of the five methods? I have implemented the BHK method and the Cox-hazard method already, but I would appreciate having this information for them also. I think I will have all methods implemented by the end of next week (or shortly thereafter!), so as soon as you can get this to me I would appreciate it! Thanks. Dave =============================================================================== Entry #22 ball@stsci 03/10/88 ASURV parameter initialization From: Dave Ball Hi Takashi It's me again! I almost forgot: particularly for the em and schmitt methods, some initial values for some parameters are to be supplied (or guesses) by the user but we had agreed that the program would do that; what are good values for these parameters for most cases? Dave =============================================================================== Entry #23 ball@stsci 03/23/88 Recent mail From: Dave Ball Takashi, Thanks for the information. I have not yet had a chance to assimilate all the news, but I do have a file with thhe data from Kriss in it if you're interested. Let me know and I'll mail you a copy. Dave =============================================================================== Entry #24 ball@stsci 03/28/88 More questions From: Dave Ball Takashi, How's spring in State College? The daffodils are blooming here. I need some more information on the icheck/ifault useage in the EM method; I don't really understand how they are used in the routines em and emalgo so I am having a little trouble seeing how to code them into my driver routine for the em method. Also the information you gave me on how the use of each method is restricted by the type of censoring was good (for em, bj, and schmitt methods); I need the same information for coxreg and bhk. Thanks! Dave =============================================================================== Entry #25 scivax::ball 03/28/88 questions answered From ti@jabba.psu.edu Tue Mar 22 14:18:47 1988 Received: Tue, 22 Mar 88 14:14:59 EST from jabba.psu.edu by stsci.arpa (3.2/1.0) Received: by jabba.psu.edu (3.2/Psu1.0) id AA00153; Tue, 22 Mar 88 12:53:59 EST Date: Tue, 22 Mar 88 12:53:59 EST From: ti@jabba.psu.edu (Takashi Isobe) Message-Id: <8803221753.AA00153@jabba.psu.edu> To: ball Subject: ASURV Status: R Hi, Dave: Sorry not to answer to your questions soon, but I just came back from IPAC last weekend and did not start work till yesterday (in fact, I was in Baltimore on Sunday to see the Gun show). Here are the answers: 1. EM algorithm and Buckley-James methods can treat multivariate problems.The restriction is that the censoring is allowed only in the dependent variable. All dependent variables should be completely detected.On the other hand, Schmitt's method cannot treat multivariate problem, but can take censored points in both variables. The censoring indicators which can be used for EM and B-J are -1, 0 , 1 which indicate the censoring status of the dependent variable. Those of Schmitt's are -4, -3, -2, -1, 0, 1, 2, 3, 4, which indicate censored status of dependent and/or indpendent variables. 2. Please set the itereation limit aroud 400. If the conversion does not occur before that, the data are not good enuogh to examine. The torelance level is 10**-5 is normally good enough. For Schmitt's method, the number of bins of both axies can be set around 15, though it might be too many for the small data set. If you have any more questions, please let me know. Takashi P.S. These days, I'm using Mac and not the department computer, so sometime it takes a few extra days to find out that I have received mails. =============================================================================== Entry #26 scivax::ball 03/28/88 more questions answered From ti@jabba.psu.edu Tue Mar 22 14:21:46 1988 Received: Tue, 22 Mar 88 14:20:34 EST from jabba.psu.edu by stsci.arpa (3.2/1.0) Received: by jabba.psu.edu (3.2/Psu1.0) id AA00133; Tue, 22 Mar 88 12:51:01 EST Date: Tue, 22 Mar 88 12:51:01 EST From: ti@jabba.psu.edu (Takashi Isobe) Message-Id: <8803221751.AA00133@jabba.psu.edu> To: ball Subject: ASURV Status: R Hi Dave: I can't send the mail to you yet, but I'll try again. This mail contains the answers for the mail on 18th. o in these tasks as well as in the em algorithm, you compute alpha and sigmaa (slope-intercept coefficients). Most places they are dimensioned at 10 but shouldn't they be nvar+2? Some related arrays probably need to be increased also then. *******You are right. Since not so many astronomers use more than 3 independent variables, I just ignored the inconsistancy. Please change in your varsion. o when I was testing the IRAF version of the schmitt algorithm, I got slightly different answers between the original ASURV and the IRAF version for the IRAS luminosity galaxy sample for 10 bins in x and y letting the program compute sizes and origin, but the same results when I chose 9 bins in both. Your comments in the schmitt routines suggest that this algorithm is VERY sensitive to the bin size and number. Is this right? If so it would be good if we could figure out a way for the program to compute optimal values for these parameters. *******This is the very dificult problem. I do not know the best way. There are two factors. First, if you take only a few bins, you lose the positional information. Second, if you take hundreds of bins, though you can get good positional information, you can't use upper limits as weight (you need to drop the upper limits of change them to detection.) From my experience, 10 to 15 bins are relatively safe. If you have any thought, let me know. o I need a more extensive set of test data, something with more points and more independent variables(>=2). Do you have something like that I could use to test the ASURV stuff with? ******I don't have a data set that you want, but I suggest to use the data from Table 1 of G. Kriss(1988, Ap. J. 324, 809; Jan. 15, 1988). The regression among logL(1micro)-logL(2500)-logL(2kev)is good one. I havn't had time to test it yet, I'll compute it and let you know the result. o in the em routine, you use icheck and ifault (returned from emalgo) almost like counters of the number of iterations used. I would like to combine them if possible into ite and return that to the IRAF driver I am working on and let the driver decode and print the Wolynetz error messages rather than do that in em. Is this possible? ******Yes, I do not find any problem. I hope that you recieve this mail. Takashi =============================================================================== Entry #27 ball@stsci 03/31/88 More questions on the Bi/multivariate tasks From: Dave Ball Takashi, How's spring in State College? The daffodils are blooming here. I need some more information on the icheck/ifault useage in the EM method; I don't really understand how they are used in the routines em and emalgo so I am having a little trouble seeing how to code them into my driver routine for the em method. Also the information you gave me on how the use of each method is restricted by the type of censoring was good (for em, bj, and schmitt methods); I need the same information for coxreg and bhk. Thanks! Dave P.S. I tried to send this to you Monday Mar 28 but you end of the network was down, so I am trying again! =============================================================================== Entry #28 ti@jabba 04/11/88 ASURV From: ti@jabba.psu.edu (Takashi Isobe) Hi Dave: blooming and ladies are wearing very attractive clothes... I received March 31 (returned to you on Apr. 4) letter. It looks like some problems between our computers. Here is the answer for the question about Icheck: Icheck is the indicator which is related to subroutine syminv. The subroutine syminv computes the errors for the linear regression coefficient but sometime it fails to get results due to zero entries in the diagonal component. Since, if the subroutine em gets ifault<0, it won't print the results but the error messages, even all other values are available, to print out the linear regression coefficients (without errors), I needed to add icheck which is not in the original code. This problem is rarely happened but it is good idea to keep it. The coxreg can use multiple independent variables if they are completely detected (in the same categoly as EM and B-J), and BHK can use one independent and one dependent variables but they can contain upper (lower) limits in either variables. By the way, my department email address might be changed. Could you try next few options? ti@astro.psu.edu ti@jabba.psu.edu ti%jabba@psuvm.bitnet I do not know which one is working now. Takashi Isobe =============================================================================== Entry #29 scivax::ball 04/11/88 Test messages done for now Takashi, I have discovered that our Unix machine is having lots of problems today so I won't be sending you any test messages that way for a while. Let me know which of the other two addresses worked (astro DID not from here: unknown host) Thanks. Dave =============================================================================== Entry #30 ti@jabba 04/15/88 ASURV From: ti@jabba.psu.edu (Takashi Isobe) Dave: I received three mails dated on Apr. 11 and 12. I guess this means that the experiments are succesful. By the way, Eric wants to add two more methods in the regression/corr- relation part. Are you interested in that? Also, if you want to, I can add the error analysis for twokm (though, since it uses the bootstrapping method, this error analysis takes lots of cpu time). Takashi =============================================================================== Entry #31 ball@scivax 04/15/88 Additions to ASURV From: "DAVID BALL" Takashi, Thanks for the quick response. It looks like our network connections are back together again. In answer to your question of interest, I would say that we definitely are interested in any further major enhancements to ASURV, and since new algorithms for correlation/regression, etc., fit that we would be interested in adding them to the IRAF implementation. All methods have now been integrated into IRAF; I have begun to work on the help files that users can read online, so the entire package should be ready for you and Dr. Feigelson to review and see demonstrated very soon. Bob would like for you both to come here for a brief visit to do that, but I don't know any more like how soon or whatever. Perhaps if you should have the new code ready by the time you visit, we could add that while you are here. As you may recall, the IRAF version is seven separate tasks; if you would like to see some kind of a structure diagram of the IRAF programs before you include the new stuff in your version, I would be glad to oblige. You might see easier ways to build the new stuff to fit easily into both versions. The only real concern I would voice is that about print statements and parameter- getting which we discussed before. I have supplied main driver routines for all the methods which do that; in some cases I have also rewritten your routines into the IRAF spp langauge to do some printout. In any case, however you decide to do it won't be a problem unless you pass data to the lower level subroutines in common rather than in the calling sequence. That would make it more difficult to expand the programs to allow more than 200 data values later. Ah, well, I rattle on here. We can discuss any of this by email or perhaps even a "conference" phone call would be useful(?). I'm sure Bob will be in touch about a visit and I certainly will with material for you to read. Hope the weather is nice. Thanks! Dave ------ =============================================================================== Entry #32 scivax::hanisch 05/25/88 visit by Takashi Would Thu/Fri 16/17 June be clear for you if Takashi were to come visit? I want him to give a talk to the sci staff and to close the loop on the installation of the ASURV package. =============================================================================== Entry #33 scivax::biemes 05/26/88 statistics$twost Dave, I see there is a sort2 module used by this twost task. I had to write a "double sort" VOPS routine for the ks test, and maybe this could be used by twost as well. Then we could move the asrt2.gx code to statistics$lib/. I notice that their code would not permit this to be done real easily, but it would be a good idea anyway, especially since their sort2 is a bubble sort. (I just modified the regular VOPS asrt routine, which is a quicksort.) Chris