OzData: Insurance Premiums
Keywords: Cubic spline regression
Description

Age specific term life premium rates for a sum insured of $50,000 are given in the table. The first column is the age of insured, the next two columns are the rates for male smokers and non-smokers, and the last two columns are the rates for female smokers and non-smokers. The four separate sets of points may be plotted and cubic spline regression used to fit them.

Age     M-Smoke M-Non   M-Smoke F-Non
33      130     100     110     95
34      135     105     110     95
35      140     105     115     100
36      145     110     120     100
37      155     110     125     105
38      160     115     130     105
39      170     120     140     110
40      180     125     145     115
41      195     130     155     120
42      210     140     165     130
43      230     145     175     135
44      250     155     190     145
45      270     170     205     155
46      295     180     225     165
47      325     200     245     180
48      360     215     265     195
49      395     235     290     210
50      435     260     320     230
51      485     285     350     250
52      535     315     380     275
53      590     350     420     305
54      650     390     460     335
55      715     435     505     370

Skin Cancer in Texas and Minnesota
Keywords: Logistic regression, Poisson regression.
Description

The data show the incidence of nonmelanoma skin cancer among women in Minneapolis-St Paul, Minnesota, and Dallas-Fort
Worth, Texas. The towns are coded 0 for St Paul and 1 for Forth Worth.

One would expect sun exposure to be greater in Texas than in Minnesota.

Source

Kleinbaum, D., Kupper, L., and Muller, K. (1989). Applied regression analysis and other multivariate methods. PWS-Kent,
Boston, Massachusetts.

Hand, D. et al, (1994). A handbook of small data sets. Chapman and Hall, London.

Analysis

Can use logistic regression for Cases/Population, or use Cases as a Poisson response with log(Population) as offset.

 OzData
                                                                                        Data File

                         This page maintained by Gordon Smyth, Department of Mathematics,
                          University of Queensland. (c) 1997. Last modified: 17 October 1998
Cases   Town    Age     Population
1       0       15-24   172675
16      0       25-34   123065
30      0       35-44   96216
71      0       45-54   92051
102     0       55-64   72159
130     0       65-74   54722
133     0       75-84   32185
40      0       85+     8328
4       1       15-24   181343
38      1       25-34   146207
119     1       35-44   121374
221     1       45-54   111353
259     1       55-64   83004
310     1       65-74   55932
65      1       85+     7583

 Passengers on the Titanic
Keywords: binomial regression, contingency table
Description
The data give the survival status of passengers on the Titanic, together with their names, age, sex and passenger class. 
About half of the ages for the 3rd Class passengers are missing, although a good many of these could be filled in from the original source below.

Variable               Description       
Name              Recorded name of passenger     
Pclass              Passenger class: 1st, 2nd or 3rd
Age             Age in years
 Sex             male or female
 Survived             1 = Yes, 0 = No

Source
Hinde, Philip (1998). Encyclopedia Titanica.

Analysis
Age, sex and passenger class all have strong relationships with whether the individual survived. There should also be strong
interactions of passenger class with the other two variables. 

This page maintained by Gordon Smyth, Department of Mathematics,
                          University of Queensland. (c) 1998. Last modified: 23 October 1998


 Aboriginal Deaths in Custody
Keywords: binomial regression.
Description
The data give the number of deaths in prison custody in Australia in each of the six years 1990 to 1995, given separately for Aboriginal and Torres Strait Islanders (indigenous) and others (non-indigenous).
Variable                Description
Year               1990 through 1995
   Indigenous               Yes = Aboriginal or Torres Strait Islander, No = Non-indigenous
     Prisoners               Total number in prison custody
      Deaths               Number of deaths in prison custody
    Population               Adult population (15+ years)

The data were collected in response to the Royal Commission into Aboriginal Deaths in Custody, the final report of which wastabled in the Federal Parliament on the 9 May 1991. 
     The report of the Royal Commission has two streams. One is concerned with the ninety-nine Aboriginal and Torres StraitIslander deaths in custody which occurred throughout Australia during the period 1 January 1980 to 31 May 1989. Issues around the causes of death, culpability of custodians and their employers, and the prevention of future deaths were addressedin depth. The second stream concerned what the Royal Commission called the 'underlying issues': the social, cultural, and legal factors which, in the view of the Commissioners, had some bearing on the deaths. These underlying issues, as revealed from the chapter headings of the Royal Commission's National Report, included the Legacy of History, Aboriginal Society Today, Relations With the Non-Aboriginal Community, The Harmful Use of Alcohol and Other Drugs, Schooling, Employment, Unemployment and Poverty, Housing and Infrastructure, Land Needs, and Self-determination.
     The link between the Royal Commission's discussion of the individual deaths investigated, the prevention of future deaths and the underlying issues, is its position on the over-representation of Indigenous people in custody in Australia. A central conclusion of the Royal Commission, illustrating this point, was as follows:
     The work of the commission has established that Aboriginal people in custody do not die at a greater rate than non-Aboriginal people in custody.
     However, what is overwhelming different is the rate at which Aboriginal people come into custody, compared      with the rate of the general community ... The ninety-nine who died in custody illustrate that over-representation      and, in a sense, are the victims of it.
     The conclusions are clear. Aboriginal people die in custody at a rate relevant to their proportion of the whole population which is totally unacceptable and which would not be tolerated if it occurred in the non-Aboriginal community. But this occurs not because Aboriginal people in custody are more likely to die than others in custody, but because the Aboriginal population is grossly over-represented in custody. Too many Aboriginal people are in custody too often (Johnston, 1991, Vol 1, p6).
Source
 Indigenous deaths in custody 1989 - 1996 / a report prepared by the Office of the Aboriginal and Torres Strait  Islander Social Justice Commissioner for the Aboriginal and Torres Strait Islander Commission. Aboriginal and Torres  Strait Islander Commission, Canberra, 1996.
Analysis
> p <- Deaths/Prisoners
> glm.deaths <- glm(p~Indigenous*Year,family="binomial",weights=Prisoners)
> anova(glm.deaths,test="Chi" )
Analysis of Deviance Table
Binomial model
Response: p
Terms added sequentially (first to last)
Df        Deviance   Resid. Df Resid.      Dev   Pr(Chi)
NULL            11 16.45645
Indigenous       1 2.740514        10 13.71594 0.0978333
Year             1 4.700794         9  9.01515 0.0301487
Indigenous:Year  1 1.259585         8  7.75556 0.2617297

This page maintained by Gordon Smyth, Department of Mathematics,
                         University of Queensland. (c) 1998. Last modified: 25 September 1998
http://www.maths.uq.edu.au/~gks/data/oz/custody.html

 Heart Valves in Dogs on Different Exercise
Regimens

Keywords: ordinal regression

Description

A new type of heart valve has been developed and is implanted in 63 dogs that have been raised on various levels of exercise.
The numbers of valve transplants that succeed are recorded. Is the proportion of successful implants the same for dogs on all
exercise regimens? Is there a trend with amount of exercise in the proportion of successful implants?

     Variable               Description
     Exercise              Amount of exercise: 1=None, 2=Slight, 3=Moderate, 4=Vigorous
      Implant              1=Successful, 2=Unsuccessful
   Frequency              Number of dogs

Source

Zar, J. H. (1999). Biostatistical Analysis, Fourth Edition. Prentice-Hall International, Upper Saddle River, New Jersey.
Exercise 24.20.

Analysis

This can be used as an example of ordinal logistic regression, with Exercise as the response and Implant as the explanatory
variable.

                         This page maintained by Gordon Smyth, Department of Mathematics,
                          University of Queensland. (c) 1999. Last modified: 03 March 1999

Exercise        Implant Frequency
1       1       8
1       0       7
2       1       9
2       0       3
3       1       17
3       0       3
4       1       14
4       0       2

 Wide of Ore-Bearing Layer

Keywords: non-parametric regression, thin plate splines.
Description
Data were collected from a mine in Cobar, NSW, Australia. At each of 38 sampling points, several measurements were taken, one of which is the 'true-width' of an ore-bearing rock layer. Also given are the co-ordinates t1 and t2 of of the data sites. Green and Silverman (1994) use this data set to illustrate thin-plate splines for fitting a smooth surface.
Source
O'Connor, D. P. H., and Leach, B. G. (1979). Geostatistical analysis of 18CC Stope block, CSA mine, Cobar, NSW.
 Estimation of statement of mineral reserves, pp. 145-153. Australian IMM, Melbourne. 
 Green, P. J., and Silverman, B. W. (1994). Nonparametric regression and generalized linear models. Chapman and Hall,  London. 

                         This page maintained by Gordon Smyth, Department of Mathematics,
                           University of Queensland. (c) 1997. Last modified: 19 July 1998

t1      t2      Width
-16     -15     17.0
-14     -4      18.0
-13     4       17.5
-7      5       19.0
-6      -43     22.0
-6      -36     24.0
1       -50     17.4
2       -39     23.0
2       -8      23.5
2       -51     15.0
9       -16     23.5
9       -42     25.0
17      -37     16.5
18      -12     19.5
24      -57     12.0
25      -29     18.5
26      -40     18.0
32      -7      14.0
33      -35     19.0
40      4       13.5
40      -61     18.0
44      -29     19.4
48      -65     13.0
48      -7      14.0
49      -32     19.5
55      -71     16.0
56      -14     16.0
59      -38     19.0
62      7       19.0
62      -3      21.5
64      -29     22.0
69      -28     20.5
70      -72     11.0
77      -19     26.0
78      -53     22.0
79      -37     26.0
84      -52     16.0
84      -16     16.0

 Prawn Trawling in the Great Barrier Reef
Keywords: regression, non-parametric regression. 
Description
These data refer to a survey of the fauna on the sea bed lying between the coast of northern Queensland and the Great Barrier Reef. The sampling region covered a zone which was closed to commercial fishing, as well as neighbouring zones where fishing was permitted. In view of the large numbers and types of species captured in the survey the catch was summarized as a score, on a log weight scale, which combines information across species. Two such scores are available. The details of the survey, and a full analysis of the data, are in Poiner et al (1997). 
Variable              Description
       Zone              an indicator for the closed (1) and open (0) zones 
       Year              an indicator of 1992 (0) or 1993 (1) 
     Latitude              latitude of the sampling position 
   Longitude              longitude of the sampling position 
      Depth              bottom depth 
     Score1              catch score 1 
     Score2              catch score 2 
Source
Poiner, IR, Balber, SJM, Brewer, DT, Burrdige, CY, Caeser, D, Connell, M, Denniss, D, Dews, GD, Ellis, AN, Farmer, M,
 Fry, GJ, Glaister, J, Gribble, N, Hill, BJ, Long, BG, Milton, DA, Pitcher, CR, Proh D, Salini, JP, Thomas, MR, Toscas, P,
 Veronise, S, Wang, YG, Wassenberg, TJ (1997). The effects of prawn trawling in the far northern section of the Great
 Barrier Reef, CSIRO Division of Marine Research, Queensland Department of Primary Industries. 
Bowman, A. W., and Azzalini, A. (1997). Applied smoothing techniques for data analysis. Clarendon Press, Oxford. 

                         This page maintained by Gordon Smyth, Department of Mathematics,
                         University of Queensland. (c) 1998. Last modified: 23 October 1998 
http://www.maths.uq.edu.au/~gks/data/oz/reef.txt

 Measurements on Babies
Keywords: analysis of covariance, spurious correlation.
Description
The data consist of measurements (x1, x2, Age in months) on 23 babies, collected in the Faculty of Medicine at the University of Hong Kong. It would be of great medical interest to find a relationship between x1 and x2. However, any correlation between them is likely spurious because both x1 and x2 tend to increase with age. See Chris Lloyd's original mailing to the ANZStat mailing list discussion.
Source
Chris Lloyd, University of Hong Kong.
Analysis
x2 is independent of x1 after adjustment for Age. 
     Dependence of x2 on Age is approximately linear. 
     There is some evidence of increasing variance, which can be handled by using gamma rather than normal regression. 
                                                             x2 is independent of x1 given Age. In fact,
                                                             as the ANOVA belows shows, the
                                                             dependence on Age is nearly linear.
Analysis of Variance Table
Response: x2
Terms added sequentially (first to last)
                  Df Sum of Sq  Mean Sq  F Value     Pr(F) 
Age                1  25928.03 25928.03 15.68836 0.0010093
as.factor(Age)     1   6420.35  6420.35  3.88479 0.0652303
x1                 1   2098.36  2098.36  1.26966 0.2754847
as.factor(Age):x1  2    515.02   257.51  0.15581 0.8569284
Residuals         17  28095.76  1652.69

                         This page maintained by Gordon Smyth, Department of Mathematics,
                           University of Queensland. (c) 1998. Last modified: 25 June 1998

x1      x2      Age
0.729   280.1   3
0.785   402.2   3
0.625   351.4   3
0.604   315.5   3
0.701   306     3
0.957   315     3
0.664   220.2   3
0.64    223.6   12
0.464   214.3   12
0.684   224.5   12
0.517   256     12
0.581   285.4   12
0.814   215.1   12
0.636   231     12
1.051   269.6   12
0.41    222.5   24
0.701   221.1   24
0.65    208.9   24
0.234   170.1   24
0.674   254.5   24
0.545   263.9   24
0.429   249.1   24
0.358   210.8   24

VISTA
Analyses

     Exploratory and Descriptive Data Analysis

          Dynamic Exploratory Graphics include Spinplots, Scatterplots,
          Scatterplot Matrices, Histograms, Boxplots, Parallel Coordinate Plots,
          Mosaic Plots, Quantile Plots, Normal Probability Plots,
          Quantile-Quantile Plots, Diamond Plots, Dotplots, Biplots, and Guided
          Tour Plots. 
          Plots support brushing and labeling, and are dynamically linked. 
          Smoothers and Contours can be added to several plots. 
          Descriptive Statistics including Means, Standard Deviations,
          Variances, Ranges, Quartiles, Medians, Correlations, Covariances,
          Distances 



     Univariate Analysis

          Univariate Tests including T- and Z-tests (confidence intervals) for
          single sample, paired samples and two independent samples data,
          with Wilcoxon Signed-Rank and Mann-Whitney tests in appropriate
          situations. 
          ANOVA - Univariate Analysis of Variance for balanced and
          unbalanced, one or multi-way data (data must be complete). Model
          may or may not include two-way (but not higher-way) interactions.
          The model visualization is a spreadplot composed of a boxplot,
          diamond plot, quantile plot, quantile-quantile plot and effects plot. 
          Multiple Regression - Univariate regression includes simple, multiple,
          robust, and monotonic regression. The model visualization is a
          spreadplot comprised of a regression, added-variable, influence,
          leverage, and residuals plots. Weight plots are also included for robust
          and monotonic regression. 



     Multivariate Analysis

          Multiple Regression - Multivariate Multiple Regression Analysis. The
          spreadplot consists of a biplot, spinplot, histogram and
          scatterplot-matrix. 
          Principal Component Analysis of correlations or covariances. The
          model visualization is a spreadplot composed of a biplot, spin-plot,
          scree-plot and scatterplot-matrix. 
          Multidimensional Scaling of one or more symmetric or asymmetric
          matrices. The model visualization is a spreadplot composed of a
          scatterplot, spin-plot, scree-plot and scatterplot-matrix. The
          spreadplot supports graphical re-estimation of model parameters. 
          Correspondence Analysis of two-way contingency tables. The
          model visualization is a spreadplot composed of a biplot, spinplot,
          residuals plot and scree-plot. The spreadplot supports graphical
          re-estimation of model parameters. 


                           Copyright (c) 1998 by Forrest W. Young.
                                  All rights reserved.

FFGRID and DENSITY display unevenly distributed data in 2-D and 3-D as
ordinary colour-coded, regularly spaced data. This is very useful when
dealing with data which are difficult to view using plot3 (which, in
my experience, is the case with all 3-D data that are not completely
smooth).

FFGRID is a Fast `n' Furious way to do the same job that griddata does
for you.  The difference is that there is no interpolation. Empty
points are left empty rather than trying to fill them using
neighbouring points. Also, FFGRID has no problem with multiple points
that fall in the same grid cell. Data are displayed using PCOLOR
unless output arguments are specified, in which case the matrix of
(regularly spaced) data is given along with the vectors s specifying
x- and y-dimensions.

DENSITY is akin to HIST in that it displays a density distribution,
but this time in 2-D rather than 1-D.

BIN is a small M-file that is needed for FFGRID and DENSITY to do the
job right.

Oyvind Breivik
Oyvind.Breivik@gfi.uib.no

1/9/98
PPLOT is a graphical plot layout and design tool for both Matlab 4 and
Matlab 5 (both PC and UNIX versions). PPLOT() is a substitute for the Matlab
PLOT command and PPLOT without arguments it is a substitute for the Matlab
FIGURE command.
 
 
Now you can create legends, insert text, titles and labels. You can place,
move and resize objects simply by 'click and drag'. You can change
properties on any object like colors, font, linewidth, linetype etc., you
can even rotate text.
 
 
The original data is saved to be able to analyse complex data. You can make
all kinds of calculations and analyses on the plotted data.
 
 
Any number of figures containing any number of axes can be created. The plot
goes to the active figure and you can select destination axes simply by
clicking with the mouse. An unlimited undo makes it easy to test different
layouts. PPLOT comes with a large number of plugins to plot Smith charts,
draw arrows, filters for a number of file formats etc.
 
 
- Everything you wanted to do with your plots but were afraid to ask...
 
 
See also: http://extwww.lulea.trab.se/users/joajoh/pplot/
 
 
Joachim Johansson 
Joachim.K.Johansson@telia.se
 
 
1/18/99
STACKFIGS is used to display multiple figures simultaneously by stacking 
all open figures. 
 
        STACKFIGS usage: stackfigs  (no arguments)
        Restriction: max number of figures = (screen_vertical_rez_in_pixels/20) 
        STACKFIGS has only been run under Matlab 5
 
Charles Plum
cplum@nichols.com
 
12/24/98
plots a 3-D surface of constant value: f(x,y,z) = const. 
Ruslan L. Davidchack
davidchack@kuphsx.phsx.ukans.edu
http://weizen.chem.ukans.edu/ruslan
11/4/97
The TILEFIGS program is used to display multiple figures simultaneously
by tiling the screen with all open figures. 
 
   TILEFIGS usage: tilefigs ([nrows ncols],border_in pixels)
   Restriction: maximum of 100 figure windows
   Without arguments, tilefigs will determine the closest N x N
   grid for all open figures.
   
   TILEFIGS has only been run under Matlab 5
 

Charles Plum                   
cplum@nichols.com 
 
12/24/98 
Numerick� integrace: quadg.m   quad2dg.m
These functions are modified versions of the quadg.m and quad2dg.m. 
files found in the NIT (Numerical Integration Toolbox) 
 
The code has been vectorized in order to be able to perform fast 
integration of several integration limits.  As before quadg and 
quad2dg only calculate one and two dimensional integrals, 
respectively,  but you may specify several integration limits in a 
single call to the functions.  It is also possible to integrate 
directly given functions enclosed in parenthesis                      
 
Example: integration from 0 to 2 and from 2 to 4 for x is done in 
a single call by:
 
>>quadg('(x.^2)',[0 2],[2 4])
 
ans=
 
2.6667   18.6667
 
similarly integration from 0 to 2 and from 2 to 4 for both x and y 
is done in a single call by:
 
quad2dg('(x.^2.*y)',[0 2],[2 4],[0 2],[2 4]) 
 
ans=
 5.3333  112.0000 
 
The files were tested under Matlab version 5.2. 
 
It should be noted that both quadg and quad2dg require the Numerical 
Integration Toolbox (NIT) to calculate the weights.  Also note that 
quad2dg uses distchk function in the Statistics Toolbox to check the 
integration limits and make sure they are of common size. This call 
is not strictly necessary and may be omitted if you do not have the 
statistics toolbox.

Per  A. Brodtkorb 
pab@marin.ntnu.no 
02/17/99
Saves current variables in a delineated ASCII file:
variable names 1st, horizontally, with data for each below the name.
Usage/Input:  save_ascii(loadname,savename,dataformat,delineator);
-"loadname"    = filename of the *.mat file to save as ASCII
  -"savename"    = filename to save this text output to
  -"dataformat"  = format of 'double array' data (e.g. '%6f' for
                   six digit fixed-point notation)
  -"delineator"  = what to delineate data blocks with (e.g. '\t' for tab)
 
 eg. save_ascii('data.mat','textfile.txt','%6f','\t');
 
 Limitations:  This script can only handle two data types: 'char' and
 'double', where the 'char' types can only be one dimensional
 (e.g. size = 1X15), and the 'double array's can be one or two dimensional
 (e.g. sizes = 52X1, 1X52, or 30X344).

Kirk Ireson 
kireson@ucsd.edu 
3/9/1999
LETSROLL is a simple MATLAB script that demonstrates how a cycloid is
made by tracing a point on a rolling circle.  It is nothing flash, but
a school teacher wanted a quick demo, and this is the result.  I
thought others might like it as well.  There are two buttons: One
named Let's Roll that starts the circle rolling, and another Quits.
ENJOY!

Peter Dunn
(dunnp@dpi.qld.gov.au / dunn@romulus.sci.usq.edu.au)
05 June 1997
hilbert.m
A .m-file which creates a square matrix with the indices of the
hilbert space filling curve.
A .m-file which creates vectors containing the row and column
cooridinates for the hilbert space filling curve for an arbitrary
sized matrix.

Hlbrtcrv.m
The Hilbert space filling curve has recently been introduced to
digital halftoning as a scan order for spatial dithering.  The
advantage to using space filling curves is the error diffusion can be
done in one dimension and the resulting patterns exhibit clustering.
For related literature see works by Velho and Gomes in their book
"Image processing for computer graphics."

Daniel Leo Lau
lau@eecis.udel.edu
6/26/98
group.m
Returns the summary of the columns of X grouped by the first
column of X. 'FUNC' is the summary function.  If 'FUNC' is omitted,
'mean' is used.  'length' can be used to give a count of data
by group.
 
If X contains more than 2 columns, each additional row of 'FUNC'
may contain a function to use to group each each additional column.
 
If only one function is given, it is used to summarize all columns of X.
If the function names are not the same length, pad the strings with
trailing spaces.
 
With one output argument, the summaries are returned in a table,
[G Xbar].  With two output arguments, the group vector and table of
summaries are returned as two variables, G and Xbar.
 
Example:  Summarize a set of data X grouped by measurement time TIME.
          The returned grouped table has columns: TIME, MEAN, STD, N
 
      group ([TIME X X X], ['mean  '; 'std   '; 'length']);
 
Don R. Maszle
maze@sparky.berkeley.edu
1/25/99
str2strs.m
This function takes a deliminated string s and breaks it up into
sub-strings stored in a cell-array of strings.  Spaces in the string
are converted to underscore characters.  It is currently set-up for
tab delimination though may be modified for another character.  Any
length string, of any number of elements, of any length elements are
ok as input. The char function can then be used to recover the
individual elements of the cell-array output.

Tested with the Student version of Matlab 5.

David Malicky
University of Michigan
malicky@umich.edu
5/11/98

This is a collection of Matlab files for time-frequency analysis.
These programs are either a result of my research or something that I
found useful enough to spend the time to implement (sometimes they
even intersect).

Included are: a rigorous implementation of time-frequency
distributions (Cohen class), some quartic time-frequency
distributions, atomic decomposition based on maximum likelihood
estimation, fractional Fourier transform, time-varying filtering, and
other useful little utilities.

A README file is included and information on each function is
available through the MATLAB "help" command.  There isn't a manual,
but you can find details in my papers at http://www.eecs.umich.edu/~jeffo
or you can send me an email.

Jeff O'Neill
jeffo@eecs.umich.edu
9/3/98
FFTMSPEC Module Spectra of the Fourier Transform
 
  FFTMSPEC(XT,T) Plots the signal XT versus time T and the absolute value
  of the the discrete Fourier transform (DFT) of the signal vector XT
  versus the a frequency vector F.
 
  [XFM,F] = FFTMSPEC(XT,T) Returns the Fourier Transform of XT and the
  frequency range in vectors XFM and F.
 
Jesus A. Rojas Zavarce.
jrojasz@telcel.net.ve
12/07/98

fourgraph.m is a demo of ploting a fuorier series to a given one
variable function; fourgraph.mat and draw.m are required for the demo
to run.

Yaniv Hollander.
aet1417@aerodyne.technion.ac.il

2/26/98