These data files accompany the SPSS Guide for Barnard biology students. They are compiled from various publicly available sources, to give a range of real data examples to illustrate the concepts taught in this course. Click on the name of a data set to link to the Excel file.
Contents
Descriptions of data sets
Old Faithful
Statistical Methods
Examination of frequency distributions, time series analysis, regression
Summary
These are data on the eruptions of the Old Faithful geyser in Yellowstone National Park, for one month (July 1995). The data were collected to estimate how the duration of eruption can be used to predict when the next eruption will occur. These data are modified from the original data set for easier use.
Variables
Date: Day of July 1995 on which data was collected.
Time: Time of day (minutes since midnight) at which the eruption actually began, i.e., beginning of time at which water comes out of geyser.
Duration: Length of the previous eruption in minutes
Interval: Length of time (in minutes) between this eruption and previous eruption.Height: Estimated height of previous eruption (in feet).
Prediction: Time of day (minutes since midnight) at which eruption was predicted to begin. The prediction was based on the Yellowstone National Park estimate that an eruption occurs after 10*(previous duration time) + 30 minutes.
Accurate: Did the predicted eruption time fall within 1% of the observed eruption time? I.e., observed*0.99 < predicted < observed*1.01.
Down Jones
Statistical Methods
Time series analysis, frequency distributions.
Summary
DowJones Industrial Average (DJIA) closing values from 1900 to 1993. The first column contains the date (yymmdd), second column contains the value.
These data are used in: E. Ley (1996): "On the Peculiar Distribution of the U.S. Stock Indices" in The American Statistician.
Variables
Date
Value: The index value at the close of that day.
Change: The change in index value compared to the last day for which there is data available.
Kidney
Statistical Methods
Chi square
Summary
Data on the recurrence times to infection, at the point of insertion of the catheter, for kidney patients using portable dialysis equipment. There are two observations per patient. Originally from McGilchrist and Aisbett, Biometrics 47, 46166, 1991.
Variables
Patient: ID code
Time: Time of recurrence
Age
Sex: 1 = male, 2 = female
Disease type: 0 = Glomerulo Nephritis, 1 = Acute Nephritis, 2 = Polycystic Kidney Disease, 3 = Other
Frailty: Author's estimate of the frailty of the patient.
Canopy Leaf Gas Exchange
Statistical Methods
Ttest, Oneway Anova, Linear regression
Summary
Measurements were made in canopy access towers between July 1991October 1992. Four species were observed: Oak  red oak (Quercus rubra); RM  red maple (Acer rubrum); WB  white birch (Betula papyrifera); YB yellow birch (Betula alleghaniensis). Each represents a single measurement of a leaf in the canopy. Investigators: Susan Bassow & Fakhri Bazzaz.
Variables
Month: Month in which the measurement was made
Species: Four species were observed: Oak  red oak (Quercus rubra);
RM  red maple (Acer rubrum); WB  white birch (Betula papyrifera); YB
 yellow birch (Betula alleghaniensis)
Photosyn: Photosynthetic rate of the leaf. Units are in
micromoles of CO_{2} / m^{2} / sec
T_leaf: Leaf temperature measured by a leaf thermocouple touching the
underside of the leaf when enclosed in the measurement device. Units are in
degrees Celsius.
RH: Relative humidity (%) measured in the measurement device.
Clouds
Statistical Methods
MannWhitney U test
Summary
Clouds were randomly seeded or not with silver nitrate. Rainfall amounts were recorded from the clouds. The purpose of the experiment was to determine if cloud seeding increases rainfall. The rainfall in acrefeet from 52 clouds 26 of which were chosen at random and seeded with silver nitrate.
References:Chambers, Cleveland, Kleiner, and Tukey. (1983). Graphical Methods for Data Analysis. Wadsworth International Group, Belmont, CA, 351. Original Source: Simpson, Alsen, and Eden. (1975). A Bayesian analysis of a multiplicative treatment effect in weather modification. Technometrics 17, 161166.
Variables
Rainfall: Amount of rainfall from clouds (in acrefeet).
Treatment: 0: Unseeded; 1: Seeded win silver nitrate.
Fluoride
Statistical Methods
Paired Ttest, Wilcoxon Signed Ranks Test
Summary
Data representing the number of cariefree subjects per 100 children before and after city water fluoridation projects in 16 cities are shown below.
Originally from Effect of Water Fluoridation on CarieFree Rates (Data from Osborne, 1980, p. 40).
Variables
ID: City number
Before: Cases of tooth decay per 100 children before treatmentAfter: Cases per 100 after treatment.
Diversity Patterns of Bird Community
Statistical Methods
Anova, regression, correlation (Pearson and Spearman)
Summary
Data were gathered at 40 different oak woodland sites along the west coast of North America. The data were collected in spring, during the bird breeding season, in 1994.
The sites were around 5 hectares in size in relatively homogeneous habitat. At least three visits were made to each site, and the number of breeding pairs of each bird species were counted by sight and by analyzing sound recordings. Measurements of vegetation structure were also made during the visits. These data were collected to determine the how vegetation density determines the species richness and population density of bird communities.
Variables
Elevation: The height of the site above sea level. Given in meters.
Vegetation Density: A measure of the total amount of foliage at a site. This number is obtained by plotting vegetation density versus height above ground and taking the total area under the graph. The units are specially designed to measure profile area, and are called f.p. units (foliage profile units).
Height: The height of the top of the forest canopy. Technically, the height above which horizontal distance to semiobscurity exceeds 30 meters. Given in meters.
Latitude: The latitude of the site. Given in degrees.
Longitude: The longitude of the site. Given in degrees.
No. Species: The number of different species of breeding bird pairs.
Total Density : A measure of the bird population density. It is simply the number of pairs of birds per hectare.
Sea Slug Larvae and Seaweed
Statistical Methods
Anova, linear regression, time series, test for equal variances.
Summary
Sea slug larvae locate a patch of seaweed to stay in before developing into adults. These data are from study on the ability of larvae to detect a particular kind of seaweed at different tide heights. Instead of randomly swimming until they find this seaweed (as was previously believed), these larvae actually "smell" the seaweed when they are passing over it in the water. They do this, it is believed, by detecting chemicals that slowly leach out of the seaweed. This study tests this theory by analyzing the ability of larvae to distinguish the smell of this seaweed just as the tide is coming in  when the chemicals are most concentrated  and at high tide  when the chemicals are more dilute due to the rising tide.
Just before the tide came in, one water sample containing filtered sea water was collected away from the patch of seaweed. This sample is the control (it is coded 99). Once the tide washed in, water samples were collected above a patch of seaweed every five minutes, for a total of thirty minutes. There are six replicates for each time point. There are seven time points (030 minutes) so there are a total of 42 observations, excluding the control. The control has only five replicates. Data from Dr. Patrick Krug, UCLA, Department of Biology.
Data Description
Fifteen slug larvae were then injected into each of the replicates, and the percentage of larvae that metamorphosed was recorded. This percentage is a function of the ability of the larvae to detect the chemicals from the seaweed.
Variables
Time: When the sample was collected.
Percent: Percentage of larvae that underwent metamorphosis.
Foraging Habits of Seed Harvesting Ants
Statistical Methods
Anova, Ttest, linear regression
Summary
The data for both files was collected at the Sierra Nevada Aquatic Research Laboratory (SNARL) in the Great Basin Desert Province. Collection trays were placed into the ground at different distances from the entrance to the ant colonies' mounds, and any ants walking into them were trapped. Thus the data may be considered as random samples of ants at various distances.
General Explanation Of The Study
The two accompanying data files are from a study on the foraging habits of two species of ants: thatch ants (Formica planipilis) and seed harvester ants (Pogonomyrmex salinus). The study was concerned with the ant colonies' different "strategies" for optimizing the balance between collecting food and exposure to risk. The idea, in its simplest form, is that sending more ants to look for food farther away from the colony is more likely to increase the colony's food supply; traveling away from the colony, however, exposes ants to the risk of death by starvation and predation. Natural selection suggests that colonies will develop foraging strategies that maximize the colony's net gain, defined as how much the colony growsthis depends primarily on its food supplyminus how many ants die. A variety of different maximizing strategies exist, and different species often have different strategies. Some colonies develop "worker conservative" foraging strategies, in which ants foraging at greater distances consume relatively more food: this minimizes the risk of starvation and leads to fewer deaths. Other colonies use strategies that conserve energy (the colonies overall supply of food). In this case long distance foragers, who are more likely to die, will consume less food so that their deaths will not be as much of a strain on the colony's food supply.
More specifically, this study investigated the relationship between the size of the ants and the distance at which they foraged. Other studies have shown that larger ants use energy more efficiently than smaller ones, and this suggests that larger ants would be more likely to forage at greater distances. Thus ants were collected at various distances from the colony, weighed, and measured. Because an ant's weight provides a measure of how much food, or energy, it carries, and because headwidth measurements allow the ants to be classified by size, the data provides detailed information on the correlations among ant size, foraging distance, and energy supply. This information, in turn, gives insight into foraging strategies: we can, for example, determine whether a colony's strategy is "workerconservative" or "energyconservative," or a more complex combination of the two. We might find, for exampleby analyzing the relationships among size, foraging distance, and energy supply both for the colony as a whole and within the separate size classesthat a colony's overall strategy is "workerconservative," but that within size classes the strategy is "energyconservative." These analyses, finally, illuminate the colony's survival strategies and provide evidence for or against competing evolutionary theories.
Brief Description Of The Data
The data on the thatch ants contains a total of 1199 samples, taken from a total of 11 different colonies.
The data on the seed harvester ants contains a total of 577 samples, taken from 8 different colonies.
Variables
COLONY: This is a number or letter that identifies which colony the sample was taken from. It is important only for distinguishing between ants from the same colony and ants from different colonies.
DISTANCE: Tells how far from the mound's entrance the sample was taken. Given in meters.
MASS or WT. (MG): How much the sample weighed in milligrams. This variable is used as a measure of how much food (energy) each sample had.
HEADWIDTH: A measure of the sample's maximum headwidth. The units here are basically arbitrary, and correspond to ruled increments that can be seen in the microscope with which the sample is measured. Headwidth is a good indicator of an ant's size, and can be used to classify the ants by size.
WORKER CLASS: A size classification. In the thatch ant data there are five possible classes: "<30", "3034", "3539", "4043", and ">43". In the seed harvester data there are four size classes: "<37", "3738", "3940", and ">40".
Relationship between IQ and Brain Size
Statistical Methods
Linear regression, chisquare
Summary
This data file contains 20 observations (10 pairs of twins) on 9 variables.
Twins share numerous physical, psychological, and pathological traits. Recent advances in in vivo brain image acquisition and analysis have made it possible to determine quantitatively whether: 1) twins share neuroanatomical traits; and 2) neuroanatomical measures correlate with brain size.
Using magnetic resonance imaging and computerbased image analysis techniques, measurements of the volume of the forebrain, the surface area of the cerebral cortex and the midsagittal area of the corpus callosum were obtained in 10 pairs of monozygotic twins. Head circumference, body weight, and FullScale IQ were also measured.
Originally from Tramo MJ, Loftus WC, Green RL, Stukel TA, Weaver JB, Gazzaniga MS. Brain Size, Head Size, and IQ in Monozygotic Twins. Neurology 1998; 50:12461252.
Variables
CCMIDSA: Corpus Callosum Surface Area (cm2)
FIQ: FullScale IQ
HC: Head Circumference (cm)
ORDER: Birth Order
PAIR: Pair ID (Genotype)
SEX: Sex (1=Male 2=Female)
TOTSA: Total Surface Area (cm2)
TOTVOL: Total Brain Volume (cm3)
WEIGHT: Body Weight (kg)
Natality
Statistical Methods
ANOVA
Summary
Data from the CDC WONDER database on low birthweight (<1.5 kg) births in the United States, summarized by region and tobacco use of the mother, for 19952002.
Variables
RegionCode: 1: Northeast; 2: Midwest; 3: South; 4: West.
Births: Number of births in 2002.
Tobacco Use: 1: Yes, 2: No, 9: Unknown.
Percentage Body Fat
Statistical methods
Multiple regression
Summary
Lists estimates of the percentage of body fat determined by underwaterweighing and various body circumference measurements for 252 men. Accurate measurement of body fat is inconvenient/costly and it is desirable to have easy methods of estimating body fat that are notinconvenient/costly.
Details
A variety of popular health books suggest that the readers assess their health, at least in part, by estimating their percentage of body fat. In Bailey (1994), for instance, the reader can estimate body fat from tables using their age and various skinfold measurements obtained by using a caliper. Other texts give predictive equations for body fat using body circumference measurements (e.g. abdominal circumference) and/or skinfold measurements. These data were gathered to develop a better predictive model of body fat.
The variables listed below, from left to right, are:
Density determined from underwater weighing
Percent body fat from Siri's (1956) equation
Age (years)
Weight (lbs)
Height (inches)
Neck circumference (cm)
Chest circumference (cm)
Abdomen 2 circumference (cm)
Hip circumference (cm)
Thigh circumference (cm)
Knee circumference (cm)
Ankle circumference (cm)
Biceps (extended) circumference (cm)
Forearm circumference (cm)
Wrist circumference (cm)
These data are used to produce the predictive equations for lean body weight The data were generously supplied by Dr. A. Garth Fisher.
Resins and termites
Statistical Methods
Repeated measures ANOVA
Summary
The bark of some tropical trees seems to offer protection from termites. If this protection could be harnessed, the trees that produce the resin might become a valuable resource. Experimenters investigated the effects of these tree resins on termites. The resin dissolved in a solvent and placed on filter paper in different doses (5mg and 10mg). For each dosage level, eight dishes are set up with 25 termites in each dish. The termites are fed the dosed filter paper and a daily count is made of the number of termites surviving. Fifteen (15) days were observed, but no observations were made on days 3 and 9.
The data collected here are from an experiment to investigate the effects of certain tree resins on termites. A resin derived from the bark of tropical trees is dissolved in a solvent and placed on filter paper in different concentrations (5mg and 10mg). For each dosage level, eight dishes are set up with 25 termites in each dish.
Variables
dish: Dish number
dose: 5 or 10 mg of resin
day1  day15: Number of termites still alive on this day
Sources
Harvard Forest LongTerm Ecological Research Site

StatLib
W.H Freeman Publishing

Data and Story Library (DASL)
EpiInfo
US Centers for Disease Control
