Data Appendix

SPSS Guide Data Appendix

These data files accompany the SPSS Guide for Barnard biology students. They are compiled from various publicly available sources, to give a range of real data examples to illustrate the concepts taught in this course. Click on the name of a data set to link to the Excel file.

Contents

Old Faithful
Dow Jones, 1900-1993
Kidney
Canopy Leaf Gas Exchange
Clouds
Fluoride
Sea Slug Larvae and Seaweed Detection
Foraging Habits of Seed Harvesting Ants
Diversity Patterns of Bird Communities
Relationship between IQ and Brain Size
Natality
Percentage Body Fat
Resins and Termites
Sources

 

Descriptions of data sets

Old Faithful

Statistical Methods

Examination of frequency distributions, time series analysis, regression

Summary

These are data on the eruptions of the Old Faithful geyser in Yellowstone National Park, for one month (July 1995). The data were collected to estimate how the duration of eruption can be used to predict when the next eruption will occur.  These data are modified from the original data set for easier use.

Variables

Date: Day of July 1995 on which data was collected.

Time: Time of day (minutes since midnight) at which the eruption actually began, i.e., beginning of time at which water comes out of geyser.

Duration: Length of the previous eruption in minutes

Interval: Length of time (in minutes) between this eruption and previous eruption.Height: Estimated height of previous eruption (in feet).

Prediction: Time of day (minutes since midnight) at which eruption was predicted to begin. The prediction was based on the Yellowstone National Park estimate that an eruption occurs after 10*(previous duration time) + 30 minutes.

Accurate: Did the predicted eruption time fall within 1% of the observed eruption time? I.e., observed*0.99 < predicted < observed*1.01.

Top

Down Jones

Statistical Methods

Time series analysis, frequency distributions.

Summary

Dow-Jones Industrial Average (DJIA) closing values from 1900 to 1993. The first column contains the date (yymmdd), second column contains the value.

These data are used in: E. Ley (1996): "On the Peculiar Distribution of the U.S. Stock Indices" in The American Statistician.

Variables

Date

Value: The index value at the close of that day.

Change: The change in index value compared to the last day for which there is data available.

Kidney

Statistical Methods

Chi square

Summary

Data on the recurrence times to infection, at the point of insertion of the catheter, for kidney patients using portable dialysis equipment.  There are two observations per patient. Originally from McGilchrist and Aisbett, Biometrics 47, 461-66, 1991.

Variables

Patient: ID code

Time: Time of recurrence

Age

Sex: 1 = male, 2 = female

Disease type: 0 = Glomerulo Nephritis, 1 = Acute Nephritis, 2 = Polycystic Kidney Disease, 3 = Other

Frailty: Author's estimate of the frailty of the patient.

Top

Canopy Leaf Gas Exchange

Statistical Methods

T-test, One-way Anova, Linear regression

Summary

Measurements were made in canopy access towers between July 1991-October 1992. Four species were observed: Oak - red oak (Quercus rubra); RM - red maple (Acer rubrum); WB - white birch (Betula papyrifera); YB- yellow birch (Betula alleghaniensis). Each represents a single measurement of a leaf in the canopy. Investigators: Susan Bassow & Fakhri Bazzaz.

Variables

Month: Month in which the measurement was made

Species:  Four species were observed:  Oak - red oak (Quercus rubra);

RM - red maple (Acer rubrum); WB - white birch (Betula papyrifera); YB

- yellow birch (Betula alleghaniensis)

Photosyn: Photosynthetic rate of the leaf.  Units are in

micromoles of CO2 / m2 / sec

T_leaf: Leaf temperature measured by a leaf thermocouple touching the

underside of the leaf when enclosed in the measurement device.  Units are in

degrees Celsius.

RH: Relative humidity (%) measured in the measurement device.

Top

Clouds

Statistical Methods

Mann-Whitney U test

Summary

Clouds were randomly seeded or not with silver nitrate. Rainfall amounts were recorded from the clouds. The purpose of the experiment was to determine if cloud seeding increases rainfall. The rainfall in acre-feet from 52 clouds 26 of which were chosen at random and seeded with silver nitrate.

References:Chambers, Cleveland, Kleiner, and Tukey. (1983). Graphical Methods for Data Analysis. Wadsworth International Group, Belmont, CA, 351. Original Source: Simpson, Alsen, and Eden. (1975). A Bayesian analysis of a multiplicative treatment effect in weather modification. Technometrics 17, 161-166.

Variables

Rainfall: Amount of rainfall from clouds (in acre-feet).

Treatment: 0: Unseeded;  1: Seeded win silver nitrate.

Top

Fluoride

Statistical Methods

Paired T-test, Wilcoxon Signed Ranks Test

Summary

Data representing the number of carie-free subjects per 100 children before and after city water fluoridation projects in 16 cities are shown below.

Originally from Effect of Water Fluoridation on Carie-Free Rates (Data from Osborne, 1980, p. 40).

Variables

ID: City number
Before: Cases of tooth decay per 100 children before treatment

After: Cases per 100 after treatment.

Top

Diversity Patterns of Bird Community

Statistical Methods

Anova, regression, correlation (Pearson and Spearman)

Summary

Data were gathered at 40 different oak woodland sites along the west coast of North America. The data were collected in spring, during the bird breeding season, in 1994.

The sites were around 5 hectares in size in relatively homogeneous habitat. At least three visits were made to each site, and the number of breeding pairs of each bird species were counted by sight and by analyzing sound recordings. Measurements of vegetation structure were also made during the visits. These data were collected to determine the how vegetation density determines the species richness and population density of bird communities.

Variables

Elevation: The height of the site above sea level. Given in meters.

Vegetation Density: A measure of the total amount of foliage at a site. This number is obtained by plotting vegetation density versus height above ground and taking the total area under the graph. The units are specially designed to measure profile area, and are called f.p. units (foliage profile units).

Height: The height of the top of the forest canopy. Technically, the height above which horizontal distance to semi-obscurity exceeds 30 meters. Given in meters. 

Latitude: The latitude of the site. Given in degrees.

Longitude: The longitude of the site. Given in degrees.

No. Species: The number of different species of breeding bird pairs.

Total Density : A measure of the bird population density. It is simply the number of pairs of birds per hectare.

Top

Sea Slug Larvae and Seaweed

Statistical Methods

Anova, linear regression, time series, test for equal variances.

Summary

Sea slug larvae locate a patch of seaweed to stay in before developing into adults. These data are from study on the ability of larvae to detect a particular kind of seaweed at different tide heights.  Instead of randomly swimming until they find this seaweed (as was previously believed), these larvae actually "smell" the seaweed when they are passing over it in the water.  They do this, it is believed, by detecting chemicals that slowly leach out of the seaweed.  This study tests this theory by analyzing the ability of  larvae to distinguish the smell of this seaweed just as the tide is coming in -- when the chemicals are most concentrated -- and at high tide -- when the chemicals are more dilute due to the rising tide.

Just before the tide came in, one water sample containing filtered sea water was collected away from the patch of seaweed. This sample is the control (it is coded 99). Once the tide washed in, water samples were collected above a patch of seaweed every five minutes, for a total of thirty minutes. There are six replicates for each time point. There are seven time points (0-30 minutes) so there are a total of 42 observations, excluding the control. The control has only five replicates. Data from Dr. Patrick Krug, UCLA, Department of Biology.

Data Description

Fifteen slug larvae were then injected into each of the replicates, and the percentage of larvae that metamorphosed was recorded.  This percentage is a function of the ability of the larvae to detect the chemicals from the seaweed.

Variables

Time: When the sample was collected.

Percent: Percentage of larvae that underwent metamorphosis.

Top

Foraging Habits of Seed Harvesting Ants

Statistical Methods

Anova, T-test, linear regression

Summary

The data for both files was collected at the Sierra Nevada Aquatic Research Laboratory (SNARL) in the Great Basin Desert Province. Collection trays were placed into the ground at different distances from the entrance to the ant colonies' mounds, and any ants walking into them were trapped. Thus the data may be considered as random samples of ants at various distances.

General Explanation Of The Study

The two accompanying data files are from a study on the foraging habits of two species of ants: thatch ants (Formica planipilis) and seed harvester ants (Pogonomyrmex salinus). The study was concerned with the ant colonies' different "strategies" for optimizing the balance between collecting food and exposure to risk. The idea, in its simplest form, is that sending more ants to look for food farther away from the colony is more likely to increase the colony's food supply; traveling away from the colony, however, exposes ants to the risk of death by starvation and predation. Natural selection suggests that colonies will develop foraging strategies that maximize the colony's net gain, defined as how much the colony grows--this depends primarily on its food supply--minus how many ants die. A variety of different maximizing strategies exist, and different species often have different strategies. Some colonies develop "worker- conservative" foraging strategies, in which ants foraging at greater distances consume relatively more food: this minimizes the risk of starvation and leads to fewer deaths. Other colonies use strategies that conserve energy (the colonies overall supply of food). In this case long distance foragers, who are more likely to die, will consume less food so that their deaths will not be as much of a strain on the colony's food supply.

More specifically, this study investigated the relationship between the  size of the ants and the distance at which they foraged. Other studies have shown that larger ants use energy more efficiently than smaller ones, and this suggests that larger ants would be more likely to forage at greater distances. Thus ants were collected at various distances from the colony, weighed, and measured. Because an ant's weight provides a measure of how much food, or energy, it carries, and because headwidth measurements allow the ants to be classified by size, the data provides detailed information on the correlations among ant size, foraging distance, and energy supply. This information, in turn, gives insight into foraging strategies: we can, for example, determine whether a colony's strategy is "worker-conservative" or "energy-conservative," or a more complex combination of the two. We might find, for example--by analyzing the relationships among size, foraging distance, and energy supply both for the colony as a whole and within the separate size classes--that a colony's overall strategy is "worker-conservative," but that within size classes the strategy is "energy-conservative." These analyses, finally, illuminate the colony's survival strategies and provide evidence for or against competing evolutionary theories.

Brief Description Of The Data         

The data for both files was collected at the Sierra Nevada Aquatic Research Laboratory (SNARL) in the Great Basin Desert Province. Collection trays were placed into the ground at different distances from the entrance to the ant colonies' mounds, and any ants walking into them were trapped. Thus the data may be considered as random samples of ants at various distances.

The data on the thatch ants contains a total of 1199 samples, taken from a total of 11 different colonies.

The data on the seed harvester ants contains a total of 577 samples, taken from 8 different colonies.

Variables

COLONY: This is a number or letter that identifies which colony the sample was taken from. It is important only for distinguishing between ants from the same colony and ants from different colonies.

DISTANCE: Tells how far from the mound's entrance the sample was taken. Given in meters.

MASS or WT. (MG): How much the sample weighed in milligrams. This variable is used as a measure of how much food (energy) each sample had.

HEADWIDTH: A measure of the sample's maximum headwidth. The units here are basically arbitrary, and correspond to ruled increments that can be seen in the microscope with which the sample is measured. Headwidth is a good indicator of an ant's size, and can be used to classify the ants by size.   

WORKER CLASS: A size classification. In the thatch ant data there are five possible classes: "<30", "30-34", "35-39", "40-43", and ">43". In the seed harvester data there are four size classes: "<37", "37-38", "39-40", and ">40".

Relationship between IQ and Brain Size

Statistical Methods

Linear regression, chi-square

Summary

 This data file contains 20 observations (10 pairs of twins) on 9 variables. 

Twins share numerous physical, psychological, and pathological traits.  Recent advances in in vivo brain image acquisition and analysis have made it possible to determine quantitatively whether: 1) twins share neuroanatomical traits; and 2) neuroanatomical measures correlate with brain size.

         Using magnetic resonance imaging and computer-based image analysis techniques, measurements of the volume of the forebrain, the surface area of the cerebral cortex and the mid-sagittal area of the corpus callosum were obtained in 10 pairs of monozygotic twins.  Head circumference, body weight, and Full-Scale IQ were also measured.

Originally from Tramo MJ, Loftus WC, Green RL, Stukel TA, Weaver JB, Gazzaniga MS.  Brain Size, Head Size, and IQ in Monozygotic Twins.  Neurology  1998; 50:1246-1252.

Variables

CCMIDSA: Corpus Callosum Surface Area (cm2)

FIQ: Full-Scale IQ

HC: Head Circumference (cm)

ORDER: Birth Order

PAIR: Pair ID (Genotype)

SEX: Sex (1=Male 2=Female)

TOTSA: Total Surface Area (cm2)

TOTVOL: Total Brain Volume (cm3)

WEIGHT: Body Weight (kg)

Top

Natality

Statistical Methods

ANOVA

Summary

Data from the CDC WONDER database on low birthweight (<1.5 kg) births in the United States, summarized by region and tobacco use of the mother, for 1995-2002.

Variables

RegionCode: 1: Northeast; 2: Midwest; 3: South; 4: West.

Births: Number of births in 2002.

Tobacco Use: 1: Yes, 2: No, 9: Unknown.

Top

Percentage Body Fat

Statistical methods

Multiple regression

Summary

Lists estimates of the percentage of body fat determined by underwaterweighing and various body circumference measurements for 252 men. Accurate measurement of body fat is inconvenient/costly and it is desirable to have easy methods of estimating body fat that are notinconvenient/costly.

Details

A variety of popular health books suggest that the readers assess their health, at least in part, by estimating their percentage of body fat. In Bailey (1994), for instance, the reader can estimate body fat from tables using their age and various skin-fold measurements obtained by using a caliper. Other texts give predictive equations for body fat using body circumference measurements (e.g. abdominal circumference) and/or skin-fold measurements. These data were gathered to develop a better predictive model of body fat.

The variables listed below, from left to right, are:

 Density determined from underwater weighing

 Percent body fat from Siri's (1956) equation

 Age (years)

 Weight (lbs)

 Height (inches)

 Neck circumference (cm)

 Chest circumference (cm)

 Abdomen 2 circumference (cm)

 Hip circumference (cm)

 Thigh circumference (cm)

 Knee circumference (cm)

 Ankle circumference (cm)

 Biceps (extended) circumference (cm)

 Forearm circumference (cm)

 Wrist circumference (cm)

These data are used to produce the predictive equations for lean body weight The data were generously supplied by Dr. A. Garth Fisher.

Top

Resins and termites

Statistical Methods

Repeated measures ANOVA

Summary

The bark of some tropical trees seems to offer protection from termites. If this protection could be harnessed, the trees that produce the resin might become a valuable resource. Experimenters investigated the effects of these tree resins on termites. The resin dissolved in a solvent and placed on filter paper in different doses (5mg and 10mg). For each dosage level, eight dishes are set up with 25 termites in each dish. The termites are fed the dosed filter paper and a daily count is made of the number of termites surviving. Fifteen (15) days were observed, but no observations were made on days 3 and 9.

The data collected here are from an experiment to investigate the effects of certain tree resins on termites. A resin derived from the bark of tropical trees is dissolved in a solvent and placed on filter paper in different concentrations (5mg and 10mg). For each dosage level, eight dishes are set up with 25 termites in each dish.

Variables

dish: Dish number

dose: 5 or 10 mg of resin

day1 - day15: Number of termites still alive on this day

Sources

UCLA Statistics Data Sets

Sea Slugs / Seaweed

Foraging habits / Ants

Diversity Patterns / Birds

Harvard Forest Long-Term Ecological Research Site

Canopy Leaf Gas Exchange

StatLib

Dow Jones 1900 - 1993

Kidney

IQ / Brain Size

Clouds

W.H Freeman Publishing

Old Faithful

Data and Story Library (DASL)    

Termites

EpiInfo        

Fluoride

US Centers for Disease Control

Natality