Daniel Majka

You are here: Home >> Portfolio >> Modeling bird distributions

Modeling bird distributions

For my MS thesis, I modeled the distributions of 41 species of birds in a local area of the Tilaran Mountains surrounding Monteverde, Costa Rica. I used 5 different statistical modeling techniques to relate species occurrence to topographic variables derived from the Shuttle Radar Topography Mission (SRTM) digital elevation model. These models were interfaced with GIS to map predicted probabilities of occurrence. With my research, I focused on two main questions:

Comparison of distribution modeling techniques

Costa Rica study site map showing locations of 89 point counts Costa Rica study site map. Click for large version.

Which modeling methods have the highest predictive accuracy? Are some techniques more likely to overfit (and thus poorly generalize) to a given data set? How accurate are techniques which only use a species' presence data compared to techniques which use both presence and absence data?

Methods

To answer these questions, I modeled the distribution of 41 bird species using logistic regression (GLM), generalized additive models (GAM), genetic algorithm for rule-set production (GARP), ecological niche factor analysis (ENFA), and artificial neural networks (ANN). I used a 10-fold cross-validation technique so predictive accuracy, as judged by Receiver Operator Characteristic (ROC) plots was not (as) biased.

Results

I was able to successfully model nearly 75% of the species (cross-validated Area Under Curve of ROC plot > 0.8). Distribution maps were often different even for species which were modeled accurately with all techniques. Surprisngly, I found found that simpler techniques such as logistic regression best predicted species occurrences, while complicated techniques such as artificial neural networks tended to overfit their training dataset.

Inference into bird-topography relationships

Predictive habitat models have been fairly successful in montane environments using topographic GIS indices as surrogate variables. However, a majority of these models have been examined in temperate environments. Does the same hold true for tropical environments, particularly those with high species turnover along an altitudinal gradient? Which topographic variables are most relevant in these habitat models?

Methods

I derived 9 topographic variables from the the Shuttle Radar Topography Mission (SRTM) digital elevation model. I hypothesized bird distributions were related to either a single primary gradient (elevation or the distance of a site to the contintental divide) or a single primary gradient and one of 7 potential secondary gradients (e.g. topographic relative moisture, distance to drainages, etc). To make inference into the overall importance of topographic variables in modeling species distributions, I used a multi-model logistic regression approach based on Akaike's information criterion (AIC).

Results

This approach revealed that the distance from the continental divide which separates the Caribbean and Pacific mountain slopes in Costa Rica was the most important predictor variable in accounting for species distributions. The biological importance of this variable may be due to its high correlation with microclimatic precipitation, suggesting that precipitation may directly or indirectly place the largest constraint on many avian species distributions in the study. The AIC approach also revealed that the distance of a location to drainages was potentially an important variable, indicating that topographic structure across the mountain may play a substantial role in explaining species distributions.

Modeling challenges

Topographic variables

There were no GIS-based distribution modeling studies in the Tilaran Mountains that I could base my study on. GIS data was much less available than in the U.S., and vegetation complexity prevented me from using vegetation communities as a modeling variable. This left topography as my best option. After spending several days attempting to open a 30m ASTER elevation model for the region, only to find huge holes in the data due to cloud cover, I was relieved to find that a slightly coarser resolution (90m) elevation model had just been released by NASA. By using radar, the Shuttle Radar Topography Mission sensor was able to 'shoot through the clouds' to sense the elevation of ground below. After pre-processing the DEM to fill in several small holes, I derived 9 topographic indices that had been used in previous distribution modeling studies.

Data organization & workflow optimization

With this study I created over 25,000 total models and lost a good deal of my sanity. I was determined to compare many different modeling techniques for many birds so I could get an honest feel for which techniques worked best. I was also determined to use cross-validation so I didn't just test the models on the data used to build them. This resulted in a data management nightmare, since I had to essentially model each species 11 times for each technique (for each species: 1 full dataset containing all 89 point counts and 10 cross-validation datasets each with 9/10's of the dataset). To compound the number of models even further, I tested 4 sub-algorithms of the ecological niche factor analysis, and used an experimental approach to the GARP algorithm, where I created and averaged 50 models for each of the 11 datasets per species, resulting in 550 total GARP models per species.

I would not have been able to complete this modeling study without writing scripts. I used ArcInfo AMLs to sum up 50 GARP models at-a-time and tweak output from generalized additive models, DOS batch scripts to run Neural Network models, SAS scripts to run batch AIC logistic regression models, and batch scripts to create generalized additive models in statistics program R. I also mashed up a fair amount of Avenue scripts in ArcView 3.3 to handle data extraction back-and-forth between GIS and statistics software.

If I had to do it all over again, I would probably write master scripts in Python which glue data in/out to all of the stats programs I used together. Unfortunately, ESRI didn't introduce Python to ArcGIS until I was almost finished with my research.

This page last updated 6 March 2007 by Dan Majka

dan@corridordesign.org | Valid XHTML, CSS | please don't steal my stuff