Programme And Abstracts For Tuesday 28th Of November

Keynote: Tuesday 28th 9:00 Mantra

Cluster Capture-Recapture: A New Framework For Estimating Population Size

Rachel Fewster
University of Auckland

Ask any wildlife manager: their first burning question is “How many are there?”, and their second is “Are they trending upwards or downwards?” Capture-recapture is one of the most popular methods for estimating population size and trends. As the name suggests, it relies on being able to identify the same animal upon multiple capture occasions. The pattern of captures and recaptures among identified animals is used to estimate the number of animals never captured.

Physically capturing and tagging animals can be a dangerous and stressful experience for both the animals and their human investigators - or if it transpires that the animals actually enjoy it, biased inference may result. Consequently, researchers increasingly favour non-invasive sampling using natural tags that allow animals to be identified by features such as coat markings, dropped DNA samples, acoustic profiles, or spatial locations. These innovations greatly broaden the scope of capture-recapture estimation and the number of capture samples achievable. However, they are imperfect measures of identity, effectively sacrificing sample quality for quantity and accessibility. As a result, capture-recapture samples no longer generate capture histories in which the matching of repeated samples to a single identity is certain. Instead, they generate data that are informative—but not definitive—about animal identity.

I will describe a new framework for drawing inference from capture-recapture studies when there is uncertainty in animal identity. In the cluster capture-recapture framework, we assume that repeated samples from the same animal will be similar, but not necessarily identical, to each other. Overlap is also possible between clusters of samples generated by different animals. We treat the sample data as a clustered point process, and derive the necessary probabilistic properties of the process to estimate abundance and other parameters using a Palm likelihood approach.

Because it avoids any attempts at explicit sample-matching, the cluster capture-recapture method can be very fast, taking much the same time to analyse millions of sample-comparisons as it does to analyse hundreds. I will describe a preliminary framework for abundance estimation from acoustic monitoring. Cluster capture-recapture can also be used for behavioural studies, and I will show an example using camera-trap data from a partially-marked population of forest ship rats.

Tuesday 28th 10:30 Narrabeen

Propensity Score Approaches In The Presence Of Missing Data: Comparison Of Balance And Treatment Effect Estimates

Jannah Baker1, Tim Watkins1,2, and Laurent Billot1,3
1The George Insitute for Global Health
2University of Sydney
3University of New South Wales

The use of propensity score methods can potentially improve balance between groups in observational study data, thus minimising confounding. However, a frequent problem with such studies is the presence of missing data. We compare three approaches to generating the propensity score accounting for missing data within the context of a clinical case study examining the effect of telemonitoring on hypertension. Overall, 4,642 patients diagnosed with hypertension receiving online health support in My Health Guardian – a telephone chronic disease support program offered by private health insurer HCF – were offered a telemonitoring intervention. Of these, 2,729 accepted and started treatment between July 2014 to April 2015 (designated “cases”), and 1,913 declined (designated “controls”). Data were available from cases and controls on several baseline variables including demographic, lifestyle and clinical characteristics. Outcomes were the number of hospitalisations, total length of stay and total cost of hospitalisation between 1 January to 31 December 2016. Propensity score methods were used to balance baseline variables between groups. Three approaches were used to generate propensity scores from a logistic regression model accounting for missing data: 1) categorisation of all variables with “Missing” as a category, 2) multiple imputation of treatment effect (MIte) where treatment effect estimates are combined over 20 imputed datasets, and 3) multiple imputation of propensity score (MIps) where propensity scores are first averaged over imputed datasets prior to estimation of treatment effects. The propensity score from each approach was then used in two ways: a) matching cases to controls, and b) inverse probability of treatment weighting (IPTW). The balance achieved by each approach was compared using standardised differences of means and proportions in baseline characteristics between groups. The treatment effect estimates from each approach were also compared. The discussion will canvas our findings and recommendations for handling missing data when using propensity score approaches.

Tuesday 28th 10:30 Gunnamatta

Visualising Model Selection Stability In High-Dimensional Regression Models

Garth Tarr
University of Sydney

The mplot R package provides a provides an implementation of model stability and variable inclusion plots for researchers to use to better inform the variable selection process. The initial focus was on exhaustive searches through the model space, however, this quickly becomes infeasible for high dimensional models. An alternative approach for high dimensional models is to combine bootstrap model selection with regularisation procedures. There exist a number of fast and efficient method regularisation methods for variable selection in high dimensional regression settings. We have implemented variable inclusion plots and model stability plots using the glmnet package. We demonstrate the utility of the mplot package in identifying stable regularised model selection choices with respect to two main sources of uncertainty. Firstly, by resampling the data we are able to determine how often various models are chosen when the data changes. Secondly, we are able to evaluate how often competing models are chosen across a range of values for the tuning parameter. Exploring these two sources of uncertainty in model selection generates a large amount of raw data that needs to be processed. The mplot package provides a variety of methods to visualise this raw data to help inform a researcher’s model selection choice.

Tuesday 28th 10:50 Narrabeen

Dimensionality Reduction Of LIBS Data For Bayesian Analysis

Anjali Gupta1, James Curran1, Sally Coulson2, and Christopher Triggs1
1University of Auckland
2ESR

In 2004, Aitken and Lucy published an article detailing a two-level likelihood ratio for multivariate trace evidence. This model has been adopted in a number of forensic disciplines such as the interpretation of glass, drugs (MDMA), and ink. Modern instrumentation is capable of measuring many elements in very low quantities and, not surprisingly, forensic scientists wish to exploit the potential of this extra information to increase the weight of this evidence. The issue, from a statistical point of view, is that the increase in the number of variables (dimension) in the problem leads to increased data demand to understand both the variability within a source, and in between sources. Such information will come in time, but usually we don’t have enough. One solution to this problem is to attempt to reduce the dimensionality through methods such as principal component analysis. This practice is quite common in high dimensional machine learning problems. In this talk, I will describe a study where we attempt to quantify the effects of this this approach on the resulting likelihood ratios using data obtained from a Laser Induced Breakdown Spectroscopy (LIBS) instrument.

Tuesday 28th 11:10 Narrabeen

Analysis Of Melanoma Data With A Mixture Of Survival Models Utilising Multi-Class DLDA To Inform Mixture Class

Sarah Romanes, John Ormerod, and Jean Yang
University of Sydney

Melanoma is a prevalent skin cancer in Australia, with close to 14000 new cases estimated to be diagnosed in 2017. Survival times are markedly different from one individual to the next. In particular, there appears to be three classes of survival outcome. This talk considers integrating survival time data with microarray gene expression data. We construct a hybrid model that seamlessly integrates a three-class linear discriminant analysis model, mixture of parametric survival models, and model selection components. We fit this model using a variational expectation maximization (VEM) approach. Our model selection component naturally simplifies as a function of likelihood ratio statistics allowing natural comparisons with traditional hypothesis testing methods. We compare our method with several naïve approaches which only addresses the classification aspect or survival model aspect in isolation.

Tuesday 28th 11:10 Gunnamatta

Forecasting Hotspots Of Potentially Preventable Hospitalisations With Spatially Aggregated Longitudinal Health Data: All Subset Model Selection With A Novel Implementation Of Repeated K-Fold Cross-Validation

Matthew Tuson, Berwin Turlach, Kevin Murray, Mei Ruu Kok, Alistair Vickery, and David Whyatt
University of Western Australia

It is sometimes difficult to target individuals for health intervention due to limited information on their behaviour and risk factors. In such cases place-based interventions targeting geographical ‘hotspots’ with higher than average rates of health service utilisation may be effective. Many studies exist examining predictors of hotspots, but often do not consider that place-based interventions are typically costly and take time to develop and implement, and hotspots often regress to the mean in the short-term. Long-term geographical forecasting of hotspots using validated statistical models is essential in effectively prioritising place-based health interventions.

Existing methods forecasting hotspots tend to prioritise positive predicted value (i.e. correct predictions) at the expense of sensitivity. This work introduces methods to develop models optimising both positive predicted value and sensitivity concurrently. These methods utilise spatially aggregated administrative health data, WA census population data, and ABS geographic boundaries, combining all subset model selection with a novel implementation of repeated cross-validation for longitudinal data. Results from models forecasting 3-year hotspots for four potentially preventable hospitalisations are presented, namely: type II diabetes mellitus, heart failure, high risk foot, and chronic obstructive pulmonary disease (COPD).

Tuesday 28th 11:30 Narrabeen

Identifying Clusters Of Patients With Diabetes Using A Markov Birth-Death Process

Mugdha Manda, Thomas Lumley, and Susan Wells
University of Auckland

Estimating disease trajectories has increasingly become more essential to clinical practitioners to administer effective treatment to their patients. A part of describing disease trajectories involves taking patients’ medical histories and sociodemographic factors into account and grouping them into similar groups, or clusters. Advances in computerised patient databases have paved a way for identifying such trajectories in patients by recording a patient’s medical history over a long period of time (longitudinal data): we studied data from the PREDICT-CVD dataset, a national primary-care cohort from which people with diabetes from 2002-2015 were identified through routine clinical practice. We fitted a Bayesian hierarchical linear model with latent clusters to the repeated measurements of HbA1c and eGFR, using the Markov birth-death process proposed by Stephens (2000) to handle the changes in dimensionality as clusters were added or removed.

Tuesday 28th 11:30 Gunnamatta

Challenges Analysing Combined Agricultural Field Trials With Partially Overlapping Treatments

Kerry Bell and Michael Mumford
Queensland Department of Agriculture and Fisheries

To make recommendations on which management practices have the potential to increase crop yield there needs to be a consistent pattern demonstrated across trials from many environments. This presentation considers a case study looking at 31 mungbean trials in northern Australian from 2014 to 2016. The trials did not always have consistent factors (e.g. variety, row spacing or target plant density) or even consistent factor levels. To overcome the issue of inconsistent factors, environments were defined as the combination of site, year and any management factors not common across trials (e.g. time of sowing, irrigation, fertiliser).

There were numerous full factorial combinations within subsets of the data that could be considered for investigation so the first challenge was to determine which factorial combinations to focus on to best address the research questions and reporting requirements. Once this was determined, all the data from the trials that contributed to the factorial were included in a combined analysis using linear mixed models. In this model, the factorial of interest was partitioned in the test of fixed effects while each trials’ design parameters and residual variances were estimated using all the data from each trial. An example of the above mentioned factorial combinations is environment by row spacing for one particular variety. The next challenge was that with so many environments there was usually an environment by row spacing interaction which was not useful for making recommendations about row spacing.

Clustering of environments allowed forming groups that did not have a significant interaction between row spacing and environment. These groups were then generalised to types of environments with certain responses to row spacing.

Tuesday 28th 11:50 Narrabeen

A Hidden Markov Model For Sleep Stage Detection Using Raw Tri-Axial Wrist Actigraphy

Michelle Trevenen1, Kevin Murray1, Berwin Turlach1, Leon Straker2, and Peter Eastwood1
1University of Western Australia
2Curtin University

Sleep is a complex yet organised process consisting of regular cycles of sleep stages. These stages are rapid-eye movement and non-rapid eye movement (light sleep and slow-wave sleep). Sleep staging is of great importance in the physiological world as sleep disorders occur in around 20% of the population and are associated with a multitude of serious health implications and considerable economic burden. Hidden Markov models have been successfully used in the classification of individual sleep stages measured by polysomnography, which is considered the ‘gold standard’ in assessing sleep, however, it is intrusive and costly.

Actigraphy is increasingly being considered as a non-intrusive and cost effective alternative method to objectively measure sleep patterns. However, there is limited research on the ability of actigraphy to detect individual sleep stages, furthermore, this research indicates that these current methodologies do not have the ability to do so. Current actigraphic approaches to sleep detection use filtered uni-axial data measured with a low sampling rate, whereas, raw tri-axial data measured at high sampling rates is frequently used in the assessment of day-time activities.

Using simultaneously measured actigraphy and polysomnography data from 100 healthy young adults in the Western Australian Pregnancy Cohort Study we created and validated an algorithm to determine sleep stages utilising raw, tri-axial acceleration data from wrist actigraphy. Ten feature variables were created from each 30-second block of data and 50 subjects were used to train the hidden Markov model with the feature variables used as input parameters. The remaining 50 subjects were used to validate the trained hidden Markov model against polysomnography.

Validation suggested that our model is able to classify sleep stages using raw tri-axial actigraphy data. These results demonstrate that actigraphy-based hidden Markov models can feasibly be used for automatic sleep staging.

Tuesday 28th 11:50 Gunnamatta

Sparse Phenotyping Designs For Early Stage Selection Experiments In Plant Breeding Programs

Nicole Cocks, Alison Smith, David Butler, and Brian Cullis
University of Wollongong

The early stages of cereal and pulse breeding programs typically involve in excess of 500 test lines. The test lines are promoted through a series of trials based on their performance (yield) and other desirable traits such as heat/drought tolerance, disease resistance, etc. It is therefore important to ensure the design (and analysis) of these trials are efficient in order to appropriately and accurately guide the breeders through their selection decisions, until only a small number of elite lines remain.

The design of early stage variety trials in Australia provided the motivation for developing a new design strategy. The preliminary stages of these programs have limited seed supply, which limits the number of trials and replicates of test lines that can be sown. Traditionally, completely balanced block designs or grid plot designs were sown at a small number of environments in order to select the highest performing lines for promotion to the later stages of the program. Given our understanding of variety (i.e. line) by environment interaction, this approach is not a sensible or optimal use of the limited resources available.

A new method to allow for a larger number of environments to be sampled for situations where seed supply is limited and number of test lines is large will be discussed. This strategy will be referred to as sparse phenotyping, which is developed within the linear mixed model framework as a model-based design approach to generating optimal trial designs for early stage selection experiments.

Tuesday 28th 12:10 Narrabeen

Comparisons Of Two Large Long-Term Studies In Alzheimer’s Disease

Charley Budgeon
University of Western Australia

The incidence of Alzheimer’s disease (AD), the leading cause of dementia, is predicted to increase at least three fold by 2050. Curing this disease is a global priority. Currently, two major studies are attempting to gain further understanding of this disease; the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Australian Imaging, Biomarker and Lifestyle Study (AIBL). We describe these two cohorts to assess the impact of combining them to provide a larger cohort for analyses.

An initial comparison of the protocols was carried out and recruitment strategies were shown to be marginally different between the studies. Inclusion criteria specified ages between 55 and 90 years in ADNI and > 65 years in AIBL. Marginally different specifications for disease stage classifications of healthy controls (HC), mild cognitively impaired (MCI) and AD individuals were observed, for example, different Mini-Mental State Exam (MMSE) cut-offs. However, both studies had AD diagnosis supported by the NINDS/ARDA criteria. Baseline characteristics were compared between ADNI and AIBL cohorts. Overall, AIBL had more HCs compared to ADNI (69% vs 30%), but fewer MCI individuals (12% vs 50%). The ADNI cohort had a higher level of education and generally, within a disease classification, there were minimal differences in baseline age, sex, MMSE, and Preclinical Alzheimer Cognitive Composite (PACC) scores.

Longitudinal analyses compared the change over time for the two cohorts and disease classifications for PACC and MMSE. There were no significant differences in cohorts within the HC and MCI groups, but within the AD group, subjects in the ADNI cohort had generally higher predicted PACC and MMSE scores over time than those in AIBL.

Our results suggest there is the potential to combine the ADNI and AIBL cohorts for analysis purposes to provide one more powerful data set; however, consideration should be taken for some measures.

Tuesday 28th 12:10 Gunnamatta

A One-Stage Mixed Model Analysis Of Canola Chemistry Trials

Daniel Tolhurst, Ky Mathews, Alison Smith, and Brian Cullis
University of Wollongong

The National Variety Trials (NVT) program is used by plant breeding companies to evaluate the yield potential of new crop varieties independently across a large range of Australian growing conditions. By comparison with the remaining NVT crops, grower decisions are further complicated in canola because of its vulnerability to the infestation of weeds. A measure historically used by farmers for the management of weeds is the application of a herbicide (chemistry) treatment. The choice of chemistry is important as it restricts variety selection to those bred with the specific tolerance. The set of varieties currently evaluated in NVT are tolerant to one of three chemistries, namely imidazolinone (I; but marketed as Clearfield), glyphosate (Roundup Ready; R) or triazine (T), or have no specific tolerance (i.e. conventional canola; C). Consequently, canola has a more complex testing regime than the remaining NVT crops as each trial has a nested treatment structure involving both chemistries and varieties.

Canola trials are conducted in locations across the Australian grain belt and reflect best farmer practice for each district. Every site is partitioned into several field blocks and plots are allocated to the treatments according to orthogonal block designs. A spray boom is used to administer each chemistry but is pragmatic in the sense that large areas are treated simultaneously. This precludes the application of different sprays to plots in the same block. Randomisation is therefore restricted so varieties in a single block are tolerant to the same chemistry. However, as the number of chemistries and blocks are exactly equal, there is no information to estimate the experimental error variation and both are statistically confounded. Consequently, growers are limited to evaluating varieties with the same tolerance as comparisons across chemistries are invalid. This also has important implications on the statistical analysis, which is discussed in this talk.

Tuesday 28th 12:30 Narrabeen

A Semi-Parametric Linear Mixed Models For Longitudinally Measured Fasting Blood Sugar Level Of Adult Diabetic Patients

Tafere Tilahun1, Belay Birlie1, and Legesse Kassa Debusho2
1Jimma University
2University of South Africa

This paper focused on longitudinal data analysis of fasting blood sugar (FBS) level of adult diabetic patients at Jimma University Specialized Hospital diabetic clinic using an application of semi-parametric mixed model. The study revealed that the rate of change in FBS level in diabetic patients, due to the clinic interventions, does not continue as a steady pace but changes with time and weight of patients. Furthermore, it clarified associations between FBS level and some characteristics of adult diabetic patients that weight of a diabetes patient has a significant negative effect whereas patient gender, age, type of diabetes and family history of diabetes did not have a significant effect on the change of FBS level. Under various variance structures of subject-specific random effects, the semi-parametric mixed models had better fit than linear mixed model. This was likely due to the localized splines, which captured more variability in FBS level than the linear mixed model.

Tuesday 28th 12:30 Gunnamatta

Individual And Joint Analyses Of Sugarcane Experiments To Select Test Lines

Alessandra Dos Santos1, Chris Brien2,4, Clarice G. B. Demétrio1, Renata Alcarde Sermarini1,5, Guilherme A. P. Silva3, and Sandro R. Fuzatto3
1University of São Paulo
2University of South Australia
3CTC - Piracicaba
4Universtiy of Adelaide
5University of Adelaide

In the early stages of a breeding program many field trials are conducted, considering several soil and weather conditions. In the case of sugarcane, these experiments are installed using a large number of test lines, but limitations in the field and of the amount of genetic material do not allow the replication of many. This work evaluated 21 trials from different regions, from a Brazilian sugarcane breeding program. Each of these experiments occupied a rectangular array of around 20 rows by 25 columns in most instance, the plots were 12 m longer, double-furrows with 0.9m between furrows within the plot and 1.5m spacing between different plots and 1m between columns. All the trials had at least 79% of the area planted with unreplicated test lines, the at most 21% of the plots were occupied by four commercial varieties, check. A special check, interspersed along diagonals, was planted systematically on a diagonal grid, the other three were equally replicated and each replicate was spread out in three neighbouring row plots. Seven of the 21 experiments had no significant direct genetic effects, 11 presented significant competition at the residual level and only one had significant competition at the genetic level. The correlation between the selected test lines for the different experiments in a same region was less than 0.54. However, the genetic correlation was significant in the joint analyses and stronger than that from the individual analyses. Two simulation studies were performed: the first investigated the analysis for a single experiment and the results show that it is difficult to fit a model when there is genetic competition with or without residual competition. The same difficulty was observed in the second study, which compared the results from individual and joint analyses. It showed that, even for the joint analyses, only around 45 to 55% of the true best test lines were selected.