Programme And Abstracts For Wednesday 29th Of November

Keynote: Wednesday 29th 9:00 Mantra

Statistics On Street Corners

Dianne Cook
Monash University

Perceptual research is often conducted on the street, with convenience sampling of pedestrians who happen to be passing by. It is through experiments conducted using passer-bys that we have learned about the effect of change-blindness (https://www.youtube.com/watch?v=FWSxSQsspiQ) is in play outside the laboratory.

In data science, plots of data become important tools for observing patterns, making decisions, and communicating findings. But plots of data can be viewed differently by different observers, and often provoke skepticism about whether what you see “is really there?” With the availability of technology that harnesses statistical randomisation techniques and input from crowds we can provide objective evaluation of structure read from plots of data.

This talk describes an inferential framework for data visualisation, and the protocols that can be used to provide estimates of p-values, and power. I will discuss the experiments that we have conducted that (1) show that the crowd-sourcing does provide results similar to statistical hypothesis testing, (2) how this can be used to improve plot design, (3) p-values in situations where no classical tests exist. Examples from ecology and agriculture will be shown.

Joint work with Heike Hofmann, Andreas Buja, Deborah Swayne, Hadley Wickham, Eun-kyung Lee, Mahbubul Majumder, Niladri Roy Chowdhury, Lendie Follett, Susan Vanderplas, Adam Loy, Yifan Zhao, Nathaniel Tomasetti

Wednesday 29th 10:30 Narrabeen

Estimating Overdispersion In Sparse Multinomial Data

Farzana Afroz
University of Otago

The phenomenon of overdispersion arises when the data are more variable than we expect from the fitted model. This issue often arises when fitting a Poisson or a binomial model. When overdispersion is present, ignoring it may lead to misleading conclusions, with standard errors being underestimated and overly-complex models being selected. In our research we considered overdispersed multinomial data, which can arise in many research areas. Two approaches can be used to analyse overdispersed multinomial data; the use of the quasilikelihood method or explicit modelling of the overdispersion using, for example, a Dirichlet-multinomial or finite-mixture distribution. Use of quasilikelihood has the advantage of only requiring specification of the first two moments of the response variable. For sparse data, such as in a contingency table with many low expected counts, use of quasilikelihood to estimate the amount of overdispersion will be particularly useful, as it may be difficult to obtain reliable estimates of the parameters in a Dirichlet-multinomial or finite-mixture model. I consider four estimators of the amount of overdispersion in sparse multinomial data, discuss their theoretical properties and provide simulation results showing their performance in terms of bias, variance and mean squared error.

Wednesday 29th 10:30 Gunnamatta

Assessing Mud Crab Meat Fullness Using Non-Invasive Technologies

Carole Wright, Steve Grauf, Brett Wedding, Paul Exley, John Mayze, and Sue Poole
Queensland Department of Agriculture and Fisheries

The decision of whether a mud crab should be retained at harvest has traditionally been based on shell hardness. This is most commonly assessed by using thumb pressure applied to the carapace of the mud crab. The carapace of a recently moulted mud crab will flex considerably and is therefore returned to the water. This assessment has also been used to divide mud crabs into three meat fullness grades (A, B and C). The higher meat fullness grade A mud crabs fetch a greater price at market compared to the lower B and C grades. The subjective nature of this assessment will always result in disputes at the boundaries of the grades. By developing a more objective science-based method downgrades at the market will be reduced while consumer satisfaction and the overall industry profitability will increase.

A scoping study was conducted that evaluated innovative non-invasive technologies to assess mud crab meat fullness based on percentage yield recovery of cooked meat from the dominant individual mud crab claws. The non-invasive technologies assessed included near infrared spectroscopy (NIRS), candling using visible light, and acoustic velocity. NIRS showed the most potential and was reassessed in a second study with slight improvements to spectra capture methods and NIR light sources.

94 live mud crabs from the Moreton Bay area were used in the second study. Partial least squares regression (PLS-R) was performed to build a calibration model to predict the percentage yield recovery of cooked meat based on the spectral data. The PLS-R had \(R^2=0.77\) and \(RMSECV=4.8\).

A principal components linear discriminant analysis (PC-LDA) was also conducted to discriminate between the standard three grades of mud crab meat fullness. This was compared to the industry standard shell hardness method. The NIRS PC-LDA achieved a minimum of 76% correct classification for each of the three grades, compared to 24% for the shell hardness method.

The non-invasive technologies trialled along with the results will be discussed in this talk.

Wednesday 29th 10:30 Bundeena

Analysis Of Multivariate Binary Longitudinal Data: Metabolic Syndrome During Menopausal Transition

Geoff Jones
Massey University

Metabolic syndrome (MetS) is a major multifactorial condition that predisposes adults to type 2 diabetes and CVD. It is defined as having at least three of five cardiometabolic risk components: 1) high fasting triglyceride level, 2) low high-density lipoprotein cholesterol, 3) elevated fasting plasma glucose, 4) large waist circumference (abdominal obesity), and 5) hypertension. In the US Study of Women’s Health Across the Nation (SWAN), a 15-year multi-centre prospective cohort study of women from five racial/ethnic groups, the incidence of MetS increased as midlife women underwent the menopausal transition (MT). A model is sought to examine the interdependent progression of the five MetS components and the influence of demographic covariates.

Wednesday 29th 10:50 Narrabeen

Statistical Analysis Of Coastal And Oceanographic Influences On The Queensland Scallop Fishery

Wen-Hsi Yang1, Anthony J. Courtney2, Michael F. O’neill2, Matthew J. Campbell2, George M. Leigh2, and Jerzy A. Filar1
1Universtiy of Queensland
2Queensland Department of Agriculture and Fisheries

The saucer scallop (Ylistrum balloti) otter-trawl fishery used to be the most valuable commercially-fished species in Queensland ocean waters. Over the last few years, there has been growing concern among fishers, fishery managers and scientists over the decline in catch rates and annual harvest. A quantitative assessment conducted in 2016 showed that scallop abundance was at an historic low level. The assessment used data sourced from the fishery and independent surveys. Further information on coastal and oceanographic influences are available and may reveal new factors that influence population abundance of scallops and improve management of the fishery. In this study, scallop catch rate abundance data and coastal and physical oceanographic variables (e.g. sea surface temperature anomalies, coastal freshwater flow and Chlorophyll-a) were modelled to identify spatial and temporal environmental processes important for consideration in fishery management procedures.

Wednesday 29th 10:50 Gunnamatta

Saved By The Experimental Design: Testing Bycatch Reduction And Turtle Exclusion Devices In The Png Prawn Trawl Fishery

Emma Lawrence and Bill Venables
CSIRO

In trawling for prawns, the prawn catch is often only a small part of the results of any one trawl, with the remainder called “bycatch”. Reducing the bycatch component, while maintaining the prawn catch, is an important industry goal, primarily for environmental accreditation purposes, but also for economic reasons.

We designed an at-sea trial for the Gulf of Papua Prawn Fishery, involving four vessels each towing “quad gear” (that is, 4 separate, but linked trawl nets) in each trawl shot, over 18 days. The experiment was designed to assess the effectiveness of 27 combinations of Turtle Excluder Devices (TEDs) and Bycatch Reduction Devices, (BRDs), with a control net, without any attached device as one of the nets in each quad. At Biometrics 2015 we discussed how we used simulated annealing to generate a highly efficient design, in several stages, to meet the large number of highly specific logistical constraints.

The focus of this talk will be the analysis, which also proved somewhat challenging. We will present the results of our analysis and demonstrate why putting the time into thinking about and generating a non-standard experimental design allowed us to accommodate the various glitches and misfortunes that always seem to happen at sea.

Wednesday 29th 10:50 Bundeena

Rethinking Biosecurity Inspections: A Case Study Of The Asian Gypsy Moth (AGM) In Australia

Petra Kuhnert1, Dean Paini1, Paul Mwebaze1, and John Nielsen2
1CSIRO
2Department of Agriculture and Water Resources

The Asian gypsy moth (AGM) (Lymantria dispar asiatica) is a serious biosecurity risk to Australia’s forestry and horticultural industries. While similar in appearance to the European gypsy moth (Lymantria dispar dispar), the Asian gypsy moth is capable of flying up to 40 kilometres and therefore has the potential to establish and spread in other areas like Australia. In addition, females are attracted to light and will oviposit (lay eggs) indiscriminately. As a result, females are attracted to shipping ports at night and will oviposit on ships. These ships therefore have the potential to spread this moth around the world.

The life-cycle of the moth has been well documented and is heavily dependent on temperature, with eggs undergoing three phases of diapause before hatching. Current inspections of vessels arriving into Australian ports from what is deemed an “at risk” port is a lengthy and costly process.

To assist the Department of Agriculture with their prioritisation of ships, we developed an AGM Tool in the form of an R Shiny App that (1) shows the shortest maritime path from an at risk port to an Australian port for a vessel of interest and (2) predicts the probability of a potential hatch and it’s reliability using a classification tree model that was developed to emulate the lifecycle of the moth from simulated data. In this talk we will discuss the methodology that (1) simulates the AGM biology and potential hatches of eggs, along with how we extracted relevant temperature data that was the primary driver of the lifecycle for AGM, and (2) emulates this simulated data using a statistical model, namely a classification tree to predict the probability of a potential hatch. We will also discuss a bootstrap approach to explore the reliability of the potential hatch predicted.

Wednesday 29th 11:10 Narrabeen

Subtractive Stability Measures For Improved Variable Selection

Connor Smith1, Samuel Müller1, and Boris Guennewig1,2
1University of Sydney
2University of New South Wales

This talk builds upon the Invisible Fence (Jiang et al., 2011) a promising model selection method. Utilizing a combination of coeffcient, scale and deviance estimates we are able to improve this resampling based model selection method for regression models, both linear and general linear models. The introduction of a variable inclusion plot allows for a visual representation for the stability of the model selection method as well as the variables bootstrapped rank. The suggested methods will be applied to both simulated and real examples with comparisons about both computational time and effectiveness made to selections through alternative selection procedures. We will report on our latest results from ongoing work in scaling up subtractive stability measures when the numbers of features is large.

References: Jiang, J., Nguyen, T., & Rao, J. S. (2011). Invisible fence methods and the identification of differentially expressed gene sets. Statistics and Its Interface, 4(3), 403-415.

Wednesday 29th 11:10 Bundeena

A Comparison Of Multiple Imputation Methods For Missing Data In Longitudinal Studies

Md Hamidul Huque1, Katherine Lee1, Julie Simpson2, and John Carlin1
1Murdoch Childrens Research Institute
2University of Melbourne

Multiple imputation (MI) for imputing missing data are increasingly used in longitudinal studies where data are missing due to non-response and lost to follow-up. Standard multivariate normal imputation (MVNI) and fully conditional specifications (FCS) are the principle imputation framework available for imputing cross-sectional missing data. A number of methods has been suggested in the literature to impute longitudinal data including (i) use of standard FCS and MVNI with repeated measurements as separate distinct variables (ii) use of imputation methods based on generalized linear mixed models. No clear evaluation of the relative performance of available MI methods in the context of longitudinal data. We present a comprehensive comparison of the all the available methods for imputation longitudinal data in the context of estimating coefficient for both linear regression model and linear mixed effect model. We also compared the performance of the methods to impute both binary and continuous data. A total of 10 different methods (MVNI, JM-pan, JM-jomo, standard FCS, FCS-twofold, FCS-MTW, FCS-2lnorm, FCS-2lglm, FCS-2ljomo and FCS-Blimp) are compared in terms of bias, standard error and coverage probability of the estimated regression coefficients. These methods are compared using a simulation study based on a previously conducted analysis exploring the association between the burden of overweight and quality of life (QoL) using data from the Longitudinal Study of Australian Children (LSAC). We found that both standard FCS and MVNI provide reliable estimates and coverage of the regression parameters. Among other methods linear mixed models based methods, JM-jomo and FCS-Blimp approaches hold great promise.

Wednesday 29th 11:30 Narrabeen

Species Distribution Modelling For Combined Data Sources

Ian Renner1 and Olivier Gimenez2
1University of Newcastle
2Centre d’Ecologie Fonctionnelle et Evolutive

Increasingly, multiple sources of species occurrence data are available for a particular species, collected through different protocols. For single-source models, a variety of methods have been developed: point process models for presence-only data, logistic regression for presence-absence data obtained through single-visit systematic surveys, and occupancy modelling for detection/non-detection data obtained through repeat-visit surveys. In situations for which multiple sources of data are available to model a species, these sources may be combined via a joint likelihood expression. Nonetheless, there are questions about how to interpret the output from such a combined model and how to diagnose potential violations of model assumptions such as the assumption of spatial independence among points.

In this presentation, I will explore questions of interpretation of the output from these combined approaches, as well as propose extensions to current practice through the introduction of a LASSO penalty, source weights to account for differing quality of data, and models which account for spatial dependence among points. This approach will be demonstrated by modelling the distribution of the Eurasian lynx in eastern France.

Wednesday 29th 11:30 Gunnamatta

A Factor Analytic Mixed Model Approach For The Analysis Of Genotype By Treatment By Environment Data

Lauren Borg, Brian Cullis, and Alison Smith
University of Wollongong

The accurate evaluation of genotype performance for a range of traits, including disease resistance, is of great importance to the productivity and sustainability of major Australian commercial crops. Typically, the data generated from crop evaluation programmes arise from a series of field trials known as multi-environment trials (METs), which investigate genotype performance over a range of environments.

In evaluation trials for disease resistance, it is not uncommon for some genotypes to be chemically treated against the afflicting disease. An important example in Australia is the assessment of genotypes for resistance to blackleg disease in canola crops where it is common practice to treat canola seeds with a fungicide. Genotypes are either grown in trials as treated, untreated or as both.

There are a number of methods for the analysis of MET data. These methods, however, do not specifically address the analysis of data with an underlying three-way structure of genotype by treatment by environment (GxTxE). Here, we propose an extension of the factor analytic mixed model approach for MET data, using the canola blackleg data as the motivating example.

Historically in the analysis of blackleg data, the factorial genotype by treatment structure of the data was not accounted for. Entries, which are the combinations of genotypes and fungicide treatments present in trials, were regarded as `genotypes’ and a two-way analysis of ‘genotypes’ by environments was conducted.

The analysis of our example showed that the accuracy of genotype predictions, and thence information for growers, was substantially improved with the use of the three-way GxTxE approach compared with the historical approach.

Wednesday 29th 11:30 Bundeena

The Impact Of Cohort Substance Use Upon Likelihood Of Transitioning Through Stages Of Alcohol And Cannabis Use And Use Disorder: Findings From The Australian National Survey On Mental Health And Well-Being

Louisa Degenhardt1, Meyer Glantz2, Chrianna Bharat1, Amy Peacock1, Luise Lago1, Nancy Sampson3, and Ronald Kessler3
1National Drug and Alcohol Research Centre
2National Institute on Drug Abuse
3Harvard University

The aims of the present study were to use population-level Australian data to estimate prevalence and speed of transitions across stages of alcohol and cannabis use, abuse and dependence, and remission from disorder, and consider the potential impacts that an individual’s age and sex cohort’s level of substance use predicted transitions into and out of substance use. Data on lifetime history of use, DSM-IV use disorders, and remission from these disorders were collected from participants (n=8,463) in the 2007 Australian National Survey of Mental Health and Wellbeing using the Composite International Diagnostic Interview.

Lifetime prevalence of alcohol use, regular use, abuse, dependence, and remission from abuse and dependence were 94.1%, 64.5%, 22.1%, 4.0%, 16.1% and 2.1%, respectively. Unconditional lifetime prevalence of cannabis use, abuse, dependence, and remission from abuse and dependence were 19.8%, 6.1%, 1.9%, 4.0% and 1.5%. Increases in the estimated proportion of people in the respondent’s sex and age cohort who used alcohol/cannabis as of a given age were significantly associated with most transitions from use through to remission beginning at the same age. Clear associations were documented between cohort-level prevalence of substance use and personal risk of subsequent transitions of individuals in the cohort from use to greater substance involvement. This relationship remained significant over and above associations involving the individual’s age of initiation. These findings have important implications for our understanding of the causal pathways into and out of problematic substance use.

Wednesday 29th 11:50 Narrabeen

The LASSO On Latent Indices For Ordinal Predictors In Regression

Francis Hui1, Samuel Mueller2, and Alan Welsh1
1ANU
2University of Sydney

Many applications of regression models involve ordinal categorical predictors. A motivating example we consider is ordinal ratings from individuals responding to questionnaires regarding their workplace in the Household Income and Labour Dynamics in Australia (HILDA) survey, with the aim being to study how workplace conditions (main and possible interaction effects) affect their overall mental wellbeing. A common approach to handling ordinal predictors is to treat each predictor as a factor variable. This can lead to a very high-dimensional problem, and has spurred much research into penalized likelihood methods for handling categorical predictors while respecting the marginality principle. On the other hand, given the ordinal ratings are often regarded as manifestations of some latent indices concerning different aspects of job quality, then a more sensible approach would be to first perform some sort of dimension reduction before entering the predicted indices into a regression model. In applied research this is often performed as a two-stage procedure, and in doing so fails to utilize the response in order to better predict the latent indices themselves.

In this talk, we propose the LASSO on Latent Indices (LoLI) for handling ordinal categorical predictors in regression. The LoLI model simultaneously constructs a continuous latent index for each or groups of ordinal predictors and models the response as a function of these (and other predictors if appropriate) including potential interactions, with a composite LASSO type penalty added to perform selection on main and interaction effects between the latent indices. As a single-stage approach, the LoLI model is able to borrow strength from the response to improve construction of the continuous latent indices, which in turn produces better estimation of the corresponding regression coefficients. Furthermore, because of the construction of latent indices, the dimensionality of the problem is substantially reduced before any variable selection is performed. For estimation, we propose first estimating the cutoffs relating the observed ordinal predictors to the latent indices. Then conditional on these cutoffs, we apply a penalized Expectation Maximization algorithm via importance sampling to estimate the regression coefficients. A simulation study demonstrates the improved power of the LoLI model at detecting truly important ordinal predictors compared to both two-stage approaches and using factor variables, and better predictive and estimation performance compared to the commonly used two-stage approach.

Wednesday 29th 11:50 Gunnamatta

Whole‑Genome QTL Analysis For Nested Association Mapping Populations

Maria Valeria Paccapelo1, Alison Kelly1, Jack Christopher2, and Arunas Verbyla3,4
1Queensland Department of Agriculture and Fisheries
2Queensland Alliance for Agriculture and Food
3Data61
4CSIRO

Genetic dissection of quantitative traits in plants has become an important tool in breeding of improved varieties. The most commonly used methods to map QTL are linkage analysis in bi-parental populations and association mapping in diversity panels. However, bi-parental populations are restricted in terms of allelic diversity and recombination events. Despite the fact that association mapping overcomes these limitations, it has low power to detect rare alleles associated with a trait of interest. Multi-parent populations such as multi-parent advanced generation inter-cross (MAGIC) and nested association mapping (NAM) populations have been developed to combine strengths of both mapping approaches, capturing more recombination events and allelic diversity than bi-parental populations and in a greater frequency than a diversity panel. Nested association mapping uses multiple RIL families connected by a single common parent. Such a population structure presents some additional challenges compared to traditional mapping, in particular the population design and the large number of molecular markers that need to be integrated simultaneously into the analysis. We present a method for QTL mapping for NAM populations adapted from multi-parent whole genome average interval mapping (MPWGAIM) where the NAM design is incorporated through the probability of inheriting founder alleles for every marker across the genome. This method is based on a mixed linear model in a one-stage analysis of raw phenotypes together with markers. It simultaneously scans the whole-genome through an iterative process leading to a multi-locus model. The approach was applied to a wheat NAM population in order to perform QTL mapping for plant height. The method was developed in R, with main dependencies being the R packages MPWGAIM and asreml. This approach establishes the basis for further studies and extensions such as the combination of multiple NAM populations.

Wednesday 29th 11:50 Bundeena

An Asymmetric Measure Of Population Differentiation Based On The Saddlepoint Approximation Method

Louise McMillan and Rachel Fewster
University of Auckland

In the field of population genetics there are many measures of genetic diversity and population differentiation. The best known is Wright’s Fst, later expanded by Cockerham and Weir, which is very widely used as a measure of separation between populations. More recently a multitude of other measures have been developed, from Gst to D, all with different features and disadvantages. One thing these measures all have in common is that they are symmetric, which is to say that the Fst between population A and population B is the same as that between population B and population A. Following my work on GenePlot, a visualization tool for genetic assignment, I am now working on the development of an asymmetric measure, where the fit of A into B may not be the same as the fit of B into A. This measure will enable the detection of scenarios such as “subsetting”, the relationship between a large, diverse population A and a smaller population B that has experienced genetic drift since being separated from A. The measure has several features that distinguish it from existing measures, and is constructed using the same saddlepoint approximation method underlying GenePlot, and which is used to approximate the multi-locus genetic distributions of populations.

Wednesday 29th 12:10 Narrabeen

Fast And Approximate Exhaustive Variable Selection For GLMs With APES

Kevin Wang, Samuel Mueller, Garth Tarr, and Jean Yang
University of Sydney

Obtaining maximum likelihood estimates for generalised linear models (GLMs) is computationally intensive and remains as the major obstacle for performing all subsets variable selection. Exhaustive exploration of the model space, even for a moderately large number of covariates, remains a formidable challenge for modern computing capabilities. On the other hand, efficient algorithms for exhaustive searches do exist for linear models, most notably the leaps and bound algorithm and, more recently, the mixed integer optimisation algorithm. In this talk, we present APES (APproximated Exhaustive Search) a new method that approximates all subset selection for a given GLM by reformulating the problem as a linear model. The method works by learning from observational weights in a correct/saturated generalised linear regression model. APES can be used in partnership with any other state-of-the-art linear model selection algorithm, thus enabling (approximate) exhaustive model exploration in dimensions much higher than previously feasible. We will demonstrate that APES model selection is competitive against genuine exhaustive search via simulation studies and applications to health data. Extensions to a robust setting is also possible.

Wednesday 29th 12:10 Gunnamatta

Order Selection Of Factor Analytic Models For Genotype X Environment Interaction

Emi Tanaka1, Francis Hui2, and David Warton3
1University of Sydney
2ANU
3University of New South Wales

Factor analytic (FA) models are widely used across a range of disciplines owing to computational advantages from dimension reduction and possible ability to interpret the factors. In plant breeding, FA model provides a natural framework to model the genotype x environment interaction. An FA model is dictated by the number of factors (order of the model). A higher order lends to more parameters in the model and this necessitates the order selection to achieve parsimony. We introduce an order selection method via the ordered factor lasso (OFAL). We illustrate its performance based on a simulation on a real wheat yield multi-environmental trial.

Wednesday 29th 12:10 Bundeena

Multiple Sample Hypothesis Testing Of The Human Microbiome Through Evolutionary Trees

Martina Mincheva1, Hongzhe Li2, and Jun Chen3
1Temple University
2University of Pennsylvania
3Mayo Clinic

Next generation sequencing technologies make it possible to survey microbial communities by sequencing nucleic acid material extracted from multiple samples. The metagenomic read counts are summarized as empirical distributions on a reference phylogenetic tree. The distance between them is evaluated by the Kantorovich-Rubinstein (kr) metric, equivalent to the commonly used weighted UniFrac distance on a tree. This paper proposes a method to test the hypothesis that two sets of samples have the same microbial composition. The asymptotic distributions of the kr distance between the two Frechet means and the Frechet variances are derived and are shown to be independent. The test statistic is defined as the ratio of those distances and it is shown to follow an asymptotic F-distribution. Its generality stems from the fact that the test is nonparametric and requires no assumptions on the probability distributions of the count data. It is also applicable for varying set sizes and sample sizes. It is an extension of Evans and Matsen (2012) who suggest a test to only compare two single samples. The computational efficiency of the proposed test comes from the exact asymptotic distribution of the proposed test statistic. Extensive data analysis shows that the test is significantly faster than the permutation-based multivariate analysis of variance using distance matrices (permanova) (McArdle and Anderson, 2001). At the same time, it has correct type 1 errors and comparable power, which makes it preferable in the analysis of large scale microbiome data.

Keynote: Wednesday 29th 13:40 Mantra

Statistical Strategies For The Analysis Of Large And Complex Data

Louise Ryan1,2, Stephen Wright1,3, and Hon Hwang1
1University of Technology Sydney
2Harvard T. H. Chan School of Public Health
3Australian Red Cross

This talk will focus on challenges that arise when faced with the analysis of datasets that are too large for standard statistical methods to work properly. While one can always go for the expensive solution of getting access to a more powerful computer or cluster, it turns out that there are some simple statistical strategies that can be used. In particular, we’ll discuss the use of so called “Divide and Recombine” strategies that relegate some of the work to be done in a distributed fashion, for example via Hadoop. Combining these strategies with clever subsampling and data coarsening ideas can result in datasets that are small enough to manage on a standard desktop machine, with only minimal efficiency loss. The ideas are illustrated with data from the Australian Red Cross.

Wednesday 29th 14:30 Narrabeen

Optimal Experimental Design For Functional Response Experiments

Jeff Zhang and Christopher Drovandi
Queensland University of Technology

Functional response models are important in understanding predator-prey interactions. The development of functional response methodology has progressed from mechanistic models to more statistically motivated models that can account for variance and the over-dispersion commonly seen in the datasets collected from functional response experiments. However, little information seems to be available to those wishing to prepare optimal parameter estimation designs for functional response experiments. We develop a so-called exchange design optimisation algorithm suitable for integer-valued design spaces, which for the motivating functional response experiment involves selecting the number of prey used for each observation. Further, we develop and compare new utility functions for performing robust optimal design in the presence of parameter uncertainty, which are generally applicable. The methods are illustrated using a published beta-binomial functional response model for an experiment involving the freshwater predator Notonecta glauca (an aquatic insect) preying on Asellus aquaticus (a small crustacean) as a case study.

Wednesday 29th 14:30 Gunnamatta

To PCA Or Not To PCA

Catherine M. McKenzie1, Wei Zhang1, Stuart D. Card1, Cory Matthew2, Wade J. Mace1, and Siva Ganesh1
1AgResearch
2Massey University

When there are groupings of observations present in the data, many researchers resort to utilising a Principal Components Analysis (PCA) a priori for identifying patterns in the data, and then look to map the patterns obtained from PCA to differences among the groupings, or attribute biological signal to them. Is this appropriate, given that PCA’s are not designed to discriminate between the groupings? Is a group-oriented multivariate methodology such as multivariate analyses of variance (MANOVA) or Canonical Discriminant Analysis (CDA) preferable? Which method has more relevance when investigating factor effects of biochemical pathways? We explore this question via a biological example.

Two biological materials (B1 & B2) were analysed for the same 19 primary metabolites, with three factors of Methods (M1 & M2), Treatment (T1, T2, T3 and T4), and Age (A1 & A2), with three replicate values giving a total of 48 observations for each biological material. Univariate and multivariate analyses of variance (ANOVA and MANOVA, respectively) were carried out, for which there were many statistically significant interaction effects. In addition, other multivariate techniques such as PCA and CDA were used to explore relationships between the variables. The question remains as to the appropriateness of carrying out PCA to explore biochemical pathways, the comparison being between tailoring the pattern extraction a priori to match the known groupings within the data versus starting with an unrestrained pattern analysis and seeking to explain the patterns detected post-analysis?

Wednesday 29th 14:50 Narrabeen

An Evaluation Of Error Variance Bias In Spatial Designs

Emlyn Williams1 and Hans-Peter Piepho2
1ANU
2University of Hohenheim

Spatial design and analysis are widely used, particularly in field experimentation. However, it is often the case that spatial analysis does not enhance more traditional approaches such as row-column analysis. It is then of interest to gauge the degree of error variance bias that accrues when a spatially-designed experiment is analysed as a row-column design. This talk builds on the work of Tedin (1931) who, with R.A. Fisher as advisor, studied error variance bias in knight’s move Latin squares.

Wednesday 29th 14:50 Gunnamatta

New Model-Based Ordination Data Exploration Tools For Microbiome Studies

Olivier Thas1, Stijn Hawinkel1, and Luc Bijnens2
1Ghent University
2Janssen Pharmaceutics

High-throughput sequencing technologies allow easy characterization of the human microbiome, but the statistical methods for analysing microbiome data are still in their infancy. Data exploration often relies on classical dimension reduction methods such as Principal Coordinate Analysis (PCoA), which is basically a Multidimensional Scaling (MDS) method starting from ecologically relevant distance measures between the vectors of relative abundances of the microorganisms (e.g Bray-Curtis distance).

We will demonstrate that these classical visualisation methods fail to deal with microbiome-specific issues such as variability due to library-size differences and overdispersion. Next we propose a new technique that is based on a negative binomial regression model with log-link, and which relies on the connection between correspondence analysis and the log-linear RC(M) models of Goodman (Annals of Statistics, vol. 13, 1985); see also Zhu et al. (Ecological Modelling, vol. 187, 2005). Instead of assuming a Poisson distribution for the counts, a negative binomial distribution is assumed. To better account for library size effects, we adopt a different weighting scheme, which naturally arises from the parameterisation of the model. An iterative parameter estimation method is proposed and implemented into R. The new method is illustrated on several example datasets, and it is empirically evaluated in a simulation study. It is concluded that our method succeeds better in discovering structure in microbiome datasets than with other conventional methods.

In the second part of the presentation we extent the model-based method to a constrained ordination method by using sample-specific covariate data. The method looks for a two-dimensional visualisation that optimally discriminates between species with respect to their sensitivity to environmental conditions. Again we build upon results of Zhu et al. (2005) and Zhang and Thas (Statistical Modelling, vol. 12, 2012). The method is illustrated on real data.

All methods are available as an R package.

Wednesday 29th 15:10 Narrabeen

Always Randomize?

Chris Brien1,2
1University of South Australia
2Universtiy of Adelaide

Fisher gave us three fundamental principles for designed experiments: replication, randomization and local control. Consonant with this, Brien et al. (2011) [Brien, C. J., Harch, B. D., Correll, R. L., & Bailey, R. A. (2011) Multiphase experiments with at least one later laboratory phase. I. Orthogonal designs. Journal of Agricultural, Biological, and Environmental Statistics, 16, 422-450.] exhort the use of randomization in multiphase experiments via their Principle 7 (Allocate and randomize in the laboratory). This principle is qualified with ‘wherever possible’, which leads to the question ‘when is randomization not possible?’.

Situations where randomization is not applicable will be described for both single-phase and multiphase experiments. The reasons for not randomizing include practical limitations and, for multiphase experiments, difficulty in estimating variance parameters when randomization is employed. For the latter case, simulation studies canvassing a number of potential difficulties will be described. A Nonrandomization Principle, and an accompanying analysis strategy, for multiphase experiments will be proposed.

Wednesday 29th 15:10 Gunnamatta

Comparing Classical Criteria For Selecting Intra-Class Correlated Features For Three-Mode Three-Way Data

Lynette Hunt1 and Kaye Basford2
1University of Waikato
2University of Queensland

Many unsupervised learning tasks involve data sets with both continuous and categorical attributes. One possible approach to clustering such data is to assume that the data to be clustered come from a finite mixture of populations. There has been extensive use of mixtures where the component distributions are multivariate normal and where the data would be described as two mode two way data. The finite mixture model can also be used to cluster three way data. The mixture model approach requires the specification of the number of components to be fitted to the model and the form of the density functions of the underlying components.

This talk illustrates the performance of several commonly used model selection criteria in selecting both the number of components and the form of the correlation structure amongst the attributes when fitting a mixture model to the finite mixture model to cluster three way data containing mixed categorical and continuous attributes

Wednesday 29th 15:50 Narrabeen

Exploring The Social Relationships Of Dairy Goats

Vanessa Cave1, Benjamin Fernoit2, Jim Webster1, and Gosia Zobel1
1AgResearch
2Agrosup Dijon

Goats are sentient beings capable of an emotional response to their lives. Yet despite this, the social relationships between animals are largely overlooked in commercial systems. Goats have been shown to recognise, and make decisions based on, the presence of other specific goats. The degree to which they choose to associate with individuals has not been established, but social bonds in other animals have been shown to buffer against stressful situations, and conversely can be a significant source of stress when such bonds are disrupted.

To investigate whether social relationships exist among dairy goats, 4 non-consecutive days of video focusing on a group of 12 goats was analysed. At one minute scan intervals, the proximity of every goat relative to every other goat was recorded as an ordinal variable with four levels (in contact, within a head length, within a body length, or alone). Each goat’s location in the pen was also noted (feeding, bedding, climbing platform).

A variety of statistical techniques, including heatmaps and network analyses, were used to study the social relationships among goats based on proximity. Social relationships were characterised by specific pairs of goats reliably spending a lot of time in close proximity.

Results indicate that whilst some goats were “sociable” (e.g., spending more than 60% of their time with other goats), others tended to be “loners” (e.g., spending more than 60% of their time alone). Interestingly, there was evidence of both preferred and avoided companionships.

This small-scale study provides the first evidence to suggest that common management practices resulting in the regrouping of dairy goats could have an impact on their welfare.

Wednesday 29th 15:50 Gunnamatta

Bayesian Semi-Parametric Spectral Density Estimation With Applications To The Southern Oscillation Index

Claudia Kirch1, Matt Edwards2, Alexander Meier1, and Renate Meyer2
1University of Magdeburg
2University of Auckland

Standard time series modelling is dominated by parametric models like ARMA and GARCH models. Even though nonparametric Bayesian inference has been a rapidly growing area over the last decade, only very few nonparametric Bayesian approaches to time series analysis have been developed. Most notably, Carter and Kohn (1997), Gangopadhyay (1998), Choudhuri et al. (2004), and Rosen et al (2012) used Whittle’s likelihood for Bayesian modeling of the spectral density as the main nonparametric characteristic of stationary time series. On the other hand, frequentist time series analyses are often based on nonparametric techniques encompassing a multitude of bootstrap methods (Kreiss and Lahiri, 2011, Kirch and Politis, 2011).

As shown in Contreras-Cristan et al. (2006), the loss of efficiency of the nonparametric approach using Whittle’s likelihood approximation can be substantial. On the other hand, parametric methods are more powerful than nonparametric methods if the observed time series is close to the considered model class but fail if the model is misspecified. Therefore, we suggest a nonparametric correction of a parametric likelihood that takes advantage of the efficiency of parametric models while mitigating sensitivities through a nonparametric amendment. We use a nonparametric Bernstein polynomial prior on the spectral density with weights induced by a Dirichlet process. Contiguity and posterior consistency for Gaussian stationary time series have been shown in a preprint by Kirch et al (2017). Bayesian posterior computations are implemented via a MH-within-Gibbs sampler and the performance of the nonparametrically corrected likelihood is illustrated in a simulation. We use this approach to analyse the monthly time series of the Southern Oscillation Index, one of the key atmospheric indices for gauging the strength of El Nino events and their potential impacts on the Australian region.

Wednesday 29th 16:10 Narrabeen

Efficient Multivariate Sensitivity Analysis Of Agricultural Simulators

Daniel Gladish
CSIRO

Complex mechanistic computer models often produce multivariate output. Sensitivity analysis can be used to help understand sources of uncertainty in the system. Much of the literature around sensitivity analysis has focused on univariate output, with some recent advances using multivariate correlated output. One promising method for multivariate sensitivity analysis involves decomposition through basis function expansion. However, these methods often require several model runs and may still be computationally intensive for practical purposes. Emulators have been a proven method for reducing computational time for univariate sensitivity analysis, with some recent development for multivariate computer models. We propose the use of generalized additive models and random forests combined with a principal component analysis for emulation for a multivariate sensitivity analysis. We demonstrate our method using a complex agricultural simulator.

Wednesday 29th 16:10 Gunnamatta

Bayesian Hypothesis Tests With Diffuse Priors: Can We Have Our Cake And Eat It Too?

John Ormerod, Michael Stewart, Weichang Yu, and Sarah Romanes
University of Sydney

We introduce a new class of priors for Bayesian hypothesis testing, which we name “cake priors”. These priors circumvent Bartlett’s paradox (also called the Jeffreys-Lindley paradox); the problem associated with the use of diffuse priors leading to nonsensical statistical inferences. Cake priors allow the use of diffuse priors (having ones cake) while achieving theoretically justified inferences (eating it too). We demonstrate this methodology for Bayesian hypotheses tests for scenarios under which the one and two sample \(t\)-tests, and linear models are typically derived. The resulting test statistics take the form of a penalized likelihood ratio test statistic. By considering the sampling distribution under the null and alternative hypotheses we show for independent identically distributed regular parametric models that Bayesian hypothesis tests using cake priors are strongly Chernoff-consistent, i.e., achieve zero type I and II errors asymptotically. Lindley’s paradox is also discussed.