Programme And Abstracts For Wednesday 29th Of November
Keynote: Wednesday 29th 9:00 Mantra
Statistics On Street Corners
Dianne Cook
Monash University
Perceptual research is often conducted on the street, with convenience sampling of pedestrians who happen to be passing by. It is through experiments conducted using passer-bys that we have learned about the effect of change-blindness (https://www.youtube.com/watch?v=FWSxSQsspiQ) is in play outside the laboratory.
In data science, plots of data become important tools for observing patterns, making decisions, and communicating findings. But plots of data can be viewed differently by different observers, and often provoke skepticism about whether what you see “is really there?” With the availability of technology that harnesses statistical randomisation techniques and input from crowds we can provide objective evaluation of structure read from plots of data.
This talk describes an inferential framework for data visualisation, and the protocols that can be used to provide estimates of p-values, and power. I will discuss the experiments that we have conducted that (1) show that the crowd-sourcing does provide results similar to statistical hypothesis testing, (2) how this can be used to improve plot design, (3) p-values in situations where no classical tests exist. Examples from ecology and agriculture will be shown.
Joint work with Heike Hofmann, Andreas Buja, Deborah Swayne, Hadley Wickham, Eun-kyung Lee, Mahbubul Majumder, Niladri Roy Chowdhury, Lendie Follett, Susan Vanderplas, Adam Loy, Yifan Zhao, Nathaniel Tomasetti
Wednesday 29th 10:30 Narrabeen
Estimating Overdispersion In Sparse Multinomial Data
Farzana Afroz
University of Otago
Wednesday 29th 10:30 Gunnamatta
Assessing Mud Crab Meat Fullness Using Non-Invasive Technologies
Carole Wright, Steve Grauf, Brett Wedding, Paul Exley, John Mayze, and Sue Poole
Queensland Department of Agriculture and Fisheries
The decision of whether a mud crab should be retained at harvest has traditionally been based on shell hardness. This is most commonly assessed by using thumb pressure applied to the carapace of the mud crab. The carapace of a recently moulted mud crab will flex considerably and is therefore returned to the water. This assessment has also been used to divide mud crabs into three meat fullness grades (A, B and C). The higher meat fullness grade A mud crabs fetch a greater price at market compared to the lower B and C grades. The subjective nature of this assessment will always result in disputes at the boundaries of the grades. By developing a more objective science-based method downgrades at the market will be reduced while consumer satisfaction and the overall industry profitability will increase.
A scoping study was conducted that evaluated innovative non-invasive technologies to assess mud crab meat fullness based on percentage yield recovery of cooked meat from the dominant individual mud crab claws. The non-invasive technologies assessed included near infrared spectroscopy (NIRS), candling using visible light, and acoustic velocity. NIRS showed the most potential and was reassessed in a second study with slight improvements to spectra capture methods and NIR light sources.
94 live mud crabs from the Moreton Bay area were used in the second study. Partial least squares regression (PLS-R) was performed to build a calibration model to predict the percentage yield recovery of cooked meat based on the spectral data. The PLS-R had \(R^2=0.77\) and \(RMSECV=4.8\).
A principal components linear discriminant analysis (PC-LDA) was also conducted to discriminate between the standard three grades of mud crab meat fullness. This was compared to the industry standard shell hardness method. The NIRS PC-LDA achieved a minimum of 76% correct classification for each of the three grades, compared to 24% for the shell hardness method.
The non-invasive technologies trialled along with the results will be discussed in this talk.
Wednesday 29th 10:30 Bundeena
Analysis Of Multivariate Binary Longitudinal Data: Metabolic Syndrome During Menopausal Transition
Geoff Jones
Massey University
Wednesday 29th 10:50 Narrabeen
Statistical Analysis Of Coastal And Oceanographic Influences On The Queensland Scallop Fishery
Wen-Hsi Yang1, Anthony J. Courtney2, Michael F. O’neill2, Matthew J. Campbell2, George M. Leigh2, and Jerzy A. Filar1
1Universtiy of Queensland
2Queensland Department of Agriculture and Fisheries
Wednesday 29th 10:50 Gunnamatta
Saved By The Experimental Design: Testing Bycatch Reduction And Turtle Exclusion Devices In The Png Prawn Trawl Fishery
Emma Lawrence and Bill Venables
CSIRO
In trawling for prawns, the prawn catch is often only a small part of the results of any one trawl, with the remainder called “bycatch”. Reducing the bycatch component, while maintaining the prawn catch, is an important industry goal, primarily for environmental accreditation purposes, but also for economic reasons.
We designed an at-sea trial for the Gulf of Papua Prawn Fishery, involving four vessels each towing “quad gear” (that is, 4 separate, but linked trawl nets) in each trawl shot, over 18 days. The experiment was designed to assess the effectiveness of 27 combinations of Turtle Excluder Devices (TEDs) and Bycatch Reduction Devices, (BRDs), with a control net, without any attached device as one of the nets in each quad. At Biometrics 2015 we discussed how we used simulated annealing to generate a highly efficient design, in several stages, to meet the large number of highly specific logistical constraints.
The focus of this talk will be the analysis, which also proved somewhat challenging. We will present the results of our analysis and demonstrate why putting the time into thinking about and generating a non-standard experimental design allowed us to accommodate the various glitches and misfortunes that always seem to happen at sea.
Wednesday 29th 10:50 Bundeena
Rethinking Biosecurity Inspections: A Case Study Of The Asian Gypsy Moth (AGM) In Australia
Petra Kuhnert1, Dean Paini1, Paul Mwebaze1, and John Nielsen2
1CSIRO
2Department of Agriculture and Water Resources
The Asian gypsy moth (AGM) (Lymantria dispar asiatica) is a serious biosecurity risk to Australia’s forestry and horticultural industries. While similar in appearance to the European gypsy moth (Lymantria dispar dispar), the Asian gypsy moth is capable of flying up to 40 kilometres and therefore has the potential to establish and spread in other areas like Australia. In addition, females are attracted to light and will oviposit (lay eggs) indiscriminately. As a result, females are attracted to shipping ports at night and will oviposit on ships. These ships therefore have the potential to spread this moth around the world.
The life-cycle of the moth has been well documented and is heavily dependent on temperature, with eggs undergoing three phases of diapause before hatching. Current inspections of vessels arriving into Australian ports from what is deemed an “at risk” port is a lengthy and costly process.
To assist the Department of Agriculture with their prioritisation of ships, we developed an AGM Tool in the form of an R Shiny App that (1) shows the shortest maritime path from an at risk port to an Australian port for a vessel of interest and (2) predicts the probability of a potential hatch and it’s reliability using a classification tree model that was developed to emulate the lifecycle of the moth from simulated data. In this talk we will discuss the methodology that (1) simulates the AGM biology and potential hatches of eggs, along with how we extracted relevant temperature data that was the primary driver of the lifecycle for AGM, and (2) emulates this simulated data using a statistical model, namely a classification tree to predict the probability of a potential hatch. We will also discuss a bootstrap approach to explore the reliability of the potential hatch predicted.
Wednesday 29th 11:10 Narrabeen
Subtractive Stability Measures For Improved Variable Selection
Connor Smith1, Samuel Müller1, and Boris Guennewig1,2
1University of Sydney
2University of New South Wales
This talk builds upon the Invisible Fence (Jiang et al., 2011) a promising model selection method. Utilizing a combination of coeffcient, scale and deviance estimates we are able to improve this resampling based model selection method for regression models, both linear and general linear models. The introduction of a variable inclusion plot allows for a visual representation for the stability of the model selection method as well as the variables bootstrapped rank. The suggested methods will be applied to both simulated and real examples with comparisons about both computational time and effectiveness made to selections through alternative selection procedures. We will report on our latest results from ongoing work in scaling up subtractive stability measures when the numbers of features is large.
References: Jiang, J., Nguyen, T., & Rao, J. S. (2011). Invisible fence methods and the identification of differentially expressed gene sets. Statistics and Its Interface, 4(3), 403-415.
Wednesday 29th 11:10 Bundeena
A Comparison Of Multiple Imputation Methods For Missing Data In Longitudinal Studies
Md Hamidul Huque1, Katherine Lee1, Julie Simpson2, and John Carlin1
1Murdoch Childrens Research Institute
2University of Melbourne
Wednesday 29th 11:30 Narrabeen
Species Distribution Modelling For Combined Data Sources
Ian Renner1 and Olivier Gimenez2
1University of Newcastle
2Centre d’Ecologie Fonctionnelle et Evolutive
Increasingly, multiple sources of species occurrence data are available for a particular species, collected through different protocols. For single-source models, a variety of methods have been developed: point process models for presence-only data, logistic regression for presence-absence data obtained through single-visit systematic surveys, and occupancy modelling for detection/non-detection data obtained through repeat-visit surveys. In situations for which multiple sources of data are available to model a species, these sources may be combined via a joint likelihood expression. Nonetheless, there are questions about how to interpret the output from such a combined model and how to diagnose potential violations of model assumptions such as the assumption of spatial independence among points.
In this presentation, I will explore questions of interpretation of the output from these combined approaches, as well as propose extensions to current practice through the introduction of a LASSO penalty, source weights to account for differing quality of data, and models which account for spatial dependence among points. This approach will be demonstrated by modelling the distribution of the Eurasian lynx in eastern France.
Wednesday 29th 11:30 Gunnamatta
A Factor Analytic Mixed Model Approach For The Analysis Of Genotype By Treatment By Environment Data
Lauren Borg, Brian Cullis, and Alison Smith
University of Wollongong
The accurate evaluation of genotype performance for a range of traits, including disease resistance, is of great importance to the productivity and sustainability of major Australian commercial crops. Typically, the data generated from crop evaluation programmes arise from a series of field trials known as multi-environment trials (METs), which investigate genotype performance over a range of environments.
In evaluation trials for disease resistance, it is not uncommon for some genotypes to be chemically treated against the afflicting disease. An important example in Australia is the assessment of genotypes for resistance to blackleg disease in canola crops where it is common practice to treat canola seeds with a fungicide. Genotypes are either grown in trials as treated, untreated or as both.
There are a number of methods for the analysis of MET data. These methods, however, do not specifically address the analysis of data with an underlying three-way structure of genotype by treatment by environment (GxTxE). Here, we propose an extension of the factor analytic mixed model approach for MET data, using the canola blackleg data as the motivating example.
Historically in the analysis of blackleg data, the factorial genotype by treatment structure of the data was not accounted for. Entries, which are the combinations of genotypes and fungicide treatments present in trials, were regarded as `genotypes’ and a two-way analysis of ‘genotypes’ by environments was conducted.
The analysis of our example showed that the accuracy of genotype predictions, and thence information for growers, was substantially improved with the use of the three-way GxTxE approach compared with the historical approach.
Wednesday 29th 11:30 Bundeena
The Impact Of Cohort Substance Use Upon Likelihood Of Transitioning Through Stages Of Alcohol And Cannabis Use And Use Disorder: Findings From The Australian National Survey On Mental Health And Well-Being
Louisa Degenhardt1, Meyer Glantz2, Chrianna Bharat1, Amy Peacock1, Luise Lago1, Nancy Sampson3, and Ronald Kessler3
1National Drug and Alcohol Research Centre
2National Institute on Drug Abuse
3Harvard University
The aims of the present study were to use population-level Australian data to estimate prevalence and speed of transitions across stages of alcohol and cannabis use, abuse and dependence, and remission from disorder, and consider the potential impacts that an individual’s age and sex cohort’s level of substance use predicted transitions into and out of substance use. Data on lifetime history of use, DSM-IV use disorders, and remission from these disorders were collected from participants (n=8,463) in the 2007 Australian National Survey of Mental Health and Wellbeing using the Composite International Diagnostic Interview.
Lifetime prevalence of alcohol use, regular use, abuse, dependence, and remission from abuse and dependence were 94.1%, 64.5%, 22.1%, 4.0%, 16.1% and 2.1%, respectively. Unconditional lifetime prevalence of cannabis use, abuse, dependence, and remission from abuse and dependence were 19.8%, 6.1%, 1.9%, 4.0% and 1.5%. Increases in the estimated proportion of people in the respondent’s sex and age cohort who used alcohol/cannabis as of a given age were significantly associated with most transitions from use through to remission beginning at the same age. Clear associations were documented between cohort-level prevalence of substance use and personal risk of subsequent transitions of individuals in the cohort from use to greater substance involvement. This relationship remained significant over and above associations involving the individual’s age of initiation. These findings have important implications for our understanding of the causal pathways into and out of problematic substance use.
Wednesday 29th 11:50 Narrabeen
The LASSO On Latent Indices For Ordinal Predictors In Regression
Francis Hui1, Samuel Mueller2, and Alan Welsh1
1ANU
2University of Sydney
Many applications of regression models involve ordinal categorical predictors. A motivating example we consider is ordinal ratings from individuals responding to questionnaires regarding their workplace in the Household Income and Labour Dynamics in Australia (HILDA) survey, with the aim being to study how workplace conditions (main and possible interaction effects) affect their overall mental wellbeing. A common approach to handling ordinal predictors is to treat each predictor as a factor variable. This can lead to a very high-dimensional problem, and has spurred much research into penalized likelihood methods for handling categorical predictors while respecting the marginality principle. On the other hand, given the ordinal ratings are often regarded as manifestations of some latent indices concerning different aspects of job quality, then a more sensible approach would be to first perform some sort of dimension reduction before entering the predicted indices into a regression model. In applied research this is often performed as a two-stage procedure, and in doing so fails to utilize the response in order to better predict the latent indices themselves.
In this talk, we propose the LASSO on Latent Indices (LoLI) for handling ordinal categorical predictors in regression. The LoLI model simultaneously constructs a continuous latent index for each or groups of ordinal predictors and models the response as a function of these (and other predictors if appropriate) including potential interactions, with a composite LASSO type penalty added to perform selection on main and interaction effects between the latent indices. As a single-stage approach, the LoLI model is able to borrow strength from the response to improve construction of the continuous latent indices, which in turn produces better estimation of the corresponding regression coefficients. Furthermore, because of the construction of latent indices, the dimensionality of the problem is substantially reduced before any variable selection is performed. For estimation, we propose first estimating the cutoffs relating the observed ordinal predictors to the latent indices. Then conditional on these cutoffs, we apply a penalized Expectation Maximization algorithm via importance sampling to estimate the regression coefficients. A simulation study demonstrates the improved power of the LoLI model at detecting truly important ordinal predictors compared to both two-stage approaches and using factor variables, and better predictive and estimation performance compared to the commonly used two-stage approach.
Wednesday 29th 11:50 Gunnamatta
Whole‑Genome QTL Analysis For Nested Association Mapping Populations
Maria Valeria Paccapelo1, Alison Kelly1, Jack Christopher2, and Arunas Verbyla3,4
1Queensland Department of Agriculture and Fisheries
2Queensland Alliance for Agriculture and Food
3Data61
4CSIRO
Wednesday 29th 11:50 Bundeena
An Asymmetric Measure Of Population Differentiation Based On The Saddlepoint Approximation Method
Louise McMillan and Rachel Fewster
University of Auckland
Wednesday 29th 12:10 Narrabeen
Fast And Approximate Exhaustive Variable Selection For GLMs With APES
Kevin Wang, Samuel Mueller, Garth Tarr, and Jean Yang
University of Sydney
Wednesday 29th 12:10 Gunnamatta
Order Selection Of Factor Analytic Models For Genotype X Environment Interaction
Emi Tanaka1, Francis Hui2, and David Warton3
1University of Sydney
2ANU
3University of New South Wales
Wednesday 29th 12:10 Bundeena
Multiple Sample Hypothesis Testing Of The Human Microbiome Through Evolutionary Trees
Martina Mincheva1, Hongzhe Li2, and Jun Chen3
1Temple University
2University of Pennsylvania
3Mayo Clinic
Keynote: Wednesday 29th 13:40 Mantra
Statistical Strategies For The Analysis Of Large And Complex Data
Louise Ryan1,2, Stephen Wright1,3, and Hon Hwang1
1University of Technology Sydney
2Harvard T. H. Chan School of Public Health
3Australian Red Cross
Wednesday 29th 14:30 Narrabeen
Optimal Experimental Design For Functional Response Experiments
Jeff Zhang and Christopher Drovandi
Queensland University of Technology
Wednesday 29th 14:30 Gunnamatta
To PCA Or Not To PCA
Catherine M. McKenzie1, Wei Zhang1, Stuart D. Card1, Cory Matthew2, Wade J. Mace1, and Siva Ganesh1
1AgResearch
2Massey University
When there are groupings of observations present in the data, many researchers resort to utilising a Principal Components Analysis (PCA) a priori for identifying patterns in the data, and then look to map the patterns obtained from PCA to differences among the groupings, or attribute biological signal to them. Is this appropriate, given that PCA’s are not designed to discriminate between the groupings? Is a group-oriented multivariate methodology such as multivariate analyses of variance (MANOVA) or Canonical Discriminant Analysis (CDA) preferable? Which method has more relevance when investigating factor effects of biochemical pathways? We explore this question via a biological example.
Two biological materials (B1 & B2) were analysed for the same 19 primary metabolites, with three factors of Methods (M1 & M2), Treatment (T1, T2, T3 and T4), and Age (A1 & A2), with three replicate values giving a total of 48 observations for each biological material. Univariate and multivariate analyses of variance (ANOVA and MANOVA, respectively) were carried out, for which there were many statistically significant interaction effects. In addition, other multivariate techniques such as PCA and CDA were used to explore relationships between the variables. The question remains as to the appropriateness of carrying out PCA to explore biochemical pathways, the comparison being between tailoring the pattern extraction a priori to match the known groupings within the data versus starting with an unrestrained pattern analysis and seeking to explain the patterns detected post-analysis?
Wednesday 29th 14:50 Narrabeen
An Evaluation Of Error Variance Bias In Spatial Designs
Emlyn Williams1 and Hans-Peter Piepho2
1ANU
2University of Hohenheim
Wednesday 29th 14:50 Gunnamatta
New Model-Based Ordination Data Exploration Tools For Microbiome Studies
Olivier Thas1, Stijn Hawinkel1, and Luc Bijnens2
1Ghent University
2Janssen Pharmaceutics
High-throughput sequencing technologies allow easy characterization of the human microbiome, but the statistical methods for analysing microbiome data are still in their infancy. Data exploration often relies on classical dimension reduction methods such as Principal Coordinate Analysis (PCoA), which is basically a Multidimensional Scaling (MDS) method starting from ecologically relevant distance measures between the vectors of relative abundances of the microorganisms (e.g Bray-Curtis distance).
We will demonstrate that these classical visualisation methods fail to deal with microbiome-specific issues such as variability due to library-size differences and overdispersion. Next we propose a new technique that is based on a negative binomial regression model with log-link, and which relies on the connection between correspondence analysis and the log-linear RC(M) models of Goodman (Annals of Statistics, vol. 13, 1985); see also Zhu et al. (Ecological Modelling, vol. 187, 2005). Instead of assuming a Poisson distribution for the counts, a negative binomial distribution is assumed. To better account for library size effects, we adopt a different weighting scheme, which naturally arises from the parameterisation of the model. An iterative parameter estimation method is proposed and implemented into R. The new method is illustrated on several example datasets, and it is empirically evaluated in a simulation study. It is concluded that our method succeeds better in discovering structure in microbiome datasets than with other conventional methods.
In the second part of the presentation we extent the model-based method to a constrained ordination method by using sample-specific covariate data. The method looks for a two-dimensional visualisation that optimally discriminates between species with respect to their sensitivity to environmental conditions. Again we build upon results of Zhu et al. (2005) and Zhang and Thas (Statistical Modelling, vol. 12, 2012). The method is illustrated on real data.
All methods are available as an R package.
Wednesday 29th 15:10 Narrabeen
Always Randomize?
Chris Brien1,2
1University of South Australia
2Universtiy of Adelaide
Fisher gave us three fundamental principles for designed experiments: replication, randomization and local control. Consonant with this, Brien et al. (2011) [Brien, C. J., Harch, B. D., Correll, R. L., & Bailey, R. A. (2011) Multiphase experiments with at least one later laboratory phase. I. Orthogonal designs. Journal of Agricultural, Biological, and Environmental Statistics, 16, 422-450.] exhort the use of randomization in multiphase experiments via their Principle 7 (Allocate and randomize in the laboratory). This principle is qualified with ‘wherever possible’, which leads to the question ‘when is randomization not possible?’.
Situations where randomization is not applicable will be described for both single-phase and multiphase experiments. The reasons for not randomizing include practical limitations and, for multiphase experiments, difficulty in estimating variance parameters when randomization is employed. For the latter case, simulation studies canvassing a number of potential difficulties will be described. A Nonrandomization Principle, and an accompanying analysis strategy, for multiphase experiments will be proposed.
Wednesday 29th 15:10 Gunnamatta
Bayesian Semi-Parametric Spectral Density Estimation With Applications To The Southern Oscillation Index
Claudia Kirch1, Matt Edwards2, Alexander Meier1, and Renate Meyer2
1University of Magdeburg
2University of Auckland
Standard time series modelling is dominated by parametric models like ARMA and GARCH models. Even though nonparametric Bayesian inference has been a rapidly growing area over the last decade, only very few nonparametric Bayesian approaches to time series analysis have been developed. Most notably, Carter and Kohn (1997), Gangopadhyay (1998), Choudhuri et al. (2004), and Rosen et al (2012) used Whittle’s likelihood for Bayesian modeling of the spectral density as the main nonparametric characteristic of stationary time series. On the other hand, frequentist time series analyses are often based on nonparametric techniques encompassing a multitude of bootstrap methods (Kreiss and Lahiri, 2011, Kirch and Politis, 2011).
As shown in Contreras-Cristan et al. (2006), the loss of efficiency of the nonparametric approach using Whittle’s likelihood approximation can be substantial. On the other hand, parametric methods are more powerful than nonparametric methods if the observed time series is close to the considered model class but fail if the model is misspecified. Therefore, we suggest a nonparametric correction of a parametric likelihood that takes advantage of the efficiency of parametric models while mitigating sensitivities through a nonparametric amendment. We use a nonparametric Bernstein polynomial prior on the spectral density with weights induced by a Dirichlet process. Contiguity and posterior consistency for Gaussian stationary time series have been shown in a preprint by Kirch et al (2017). Bayesian posterior computations are implemented via a MH-within-Gibbs sampler and the performance of the nonparametrically corrected likelihood is illustrated in a simulation. We use this approach to analyse the monthly time series of the Southern Oscillation Index, one of the key atmospheric indices for gauging the strength of El Nino events and their potential impacts on the Australian region.
Wednesday 29th 16:10 Narrabeen
Efficient Multivariate Sensitivity Analysis Of Agricultural Simulators
Daniel Gladish
CSIRO
Wednesday 29th 16:10 Gunnamatta
Bayesian Hypothesis Tests With Diffuse Priors: Can We Have Our Cake And Eat It Too?
John Ormerod, Michael Stewart, Weichang Yu, and Sarah Romanes
University of Sydney