Programme And Abstracts For Monday 27th Of November

Keynote: Monday 27th 9:40 Mantra

Agricultural And Agri-Environmental Statistics With Support Of Geospatial Information, Methodological Issues

Elisabetta Carfagna
University of Bologna

Agri-environmental trade-offs are issues critical for policy makers charged with managing both food supply and the sustainable use of the land. Reliable data are crucial for developing effective policies and for evaluating their impact. However, often the reliability of agricultural and agro-environmental statistics is low.

Due to the technological development, in the last decades, different kinds of geospatial data have become easily accessible at decreasing prices and have started to be an important support to statistics production process.

In this paper, we focus on methodological issues related to the use of geospatial information for sampling frame construction, sample design, stratification, ground data collection and estimation of agricultural and agri-environmental parameters. Particular attention is devoted to the impact of spatial resolution of data, change of support aggregation and disaggregation of spatial data, when remote sensing data, Global Positioning Systems and Geographic Information Systems (GIS) are used for producing agricultural and agro-environmental statistics.

Monday 27th 11:00 Narrabeen

Developing A Regulatory Definition For The Authentication Of Manuka Honey

Claire McDonald1, Suzanne Keeling1, Mark Brewer2, and Steve Hathaway1
1Ministry for Primary Industries
2BioSS

Manuka honey is a premium export product from New Zealand that has been under scrutiny due to claims of fraud, adulteration and mislabelling. Although there are several industry approaches for defining manuka honey, there is currently no scientifically robust definition suitable for use in a regulatory setting. As such, ensuring the authenticity of manuka honey is challenging.

Here we present the results of a three year science programme which developed scientifically robust definitions for monofloral and multifloral manuka honey produced in New Zealand. The programme involved: selecting appropriate markers to identify honey sourced from Leptospermum scoparium (manuka), establishing plant and honey reference collections, developing test methods to determine the levels of the markers and analysing the data generated to develop the definitions.

The suitability of 16 markers (chemical and DNA-based) were evaluated for use in a regulatory definition for manuka honey. Plant samples were collected from two flowering seasons representing both manuka and non-manuka species from both New Zealand and Australia. Honey samples, also representing manuka and non-manuka floral types, were sourced from seven New Zealand production seasons. Additionally, honey samples were sourced from another 12 countries to enable comparison. All samples were tested for the markers being evaluated using the developed test methods.

The method of CART (Classification and Regression Trees) was used to develop the monofloral and multifloral manuka honey definitions. The CART outputs were further processed using a simulation approach to determine the sensitivity and the robustness of the definitions. The definitions use a combination of 5 markers (4 chemical and 1 DNA) at set thresholds to classify a sample as manuka honey or otherwise. We discuss the practicalities of using the science-based definitions within a regulatory context.

Monday 27th 11:00 Gunnamatta

On Testing Random Effects In Linear Mixed Models

Alan Welsh1, Francis Hui1, and Samuel Mueller2
1ANU
2University of Sydney

We can approach problems involving random effects in linear mixed models directly through the random effects or through parameters such as that the variance components that describe the distributions of the random effects. Both approaches are useful but lead to different issues. For example, working with the random effects raises questions of how to estimate them and can mean dealing with a large number of random effects, while working with variance components leads to testing hypotheses on the boundary of the parameter space. In both cases, finding good approximations to the null distribution of the test statistic can be challenging so modern approaches often rely on simulation. In this talk, we re-examine the F-test based on linear combinations of the responses, for testing random effects in linear mixed models. We present a general derivation of the test, highlight its computation speed, its generality, and its exactness as a test, and report empirical studies into the finite sample performance of the test. We conclude the presentation by reporting our latest results from ongoing research that investigates connections between testing and model selection by discussing some tests of significance of random effects and exploring their relationship to model selection procedures.

Monday 27th 11:20 Narrabeen

Analysing Digestion Data

Maryann Staincliffe, Debbie Frost, Mustafa Farouk, and Guojie Wu
AgResearch

Food technologist are interested in understanding if adding grain and/or vegetables to beef increases the digestibility over meat alone. Digestibility of meat was measured at up to 4 hours using a pepsin and pancreatin in vitro model. This method involves using gels in lanes, where the density of the colour of the gel is an indicator of the presence of a protein or peptide at that level of kilodalton (kDa). Typically kDa above 12kDa are considered to be the proteins and lower than that are the peptides. In the past we have analysed these data using two approaches. The first approach is to select 4 or 5 bands that expected to be important and then use a mixed effects model to compare the mean Trace Quantity of each of the bands, where Meat type, Additive type (vegetable and/or grain) and Time points are fixed factors and the gel is a random effect. The second approach is to fit a curve to the change in the proportion of Trace Quantity above 12kDa from Time zero. The problem with these two approaches is that we struggle to provide a consistent interpretation of the results. Therefore, we will explore alternative methods for analysing this type of data.

Monday 27th 11:20 Gunnamatta

A Permutation Test For Comparing Predictive Values In Clinical Trials

Kouji Yamamoto and Kanae Takahashi
Osaka City University

Screening tests or diagnostic tests are important for early detection and treatment of disease. There are four well-known measurements, sensitivity (SE), specificity (SP), positive predictive value (PPV) and negative predictive value (NPV) in diagnostic studies. For comparing SEs/SPs, McNemar test is widely used, but there are only few methods for the comparison of PPVs/NPVs. Moreover, all of these methods are based on large-sample theory.

So, in this talk, firstly, we investigate the performance of those methods when the sample size is small. In addition, we propose a permutation test for comparing two PPVs/NPVs we can apply even if the sample size is small. Finally, we show the performance of the proposed method with some existing methods via simulation studies.

Monday 27th 11:40 Narrabeen

Challenges And Opportunities Working As A Consulting Statistician With A Food Science Research Group

M. Gabriela Borgognone
Queensland Department of Agriculture and Fisheries

When an established research group has been functioning for many years without a statistician as an integral part of the team, welcoming one into the group can present challenges as well as opportunities for all involved.

Challenges for the research group include, for example, involving the statistician at the beginning of the study instead of once the experiments have been completed and the data collected; acquiring or increasing knowledge of experimental design principles; understanding the limitations of some statistical analyses, expanding the range of methods they feel familiar with, and learning when/how to apply each one; and improving the presentation of results in this era where poor presentation is perpetuated by the general lack of sound statistical methods in the literature of the research area. Challenges for the statistician include, for example, overcoming his/her lack of general knowledge of the underlying scientific area and its specific vocabulary; determining what experimental designs would work from a practical point of view; developing understanding of their scientific questions, data management practices, and types of data collected; navigating the various software they use and checking their adequacies and limitations; and, above all, communicating with patience and perseverance.

Correspondingly, all challenges present opportunities for improvement and collaboration between scientists and statisticians. Working as a team supports a decision making process that is relevant to industry and that is based on good statistical practices. Additionally, it helps scientists become more statistically aware and empowered. A bit more than a year ago I started working as a consulting statistician with a food science research group. In this presentation I will share some of the challenges and the opportunities to incorporate good statistical practice I have identified, as well as some of the improvements we have made so far working together in this partnership.

Monday 27th 11:40 Gunnamatta

Robust Semiparametric Inference In Random Effects Models

Michael Stewart1 and Alan Welsh2
1University of Sydney
2ANU

We report on recent work using semiparametric theory to derive procedures with desirable robustness and efficiency properties in the context of inference concerning scale parameters for random effect models.

Monday 27th 12:00 Gunnamatta

Robust Penalized Logistic Regression Through Maximum Trimmed Likelihood Estimator

Hongwei Sun, Yuehua Cui, and Tong Wang
Shanxi Medical University

Penalized logistic regression is used to identify genetic markers for many high-dimensional datasets such as in gene expression, GWAS, DNA methylation studies and so on. But outliers sometimes occur due to missed diagnosis or misdiagnosis of subjects, heterogeneity of samples,technical problems in experiments or other problems. They can greatly influence the estimation of penalized logistic regression. Few studies focus on the robustness of penalized methods when the response variable is categorical, which is standard in medical research. This study proposed a robust LASSO-type penalized logistic regression based on maximum trimmed likelihood(MTL-LASSO). The definition of breakdown point (BDP) for penalized logistic regression was given and its property for the proposed method was proved. A modification of FAST-LTS algorithms was used to implement the estimation. The reweighted step was added to improve performance while guaranteeing robustness. The simulation study shows the proposed method can resist against outliers. A real dataset about gene expression profiles of multiple sclerosis patients and healthy controls was analysed. Outliers in the control group identified by reweighted MTL-LASSO behave differently from others. It unveils there may be heterogeneity problem in control group. A much better fit is obtained after removing outliers.

Keynote: Monday 27th 13:30 Mantra

A Multi-Step Classifier Addressing Cohort Heterogeneity Improves Performance Of Prognostic Biomarkers In Complex Disease

Jean Yang
University of Sydney

Recent studies in cancer and other complex diseases continue to highlight the extensive genetic diversity between and within cohorts. This intrinsic heterogeneity poses one of the central challenges to predicting patient clinical outcome and the personalization of treatments. Here, we will discuss the concept of classifiability observed in multi-omics studies where individual patients’ samples may be considered as either hard or easy to classify by different platforms, reflected in moderate error rates with large ranges. We demonstrate in a cohort of 45 AJCC stage III melanoma patients that clinico-pathologic biomarkers can identify those patients that are most likely to be misclassified by a molecular biomarker. The process of modelling the classifiability of patients was then replicated in independent data from other diseases.

A multi-step procedure incorporating this information not only improved classification accuracy overall but also indicated the specific clinical attributes that had made classification problematic in each cohort. In statistical terms, our strategy models cohort heterogeneity via the identification of interaction effects in a high dimensional setting. At the translational level, these findings show that even when cohorts are of moderate size, including features that explain the patient-specific performance of a prognostic biomarker in a classification framework can significantly improve the modelling and estimation of survival, as well as increase understanding.

Monday 27th 14:20 Narrabeen

How To Analyse Five Data Points With Fun

Pauline O’Shaughnessy1, Stephen Robson2, and Louise Rawlings2
1University of Wollongong
2ANU

While “big data” is one of the biggest buzzwords, we occasionally come cross data with very few data points. This dataset is from a study of the incidence rates of commonly performed medical procedures In Australia, which is available only at the state level. We have five states in Australia thus five data points. So what can we do when we only have five data points? One of the novelty approaches is to fit the data with a regression model. However given the challenging nature of the data with small size, no guarantee can be placed on the satisfactory of the linearity and homoscedasticity assumption of the linear regression, in turns, the inference from the standard linear model theory is no longer valid. Double bootstrap is used to provide solution to the valid statistical inference for the best linear approximation for the relationship between variables in an assumption-lean regression setting.

Monday 27th 14:20 Gunnamatta

An Approach To Poisson Mixed Models For -Omics Expression Data

Irene Suilan Zeng and Thomas Lumley
University of Auckland

We are interested in regression models for multivariate data from high-throughput biological assays (‘omic’ data). These data have correlations between variables, and may also come from structured experiments, so a generalised linear mixed model is appropriate to fit the experimental variables and different types of omics data. However, the number of variables is often larger than the number of observations: a structured covariance model is necessary and sparsity induction is biologically appropriate. In this presentation we describe an approach to Poisson mixed models, suitable for RNAseq gene expression data, based on transcript-specific random effects with a sparse precision matrix. We show by simulations that the optimal sparseness penalty for regression modelling is not the same as in the usual graph estimation problem and compare some estimation strategies in simulations.

Monday 27th 14:40 Narrabeen

Bayesian Spatial Estimation When Areas Are Few

Aswi Aswi, Susanna Cramb, Earl Duncan, and Kerrie Mengersen
Queensland University of Technlogy

Spatial modelling when there are few (< 20) small areas can be challenging. Bayesian methods can be beneficial in this situation due to the ease of specifying structure and additional information through priors. However, care is needed as there are often fewer neighbours and more edges, which may influence results. Here we investigate Bayesian spatial model specification when there are few areas, first through a simulation study (number of areas ranging from 4 to 2500) and then apply to a case study on dengue fever in 2015 in Makassar, Indonesia (14 areas). Four different Bayesian spatial models namely, an independent model and 3 models based on a CAR (Conditional Autoregressive) prior: the Besag, York & Mollié, Leroux, and a localised model (augments the CAR prior with a cluster model using piecewise constant intercepts) were applied. Data were generated for the simulation study considering low and high spatial autocorrelation and low and high disease incidence. Model goodness of fit was compared using Deviance Information Criteria. Analysis of variance and Bonferroni’s method were also used to determine which models were significantly different. The simulation study showed models differed in their performance mainly in two situations: 1. When there were at least 25 areas and both the disease rate and spatial autocorrelation was low, and 2. For all area sizes when there was low spatial autocorrelation but a high overall disease rate. Likewise, results from the case study showed that all four models performed similarly. This is probably due to the low number of areas and a low disease incidence.

Monday 27th 14:40 Gunnamatta

Knowledge-Guided Generalized Biclustering Analysis For Integrative –Omics Analysis

Changgee Chang1, Yize Zhao2, Mingyao Li1, and Qi Long1
1University of Pennsylvania
2Cornell University

Advances in technology have enabled generation of multiple types of -omics data in many biomedical and clinical studies, and it is desirable to pool such data in order to improve the power of identifying important molecular signatures and patterns. However, such integrative analyses present new analytical and computational challenges. To address some of these challenges, we propose a Bayesian sparse generalized biclustering analysis (GBC) which enables integrating multiple omics modalities with incorporation of biological knowledge through the use of adaptive structured shrinkage priors. The proposed methods can accommodate both continuous and discrete data. MCMC and EM algorithms are developed for estimation. Numerical studies are conducted to demonstrate that our methods achieve improved feature selection and prediction in identifying disease subtypes and latent drivers, compared to existing methods.

Monday 27th 15:00 Narrabeen

Understanding The Variation In Harvester Yield Map Data For Estimating Crop Traits

Dean Diepeveen1, Karyn Reeves1, Adrian Baddeley1, and Fiona Evans2
1Curtin University
2Murdoch University

Our recent exploratory research involves extracting tangible crop traits from images such as yield-maps. Research by Diepeveen et al (2012) demonstrated that genetic information can be extracted from near-infrared (NIR) images using both the implicit knowledge of data, environmental data and using a multivariate approach. Yield map data is geo-referenced data of grain-yield that is generated by a harvester cutting the crop, threshing the straw and extracting the grain. Our results show that the yield-map data has significant issues associated with it. One issue is the delay from time of cutting the crop to entering into the storage-bin for measurement. This is dependent on speed of the harvester and is compounded with maintaining a critical volume going through the harvester to operate efficiency. There are also issues with variation from the density and plant size of the crop within the paddock being harvested. Our preliminary results just highlight the significant challenges in extracting precise crop traits from yield map data.

Monday 27th 15:00 Gunnamatta

Citizen Science To Surveillance: Estimating Reporting Probabilities Of Exotic Insect Pests

Peter Caley1, Marijke Welvaert2, and Simon Barry1
1CSIRO
2University of Canberra

Up until mid-2016, citizen science uploads to the Atlas of Living Australia included c. 400 bug species, and c. 1,000 beetle species. Given the short time period (c. 3 years) over which most of these records have accumulated, this represents a considerable reporting effort. The key applied question from a biosecurity context is how this level of reporting translates to the detection and reporting of exotic insect pests in the event of an incursion.

We use a case-control design to model the probability of existing insect species being reported via citizen science channels feeding into the Atlas of Living Australia. The effect of insect features (size, colour, pattern, morphology) and geographic distribution on reporting rates are explored as explanatory variables. We then apply the model to exotic high priority pest species to predict their reporting rates in the event of their introduction.

Monday 27th 15:50 Narrabeen

Introduction To “Deltagen” - A Comprehensive Decision Support Tool For Plant Breeders Using R And Shiny

Dongwen Luo and Zulfi Jahufer
AgResearch

The objective of this presentation is to introduce a unique new plant breeding decision support software tool “DeltaGen”, implemented in R and its package Shiny. DeltaGen provides plant breeders with a single integrated solution for experimental design generation, data quality control, statistical and quantitative genetic analyses, breeding strategy evaluation/simulation and cost analysis, pattern analysis, index selection and underlying basic theory on quantitative genetics. This software tool could also be used as a teaching resource in plant breeding courses. DeltaGen is available as Freeware on the link: http://agrubuntu.cloudapp.net/shiny-apps/PlantBreedingTool/

Monday 27th 15:50 Gunnamatta

Estimating Nitrous Oxide Emission Factors

Alasdair Noble and Tony Van Der Weerden
AgResearch

Nitrous Oxide (N2O) is an important greenhouse gas with a global warming potential nearly 300 times that of carbon dioxide. Under the Kyoto Protocol New Zealand is required to report a greenhouse gas inventory annually which includes N2O. In New Zealand, 95% of N2O emissions are derived from nitrogen (N) inputs to agricultural soils (e.g. animal excreta and fertiliser). Field experiments are conducted to estimate these N2O emissions, where data is collated and analysed following a standard methodology to determine emission factors, which estimate the amount of N2O lost per unit of N applied to soil. However for individual datasets there are some aspects of the data are incompatible with the proposed model so some ad hoc adjustments are made. A more rigorous Bayesian approach is proposed and some results will be discussed.