This work by Peter Sefton, Steve Cassidy, Dominique Estival, Jared Berghold & Denis Burnham is licensed under a Creative Commons Attribution 4.0 International License.
This presentation about the HCS Vlab was delivered by Peter Sefton at Digital Humanities Australasia 2014 in Perth
HCS vLab
) is a UWS-led project, funded by The National eResearch Collaboration Tools and Resources project (NeCTAR), an Australian Government Super Science project, regrouping almost 50 active researchers from 16 institutions., 10 such Virtual Labs across Australia. The project builds upon 2 previous UWS-administered projects: the 2000-member
Human Communication Science Network
(HCSNet, ARC, RN0460284) and the 30-investigator, 12-institution
Big Australian Speech Corpus
(ARC LIEF, LE100100211) and the ANDS funded Australian National Corpus project led by Griffith University.
The main purpose of the
HCS vLab
is to provide an environment that will foster inter-disciplinary research in Human Communication Science (HCS). While HCS is a broad field which encompasses speech science, speech technology, computer science, language technology, behavioural science, linguistics, music science, phonetics, phonology, sonics and acoustics, research is often conducted in isolation within each discipline. Too often the data sets used in research are difficult to share between researchers and even more between disciplines; tools are rarely shared across disciplines. HCS research in Australia, and the development of successful real-life applications, demands a new model of research, beyond that of the isolated desk/lab/university bound research environment. The
HCS vLab
environment aims to eliminate the waste involved in repeated unshared analyses, provide the impetus for new collaborations, encourage new tool-data combinations, and improve scientific replicability by moving data and tools as well as the analyses conducted with these into an easily accessible, shared environment.
Architecturally, the
HCS vLab
comprises a repository for heterogeneous data under a standardised metadata framework based on RDF, providing discovery services that allow researchers to create data sets that can be fed to a wide variety of research tools via a rich Application Programming Interface (API). Another major component is a workflow engine which allows data to be fed through a series of processing steps which can be stored and re-used. The
HCS vLa
b will also orchestrate the creation of virtual environments including virtual servers pre-loaded with both a set of tools and data, as well as virtual High Performance Computing clusters.
This presentation will first cover the architecture of the
HCS vLab
and give examples of its use across different kinds of data (text, audio, video) with a variety of tools. These include both ‘point and click’ pre-configured tools and a range of full programming environments in which data can be automatically marshalled for further processing. Examples include the Python-based Natural Language Toolkit (NLTK) for text processing and EMU-R on the R-stats platform for speech processing and analysis. We will present in more detail the variety of corpora that have been made accessible and discuss the tools that are available for analysing these data sets, emphasising the novel use of some of these. The presentation will then report on experiences with new kinds of interdisciplinary research and demonstrate some research scenarios.
We will also discuss the potential for this approach and architecture to be adopted more generally in the digital humanities world, showing how new data and tools can be imported into the Virtual Lab environment, and how the tools can be used on data anywhere.
Biography
Dominique Estival has a PhD in Linguistics and extensive experience in academic research and commercial project management for Language Processing in the USA, Europe and Australia, including as NLP Team Leader for R&D at Syrinx Speech Systems, a Sydney speech recognition company developing automated telephone dialogue systems, Senior Research Scientist, for natural language technologies, human-computer interfaces and multi-lingual processing with the DSTO (Defence Science & Technology Organisation), and Senior Manager for language processing research for US-government-funded and commercial projects at Appen P/L, a company providing speech and language databases for language applications. At UWS, Estival is the Project Manager of the Big ASC Project, establishing the audio-visual AusTalk corpus of Australian English, and of the HCS vLab. She is a founding member of the Australasian Language Technology Association (ALTA) and in 2008 established the Australian Computation and Linguistic Olympiad (OzCLO).
Steve Cassidy is a Computer Scientist whose research covers the creation, management and exploitation of language resources. He is the main author of the Emu speech database system which is widely used in the creation and analysis of spoken language data for acoustic phonetics research. He has been involved in the standardisation of tools and formats for the exchange of language resources starting with his work on Emu and more recently as an invited expert on the ISO TC 37 working groups on annotation interchange formats and query languages for linguistic data. Cassidy is the Product Owner for the HCS vLab, acting as a conduit between the development team and prospective users around Australia as well as ensuring interoperability with related international efforts.
Peter Sefton is the Manager for eResearch at UWS. Before that he ran the Software Research and dDvelopment Laboratory at the Australian Digital Futures Institute at USQ. Following a PhD in computational linguistics, he has gained extensive experience in the higher education sector in leading the development of IT and business systems to support both learning and research. At USQ, Sefton was involved in the development of institutional repository infrastructure in Australia via the federally funded RUBRIC project and was a senior advisor to the CAIRSS repository support service from 2009 to 2011. He oversaw the creation of one of the core pieces of research data management infrastructure to be funded by the Australian National Data Service, consulting widely with libraries, IT, research offices and eResearch departments at a variety of institutions. The resulting Open Source research data catalogue application ReDBOX is now widely deployed at Australian universities. At UWS, Peter leads a team working with key stakeholders to implement university-wide eResearch infrastructure, including an institutional data repository, and collaborates widely with research communities on specific research challenges. His research interests include repositories, digital libraries, and the use of The Web in scholarly communication.
Denis Burnham is inaugural Director of MARCS Institute, UWS (1999-present) and President of the Australasian Speech Science and Technology Association (ASSTA, 2002-present). He conducts research in speech perception (auditory-visual, cross-language, infants, children, adults), special speech registers (to infants, pets, computers, foreigners), language development and literacy, human-machine interaction, and corpus management; has been continuously funded by the Australian Research Council since 1986; and has run various large projects, most recently this HCS vLab, the Big Australian Speech Corpus (Big ASC), the Human Communication Science research network, the Thinking Head Project, and the Seeds of Literacy Dyslexia project.
Jared Berghold has a research and development background in computer visualisation, interactivity and enterprise architecture. Jared has practiced software engineering for over eight years and prior to Intersect worked at iCinema, a research centre at UNSW, working on interdisciplinary projects with a focus on interactive and immersive narrative systems. worked at Avolution and CiSRA, the Australian research and development lab for Canon.
Jared has a BE/BA (Hons) in Computer Systems and International Studies from the University of Technology, Sydney.
HCS vLab
is funded by NeCTAR, a body set up by the Australian Government as part of the Super Science initiative and financed by the Education Investment Fund.
All the current collections in the lab require researchers to agree to a license, usually via web-click, some require an offline contract. This is not Open Data in the normal sense but given the terms under which most of the collections were collected this is the best we can do to make data available as broadly as possible.
Item list are stable, reusable and in future will be citable, they are NOT like saved searches. This contrasts with a saved-search in something like the National Library’s Trove service where the same search or API call might yield a different set of items on different days.
These are know as item lists and these are the key to re-doable research workflows as they allow the same stable data set to be run through multiple processes. Item lists are available via the web interface and via the API.
The API respects the access control of data collections in the lab, via a per-user private API key.
The aim of the ultimate aim lab is to make sure that you can do everything via the API.
A similar approach will be used for audio and video data allowing researchers to create item lists, set some parameters and run analytical processes, resulting in new data sets or graphical plots.
PsySound3 software toolkit (which was developed by Densil Cabrera, Emery Schubert and others to analyse sound recordings using physical and psychoacoustical algorithms). Given any audio recording as input, this workflow will perform an FFT (Fast Fourier Transform), Hilbert transform and Sound Level Meter analysis on the audio file and plot a graph for each one.
Hydra which is Ruby on Rails framework wrapping:
Apache Solr and blacklight
The Duraspace Fedora Commons repository
Apache Solr, via the blacklight project
WOrkflow courtest of Galaxy “Data intensive biology
for everyone.” 🙂
Photo credit: http://nla.gov.au/nla.pic-an7697018-3
But what is it doing in this presentation?
(No it’s not a
virtual
lab, it’s just a lab).
At the presentation I (Peter) asked the audience what they could tell about this dog from the picture. There were two interesting answers. Firstly it is male, which might be of interest in a biological virtual lab. Secondly someone said it is ‘happy’, which might be of interest in a Human Communications Science lab. A set of images like this may be an appropriate addition to the lab, studying how people react to non-human faces. Denis Burnham, who leads this project is a psychologist, and has been exploring ways that Alveo could be used to store re-usable sets of stimuli used in experiments, which are typically collected for a particular study and not made available for re-use.
The idea of looking at dog pictures is something I made up, but in phase two of the virtual lab, starting mid 2014, one of the tasks is to set up a board who to approve the addition of new data sets, they will be able to answer the question; dog pictures or no?
* (it’s Daniel de Byl’s dog Merlin, and he took the photo which is used here with permission).
We’ll be:
* Promoting its use to researchers and research communities via a variety of outreach activities
* Supporting the lab via a combination of the UWS Service Desk and the AERO eResearch body
* Continuing development of new features
* And most importantly, we’ll be working on a sustainable model for the future. (Can it live-on through grants? Subscriptions? Partnerships with other similar projects?)
“It is looking very good. Lots of possible uses and a nice interface.”
“The platform is easy to use and has the great potential to help with Linguistic research and wide applications in other areas…”
“The system seemed to be quite user-friendly. As first I was relying on the manual, however when the manual became more streamlined with less details, the system was still easy to follow.”
“This is a powerful tool and I think it is pretty good.”
“I really liked using the system and the instructions were very easy to use and the system easy to navigate. […] This platform would be very useful for my research.”
“I think it’s quite easy to use. […] Generally the platform is very clearly organised, and user-friendly.”
“The platform overall is very good.”
“Very nice platform with great user interface!”
“A very promising and impressive setup so far!”
“I’m impressed with the platform – it’s smooth and the interface is very intuitive.”
“I think it’s quite easy to use. […] Generally the platform is very clearly organised, and user-friendly.”
“The platform overall is very good.”
“Very nice platform with great user interface!”