Content index:
Research interests
- Machine Learning and Deep Learning
- Natural Language Processing (NLP), in particular sequence modelling
- Automatic Speech Recognition and Understanding (ASRU)
- Probabilistic models, in particular Neural Networks, Conditional Random Fields (CRF), Stochastic Finite State Machines (FSM), Support Vector Machines (SVM), probabilistic grammars
- Representation learning
Research projects
- Pantagruel: Modèles de langue multimodaux et inclusifs pour le français général et clinique (WP leader), October 2023 - April 2027
- Make-NMT Viz: Visualisation and explanation of NMT models (Collaborator), September 2022 - September 2024
- E-SSL: Efficient Self-Supervised Learning for Inclusive and Innovative Speech Technologies (Collaborator), November 2022 - April 2026
ANR PRC project (CE23) - CREMA: Coreference REsolution into MAchine translation (PI), January 2022 - December 2025
ANR JCJC (Jeunes Chercheuses Jeunes Chercheurs) project (CE23) - Chaire MIAI (Multidisciplinary Institute in Artificial Intelligence) (Collaborator), October 2019 - December 2024
- Multi-Task Sequence Prediction for NLP (PI), January 2021 - December 2021
LIG local Emergence project - Neural Coreference Resolution (PI), January 2019 - December 2019
LIG local Emergence project - ANR DEMOCRAT (Collaborator), January 2016 - December 2019
DEscription et MOdélisation des Chaïnes de Référence : outils pour l'Annotation de corpus (en diachronie et en langues comparées) et le Traitement automatique - Quaero (Collaborator), Juin 2010 - September 2013
- TRACE (Collaborator), December 2011 - November 2012
- Live Memories (Collaborator), November 2009 - March 2010
- LUNA (Collaborator), October 2006 - October 2009
Activities
Supervising
Post docs
- Gabriela Gonzales-Saez, 10/2024 - 09/2025, funded by ANR JCJC CREMA
Subject : Context-Aware NMT models explainability
- Hang Le, 10/2023 - 12/2024, funded by Pantagruel
Subject : Multi-Modal SSL Models for Text, Speech and Image
- Gabriela Gonzales-Saez, 07/2023 - 09/2024, funded by Make-NMT Viz
Subject : NMT models visualisation and explainability
- Elisa Gugliotta, 06/2022 - 02/2023, funded by Chaire MIAI (Multidisciplinary Institute in Artificial Intelligence)
Subject : NLP for Arabish analysis
Ph.D. students
- Yuxuan Zhang, 2024 - 2027, Ph.D. student CIFRE at Eloquant
with Fabien Ringeval, Ruslan Kalitvianski
Subject : Prediction of user satisfaction
Ph.D. in progress
- Ryan Whetten, 2023 - 2026, Ph.D. student at LIA, UGA, Samsung AI Center Cambridge
with Yannick Estève, Titouan Parcollet
Subject : Efficient SSL Models for Speech
Ph.D. in progress
- Mariam Nakhlé, 2022 - 2025, Ph.D. student CIFRE at Lingua Custodia
with Emmanuelle Esperança-Rodier, Raheel Qader
Subject : Document-Level Machine Translation Evaluation
Ph.D. in progress
- Fabien Lopez, 2022 - 2025, Ph.D. student at UGA
with Didier Schwab, Emmanuelle Esperança-Rodier
Subject : Coreference Resolution and Machine Translation
Ph.D. in progress
- Lorenzo Lupo, 2019 - 2022, Ph.D. student at UGA
with Laurent Besacier
Subject : Document-Level Neural Machine Translation
Ph.D. defended in March 2023
- Elisa Gugliotta, 2019 - 2022, Ph.D. student at La Sapienza, UGA
with Giuliano Mion, Olivier Kraif
Subject : NLP for Arabish analysis
Ph.D. defended in May 2022
- Loïc Grobol, 2016 - 2020, Ph.D. student at Paris 3
with Isabelle Tellier/Frédéric Landragin, Eric De La Clergerie
Subject : Coreference Resolution
Ph.D. defended in July 2020
- Tian Tian, 2014 - 2019, Ph.D. student CIFRE at Synthesio
with Isabelle Tellier/Thierry Poibeau
Subject : NLP for User-Generated-Content analysis
Ph.D. defended in October 2019
- Yoann Dupont, 2013 - 2017, Ph.D. student CIFRE at Expert System (ex Temis)
with Isabelle Tellier
Subject : Named Entity Detection
Ph.D. defended in November 2017
Master students
- 2023 Dimitra Niaouri, Subject : Context-Aware Machine Translation Evaluation
- 2022 Romaissa Kessi, Subject : Classification of political adds
- 2021 Lyheang Ung, Subject : Multi-task sequence-to-sequence learning
- 2021 Marco Naguib, Subject : End-to-End Spoken Language Understanding
- 2021 Laura Alonzo Canul, Subject : Document-Level Neural Machine Translation
- 2019 Julien Sfeir, Subject : Neural Coreference Resolution
- 2019 Nikita Kapoor, Subject : End-to-End Spoken Language Understanding
- 2017 Evann Cordier, Subject : Entity-Aware Language Models
- 2016 Nour El Houda Belhaouane, Subject : Mention detection for coreference resolution
- 2015 Abdelwahed Zaki, Subject : Mention detection for coreference resolution
- 2015 Sina Ahmadi, Subject : Entity detection for coreference resolution
Teaching
- Natural Language Processing for master Mosig 2023 @ UGA (~5h)
Material: - Natural Language Processing for master Mosig 2022 @ UGA (~5h)
- Analyse Syntaxique 2019 @ UGA (~40h)
- Traitement Automatique de Langues (TAL) 2015 @ Paris 6 (~40h)
- Introduction au TAL @ Paris 3 (4h)
Others
I'm regularly reviewer of national and internationl journal papers
I'm regularly in the scientific program commettee (reviewer) of conferences such as IJCAI, AAAI, IJCNLP, TALN, ...
- 2023, Paper evaluation committee member at EMNLP
- 2023, Project evaluation committee member at ANR
- 2022, Co-organizer of the workshop "Rumore di fondo o valore aggiunto" at Grenoble on detecting noise in annotated data
- 2022, Talk about the LeBenchmark project at the GENCI big challanges day at LPS, Orsay, France
- 2022, Chair of the session Spoken Language Modeling and Understanding at Interspeech 2022
- 2022, Program committee member at the joint GDR LIFT\&NLP days
- 2022, Talk about the LeBenchmark project at the French-German workshop on AI, INRIA Rocancourt, Paris
- 2022, Co-organizer of the GDR TAL day on oral language representation learning
- 2019, Examinator for the Ph.D. defense of Edwin Simonnet
- 2018, Project evaluator for the ANR
- 2017, Area chair at the French conference TALN
- 2016, Project evaluator for the Fond de recherche Nature et Technologies Québec
- Program Commettee member at the International Conference of the Association for Computational Linguistics (ACL) 2015
- Reviewer for the Journal of IEEE Signal Processing Letters 2015
- Program Commettee member at the International Conference of the Association for Computational Linguistics (ACL) 2014
- Reviewer for the Journal of Natural Language Engineering (JNLE) 2013
- Program Commettee member at the International Conference of the Association for Computational Linguistics (ACL) 2013
- Program Commettee member at the International Joint Conference on Artificial Intelligence (IJCAI) 2013
Previous research applications
Extended Named Entity Detection
Named Entity Detection is a well-known NLP task used as preliminary step to extract semantic information, to be used in more complex application. Beyond simple named entity detection tasks like the CoNLL shared task 2003, during last years more complex named entity sets have been defined, e.g. the one described in (Sekine and Nobata, 2004). Despite the complexity of the entity sets, most of the named entity detection tasks defined in the last years, can be tackled more or less as sequence labeling tasks.During the first part of my post-doc at LIMSI-CNRS, I have been working on a new set of named entities defined within the project Quaero. This new set of named entities is described in (Grouin et Al., 2011), and its main difference with respect to previous entity sets is constituted by entities having complex tree-structures, where simple entity components can be combined to have complex and higher level entities.
Given such tree structure, the task cannot be tackled as sequence labeling, which makes it more difficult. A further contribution to make the task harder is the kind of data used for annotating the named entity: transcriptions of French broadcast data, coming from different French and North-African radio channels.
In order to address these issues, after trying approaches coming from syntactic parsing without success, I used an approach combining the robustness of Conditional Random Fields (CRF) (Lafferty et Al.,2001) in sequence labeling tasks, with the ability of syntactic parsing algorithms (e.g. (Charniak, 1997)) to generate tree structures from flat sequences in an effective way, even on noisy data.
My approach uses CRF models to tag simple entity components on words, while a Probabilistic Context-Free Grammar (PCFG) along with a chart-parsing algorithm reconstruct the whole entity tree. The advantage with this approach is that CRFs are particularly effective for sequence labeling and robust to noisy data, they can thus provide an accurate annotation even using noisy data like French broadcast news. Once the words are annotated with entity components, since entity trees are far simpler than syntactic trees, even a simple model like PCFG is effective for parsing entity trees.
This approach has been evaluated in the 2011 Quaero named entity detection evaluation campaign, ranking first by a large margin.
Details about this approach are described in (Dinarelli Rosset, IJCNLP 2011). Recently, some advances have been published in (Dinarelli Rosset, EACL 2012), where several different tree structures have been used in order to encode some context in the PCFG. The same approach has been also recently applied to OCR-ized data dating from 1890, after a preprocessing step described in details in (Dinarelli Rosset, LREC 2012)
Spoken Dialog Systems
Spoken Dialog Systems (SDS) are speech applications allowing humans to engage a dialog with a machine in order to solve a task.During my Ph.D. I've been working on the LUNA project SDS prototype, in particular I designed the understanding module of the application. The main goal was to develop an evolution of a simple call routing application in Italian, in the domain of hardware/software problem solving. The understanding module of the application integrates state-of-the-art Spoken Language Understanding models, complemented with a sentence classifier.
Once the system understands the problem, as belonging to one of 10 possible scenarios, it redirects the user to an operator able to provide further assistance.
For more details see (Dinarelli et Al., ICASSP 2010).
Ontology-Based Spoken Language Understanding
From a computer science perspective, an ontology is a taxonomy of classes linked by some relations. In a Spoken Language Understanding (SLU) context, classes are semantic classes, or concepts, relations are semantic relations between concepts.Beyond traditional ontology relations, e.g. "is-a" or "part-of", we have defined some specific relations among concepts taken from the Italian corpus of Spoken Dialogs described in (Dinarelli et Al., EACL 2009b).
The corpus covers the domain of problem solving for hardware/software repairing and has been used for the development and evaluation of Spoken Language Understanding systems (see e.g. (Dinarelli et Al., EACL 2009a)).
We used the ontology semantic relations in order to assess semantic interpretation hypotheses generated by a baseline SLU system based on Stochastic Finite State Transducers, like the one described in (Dinarelli et Al., EACL 2009a). We choose the most consistent hypothesis with respect to the Ontology Relatedness measure defined in (Quarteroni et Al., ASRU 2009).
Despite the final results in terms of accuracy were not improving state-of-the-art, this idea received good feedback at Interspeech 2009 conference and ASRU 2009 workshop.
Ph.D. Thesis
The topic of my Ph.D. was Spoken Language Understanding (SLU) models for Spoken Dialog Systems.
The work focused on the integration of different SLU models using discriminative re-ranking algorithms
(Collins,2000).
Two models for hypotheses generation were used: Stochastic Finite State Transducers (SFST), encoding a semantic language model (Raymond et Al.,2006), and Conditional Random Fields (CRF) (Lafferty et Al.,2001). The re-ranking model was based on Support Vector Machines (Vapnik,1998) with Kernel Methods, in particular String Kernels (Shawe-Taylor&Cristianini,2004) and Tree Kernels (Collins&Duffy,2001) (Moschitti,2006).
New tree-structured features for kernels have been designed with the aim of giving an effective representation of SLU hypotheses in SVM (Dinarelli et Al., EMNLP 2009).
An important contribution to reranking is the hypotheses selection criteria: a heuristic providing a semantic inconsistency metric over hypotheses allowing to select the best hypotheses among those generated by SFST or CRF, for details see (Dinarelli et Al., SLT 2010), (Dinarelli Rosset, EMNLP 2011), and (Dinarelli et Al., IEEE 2011).
The joint models based on reranking have been evaluated on 4 different corpora in 4 different languages: ATIS (English), MEDIA (French), Italian and Polish corpora acquired within the European project LUNA (see (Dinarelli et Al., EACL 2009b) for the Italian corpus). An exhaustive comparison with several state-of-the-art models has been performed, showing the effectiveness of reranking models, all details are in my Ph.D. dissertation (Dinarelli, Ph.D. Dissertation 2010).
Two models for hypotheses generation were used: Stochastic Finite State Transducers (SFST), encoding a semantic language model (Raymond et Al.,2006), and Conditional Random Fields (CRF) (Lafferty et Al.,2001). The re-ranking model was based on Support Vector Machines (Vapnik,1998) with Kernel Methods, in particular String Kernels (Shawe-Taylor&Cristianini,2004) and Tree Kernels (Collins&Duffy,2001) (Moschitti,2006).
New tree-structured features for kernels have been designed with the aim of giving an effective representation of SLU hypotheses in SVM (Dinarelli et Al., EMNLP 2009).
An important contribution to reranking is the hypotheses selection criteria: a heuristic providing a semantic inconsistency metric over hypotheses allowing to select the best hypotheses among those generated by SFST or CRF, for details see (Dinarelli et Al., SLT 2010), (Dinarelli Rosset, EMNLP 2011), and (Dinarelli et Al., IEEE 2011).
The joint models based on reranking have been evaluated on 4 different corpora in 4 different languages: ATIS (English), MEDIA (French), Italian and Polish corpora acquired within the European project LUNA (see (Dinarelli et Al., EACL 2009b) for the Italian corpus). An exhaustive comparison with several state-of-the-art models has been performed, showing the effectiveness of reranking models, all details are in my Ph.D. dissertation (Dinarelli, Ph.D. Dissertation 2010).
Master Degree Thesis
During my Master Thesis I have studied, implemented and evaluated
an application for data clustering and compression.
Data compression algorithms can be thought of as functions transforming data so that to reduce local redundancy. The data redundancy is detected by the compression algorithm inside a window on the input data stream. Redundancy detection is limited to this window, this can constitute a serious limitation when compressing relatively large amount of data. Common data compression algorithms, like the Lempel-Ziv algorithm family used in zip and gzip Linux tools, or algorithms using the Burrows-Wheeler Transform (BWT) like bzip2 Linux tool, use a fixed-size window (e.g. the options -1,...,-9, used in mutual exclusion, fix the window size to 100K,...,900K).
A possible way to improve the compression performance is to increase the window size. Unfortunately this solution increases also the compression time, that in the worst case cannot be bounded a priori.
The solution studied in the thesis works on the opposite point of view: instead of arbitrarily increasing the window size in order to detect data redundancies far away in the documents, we apply a fast data clustering algorithm putting (possibly) close together similar sub-parts of documents, thus increasing data local redundancy. After the clustering phase, data are compressed using a variable-size window algorithm. The window size bound has been computed empirically with a set of experiments where increasing window size was used. The clusterisation phase has been performed using min-wise independent linear permutations (Bohman, Cooper, Frieze 2000) to convert document sub-parts into feature vectors. These were then mapped into one-dimensional real number space using Locality Sensitive Hashing (LSH) (Andoni, Indyk 2006). Exploiting LSH properties (similar vectors, and so similar documents sub-parts, are hashed close together in the real line), we just re-sort document sub-parts using hash values order, thus getting possibly highly redundant data. The final data compression step is performed with an algorithm based on the BWT, provided by my advisor Professor Paolo Ferragina
Data compression algorithms can be thought of as functions transforming data so that to reduce local redundancy. The data redundancy is detected by the compression algorithm inside a window on the input data stream. Redundancy detection is limited to this window, this can constitute a serious limitation when compressing relatively large amount of data. Common data compression algorithms, like the Lempel-Ziv algorithm family used in zip and gzip Linux tools, or algorithms using the Burrows-Wheeler Transform (BWT) like bzip2 Linux tool, use a fixed-size window (e.g. the options -1,...,-9, used in mutual exclusion, fix the window size to 100K,...,900K).
A possible way to improve the compression performance is to increase the window size. Unfortunately this solution increases also the compression time, that in the worst case cannot be bounded a priori.
The solution studied in the thesis works on the opposite point of view: instead of arbitrarily increasing the window size in order to detect data redundancies far away in the documents, we apply a fast data clustering algorithm putting (possibly) close together similar sub-parts of documents, thus increasing data local redundancy. After the clustering phase, data are compressed using a variable-size window algorithm. The window size bound has been computed empirically with a set of experiments where increasing window size was used. The clusterisation phase has been performed using min-wise independent linear permutations (Bohman, Cooper, Frieze 2000) to convert document sub-parts into feature vectors. These were then mapped into one-dimensional real number space using Locality Sensitive Hashing (LSH) (Andoni, Indyk 2006). Exploiting LSH properties (similar vectors, and so similar documents sub-parts, are hashed close together in the real line), we just re-sort document sub-parts using hash values order, thus getting possibly highly redundant data. The final data compression step is performed with an algorithm based on the BWT, provided by my advisor Professor Paolo Ferragina
Bibliography
(Dinarelli et Al., IEEE 2012)
Marco Dinarelli, A. Moschitti, G. Riccardi
Discriminative Reranking for Spoken Language Understanding
IEEE Journal of Transactions on Audio, Speech and Language Processing (TASLP), volume 20, issue 2, pages 526 - 539, 2012.
(Dinarelli Rosset, LREC 2012)
Marco Dinarelli, S. Rosset
Tree-Structured Named Entity Recognition on OCR Data: Analysis, Processing and Results
In Proceedings of the Language Resources and Evaluation Conference (LREC), Istanbul, Turkey, 2012.
(Dinarelli Rosset, EACL 2012)
Marco Dinarelli, S. Rosset
Tree Representations in Probabilistic Models for Extended Named Entity Detection
In Proceedings of the European chapter of the Association for Computational Linguistics (EACL), Avignon, France, 2012.
(Dinarelli Rosset, IJCNLP 2011)
Marco Dinarelli, S. Rosset
Models Cascade for Tree-Structured Named Entity Detection
In Proceedings of International Joint Conference on Natural Language Processing (IJCNLP), Chiang Mai, Thailand, 2011.
(Dinarelli Rosset, EMNLP 2011)
Marco Dinarelli, S. Rosset
Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), Edinburgh, U.K., 2011.
(Dinarelli et Al., SLT 2010)
Marco Dinarelli, A. Moschitti, G. Riccardi
Hypotheses Selection For Re-ranking Semantic Annotations
IEEE Workshop on Spoken Language Technology (SLT), Berkeley, U.S.A., 2010.
(Dinarelli, Ph.D. Dissertation 2010)
Marco Dinarelli
Spoken Language Understanding: from Spoken Utterances to Semantic Structures
Ph.D. Dissertation, University of Trento
Department of Computer Science and Information Engineering (DISI), Italy, 2010.
(Dinarelli et Al., ICASSP 2010)
Marco Dinarelli, E. Stepanov, S. Varges, G. Riccardi
The LUNA Spoken Dialog System: Beyond Utterance Classification
In Proceedings of International Conference of Acoustics, Speech and Signal Processing (ICASSP), Dallas, USA, 2010.
(Dinarelli et Al., EMNLP 2009)
Marco Dinarelli, A. Moschitti, G. Riccardi
Reranking Models Based On Small Training Data For Spoken Language Understanding
In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), Singapore, 2009.
(Dinarelli et Al., EACL 2009a)
Marco Dinarelli, A. Moschitti, G. Riccardi
Reranking Models for Spoken Language Understanding
In Proceedings of the European chapter of the Association for Computational Linguistics (EACL), Athens, Greece, 2009.
(Dinarelli et Al., EACL 2009b)
Marco Dinarelli, S. Quarteroni, S. Tonelli, A. Moschitti, G. Riccardi
Annotating Spoken Dialogs: from Speech Segments to Dialog Acts and Frame Semantics
EACL Workshop on Semantic Representation of Spoken Language, Athens, Greece, 2009.
(Quarteroni et Al., ASRU 2009)
S. Quarteroni, Marco Dinarelli, G. Riccardi
Ontology-Based Grounding Of Spoken Language Understanding
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Merano, Italy, 2009.
Marco Dinarelli, A. Moschitti, G. Riccardi
Discriminative Reranking for Spoken Language Understanding
IEEE Journal of Transactions on Audio, Speech and Language Processing (TASLP), volume 20, issue 2, pages 526 - 539, 2012.
(Dinarelli Rosset, LREC 2012)
Marco Dinarelli, S. Rosset
Tree-Structured Named Entity Recognition on OCR Data: Analysis, Processing and Results
In Proceedings of the Language Resources and Evaluation Conference (LREC), Istanbul, Turkey, 2012.
(Dinarelli Rosset, EACL 2012)
Marco Dinarelli, S. Rosset
Tree Representations in Probabilistic Models for Extended Named Entity Detection
In Proceedings of the European chapter of the Association for Computational Linguistics (EACL), Avignon, France, 2012.
(Dinarelli Rosset, IJCNLP 2011)
Marco Dinarelli, S. Rosset
Models Cascade for Tree-Structured Named Entity Detection
In Proceedings of International Joint Conference on Natural Language Processing (IJCNLP), Chiang Mai, Thailand, 2011.
(Dinarelli Rosset, EMNLP 2011)
Marco Dinarelli, S. Rosset
Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), Edinburgh, U.K., 2011.
(Dinarelli et Al., SLT 2010)
Marco Dinarelli, A. Moschitti, G. Riccardi
Hypotheses Selection For Re-ranking Semantic Annotations
IEEE Workshop on Spoken Language Technology (SLT), Berkeley, U.S.A., 2010.
(Dinarelli, Ph.D. Dissertation 2010)
Marco Dinarelli
Spoken Language Understanding: from Spoken Utterances to Semantic Structures
Ph.D. Dissertation, University of Trento
Department of Computer Science and Information Engineering (DISI), Italy, 2010.
(Dinarelli et Al., ICASSP 2010)
Marco Dinarelli, E. Stepanov, S. Varges, G. Riccardi
The LUNA Spoken Dialog System: Beyond Utterance Classification
In Proceedings of International Conference of Acoustics, Speech and Signal Processing (ICASSP), Dallas, USA, 2010.
(Dinarelli et Al., EMNLP 2009)
Marco Dinarelli, A. Moschitti, G. Riccardi
Reranking Models Based On Small Training Data For Spoken Language Understanding
In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), Singapore, 2009.
(Dinarelli et Al., EACL 2009a)
Marco Dinarelli, A. Moschitti, G. Riccardi
Reranking Models for Spoken Language Understanding
In Proceedings of the European chapter of the Association for Computational Linguistics (EACL), Athens, Greece, 2009.
(Dinarelli et Al., EACL 2009b)
Marco Dinarelli, S. Quarteroni, S. Tonelli, A. Moschitti, G. Riccardi
Annotating Spoken Dialogs: from Speech Segments to Dialog Acts and Frame Semantics
EACL Workshop on Semantic Representation of Spoken Language, Athens, Greece, 2009.
(Quarteroni et Al., ASRU 2009)
S. Quarteroni, Marco Dinarelli, G. Riccardi
Ontology-Based Grounding Of Spoken Language Understanding
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Merano, Italy, 2009.