Rédaction Africa Links 24 with João Neves
Published on 2024-03-18 14:27:00
The International Portuguese Language Institute (IILP) and Media Comunicações S.A., owner of Expresso das Ilhas (EI), signed a protocol whose relevance and innovative nature deserve the emphasis that is sought to be conferred here. Through this protocol, EI transfers to IILP, for research purposes, by itself or by third parties, the digital collection of all editions of the newspaper already published, numbering well above a thousand.
There are hundreds of editorials, news, articles, and interviews that, in addition to fulfilling the informative role they had at the time of their publication (in many cases, also due to the news voracity of the times), now add, regardless of their time, another equally important purpose.
In this newspaper, we have already referred to the importance of databases and corpora for automatic text processing, natural language processing, and the importance of strengthening those that feed generative artificial intelligence computational resources in the Portuguese language, structuring language models, studies in the area of linguistic variation, among other possibilities.
And this is precisely one of the faculties that this transfer configures – the creation of a database that stores and enables access to a wide variety of information and the performance of a similarly significant set of operations associated with different areas of study and research, which is, in fact, the only purpose for which the journalistic corpus of Cape Verde (as it has been designated) may be used.
This is reflected in the protocol which, without compromising the integrity of the texts (which is ensured), allows for the tokenization of the corpus, that is, categorization based on which computational processing analyzes language, organizing it into sequences (strings), according to the unit of measure determined for a specific information search, whether at the level of word segmentation, phrases, punctuation, frequencies, etc. Essentially, the key by which the researcher, based on the selected tool in the field of computational linguistics or other, will “read” the texts.
Facilitating and enhancing this research work, EI also agreed that the corpus be made available in its entirety, meaning that the texts can be viewed in full and researchers can open them on their computers with the tool that suits their purpose, without any restrictions.
Therefore, a large collection that can contribute to the development and improvement of language processing models as well as to the better knowledge or recognition of variation patterns.
Without further elaborating on the investigative potential of the corpus, an important point in the protocol assumes clear significance. IILP, as an institution with a vocation for promoting the Portuguese language from a multilateral perspective, which is a result of its organization and statutory provisions, will act as the repository of this collection, responsible for diligently establishing agreements with third parties to provide it free of charge to institutions, universities, science consortia, centers, or other research-related entities.
In the future, the interest of Portulan Clarin – Research Infrastructure for Language Science and Technology is already outlined, an interest that accompanied the genesis of this project that IILP undertook and that is of great importance for the enhancement of the Expresso das Ilhas corpus. Being a renowned platform for open science and corpus management, Portulan Clarin provides access to a comprehensive international collection of resources for science benefiting researchers, innovators, companies, students, language professionals, and citizens interested in various study areas for which the platform is designed.
Therefore, this is a privileged way of achieving one of the main goals of the project: to bring the corpus, as a set of data and an object of study, closer to a community (in the context of the CPLP and internationally) of specialists in domains identified by the platform, such as science, technology, language promotion, linguistic diversity, and other language and culture-related areas.
With this initiative, the EI’s collection offers itself to a new reading community for these texts that were read by others in the past, surely with a different perspective and purpose, a community that adds extra value to this linguistic and memory collection, which it certainly is. At the same time, the Cape Verdean journalistic corpus (which, it must be said, is an open corpus) will be added to and strengthen a wide range of other resources contributing to the technological development of the Portuguese language and the challenges it faces in the digital era.
Therefore, there are reasons to welcome this collaboration and the supporting protocol, especially since it inaugurates a cooperation with these characteristics for both entities, but also in terms of the collection, which, by further collaboration with the aforementioned infrastructure, will be made available, alongside others in different areas, to the various identified audiences.
It is, therefore, appropriate to consider it as innovative, especially since, in the context of the Portuguese-speaking countries’ communities, it will be the first collection in this domain, with these characteristics and this scope, to be made available in this way. This is also the vision of this project and this collaboration: to pave the way that will be followed as other collaborations with these characteristics materialize. The important thing is that the path is there and, adapting the poet António Machado, the path is made by walking.
Read the original article(Portuguese) on Expresso das Ilhas



