The Rise of Government NLP Programs – with Manuel Herranz, Pangeanic
July 28, 2021
Partner Spotlight: Pangeanic
Smart governments are hiring data scientists to further automate what governments do for their citizens. These data scientists work on creating data highways, so that the information that flows into systems is structured, and a thousand different applications can spring forth from it in the future. In the meanwhile, Manuel Herranz and his company Pangeanic are right on top of this emerging trend: their NLP ECO platform delivers private machine translation with proprietary, user-centric Deep Adaptive features, anonymization, language detection, data augmentation, sentiment analysis, etc..
Learn how he is moving from machine translation to anonymization and cognitive engines in the interview.Expect some fantastic insights into the benefits and drawbacks of joining EU projects, how this contrasts to working on AI and NLP within the private sector, and the myriad solutions a platform like ECO offers.
MT, NLP, AI and Public Administrations
Increasingly, governments are hiring AI experts and data scientists to build AI systems in-house. Are you encountering this trend?
Within the MAPA Project (Multilingual Anonymisation toolkit for Public Administrations), we have a mandate to have Use Cases, like solving a public administration’s problems, or testing anonymization for them. One of our partners approached the Spanish Ministry of Justice, and surprise surprise, the people they talked to were all data scientists and NLP experts. They knew as much as we did about anonymizing data! So yes, more and more we’re encountering people in government administration settings who ‘speak our language’, and understand the need for AI-based solutions to help cope with massive amounts of data and how AI can extract knowledge from BigData.
Given this current wave of digitalization, what are things that governments should be doing to make their lives easier?
The challenge is always to go from unstructured to structured content. Once content is structured, you can deal with it. You can extract a lot of insights and information. So, challenge number one is to establish a process whereby your unstructured, floating content from many sources becomes structured content.
Isn’t that more of an engineering challenge? Surely governments are aiming to buy some kind of coherent outcome for the public?
Yes – but you can’t separate the goals from the intermediate journey – you need a direction in order to buy the tools. If you purchase an MT solution here and an anonymization solution there, and then a classifier as well, everything should marry and should work together, because needs are multiple and interconnected, but integration is a challenge. We saw things were this way a long time ago and have developed ECO to satisfy multiple needs from a single platform.
Do you know who the largest data producer in the world is? Public administrations – it’s not Google. They’re just data handlers, manipulators, or managers. So when it comes to AI, governments are the ones who need to be modernizing and implementing AI technologies. A lot of AI technologies that public administrations need depend on NLP: that’s why we’ve grown out of MT, and started to offer solutions that clients in the public sector actually require.
So your idea is to help government institutions prepare, and create a fertile environment where lots of new products can take root…
Public administrations, at least in Spain, are realizing that they need to incorporate NLP into their strategies. ECO has been very successful in raising awareness about NLP and AI in general, as a solution for 21st century administrations.
Is the presence of NLP personnel becoming more prevalent within government?
Well, there’s a shortage of NLP personnel and data scientists everywhere, because they’re the basis of many AI solutions. So whether you want that person to run the projects, design the implementation, or guide policy change at an organization, data scientists and NLP people are in very, very high demand.
Are you finding that you are attracting quite a variety of different people looking for different things?
We are speaking to people that we haven't spoken to before. There are more sophisticated buyers. Three years ago, we couldn't have had this conversation – people would often be interested only on a surface level. But now, because technology is so much more in your face (who hasn't spoken to an Alexa or Siri product?) a ripple effect has reached the governmental and private sectors.
ECOV2 is an ecosystem of cognitive engines and services, built to address the needs of government NLP experts. It offers anonymization, annotation, data handling, data cleaning and classification, and categorization. It can also combine sentiment analysis with MT.
Who stands to benefit from these new NLP capabilities?
Well, anybody having to deal with data. Let's not forget, we're talking about two services here, three with named entity recognition. But there's a myriad of other services such as data classification, which is increasingly important for AI training. Data can be annotated automatically, and train AI far more quickly than via a manual approach. Equally, you may have tons and tons of data, and this can now be classified into multiple domains like journalism, legal, or medical.
Increasingly, we talk to NLP experts and data scientists who are building their own AI for institutions. So, we don’t just sell the platform or use of the platform, we also aid companies who want to augment data, or collect it afresh.
How did you come to be here at the right time, in the right place?
Years ago I saw how naïve the approach to MT was. In 2009, we produced the first Pangea platform which offered retraining features with statistical processes. Everything was included, from data cleaning to automated retraining – it was very ahead of its time – maybe too ahead of its time! Some people weren’t convinced. We didn't fully commercialize because of the circumstances (the financial crisis of 2008-2012), but we kept it as a proprietary tool. It was something that made our translation services more efficient, but the time wasn't right for commercialization. The perceptions of well-established linguists just weren’t there yet.
We began our NLP journey in 2017, which changed everything. It made MT a lot better, much faster to train, and its results were much closer to human parity. Thanks to European projects, new fields have since opened for us. We’ve been able to think about new ways of solving people’s needs by listening to what our clients would like beyond just MT – this journey has made us who we are.
What was the journey to winning your first government project?
Our first government contract actually came from the US – in Texas – they had a MemoQ user in their department. In the list of MT plugins, there was Google, Bing, and then a small one called PangeaMT. The client needed privacy because of the nature of the translation; they wanted an engine developed specifically. The RFP followed after that.
How did you secure your projects with the EU?
We had a post-editing project assigned to us in 2007, which was the direction we wanted to go in. But it was bad – uncustomized, rule-based MT. At the time we were experimenting with pattern recognition here at the Polytechnic in Valencia. We were beginning to have enough data for statistical systems, but still not enough.
After that, we were invited as a research organization into a Marie Curie project, and we had a good number of students, mature students and PhD students – all of that created a great atmosphere. Our research in Valencia, combined with other talents from elsewhere, led us to consider becoming a tool-based company rather than a service providing company. The following step was to build something – we got a small grant from the local government and used it to build a statistical model. It felt like a huge leap from where we’d started, and it worked well.
Appendix: Some of Pangeanic’s EU Projects
The MAPA Project (Multilingual Anonymization toolkit for Public Administrations)
The year 2021 marked the arrival of speech to speech translation in the commercial world. Scientists are working on making the underlying technology smoother and more accurate, engineers are integrating it into practical use cases. At the same time, there is an explosion in neural voices. Between July and September, three companies in this area […]
Partner Spotlight: Pangeanic Smart governments are hiring data scientists to further automate what governments do for their citizens. These data scientists work on creating data highways, so that the information that flows into systems is structured, and a thousand different applications can spring forth from it in the future. In the meanwhile, Manuel Herranz and his company […]
Case Study Engines from IT giants such as Google Translate, Microsoft, and Yandex often win in quality because search engine companies possess the whole internet as their data pool. However, with very specialized content and excellent translation memory, this advantage is nullified. In this case study, the engine from a smaller MT vendor Globalese won […]