A Conversation with Synthesio’s Product & Data Science Team
We sit with Synthesio’s Product and Data Science team to discuss our advanced semantic analysis capabilities.
Today’s consumer changes fast and often. For brands, understanding this consumer can feel like guesswork.
Up to now, when tracking keywords across social, one can uncover the contours of the consumer conversation. But you’re still left needing to color in the lines from there.
That’s why we built Semantic AI. By classifying specific grammatical elements – descriptions, actions, and expressions – you can better understand the structure and meaning of the conversation, i.e., consumer language. With a clearer understanding of language comes a clearer understanding of consumer context – her motivations and experiences.
Let’s discuss more with the builders of Semantic AI – Sinan Asil, Senior Product Manager, and Ferdinand Roth, Lead Data Scientist.
Let’s start with an explanation. What is Semantic AI?
Sinan: Semantic AI is our advanced text analytics capabilities. Semantic AI deals with the what and the how of the consumer conversation. With Entities, we are surfacing new, emerging themes entering the conversation, such as people, organizations, products, locations, and more. With Text Analytics, we’re concerned with how people are naturally discussing these conversational themes. For example, the descriptions, expressions, nouns, verbs, adjectives, etc., that define and drive consumer conversation. Above all, we’re excited about these developments because it allows our customers to spot unique insights (emerging conversation subjects) and specific insights that look at consumer language (how consumers discuss those subjects).
Why did the Product team choose to build these capabilities?
Sinan: When people (consumers) speak online, they seek not only to be heard but to be understood. When there are millions of voices, how do we avoid distilling them to mere numbers? How do we understand them intelligently and humanely? Likewise, Brands, our customers, are looking to us for consumer behavior shifts and more specific and actionable insights.
Our fundamental belief is this – If brands (our customers) can understand the content and the user behavior at a deep, semantic level, they can deliver more relevant content and products, thereby creating richer experiences for consumers.
For the technical novices out there, can you explain how it all works?
Ferdinand: Natural Language Processing (NLP) – a subfield of Artificial Intelligence – powers our semantic analysis capabilities. NLP helps machines understand human language, which is quite complicated. NLP studies the structure and rules of language and creates intelligent systems capable of deriving meaning from text and speech.
Named entity recognition is an essential NLP task that allows us to spot the main entities in a text. It locates and classifies named entities mentioned in unstructured text into predefined categories like organizations, people, products, facilities, countries &cities, identity groups, and events.
Parts-Of-Speech is the process of marking up a word in a text as corresponding to a particular grammatical element. Examples include verbs, adverbs, nouns, numbers, particles, pronouns, interactions, etc.
With Synthesio being part of one of the world’s largest Market Research firms, how did this help you produce a better product?
Sinan: Because Ipsos is a global company, the resources we have at our disposal are significant in many countries. For example, this global reach became particularly useful for the validation of the semantic analysis NLP model in Asian languages. Manipulation is complicated with languages like Chinese, Thai, and Korean.
Can you explain the Normalization concept? Why does it matter?
Ferdinand: When you Google search for “PS5”, the semantic analysis engine is intelligent enough to understand that this is the same as “Playstation5” or “Play Station 5” or “Playstation 5”, etc. Therefore, this creates a sense of “intuitiveness,” but it takes a lot of work to get there. Thus, our Named Entity Recognition is going through that work – a deeper quality assurance level with what we call “normalization.”
Through NLP, we can detect the same entity written in multiple forms on social media and life in general. For example, you can write the new Sony Playstation 5 as Playstation 5 as PS5, Sony PS5, PS 5.
But it is all considered as only one entity. In other words, this is not about building a static library but a neural approach to detecting relationships & synonyms – when Playstation 6 comes out, we’ll require no human intervention.
Where can Users find these features?
Sinan: You’ll find them in our Data & Reports sections. Beyond just word clouds, our Semantic AI capabilities are baked in throughout the experience, enhancing the uniqueness and specificity of insights. For example, you can filter, build reports, and take advantage of pre-built visualizations and widgets.
In Data, spot new insights with the Entities & Semantic scorecards. You can slice the top results by applying any filter, hide irrelevant results, and view trend lines. Use the semantic explorer to highlight connections between repeated and emerging entities and the related grammatical elements.
In Reports, either leverage pre-built templates or drill-down your data-points with the new Semantic dimensions and make any data-visualization that you see fit for your study.
What was technically challenging about building this capability, and how did you solve those challenges?
Ferdinand: We faced two core technical challenges – (1) the semantic analysis model must be multilingual and (2) it has to understand context deeply and disambiguate synonyms, like Apple, the company, and Apple the fruit.
Thus, that’s why we decided to build a neural network based on Google BERT. BERT is a large neural network model developed by Google and pre-trained with multilingual data from more than 100 languages. BERT can “process words in relation to all the other words in a sentence… therefore considers the full context of a word by looking at the words that come before and after it”. Exactly what we needed. Out of the box, BERT is pre-trained on professional, neutral, and semi-formal language (like Wikipedia) – not what you see on Reddit or Twitter.
As you can see with the above examples, Wikipedia text and social-media mentions differ drastically in syntax and conversation style. So, we started with BERT as a foundation and built on top of it a neural network trained on our extremely-heterogeneous social corpus specifically to detect part-of-speech and named entities across social mentions.
To train this model and obtain the best results, we used public datasets made of various types of texts and in-house annotated datasets made of our social data.