Automated Content Moderation in a Brazilian
Автор: Сергей Астафьев • Октябрь 31, 2023 • Реферат • 2,321 Слов (10 Страниц) • 111 Просмотры
Automated Content Moderation in a Brazilian
Marketplace
Tatiana Gama
Americanas S.A.
Rio de Janeiro, Rio de Janeiro, Brasil tatiana.gama@americanas.io
Joao Gabriel Melo Barbirato
Americanas S.A.
Sao Carlos, Sao Paulo, Brasil joao.barbirato@americanas.io
Abstract
Clarifying doubts can become decisive when shopping on e-commerces platforms. Considering the relevance of user generated content, this work aimed to develop an internal hybrid system, composed of machine learning models alongside a rule-based module, to moderate customers’ questions and sellers’ answers in one of the biggest marketplaces in Brazil.
Keywords: content moderation, Portuguese, questions and answers, user generated content, e-commerce, marketplace.
1 Introduction
In e-commerce platforms, user-generated content has been gaining more and more relevance to its business since it increases the consumer confidence level to conclude online purchases [2]. As one of the largest marketplace in Latin America, Americanas Marketplace has its business based on online shopping, connecting sellers and final customers through its platform. One type of content that facilitates this interaction is the question and answer (Q&A) technology: a system that allows customers to publicly ask questions in natural language, in this case, mostly Brazilian Portuguese, in the product page and to publicly receive answers from a seller. Figure 1 illustrates this section on the product page.
The questions are mainly asked in the pre-purchase stage and their content must focus on doubts about product features, especially those that have not been specified in the product description or in the product technical data sheet.
Figure 1. Q&A section.
Source: https://www.americanas.com.br/produto/111957454. Last accessed in jun. 30, 2022.
This kind of content can be really useful not only to the one asking it, but also for other clients that would like to have the same information about this given product. However, it is not unusual for customers and sellers to approach other topics, such as delivery time, freight costs or personal issues in general. Such contents are considered inappropriate to be displayed in the Q&A section of the product page, since they are not relevant to all customers and the answers for them are not stable. Therefore, certain questions and answers should be blocked and not be displayed in the platform.
Due to the large number of interactions that Americanas receives, around one million questions per year, manual moderation, that is to say, to filter what should or should not be available in the product page, becomes impracticable. In that way, scalable solutions are fundamental, since clients need their questions answered as fast as possible to continue their purchases.
Although most research focuses on question-answering (QA) communities, such as Stackoverflow [3, 5, 7], there are works related to Q&A, such as how this system affects customers reviews [1]; the impact of information quality on customers’ purchase intention [9]; the economic impact of the Q&A section on the product page [6], among others.
However, there is a lack of research focusing on Q&A content moderation.
Whereas an online product review represents a unilateral communication channel that allows customers to share postpurchase experiences, the Q&A system permits a bilateral interaction between users that ask questions and sellers that answer them. It enables the reduction of customer uncertainty and enhances purchase intention [9].
Considering the relevance of this topic, this work aimed to automatize Q&A content moderation on e-commerce product pages. To do so, a hybrid system, composed of machine learning models alongside a rule-based module, was created, tested and implemented. The main purpose of this project was to achieve better, cheaper and faster results when compared to the prior moderation method used by Americanas Marketplace. Previously, a third-party company was responsible for this content moderation, so we were also looking to internalize the operation of this system.
This work is divided as the following: section 2 describes the methodology used, alongside the business rules, the hybrid system details and the datasets used to build the models. Next, section 3 presents the results, the evaluation metrics and the model error analysis. Finally, the section 4 shows the conclusions and future research.
2 Methodology
Given a customer’s question or a seller’s answer, the automatic moderator must define whether or not it should be published on the product page. This work considers the following criteria due to business rules:
• The question should be restricted to the characteristics of the product.
• The question cannot be related to shipping, price, complaints, problems related to the purchase journey.
• The question/answer must not contain bad words.
• The answer cannot head or influence the costumer to purchase the product in a different platform.
In addition, the solution aims to have better metrics in Q&A moderation than the third party company had and to have a prediction time lower than the previous solution.
Considering these directives, this work divides the solution into two parts: a specific moderation for the questions and a second moderation only for the sellers’ answers. For each part of the solution it was developed a hybrid model composed by two modules. The first one is a rule-based module composed of a list of prohibited words (Blocklist), described in section 2.1; the second one is a module composed of a machine learning model for automatic classification, detailed in section 2.2.
2.1 Blocklist Module
The Blocklist module consists of a list of n-grams that are very often used in contexts that always should be blocked,
such as: delivery, freight, price, market competitors, inappropriate content and bad words. If the text contains any n-grams in this list, the question, or answer, is automatically blocked. In this case, to the customers, the submit button in the Q&A web interface becomes unavailable and a user feedback is instantly reported showing the Q&A section rules. This list was composed using regular expressions and manual annotation considering business and linguistics knowledge. A comparative analysis was performed between a sample randomly extracted and manual annotated data and results are shown in 3.
...