Developing a Large Scale Attribute Extractor for E-commerce

Leveraging NLP models can achieve impressive results, but also needs careful evaluation

Illustration by Stable Diffusion

Note: The original post has been published on Medium.

GET AN INSIGHT IN HOW OUR TECH & PRODUCT DEPARTMENT WORKS: In this guest article, our Senior Data Scientist Erick talks about developing a large scale attribute extractor for the items of the Kaufland online marketplaces.

Extracting structured data from text is a well-known NLP problem. Even in the more specific domain of e-commerce, there is a significant body of research on the subject. I recently developed a machine learning model for this task at Kaufland e-commerce, a major online marketplace in Europe, and as usual in applied scenarios, knowing the algorithms was just the start: making sense of the data was the greatest challenge. Let’s have a look at the scenario and how we dealt with the difficulties.


The Customer’s Perspective

Imagine you are shopping online for something like a lawn mower. You find some offers and want to look out for their technical specifications — say, size, weight, and performance. You might run into a page like the one below.

Erick, Senior Data Scientist

There’s a bit of text, but the tech specifications (let’s call them attribute values) at the bottom right might catch your eye. Now the question is: where do these values come from?

Product Data Sources

Online marketplaces offer assortments of millions of products. Third-party sellers are the ones who actually sell most products through the platform, and they are the ones who provide the product data.

In general, products must have a title, description, one or more pictures, and a couple of other mandatory fields to be allowed into our online marketplace. Structured attributes are optional –sellers are encouraged to provide them, but that doesn’t always happen.

Let’s look at an example. Here is a simplified product data JSON file:

This JSON file contains only one optional attribute, length. However, if we read the full textual description, we will find explicit mentions of the device’s weight, motor wattage, and drill rotation speed.

This is a very common issue, as sellers often don’t have the full structured data about the products they sell. We will then take advantage of the attributes that sellers do provide to train an NLP model, which in turn will be able to extract more attributes from the text.


Regular Expressions

Regular expressions look like a reasonably simple and efficient solution at first. However, the amount of similar values that different attributes can have makes them inadequate for this task. Consider, for example, the two width values below. The first one refers to the grass cutting width, and the other to the whole equipment width.

You could, of course, write more specialized regular expressions to get exactly what you want. But considering all wording possibilities, this becomes a never-ending effort, multiplied by the number of attributes you are interested in. Machine learning turns out to be easier and more efficient.


Have You Already Tried ChatGPT?

With the LLM hype, the question of using one of them is to be expected. For what it’s worth, yes, I have tried ChatGPT, and it is very good. At least one paper has been written on applying it to attribute extraction. However, some considerable issues apply regarding costs, compliance, and performance. I summarized them in the table below.

Our resulting extractor model runs on a single T4 GPU, which is not so expensive as far as hardware accelerators go. More on that later.


Models: The Easy Part

Model building can seem daunting if you have limited machine learning experience, but believe me, it’s the easiest part. Most of the needed tools are already implemented by libraries like Hugging Face. We took advantage of XLM-Roberta, a Transformer model like BERT. It is pretrained on huge amounts of text in many languages, and can kickstart nearly any neural NLP model. Being multilingual is a huge plus: Germany is the main market for Kaufland e-commerce, but it recently expanded to Czechia and Slovakia, with plans for more.

More specifically, we used Hugging Face’s Question Answering implementation. It was originally meant for fine-tuning models that take a pair [question, context] as input, and find the answer to the question in the given context. We could then train the model with questions like „What is the color?“, followed by a product description whose color we knew beforehand. To simplify things and spare a few tokens in the limited model input, provided just the attribute name followed by a title or description.

This architecture learns to point where exactly in the context is the answer to the question (or attribute name, in my case). It produces two outputs: a start and an end position, both trained as a softmax over all the model input tokens.

Learning to Say“No“

An important case to consider is when there is no answer available — which happens very frequently, as texts do not always list every possible attribute a product has. If a model is trained only with examples that contain the attribute value (let’s call them positive examples), it will always try to find something, even when there is nothing to be found. We must instead include negative examples in the training data, i.e., cases when the right answer is no answer at all. Take a look at the example below:

Encoding negative examples is pretty straightforward. The model output is a softmax over the input tokens — so it will always have to point at something. We then pick the special token [CLS], commonly used by Transformer networks, as the ground truth when no attribute value is mentioned.

The [SEP] tokens might work just as fine. But since they change their position depending on input sizes, the fixed position zero of [CLS] sounds simpler.


Getting the Training Data

Data is the tricky part, and the one least talked about in research papers. Let’s examine the difficulties:

Finding the Answer Span

The model is trained to find the text span containing the value for a given attribute name. When creating the training data, these spans are determined automatically by matching with the first occurrence of the value provided by the seller. When multiple attributes have the same value, it is an issue, particularly common with dimensions, like width and height. When this happens, the safest option is to ignore all of the conflicting values — not to train the model to extract any of them.

For binary attributes, it is even worse. It is common for some electronic products to list binary properties like in the picture below, using „yes“ or „no“ in the text.

Since we rely on simple string matching, it is impossible to tell which „true“ corresponds to which „yes“. Granted, we could write some regular expressions to catch some of them, but as already mentioned, it was not worth the effort. As a result, we did not add any binary attribute to the training dataset.

Negative Examples

In this example, it is impossible to determine just via string matching which is the width and which is the length. It is not easy to tell when a product text doesn’t mention any attribute. Just because a seller didn’t provide some value in a structured format, it doesn’t mean it cannot be found. Still, we needed negative examples. So, to create them, we considered attributes given by the seller, but whose value was not mentioned in the text.

This is not foolproof, though: when a value is expressed differently, it is missed. For example, if the product text contains „120 cm“ but the structured value is „1.2 m“, it would count as non-available. Still, these cases were rare in our dataset, and would affect the model recall (i.e., it would make it „learn“ to miss more attributes), not its precision.

Attribute Imbalance

Some attributes are a lot more common than others. In any marketplace, you will find many more items with „color“ and „width“ than with „rotations per minute“. If we let the common attributes overwhelm the rare ones in the training data, the model might end up overly favoring the former. The solution here is very simple: we downsample the most frequent ones.


Evaluation is also tricky. Getting precise values is nearly impossible, as our automatically generated dataset is incomplete. Let’s take a look at a training example to understand the problem:

It lists the wheel size and the color present in the title, but not the number of gears. For automatic evaluation purposes, we also couldn’t tell whether the weight, brake type, or other bike attributes could be found in the title. This is fine for training because we don’t have to train the model with every single instance of an attribute. But it’s not fine if we use it for evaluation:

  • We underestimate precision: If the model finds the number of gears in the example above, it would be considered wrong.
  • We overestimate recall: If the model doesn’t find the number of gears, we wouldn’t know it missed something.

Evaluation Workarounds

One straightforward solution is to label every attribute in a product text, making sure we don’t miss anything. That is great for evaluation, but costly. Another solution is limiting ourselves to the attributes we know about, and computing precision and recall only on this subset. This is a lot simpler but biased towards the most common attributes.

During model development, the second approach is still a decent indicator of how the model improves. However, when nearing deployment, it is necessary to have more accurate numbers of the impact on the marketplace. For that, we would need more labeled data. We focused on a few product categories that would benefit from a good attribute extractor, and had people manually label every occurrence of attribute values in their descriptions. In fact, we ran the extractor before the manual work —that made the job a lot easier, as the extractor is correct most of the time.

Some of this labeled data was used for further fine-tuning, as it contained a larger share of rare attributes (rare because sellers don’t often provide them). The rest was saved for a more precise evaluation.

Results and Caveats

On the set of manually curated products above, my model had roughly 0.75 for both precision and recall — good, but not amazing. But… can we really trust these numbers? Short answer: not much. After closely examining the model outputs, we noticed that most of the apparent mistakes were in fact correct answers. These were some common discrepancies between model output and reference answer:

  • Different units (e.g., „120cm“ vs „1.2m“).
  • Missing an implicit unit (like inches or centimeters). Especially common with bike wheels or TV screen sizes.
  • Longer value description (e.g., „H2 — middle hardness“ vs „H2“).
  • Some dimensions (height/width/length) were confused with one another; this is also common among humans and sometimes not clear which is which.
  • Conflicting values in the product text (e.g., „Windows 7“ and „Windows 8“ as the pre-installed OS in a laptop) with no clear answer.
  • Reference value downright wrong.
  • Reference value not given when it was present in the text.

It turns out that the 0.75 was just a lower bound of the model performance, and a very pessimistic one. And there’s more. Recall that the model output is a softmax over the token positions in a text that contain the attribute value. By default, the model is considered to have given an answer when the softmax probability of some span is higher than the „no answer“ token, but we don’t need to stick to the default.

The above graph shows a comparison of the mean softmax probability (as a proxy for model confidence) when the model produces a correct answer versus a wrong one. This means that by setting a minimum threshold of, say, 0.6, we get even more precise results.

Putting it to Work

The final model was trained with over 80 different attributes. Training it, including a second round of fine-tuning, took around 8 hours on an A100 GPU — compared to what you would need to fine-tune an LLM, this costs pocket change. At inference time, the extractor is part of a pipeline and runs only after the category of a product has been determined. This way, we can determine which attributes are associated with it — if we didn’t have this step, the extractor would waste a lot of time trying to find the screen size of hammers and the number of gears of TVs.


After the extractor went live, we observed a considerable improvement in attribute fill rates, from 32% to 39% on average. Some attributes had extraordinary boosts: mattress hardness used to be available to less than 5% of mattresses, but the extractor bumped it to over 65%.

This is a great benefit for sellers and customers in the online marketplace. Customers can more easily find things they are looking for and quickly check structured information about product properties. Sellers get the benefits of improved product data without having to worry about structured data formats.

I hope my experience helps other practitioners and researchers out there, especially regarding data, data quality awareness, and proper model evaluation. As deep learning technologies evolve and model building becomes easier, properly curating data remains a big part of our work as data scientists.