Under business conditions, narrowly specialized tasks often come across, which require a special approach because they do not fit into the standard data processing flow and constructing models. One of these tasks is the classification of new products in master data management process (MDM).

Example 1

You are working in a large company (supplier) engaged in the production and / or sales of products, including through wholesale intermediaries (distributors). Often your distributors have an obligation (in front of the company in which you work) regularly providing reporting on your own sales of your products – the so-called Sale Out. Not always, distributors are able to report on the sold products in your company codes, more often are their own codes and their own product names that differ from the names in your system. Accordingly, in your database there is a need to keep the table matching distributors with product codes of your account. The more distributors, the more variations of the name of the same product. If you have a large assortment portfolio, it becomes a problem that is solved by manual labor-intensive support for such matching tables when new product name variations in your accounting system are received.

If it refers to the names of such products as to the texts of documents, and the codes of your accounting system (to which these variations are tied) to consider classes, then we obtain a task of a multiple classification of texts. Such a matching table (which operators are maintained manually) can be considered a training sample, and if it is built on it such a classification model – it would be able to reduce the complexity of the work of operators to classify the flow of new names of existing products. However, the classic approach to working with the text “as is” will not save you, it will be said just below.

Example 2

In the database of your company, data on sales (or prices) of products from external analytical (marketing) agencies are coming or from the parsing of third-party sites. The same product from each data source will also contain variations in writing. As part of this example, the task can be even more difficult than in Example 1 because often your company's business users have the need to analyze not only your products, but also the range of your direct competitors and, accordingly, the number of classes (reference products) to which variations are tied – sharply increases.

What is the specificity of such a class of tasks?

First, there are a lot of classes (in fact, how many products you have so many classes) And if in this process you have to work not only with the company's products, but also competitors, the growth of such new classes can occur every day – therefore it becomes meaningless to teach one time Model to be repeatedly used to predict new products.

Secondly, the number of documents (different variations of the same product) in the classes are not very balanced: there may be one by one to class, and maybe more.

Why does the classic approach of the multiple classification of texts work poorly?

Consider the shortcomings of the classic text processing approach by steps:

In such tasks there are no stop-words in the generally accepted concepts of any text processing package.

In classic packages from the box, the division of text on words is based on the presence of punctuation or spaces. As part of this class task (where the length of the text field input is often limited), it is not uncommon to receive product names without spaces where words are not clearly separated, but visually on the register of numbers or other language. How to pass toochenization from the box on your favorite programming language for the name of wine “Dom.CHRISTIANmoreau0,75LPeLtr.EtFilChablis” ? (Unfortunately it's not a joke)

Product names are not text in a classic understanding of the task (such as news from sites, services reviews or newspaper headers) which is amenable to release suffix that can be discarded. In the names of products, abbreviations are often present and the reductions of words of which are not clear how to allocate this suffix. And there are also the names of the brands from another language group (for example, the inclusion of the brands of French or Italian) that are not amenable to a normal stemming.

Often, when building “Document-Term” matrices, the package of your language offers to reduce the sparsity of matrices to remove words (columns of the matrix) with a frequency below some minimum threshold. And in classical tasks, it really helps improve quality and reduce overhead in the training of the model. But not in such tasks. Above, I have already written that the distribution of classes is not strongly balanced – it can easily be on the same product name to the class (for example, a rare and expensive brand that has sold it for the first time and while there is only one time in the training sample). The classic approach of sparsity reduction we bring the quality of the classifier.

Usually, some kind of model is trained on texts (LibSVM, naive Bayesian classifier, neural networks, or something else) which is then used repeatedly. In this case, new classes can appear daily and the number of documents in the class can be counted as a single instance. Therefore, it makes no sense to learn one large model for a long time using- any algorithm with online training, for example, a KNN classifier with one nearest neighbor, is enough.

Next, we will try to compare the classification of the traditional approach with the classification based on the proposed package. We will use tidytext as an auxiliary package.

Case example

#devtools::install_github(repo = 'https://github.com/edvardoss/abbrevTexts')
library(abbrevTexts)
library(tidytext) # text proccessing
library(dplyr) # data processing
library(stringr) # data processing
library(SnowballC) # traditional stemming approach
library(tm) #need only for tidytext internal purpose

The package includes 2 data sets on the names of wines: the original names of wines from external data sources – “rawProducts” and the unified names of wines written in the standards for maintaining the company's master data – “standardProducts”. The rawProducts table has many spelling variations of the same product, these variations are reduced to one product in standardProducts through a many-to-one relationship on the “standartId” key column. PS Variations in the “rawProducts” table are generated programmatically, but with the maximum possible similarity to how product names come from external various sources in my experience (although somewhere I may have overdone it)

data(rawProducts, package = 'abbrevTexts')
head(rawProducts)
## # A tibble: 6 x 2
##   StandartId rawName                                               
##        <dbl> <chr>                                                 
## 1          1 MELOT   mouvedr.Les olivier.vinPAY.                   
## 2          1 melotMouvedr.LesOlivier.VinPay.                       
## 3          2 Carigna. grenac.les betes   Rousses igp pay.   HERAULT
## 4          2 carign. GrenacheLesBetes / RoussesIGP Pays -  HERAU.  
## 5          2 Carignan  grenac. Bete. les rousse.igp PAY.   HERAULT 
## 6          3 Petit / chat   Rouge -Vin  pay.
data(standardProducts, package = 'abbrevTexts')
head(standardProducts)
## # A tibble: 6 x 3
##   StandartId StandardName                                                        WineColor
##        <dbl> <chr>                                                               <chr>    
## 1          1 Melot/Mouvedre, Les Oliviers, Vin de Pays d'Oc - France / Red Wines Red Wines
## 2          2 Carignan/Grenache, Les Betes Rousses, IGP Pays d'Herault - France ~ Red Wines
## 3          3 Le Petit Chat Rouge, Vin de Pays d'Oc - France / Red Wines          Red Wines
## 4          4 Pinot Noir, Baron de Poce, Pierre Chainier, Loire Valley - France ~ Red Wines
## 5          5 Gamay, Uva Non Grata, Vin de France - France / Red Wines            Red Wines
## 6          6 58 Guineas Claret, Bordeaux - France / Red Wines                    Red Wines

Train and test split

set.seed(1234)
trainSample <- sample(x = seq(nrow(rawProducts)),size = .9*nrow(rawProducts))
testSample <- setdiff(seq(nrow(rawProducts)),trainSample)
testSample
##  [1]   1   5   7   8   9  11  32  37  44  45  46  48  68  69  82 110 113 119 128 179 187
## [22] 190 191 194 202 213 223 241 256 260 268 271 272 283 288 292 309 344 351 376 395 407

Create dataframes for 'no stemming mode' and 'traditional stemming mode'

df <- rawProducts %>% mutate(prodId=row_number(), 
                             rawName=str_replace_all(rawName,pattern = '\\.','. ')) %>% 
  unnest_tokens(output = word,input = rawName) %>% count(StandartId,prodId,word)

df.noStem <- df %>% bind_tf_idf(term = word,document = prodId,n = n)

df.SnowballStem <- df %>% mutate(wordStm=SnowballC::wordStem(word)) %>% 
  bind_tf_idf(term = wordStm,document = prodId,n = n)

Create document terms matrix

dtm.noStem <- df.noStem %>% 
  cast_dtm(document = prodId,term = word,value = tf_idf) %>% data.matrix()

dtm.SnowballStem <- df.SnowballStem %>% 
  cast_dtm(document = prodId,term = wordStm,value = tf_idf) %>% data.matrix()

Create knn model for 'no stemming mode' and calculate accuracy

knn.noStem <- class::knn1(train = dtm.noStem[trainSample,],
                          test = dtm.noStem[testSample,],
                          cl = rawProducts$StandartId[trainSample])
mean(knn.noStem==rawProducts$StandartId[testSample])
## [1] 0.4761905

accuracy knn.noStem: 0.4761905 (47%)

Create knn model for 'stemming mode' and calculate accuracy

knn.SnowballStem <- class::knn1(train = dtm.SnowballStem[trainSample,],
                               test = dtm.SnowballStem[testSample,],
                               cl = rawProducts$StandartId[trainSample])
mean(knn.SnowballStem==rawProducts$StandartId[testSample])
## [1] 0.5

accuracy knn.SnowballStem: 0.5 (50%)

abbrevTexts primer

Below is an example on the same data but using the functions from abbrevTexts package

Separating words by case

df <- rawProducts %>% mutate(prodId=row_number(), 
                             rawNameSplitted= makeSeparatedWords(rawName)) %>% 
        unnest_tokens(output = word,input = rawNameSplitted)
print(df)
## # A tibble: 2,376 x 4
##    StandartId rawName                             prodId word   
##         <dbl> <chr>                                <int> <chr>  
##  1          1 MELOT   mouvedr.Les olivier.vinPAY.      1 melot  
##  2          1 MELOT   mouvedr.Les olivier.vinPAY.      1 mouvedr
##  3          1 MELOT   mouvedr.Les olivier.vinPAY.      1 les    
##  4          1 MELOT   mouvedr.Les olivier.vinPAY.      1 olivier
##  5          1 MELOT   mouvedr.Les olivier.vinPAY.      1 vin    
##  6          1 MELOT   mouvedr.Les olivier.vinPAY.      1 pay    
##  7          1 melotMouvedr.LesOlivier.VinPay.          2 melot  
##  8          1 melotMouvedr.LesOlivier.VinPay.          2 mouvedr
##  9          1 melotMouvedr.LesOlivier.VinPay.          2 les    
## 10          1 melotMouvedr.LesOlivier.VinPay.          2 olivier
## # ... with 2,366 more rows

As you can see, the tokenization of the text was carried out correctly: not only transitions from upper and lower case when writing together are taken into account, but also punctuation marks between words written together without spaces are taken into account.

Creating a stemming dictionary based on a training sample of words

After a long search among different stemming implementations, I came to the conclusion that traditional methods based on the rules of the language are not suitable for such specific tasks, so I had to look for my own approach. As a result, I came to the most optimal solution, which was reduced to unsupervised learning, which is not sensitive to the text language or the degree of reduction of the available words in the training sample.

The function takes a vector of words as input, the minimum word length for the training sample and the minimum fraction for considering the child word as an abbreviation of the parent word and then does the following:

  1. Discarding words with a length less than the set threshold
  2. Discarding words consisting of numbers
  3. Sort the words in descending order of their length
  4. For each word in the list:

Let's say that we fix min.share = 0.7 At this intermediate stage (4.2), we get a parent-child table where such examples can be found:

##    parent child
## 1 bodegas bodeg
## 2   bodeg  bode

Note that each line meets the condition that the length of the child's word is not shorter than 70% of the length of the parent's word.

However, there may be found pairs that can not be considered as abbreviations of words because in them different parents are reduced to one child, for example:

##    parent child
## 1 bodegas bodeg
## 2 bodegue bodeg

My function for such cases leaves only one pair.

Let's go back to the example with unambiguous abbreviations of words

##    parent child
## 1 bodegas bodeg
## 2   bodeg  bode

But if you look a little more closely, we see that there is a common word 'bodeg' for these 2 pairs and this word allows you to connect these pairs into one chain of abbreviations without violating our initial conditions on the length of a word to consider it an abbreviation of another word:

bodegas->bodeg->bode

So we come to a table of the form:

##    parent child terminal.child
## 1 bodegas bodeg           bode
## 2   bodeg  bode           bode

Such chains can be of arbitrary length and it is possible to assemble from the found pairs into such chains recursively. Thus we come to the 5th stage of determining the final child for each participant of the constructed chain of abbreviations of words

  1. Recursively iterating through the found pairs to determine the final (terminal) child for all members of chains
  2. Return the abbreviation dictionary

The makeAbbrStemDict function is automatically paralleled by several threads loading all the processor cores, so it is advisable to take this point into account for large volumes of texts.

abrDict <- makeAbbrStemDict(term.vec = df$word,min.len = 3,min.share = .6)
abrDict <- as_tibble(abrDict)
print(abrDict) # We can see parent word, intermediate results and total result (terminal child)
## # A tibble: 408 x 3
##    parent    child    terminal.child
##    <chr>     <chr>    <chr>         
##  1 abruzz    abruz    abruz         
##  2 absoluto  absolut  absolu        
##  3 absolut   absolu   absolu        
##  4 aconcagua aconcagu aconcag       
##  5 aconcagu  aconcag  aconcag       
##  6 agricola  agricol  agrico        
##  7 agricol   agrico   agrico        
##  8 alabastro alabast  alabast       
##  9 albarino  albarin  albari        
## 10 albarin   albari   albari        
## # ... with 398 more rows

The output of the stemming dictionary in the form of a table is also convenient because it is possible to selectively and in a simple way in the “dplyr” paradigm to delete some of the stemming lines.

Lets say that we wont to exclude parent word “abruzz” and terminal child group “absolu” from stemming dictionary:

abrDict.reduced <- abrDict %>% filter(parent!='abruzz',terminal.child!='absolu')
print(abrDict.reduced)
## # A tibble: 405 x 3
##    parent     child     terminal.child
##    <chr>      <chr>     <chr>         
##  1 aconcagua  aconcagu  aconcag       
##  2 aconcagu   aconcag   aconcag       
##  3 agricola   agricol   agrico        
##  4 agricol    agrico    agrico        
##  5 alabastro  alabast   alabast       
##  6 albarino   albarin   albari        
##  7 albarin    albari    albari        
##  8 alentejano alentejan alentejan     
##  9 alianca    alianc    alian         
## 10 alianc     alian     alian         
## # ... with 395 more rows

Compare the simplicity and clarity of this solution with what is offered in stackoverflow:

Text-mining with the tm-package – word stemming

Stem using abbreviate dictionary

df.AbbrStem <- df %>% left_join(abrDict %>% select(parent,terminal.child),by = c('word'='parent')) %>% 
    mutate(wordAbbrStem=coalesce(terminal.child,word)) %>% select(-terminal.child)
print(df.AbbrStem)
## # A tibble: 2,376 x 5
##    StandartId rawName                             prodId word    wordAbbrStem
##         <dbl> <chr>                                <int> <chr>   <chr>       
##  1          1 MELOT   mouvedr.Les olivier.vinPAY.      1 melot   melo        
##  2          1 MELOT   mouvedr.Les olivier.vinPAY.      1 mouvedr mouvedr     
##  3          1 MELOT   mouvedr.Les olivier.vinPAY.      1 les     les         
##  4          1 MELOT   mouvedr.Les olivier.vinPAY.      1 olivier olivie      
##  5          1 MELOT   mouvedr.Les olivier.vinPAY.      1 vin     vin         
##  6          1 MELOT   mouvedr.Les olivier.vinPAY.      1 pay     pay         
##  7          1 melotMouvedr.LesOlivier.VinPay.          2 melot   melo        
##  8          1 melotMouvedr.LesOlivier.VinPay.          2 mouvedr mouvedr     
##  9          1 melotMouvedr.LesOlivier.VinPay.          2 les     les         
## 10          1 melotMouvedr.LesOlivier.VinPay.          2 olivier olivie      
## # ... with 2,366 more rows

The output of the stemming dictionary in the form of a table is also convenient because it is possible to selectively and in a simple way in the “dplyr” paradigm to delete some of the stemming lines.
To example:

TF-IDF for stemmed words

df.AbbrStem <- df.AbbrStem %>% count(StandartId,prodId,wordAbbrStem) %>% 
  bind_tf_idf(term = wordAbbrStem,document = prodId,n = n)
print(df.AbbrStem)
## # A tibble: 2,289 x 7
##    StandartId prodId wordAbbrStem     n    tf   idf tf_idf
##         <dbl>  <int> <chr>        <int> <dbl> <dbl>  <dbl>
##  1          1      1 les              1 0.167  2.78  0.463
##  2          1      1 melo             1 0.167  4.94  0.823
##  3          1      1 mouvedr          1 0.167  5.34  0.891
##  4          1      1 olivie           1 0.167  4.25  0.708
##  5          1      1 pay              1 0.167  3.15  0.525
##  6          1      1 vin              1 0.167  2.78  0.463
##  7          1      2 les              1 0.167  2.78  0.463
##  8          1      2 melo             1 0.167  4.94  0.823
##  9          1      2 mouvedr          1 0.167  5.34  0.891
## 10          1      2 olivie           1 0.167  4.25  0.708
## # ... with 2,279 more rows

Create document terms matrix

dtm.AbbrStem <- df.AbbrStem %>% 
  cast_dtm(document = prodId,term = wordAbbrStem,value = tf_idf) %>% data.matrix()

Create knn model for 'abbrevTexts mode' and calculate accuracy

knn.AbbrStem <- class::knn1(train = dtm.AbbrStem[trainSample,],
                                test = dtm.AbbrStem[testSample,],
                                cl = rawProducts$StandartId[trainSample])
mean(knn.AbbrStem==rawProducts$StandartId[testSample]) 
## [1] 0.8333333

accuracy knn.AbbrStem: 0.8333333 (83%)

As you can see , we have received significant improvements in the quality of classification in the test sample. Tidytext is a convenient package for a small courpus of texts, but in the case of a large courpus of texts, the “AbbrevTexts” package is also perfectly suitable for preprocessing and normalization and usually gives better accuracy in such specific tasks compared to the traditional approach.

Link to github: https://github.com/edvardoss/abbrevTexts