Classifying Extracts: Product Name Text

Exploring Keyword Search

Our first pass at categorizing inhalants is by searching product names for what would indicate product types. This method is able to classify 65.6% of “Marijuana Extract for Inhalation” products that were sold in the dispensing file, which is 66.7% of inhalant product names.

This table shows how many products fit each classification as well as the strings used for classification. (For “hash”, it also picks up “hashish”, but hashish was not searched for separately.) These do not sum to 100% because 9.3% of product names contain more than one string. Count here is counting number of products appearing in the dispensing (retail) file, not a count of product names.

While we could do a more thorough analysis of all the overlapping keywords, for this preliminary pass below we can look at “cartridges” and “oil” to see what the overlap for these keywords look like.

Overlap: Cartridges

In a cursory look at the product names, it looked like things that are categorized as “cartridges” might be picking up other keywords too. This table looks only at things that are cartridges, and then what other, additional keywords they also pick up, as well as their percent of all inhalants in the market, and percentage of cartridges in the market.

Overall 37.02% of cartridges are picking up other keywords, with the highest being “CO2” at 17.28% followed by “oil” at 6.56% of the cartridges. We will have to consider whether to categorize as “cartridge” first, or as oil or BHO, or as something like “BHO cartridge” as compared to “General cartridge”, for example. Similarly we need to make the same judgements with other subcategories of cartridges, though there are fewer products in the market with these other combinations of keywords.

Overlap: Oil

It also looks like things that are categorized as “oil” are picking up other keywords too. This table looks only at products that have “oil” in the name, and then what other, additional keywords they also pick up, along with the percentage of inhalant market and percentage of “oil” products that also contain that keyword.

It appears that 41.78% of the oils are also listed as cartridges, which is not surprising. It might be best to categorize as “cartridge” first, and “oil” if not a cartridge. Also 8.44% of “oils” are also picking up other keywords, which is something we could explore further to determine how to categorize. For example, 4.01% of products that contain the string “oil” also contain the string “dab”, and 2.92% of oil product names also contain the string “hash”.

Implementing Text Rules

Because we do not have absolute ground truth for these classifications, we want to err on the side of caution in implementing these rules. To do this, for the most part if a product picks up two keywords, we will move it to “Uncategorized”, and then try other methods of classification. The main exception to this is that we will categorize “Hash Oil” as oil. Also, generally speaking we’re using “dab” as a search string to pick up product that are used for dabbing, but if some name contains both “wax” and “dab”, it is classified as wax.

 Visual explanation of search string hierarchy

Visual explanation of search string hierarchy

Using this stricter method, we get the following product distribution and are able to classify 65.4% of extracts.

Functionalizing

Because the objective of this categorization is to inform other research, this sequence has been formalized into a function that can be called for other analyses. There are two functions below – one to break down into the 9 distinct product categories, and one to get generalized categories, Oil/Cartridge, Wax/Shatter/Resin, Hash/Kief, and Uncategorized. (There is also another function, not included here, that breaks oils and cartridges out separately.)

Here it is also possible to see the specific strings that are used. These have been determined through iteratively looking at product names and phrases, consulting with retail website and domain researchers, and then going back again to re-examine the uncategorized products. By creating this function it also creates some flexibility – the strings can be easily modified as needed.

#create search terms for each product type
cartridge.strings <- "cart|vap|vc|pen|refill|juju|joint|atomizer"
oil.strings <- "oil|rso|eso"
hash.strings <- "hash"
kief.strings <- "kief|keif"
wax.strings <- "wax|crumble|budder"
shatter.strings <- "shatter|snap"
dab.strings <-"dab"
resin.strings <- "resin|rosin"

categorizeNames <- function(productName){
  #' Takes product name and categorizes it into a 
  #' product category type
  #' @param productName  A string of inhalant product names
  #' @return A categorized usage of the productname as a string.
  
  # first check for cartridges. allow for oil to be in product name so that 
  # for example "oil cartridge" will be classified as a cartridge
  if(grepl( cartridge.strings, productName, ignore.case = T) == TRUE & 
     # allows for dab and oil strings
     grepl(paste(hash.strings, kief.strings, wax.strings,shatter.strings, 
                 resin.strings, sep = "|"), productName, ignore.case = T) == FALSE) {
    return("Cartridge")
  }
  # now check for oil products. allow for hash to also be in product name so that "hash oil" is classified as oil
  else if(grepl(oil.strings, productName, ignore.case = T) == TRUE & 
          # allows for dab strings and hash strings
     grepl(paste(hash.strings, kief.strings, wax.strings,shatter.strings, resin.strings, sep = "|"), 
           productName, ignore.case = T) == FALSE) {
    return("Oil")
  }
  # check for hash products. Allow for no overlap with other products except "dab"
  else if(grepl( hash.strings, productName, ignore.case = T) == TRUE & 
          grepl(paste(cartridge.strings, kief.strings,oil.strings, wax.strings,shatter.strings, 
                      resin.strings, sep = "|"), productName, ignore.case = T) == FALSE) {
    return("Hash")
  }
  # check for kief products. Allow for no overlap with other products except "dab"
  else if(grepl( kief.strings, productName, ignore.case = T) == TRUE & 
          grepl(paste(cartridge.strings, hash.strings,oil.strings, wax.strings,shatter.strings, 
                      resin.strings, sep = "|"), productName, ignore.case = T) == FALSE) {
    return("Kief")
  }
  # check for wax products. Allow for no overlap with other products except "dab"
  else if (grepl( wax.strings, productName, ignore.case = T) == TRUE & 
           grepl(paste(cartridge.strings, kief.strings,oil.strings, hash.strings,shatter.strings, dab.strings, 
                       resin.strings, sep = "|"), productName, ignore.case = T) == FALSE) {
    return("Wax")
  }
  #check for shatter products. Allow for no overlap with other products except "dab"
  else if (grepl( shatter.strings, productName, ignore.case = T) == TRUE & 
           grepl(paste(cartridge.strings, kief.strings,oil.strings, wax.strings, hash.strings, 
                       resin.strings, sep = "|"), productName, ignore.case = T) == FALSE) {
    return("Shatter")
  }
  #check for resin products. Allow for no overlap with other products except "dab"
  else if (grepl( resin.strings, productName, ignore.case = T) == TRUE & 
           grepl(paste(cartridge.strings, kief.strings,oil.strings, wax.strings,shatter.strings, 
                       hash.strings, sep = "|"), productName, ignore.case = T) == FALSE) {
    return("Resin")
  }
  #check for dab products. Allow for no overlap with other products
  else if (grepl( dab.strings, productName, ignore.case = T) == TRUE & 
           grepl(paste(cartridge.strings, kief.strings,oil.strings, wax.strings,shatter.strings, hash.strings, 
                       resin.strings, sep = "|"), productName, ignore.case = T) == FALSE) {
    return("Dab")
  }
  else return("Uncategorized")
}

groupProductTypes <- function(productType){
  #' Takes product type and categorizes it into a 
  #' product category grouping
  #' @param productType  A string of inhalant product type
  #' @return A grouped usage of the product type as a string.
  
  if(productType=="Cartridge" | productType=="Oil") {
    return("Cartridge/Oil")
  }
  else if(productType=="Hash" | productType=="Kief") {
    return("Hash/Kief")
  }
  else if(productType=="Wax" | productType=="Shatter" | productType=="Dab" | productType=="Resin") {
    return("Wax/Shatter/Resin")
  }
  else return("Uncategorized")
}