Classifying Extracts: Machine Learning Models

Topic Modeling

LDA Output

To start, it may have been that there were more groups or topics in the product names that were missed in the prior methodology. To explore the text in these names in more depth, we used tried topic modeling using Latent Dirichlet allocation on the uncategorized product names, which treats each product name as a document and finds topics that exist within these documents.

However, this did not produce any obvious new groupings that we had missed, nor did they seem to line up with existing categories. Looking further at these remaining product names, there are no obvious patterns that emerge from names like Blue Hawaiian 1g - 1.00 gram, GC Grenades, Ghost Cheese 1g, Natural Indica, Narnia, or Nevil's Haze

Clustering

Moving beyond product names, the product characteristics may be indicative of their type. We hypothesized that using price per gram, potency as measured in total THC and also CBD, and which lab tests a product received may product clusters of products that would hopefully match our text created categories. The lab tests are included because not all products are required to receive all tests. Only things created using solvents get solvent tested, for example, and the need to test for certain types of bacteria or moisture may indicate other production methods and product outcomes.

We used k-means to produce clusters, eliminated some outlier values, and then ran the clustering again. These groupings did not match our text groups, and do not seem to work to break down these products in isolation.  

Note on machine learning models
While we lack absolute ground truth to train and test machine learning models, the conservativeness of the text classification, in combination with consulting with researchers and regulators in Washington State lead us to believe that those groupings are accurate enough for further modeling.

K Nearest Neighbor (KNN)

These product characteristics may still be indicative of the product type if we combine them with our groupings from the text methodology to train and test a model. We selected K Nearest Neighbors because it can handle producing more than two classifications.

We separated out the categorized and uncategorized. The data was group at the product name level, using the average value for that product for each feature. We used the aggregated, categorized data to train and test the model. Because this is not a standard two-by-two table, we instead calculated a concordance rate -- a measure how well did the outcomes match our expected outcomes. The results were fairly satisfactory, matching our text groupings over 80% of the time.

We then applied this model to the previously uncategorized extract products and were left with only 3.9% remaining uncategorized, primarily because of missing or NA product names.

Product classification results

Limitations

No ground truth
The greatest challenge with classifying extracts is that there is no pre-existing extract categories. Consultation with subject matter experts in Washington and at RAND as well as observing trends, we were able to identify categories that are reasonable proxies for ground truth. However, there may be categories that should exists that we have not identified that could change the results of the classification.

KNN model run on aggregated data
The KNN model is running using Average THC, CBD and Price per Gram for each product name. This was done for two reasons. First, without aggregating, the dataset to go through the model is very large and requires a lot of processing power. Secondly, not every product receives a lab test. Running the model on each product sold would include many missing values. However, at least one test exists for most product names. Aggregating on product name reduces the missingness of the the lab features and increases the accuracy of the model.  

Used binary library indicators
These models were run using an indicator of whether or not a test was performed, not the value the sample received on the the test itself. This was in part because of processor power and to avoid missingness in the way we structured the KNN approach. A different model might be able to incorporate both whether or not a test occurred and the actual test value as well.