Classifying Marijuana Extracts for Inhalation

In July  2014, the State of Washington opened a legal marijuana market, that allowed for legal growth and production of marijuana and marijuana related products as well as the sale of marijuana products for recreational use ranging from typical flower products, edibles, medical products, and cartridges for vaporizing and oils. Included in the development of this system is a database that tracks each product sold from the grower down the supply chain to the the consumer. The database is managed by the Washington State Liquor and Cannabis Board and is openly available through their website.

When Washington State legalized cannabis for recreational use and then set up a regulated market in 2014, they also instituted a tracking system, monitoring product from "seed to sale". These records provide regulators and researchers with opportunities to understand patterns in such a market with much greater detail than before. 

There are, however, some things that are not captured directly in the data that are still important to have a better understanding of trends in the market. One such area of interest is the category of retail products called "Marijuana Extract for Inhalation." This includes a number of concentrate products, like wax, shatter, and oils, which are sometimes packaged as cartridges for vape pens, and also includes hash and kief. These sub-categories are not captured in the data, though. This analysis explores a few options for disaggregating this product group into subcategories that could be used alone, in conjunction with each other, and/or as consistency checks on one another.


Why Classify Extracts?

While the recreational market is steadily growing overall, extract products appear to be the fastest growing segment. Researchers and regulators are interested in learning more about the patterns of these products, especially given this new opportunity to study them with specific, transaction level data.

Understanding the Data

Two and half years of transactions have resulted in 25 gigs of data, stored in 28 tables, capturing transactions along the supply chain. This analysis focuses on product names and product features of a the subset known as "Extracts for Inhalation."

Screen Shot 2017-05-08 at 11.51.37.png

Option 1: Classify by product name

First, we used the productname variable and search strings to classify products. The search strings were chosen though an iterative process of reviewing what words and phrases showed up often in unclassified products, and domain research to understand which strings indicate which product types, as well as what product breakdowns make sense.

Option 2: Classify by producer's type

Another option is to use the inventoryproducttype variable that is associated with a given retail product one step up the supply chain, either when the processor had the product in inventory, or the type provided to the lab when a sample was tested. In some cases the type is simply listed as "marijuana extract for inhalation" or is missing, which don't help us, but it seems that we'd be able to both classify more of the uncategorized using this method, and also use it as a check against our search strings method.

Screen Shot 2017-05-08 at 13.09.38.png

Screen Shot 2017-05-02 at 22.07.01.png

Option 3: Machine Learning

Another method is to use other variables, such as price per gram, potency, and whether or not the sample received certain tests, to train and test a machine learning algorithm. This could be used to classify items where the product name is simply a name of a strain, or something else that does not provide enough indication of it's type. A supervised learning method can be applied, using the categories produced from strict text classification rules as ground truth to train the model. 

Feedback from Experts

In order to fine tune this model, we discussed our methodology with our capstone team, our advisor and even our client RAND. After the first round of text classification using product name, we explained the logic to our advisor and teammates, who provided valuable feedback on additional strings to include in our rules. Additionally, since we were able to classify 63% of the extracts for inhalation product names using this method we discussed the results with RAND to check that this is the type of classification is what they were hoping to see. RAND was impressed with how well using rules to classify text worked and particularly interested in the subsequent analysis it informed. They also confirmed that the strings and final categories made logical sense in the context of marijuana policy discussion. This also served to fortify our idea that these categories are accurate enough to serve as ground truth for a machine learning model.

Given this positive feedback, we continued to develop our machine learning model to classify products that could not be classified using text. Our advisor provided helpful feedback on building the model as did Jen and Nikola. 


This portion of the analysis on this dataset was conducted by Krista Kinnard and Lauren Renaud.

We are grateful for the work from the rest of the Heinz College Systems Team: Yutian Gao, Ellie Najewicz, Imane Fahli, Yilun Bao, and our advisor, Professor Jon Caulkins.
We also want to thank the RAND Team, Beau Kilmer, Steve Davenport, Rosanna Smart, and Greg Midgette, Peter Corier from the Liquor and Cannabis Board, and our wider Heinz College Advisory Board, including Jen Mankoff and Nikola Banovic for their guidance and assistance on this project.