Transfer Leaning and Pre-Trained Models

I have been reading the book Deep Learning for Coders with fastai and PyTorch, which I do recommend for any deep learning implementers. There are many good ideas and specific implementations in the book. Early in chapter 1 the author’s piqued my interest when they mentioned pre-trained models. As a practitioner and someone who has to show results quickly pre-training and transfer learning have always been of interest to me. The author’s report that “The importance of pertained models is generally not recognized or discussed in most courses, books, or software library features, and is rarely considered in academic papers”. This seemed strange to me so I decided to dig a little deeper into the academic journals and look at various implementations to understand this topic more thoroughly because pre-trained models are an integral part of an insights as a service solution for various industries.

Definitions

A pre-trained model is a ML model that has been trained on a dataset other than the one you are currently using - the weights and biases have been updated and hyper-parameters already tuned on this other dataset before your current dataset is introduced. Transfer learning is a related but different concept, this is where the pre-trained model was trained on task(s) or domains that are different than the one you are using it for now. Goodfellow et al in their 2016 book Deep Learning defined transfer learning as “Situation where what has been learned in one setting is exploited to improve generalization in another setting”. Generally speaking, the pre-trained model was originally built and tuned on a very large corpus of data and will be used in a setting for which it was not originally trained.

There are numerous advantages to using pre-trained models:

  • Data Requirements - if you use vetted pre-trained models the data required to train the model on your current problem is much lower. Getting access to, cleansing and labeling large datasets represents a significant cost in time and money so savings here pays huge dividends in the future

  • Quality Assurance - using models that have been pre-trained and tuned can save you significant hours and you can be more confident in your end results

  • Time to Market - ultimately, pre-trained models provide a time to market advantage. Time to market and fast implementations are always a primary concern for my customer’s. I have to have answers to the question “how do I deliver this sooner?”.

Research Topic

I did a scan of the research and academic literature because engineering is the practical application of scientific knowledge, it would then be interesting to see what is in the implementation pipeline. I found less research in the area of pre-training models than I had expected. The bulk of the research I found and the references to other research work starts in 2015 to 2021 timeframe. Interestingly, there was a a significant amount of research in the field of transfer learning in the early 2000’s. The interest in transfer learning seems to have been spawned in 1995 by a NIPS (Neural Information Processing Systems) post conference workshop entitled "Learning to Learn: Knowledge Consolidation and Transfer in Inductive Systems".

The good news is there is a lot more research being conducted in the field of pre-trained models and transfer learning so I would expect additional frameworks and implementations to be available in the coming years. At the end of this post I put in a reference list the academic research I thought was most pertinent to pre-training and transfer learning.

Current Implementations of Pre-Training Models

Amazon Web Services: It maybe true that pre-trained models got short shrift in the academic literature but there is a fair number of them in the marketplace for the implementor to choose from. Considering AWS’ dominance in the industry I will start there. As of this writing I have not implemented AWS Marketplace pre-trained models on SageMaker. We had a client who had some interest in deploying on AWS so we did a few proof of concepts that worked well but in the end the decision was to remain in house. While there are many models to choose from we were not ready to commit to using ML models from some potentially unreliable vendors for a mission critical application.

PyTorch & fastai: I primarily use PyTorch so fastai is a good candidate because it is built on top of PyTorch. fastai adds higher level functionality — a layer of abstraction - above PyTorch that makes designing, developing, testing, and deploying your models easier. The folks at fastai are strong proponents of using pre-trained models and much of this framework is built on the assumption that you will leverage their models in your implementations. fastai was founded as a non-profit research group by Jeremy Howard and Rachel Thomas and provides an open source solution for practitioners. Here you will find models for image classification, natural language processing (NLP), text classification, CNN learner models, and others.

I have found fastai NLP models very useful for sentiment analysis. In 2018 Howard and Ruder published a paper proposing a Universal Language Model Fine-tuning (ULMFiT) a transfer learning method that can be applied to NLP problems to avoid training from scratch. Since that time they have furthered this effort to non-English language solutions using a multilingual text classification model called MultiFiT which extends ULMFiT. Very early research on the topic supports use of pre-trained models for sentiment analysis using CNN especially when the amount of labeled data you have access to is very small which is a problem we have encountered with clients. From Severyn and Moschitti’s 2015 research paper Twitter Sentiment Analysis with Deep Convolutional Neural Networks: “When dealing with small amounts of labelled data, starting from pre-trained word embeddings is a large step towards successfully training an accurate deep learning system.” A quick google search and you will find numerous implementations and examples that leverage fastai’s pre-trained models in a variety of domains one of which will surely fit your area of interest.

BERT: BERT (Bidirectional Encoder Representations from Transformers), is an open source technique for NLP that was created by Google researchers in 2018. To give you some idea of the power of using pre-trained models Google claims that in 30 minutes you can have your own state of the art question answer model (or other similar models) up and running on a single cloud CPU. BERT has been pre-trained on corpus of text from Wikipedia. BERT is primarily been used for voice and/or text searches or speech recognition. BERT has been fine tuned for various domains - document classification docBert, bio-medical BERT created in Korea bioBERT, and VideoBERT to be used for video captioning and action classification.

Vendor Implementations: In addition to the above options are vendor implementations and augmentations of models based on pre-trained solutions described above and some implementations based on their own in-house work. I have found vendor solutions helpful because they tend to be more domain specific which makes for faster model delivery for Fortune 500 companies that use Insights as a Service for implementations. The cost of transfer learning is then borne by the vendor and they can come to the table with better more refined solutions. You will also be able to take advantage of and tune for your individual needs ML models in common products. An example of this is an interesting implementation of Splunk and Tensorflow being used to detect fraud based on an individual’s mouse movements.

As you develop your company’s machine learning practice be mindful of the many advantages of pre-trained models.

References:

Devlin, Jacob, et. al. Google AI Blog: Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

Esman, Gleb. 2017. Splunk and Tensorflow for Security: Catching the Fraudster with Behavior Biometrics.

Guo, Yuting. 2020. Benchmarking of Transformer-Based Pre-Trained Models on Social Media Text Classification Datasets

Howard, Jeremy and Gugger, Sylvain. 2020. Deep Leaning for Coders with fastai & PyTorch.

Han, Xu. et. al. 2021. OpenAI: Pre-Trained Models: Past, Present and Future

Kumar, Varun. 2021. Data Augmentation Using Pre-trained Transformer Models

Tian, Hao, et. al. 2020. SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis

Vennaro, Nick. 2017. Buy (don’t build) healthcare data insights to improve data investment ROI

Previous
Previous

AI Agent Based Implementation of “Cobots” in Healthcare Office Administration: A Cloud-Native Solution using Microsoft Technologies

Next
Next

Data Collection, Preparation & Preprocessing in ML