Feed-Forward Assisted Transformers for Time Efficient Fine-Tuning
I'm grateful to my amazing teammates Blake Hu, Julian Baldwin, Stephen Cheng, Marko Veljanovski, and Michelle Zhang for their dedication, creativity, and collaborative spirit throughout this project. Our advisor Prof. Zach Wood-Doughty provided invaluable feedback and guidance that helped shape our research direction and presentation. This project taught me the importance of balancing theoretical innovation with practical efficiency constraints in machine learning research (and put me through a bit of light CUDA hazing).
Abstract
Fine-tuning has become the standard approach for adapting pre-trained language models for specific downstream tasks. However, the energy and time required to fully fine-tune all parameters can become prohibitively large for many applications as the size of the model increases. While recent advancements in parameter-efficient transfer learning have reduced the number of parameters that need to be updated, the training time and energy consumption of these methods remain similar to full fine-tuning. In this paper, we propose a time-efficient fine-tuning method based on feature-extraction in which we treat off-the-shelf language models as fixed sources of embeddings and train small feed-forward networks on top for each downstream task. Averaging across the GLUE NLI benchmark, our method trains 124 times faster than full fine-tuning and 101 times faster than parameter-efficient fine-tuning methods using distilRoBERTa, while achieving 81.9% and 85.0% performance respectively.