PyData Global 2022

Converting sentence-transformers models to a single tensorflow graph
12-02, 13:00–13:30 (UTC), Talk Track I

Getting predictions from transformer models such as BERT requires two steps: first to query the tokenizer and then feed the outputs to the deep learning model itself. These two parts of the model are kept under different class implementations in popular open source implementations like Huggingface Transformers and Sentence-Transformers. This works well within Python but when one wants to put such a model in production or convert it to more efficient formats like onnx that may be served by other languages such as JVM-based it is preferable and simpler (and less risky) to have a single artifact that is directly queried. This talk builds on the popular sentence-transformers library and shows how one can transform a sentence-transformer model into a single tensorflow artifact that can be queried with strings and is ready for serving. At the end of the talk the audience will get a better understanding of the architecture of sentence-transformers and the required steps for converting a sentence-transformer model to a single tensorflow graph. The code is released as a set of notebooks so that the audience can replicate the results.


This is my first pyData talk proposal and your feedback to its elements is most welcome. It is built on my lessons learned when trying to prepare a BERT-type of model to be shipped in production. The code of the talk is ready and tested and will be published as notebooks as part of the talk. The code currently lives at my github (https://github.com/balikasg/tf-exporter) and the main functionality is in https://github.com/balikasg/tf-exporter/tree/master/src/tf_packager and is developed further. The problem the talk aims to discuss is a common and interesting one without a complete solution yet, see for instance [1][2][3][4].

I will showcase how one can transform a sentence transformer model that consists of its custom tokenizer and the DL model itself to a single tf graph that can be queried directly with strings. This is particularly useful when one puts such a model in production as it reduces the probability of misalignment of the model’s tokenizer and transformer parts and also removes the need for constructing and maintaining two APIs (one for tokenization and another for querying the DL model).

There will be 4 main parts in the presentation that will use ~25 minutes leaving 5 minutes for questions. Throughout the talk, I will use sentence-transformers as the main framework and I will show how models residing in the sentence-transformers huggingface space or are trained (or fine-tuned) with sentence-transformers can be converted to a tf graph ready for serving. Here is the break-down of the talk:

0-5 min: Introduction. In this first part of the talk I will introduce myself and I will state what the main problem is. In particular, I will show that getting predictions from a transformer model requires two distinct steps, tokenization and querying the models and I will discuss the advantages of having a single artifact when putting such a model in production. Note that while having a single graph is not explicitly needed for serving purposes only, when serving a model we care a lot about reducing the risk of any failure. I expect this element of the talk to make it really interesting for ML engineers in the audience or folks who try to serve such models because simplifying the process of querying the model is of utmost importance for them.

5-12 min: I will show what are that actual model files and configs that sentence-transformers framework persists which will motivate and explain the conversion steps. Essentially, a sentence-transformers model is a sequence of models. In the most minimal scenario it consists of a tokenizer and a transformer model whereas in more elaborate model architectures we can have other types of layers (e.g, normalization layer, dense layers, LSTM layers, …) on top of the transformer outputs of the minimal example. I will use three models of increasing complexity as examples: nq-distilbert-base-v1 (implements the minimal example), all-MiniLM-L6-v2 (has normalization on top of nq-distilbert) and the distiluse-base-multilingual-cased-v1 model (has also a dense layer on top of the transformer output). Describing this will let the audience understand some of the architectural choices within sentence-tranformers but also motivate the solution.

12-22 min: I will describe the solution showing how each of the components of the example models can be converted to a tensorflow component and how all these tensorflow components can be tied together as a sequential neural network. The highlights of this part will be how to convert the tokenizer using native tf.text operations and how to convert a pytorch dense layer into a tensorflow dense layer with the same parameters. This is quite interesting because it touches on cross-framework model transformations (tensorflow vs pytorch) which is not trivial. Completing this description essentially will clearly demonstrate the framework where in order to get a single tensorflow graph for a given model it suffices to convert each of the components and tie them together in a sequential model.

22-25 min: As a conclusion to this talk I will show a demo where each of the aforementioned models will be converted to a single tf artifact. As a sanity check I will compare the predictions of the sentence-transformer models for example queries with those one gets from the complete graph. A public repository with the conversion notebooks will be shared with the audience.

As a side-note I hope to announce a pip package that generalizes the code of the talk to a package whose main functionality at the point will be to convert a sentence-transformer model to a single tensorflow graph that includes the tokenizer, the transformer and any additional layers that are applied on top of it.

[1]https://github.com/kpe/bert-for-tf2/issues/75
[2]https://github.com/huggingface/transformers/issues/13985
[3]https://github.com/microsoft/onnxruntime-extensions/issues/164
[4]https://stackoverflow.com/questions/71035590/how-can-i-combine-a-huggingface-tokenizer-and-a-bert-based-model-in-onnx


Prior Knowledge Expected

No previous knowledge expected

Georgios Balikas is a Lead Data Scientist at Salesforce Search. He works on building production models for machine learning applications such as named entity recognition, classification, ranking and question answering. He holds a PhD from the University of Grenoble Alps on the intersection of machine learning and natural language processing.