Hugging Face: A Step Towards Democratizing NLP

Hugging Face: A Step Towards Democratizing NLP


10 min read

It’s not an emoji, it’s NLP for everyone

Photo by [James Lee]( on [Unsplash]( by James Lee on Unsplash

Hugging face; no, I am not referring to one of our favorite emoji to express thankfulness, love, or appreciation. In the world of data science, Hugging Face is a startup in the Natural Language Processing (NLP) domain, offering its library of models for use by some of the A-listers including Apple and Bing.

For those wondering why the focus of today’s blog is on a startup, let me first take you through what Hugging Face is all about and why it matters for fellow data scientists.

What is Hugging Face?

Hugging Face, a company that first built a chat app for bored teens provides open-source NLP technologies, and last year, it raised $15 million to build a definitive NLP library. From its chat app to this day, Hugging Face has been able to swiftly develop language processing expertise. The company’s aim is to advance NLP and democratize it for use by everyone.

Photo by [Daniela Turcanu]( on [Unsplash]( by Daniela Turcanu on Unsplash

In a bid to make it easier for humans to communicate with machines, technologies such as NLP are crucial. For instance, with NLP, it is possible for computers to read text, hear speech, interpret it, measure sentiment, and even determine which parts of the text or speech are important. As more companies increasingly add NLP technologies for enhanced interactions, it becomes imperative to have ready libraries on which language models can be trained, saving time and cost. This is where companies like Hugging Face come into play. Its BERT models are considered highly effective and you can see them everywhere.


*Photo by[ Markus Winkler]( on[ Unsplash](*Photo by Markus Winkler on Unsplash

Bidirectional Encoder Representations from Transformers or BERT is a technique used in NLP pre-training and is developed by Google. Hugging Face offers models based on Transformers for PyTorch and TensorFlow 2.0. There are thousands of pre-trained models to perform tasks such as text classification, extraction, question answering, and more. With its low compute costs, it is considered a low barrier entry for educators and practitioners. The company also offers inference API to use those models.

Hugging Face provides a number of models popular for their effectiveness and seamless implementation. Now that we have a fair idea about Hugging Face and its BERT models, let me give you a brief overview of two of its popular models for language processing.

Model: Bert-base-uncased

One of the popular models by Hugging Face is the bert-base-uncased model, which is a pre-trained model in the English language that uses raw texts to generate inputs and labels from those texts. It was pre-trained with two objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

In MLM objective, the model randomly masks 15% of the words in a sentence and then the masked sentence is run through the model to predict the masked words. It allows the model to learn a bidirectional representation of the sentence.

In NSP objective, the model concatenates two masked sentences as inputs during pre-training. The model has to predict if two sentences were following each other or not. In this way, it learns an inner representation of the English language which can be leveraged for downstream tasks such as training a standard classifier if the dataset is of labeled sentences.

Model: bert-base-uncase

This bert-base-uncased model is intended to be fine-tuned on a downstream task, but it can be used for either masked language modeling or next sentence prediction.

You can use this model with a pipeline for masked language modeling:

**>> from** transformers **import** pipeline

>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')

>>> unmasker("Hello I'm a [MASK] model.")

[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",

'score': 0.1073106899857521,

'token': 4827,

'token_str': 'fashion'},

{'sequence': "[CLS] hello i'm a role model. [SEP]",

'score': 0.08774490654468536,

'token': 2535,

'token_str': 'role'},

{'sequence': "[CLS] hello i'm a new model. [SEP]",

'score': 0.05338378623127937,

'token': 2047,

'token_str': 'new'},

{'sequence': "[CLS] hello i'm a super model. [SEP]",

'score': 0.04667217284440994,

'token': 3565,

'token_str': 'super'},

{'sequence': "[CLS] hello i'm a fine model. [SEP]",

'score': 0.027095865458250046,

'token': 2986,

'token_str': 'fine'}]

To use this model to get the features of a given text in PyTorch:

**from** transformers **import** BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model = BertModel.from_pretrained("bert-base-uncased")

text = "Replace me by any text you'd like."

encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)

To use this model in TensorFlow:

**from** transformers **import** BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model = TFBertModel.from_pretrained("bert-base-uncased")

text = "Replace me by any text you'd like."

encoded_input = tokenizer(text, return_tensors='tf')

output = model(encoded_input)

Model: xlm-roberta

Another very popular model by Hugging Face is the xlm-roberta model. This is a multilingual model trained on 100 different languages, including Hindi, Japanese, Welsh, and Hebrew. It is capable of determining the correct language from input ids; all without requiring the use of lang tensors.

Trained on 2.5T of filtered CommonCrawl data in different languages, the xlm-robeta model is capable of obtaining state-of-the-art results on many cross-lingual understanding benchmarks. It is also a PyTorch subclass model and you can use it as a regular PyTorch module.

You can check the two pre-trained models here; one is the XLM-R using the BERT base architecture and the other is the XLM-R using the BERT large architecture.

To use this model for PyTorch 1.0:

Load XLM-R (for PyTorch 1.0 or custom models):

# Download xlmr.large model


tar -xzvf xlmr.large.tar.gz

# Load the model in fairseq

from fairseq.models.roberta import XLMRModel

xlmr = XLMRModel.from_pretrained('/path/to/xlmr.large', checkpoint_file='')

xlmr.eval()  # disable dropout (or leave in train mode to finetune)

To apply sentence-piece-model (SPM) encoding to input text:

en_tokens = xlmr.encode('Hello world!')

assert en_tokens.tolist() == [0, 35378,  8999, 38, 2]

xlmr.decode(en_tokens)  # 'Hello world!'

zh_tokens = xlmr.encode('你好,世界')

assert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]

xlmr.decode(zh_tokens)  # '你好,世界'

hi_tokens = xlmr.encode('नमस्ते दुनिया')

assert hi_tokens.tolist() == [0, 68700, 97883, 29405, 2]

xlmr.decode(hi_tokens)  # 'नमस्ते दुनिया'

ar_tokens = xlmr.encode('مرحبا بالعالم')

assert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]

xlmr.decode(ar_tokens) # 'مرحبا بالعالم'

fr_tokens = xlmr.encode('Bonjour le monde')

assert fr_tokens.tolist() == [0, 84602, 95, 11146, 2]

xlmr.decode(fr_tokens) # 'Bonjour le monde'

To extract features from XLM-R:

# Extract the last layer's features

last_layer_features = xlmr.extract_features(zh_tokens)

assert last_layer_features.size() == torch.Size([1, 6, 1024])

# Extract all layer's features (layer 0 is the embedding layer)

all_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)

assert len(all_layers) == 25

assert torch.all(all_layers[-1] == last_layer_features)


With companies such as Hugging Face providing their pre-trained language models, it becomes easier for businesses to extract easily decodable information on how well their product is functioning instead of deciphering graphs and reports. At the core of NLP is having the technology to understand the very language or inputs the human world functions upon and democratizing it only makes the process more seamless and effective.

Follow Me on Linkedin & Twitter

If you are interested in similar content do follow me on Twitter and Linkedin