blog_post

An Introduction to spaCy

What is spaCy

This is an open-source library for Natural Language Procesing in Python

Some salient features

  • Quite faster. Written in Cython, so its quite fast (Cy in spaCy stands for cython)
  • Most accurate
  • Applicative oriented rather than academic approach.
  • Can be easily geared up for deep learning

Infact spaCy tagline is "Industrial Strength Natural Language Processing". Other NLP packages are loaded with too many features. SpaCy has only one POS tagging and one NER algorithm

This post is more works like a cheatsheet for what can be done with spaCy rather than descriptions for the functionalities.

Check the end of the post for the resource list for deep explanations

Installation

pip install -U spacy
python -m spacy download en

Working

In [1]:
# Load spacy
import spacy
nlp = spacy.load('en')  #Loading English model

spaCy supports some other lanugages like German, Spanish, French

In [2]:
sample_test = 'spaCy has some cool features!'

Now keeping this sample_test in spaCy pipeline

In [3]:
doc = nlp(sample_test)

Now, if you check the type of doc

In [4]:
type(doc)
Out[4]:
spacy.tokens.doc.Doc

Now this is a spaCy doc object and there are many ibuilt default functions you can apply on this object.

Check out using PrettyDir

In [5]:
import pdir #install pdir using pip
In [6]:
pdir(doc)
Out[6]:
abstract class:
    __subclasshook__
attribute access:
    __delattr__, __dir__, __getattribute__, __setattr__
class customization:
    __init_subclass__
container:
    __getitem__, __iter__, __len__
object customization:
    __bytes__, __format__, __hash__, __init__, __new__, __repr__, __sizeof__, __str__
pickle:
    __reduce__, __reduce_ex__, __setstate__
property:
    _, __pyx_vtable__, _py_tokens, _vector, _vector_norm, cats, doc, ents, has_vector, is_parsed, is_sentenced, is_tagged, mem, noun_chunks, sentiment, sents, tensor, text, text_with_ws, user_data, user_hooks, user_span_hooks, user_token_hooks, vector, vector_norm, vocab
rich comparison:
    __eq__, __ge__, __gt__, __le__, __lt__, __ne__
special attribute:
    __class__, __doc__
function:
    __unicode__:
    _realloc:
    char_span: Create a `Span` object from the slice `doc.text[start : end]`.
    count_by: Count the frequencies of a given attribute. Produces a dict of
    extend_tensor: Concatenate a new tensor onto the doc.tensor object.
    from_array:
    from_bytes: Deserialize, i.e. import the document contents from a binary string.
    from_disk: Loads state from a directory. Modifies the object in place and
    get_extension:
    get_lca_matrix: Calculates the lowest common ancestor matrix for a given `Doc`.
    has_extension:
    merge: Retokenize the document, such that the span at
    noun_chunks_iterator: Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    print_tree: Returns the parse trees in JSON (dict) format.
    retokenize: Context manager to handle retokenization of the Doc.
    set_extension:
    similarity: Make a semantic similarity estimate. The default estimate is cosine
    to_array: Export given token attributes to a numpy `ndarray`.
    to_bytes: Serialize, i.e. export the document contents to a binary string.
    to_disk: Save the current state to a directory.

Tokenizing

This is the process of splitting doc object into small meaningul sections called tokens.

Rudimentary way tokens are similar to words in a sentence. Many more spaCy functionalities you can apply on these tokens

In [7]:
[token.text for token in doc]
Out[7]:
['spaCy', 'has', 'some', 'cool', 'features', '!']

You can do above tokenization using Python string manipulation, but it splits the sentence in Naive way by considering space as splitting criteria and it can't handle punctuations as spaCy did

In [10]:
[word for word in sample_test.split()]
Out[10]:
['spaCy', 'has', 'some', 'cool', 'features!']

Parts of Speech Tagging

Applying pos_ attribute, parts of speech of the tokens can be recognized

In [11]:
doc = nlp('She sells seashells by the seashore')
In [12]:
[print(f'{token.text:<10}:{token.pos_:^6}:{token.tag_}') for token in doc];
She       : PRON :PRP
sells     : VERB :VBZ
seashells : NOUN :NNS
by        : ADP  :IN
the       : DET  :DT
seashore  : NOUN :NN

Anytime you want to decode the acronyms can use spacy.explain command

In [13]:
se = spacy.explain
In [14]:
se('ADP')
Out[14]:
'adposition'
In [15]:
se('PRP')
Out[15]:
'pronoun, personal'

Named Entity Recognition

Different types of entities like person, location, organization in the documents can be recognized using NER attribute.

These are built on statistical models, at times they may not work accurately.

In [16]:
doc = nlp('Satya Narayana Nadella is an Indian American business executive. He is the Chief Executive Officer of Microsoft, succeeding Steve Ballmer in 2014.')
In [17]:
[print(f'{ent.text:<25} :{ent.label_}') for ent in doc.ents];
Satya Narayana Nadella    :ORG
Indian American           :NORP
Microsoft                 :ORG
Steve Ballmer             :PERSON
2014                      :DATE

Noun phrases

We can get Noun phrases and root words

In [18]:
doc = nlp('The Indian independence movement was a movement from 1857 until 15 August 1947, when India got independence from the British Raj.')
In [19]:
[print(f'{phrase.text :<32} : {phrase.label_}: {phrase.root.text}') for phrase in doc.noun_chunks ];
The Indian independence movement : NP: movement
a movement                       : NP: movement
15 August                        : NP: August
India                            : NP: India
independence                     : NP: independence
the British Raj                  : NP: Raj

Splittig document into sentences

spaCy is intelligent enough to split doc into sentences. Rudimentary way we can do by considering '.', but for titles like Dr. this fails. spaCy handles this easily.

In [20]:
doc = nlp("Dr. A. P. J. Abdul Kalaam (15 October 1931 – 27 July 2015) was an Indian scientist who served as the 11th President of India from 2002 to 2007. He was born and raised in Rameswaram, Tamil Nadu and studied physics and aerospace engineering. He spent the next four decades as a scientist and science administrator, mainly at the Defence Research and Development Organisation (DRDO) and Indian Space Research Organisation (ISRO) and was intimately involved in India's civilian space programme and military missile development efforts.[1] ")
In [21]:
[print(f'{index} : {sents.text}') for index,sents in enumerate(doc.sents)];
0 : Dr. A. P. J. Abdul Kalaam (15 October 1931 – 27 July 2015) was an Indian scientist who served as the 11th President of India from 2002 to 2007.
1 : He was born and raised in Rameswaram, Tamil Nadu and studied physics and aerospace engineering.
2 : He spent the next four decades as a scientist and science administrator, mainly at the Defence Research and Development Organisation (DRDO) and Indian Space Research Organisation (ISRO) and was intimately involved in India's civilian space programme and military missile development efforts.[1]

Some other useful attributes on tokens are

  • text (token to string)
  • string
  • Idx (position)
  • lemma (get the root form of a word)
  • is_punct (check whether the string/char is a punctuation)
  • is_space
  • shape_ (Gives the shape of the word, ex: King --> Xxxx)
  • pos_
  • tag_ (pos tag)

Visualizing

spaCy comes with nice visualization tool, displaCy

Named Entity Recognition

In [22]:
doc = nlp("Francisco D'Souza, CEO, Cognizant. Cognizant CEO Francisco D'Souza took home $11.95 million as annual compensation last year, making him the top paid CEO among peers such as Vishal Sikka of Infosys, N Chandrasekaran of TCS and Abidali Neemuchwala of Wipro.")
In [23]:
from spacy import displacy
In [24]:
displacy.render(doc,style='ent', jupyter=True)
Francisco D'Souza PERSON , CEO, Cognizant. Cognizant CEO Francisco D'Souza PERSON took home $11.95 million MONEY as annual DATE compensation last year DATE , making him the top paid CEO among peers such as Vishal Sikka ORG of Infosys ORG , N Chandrasekaran of TCS ORG and Abidali Neemuchwala of Wipro ORG .

We can see some of the words, it is not able to recognize properly. For this we can do training, which we will see in later posts

In [25]:
doc = nlp("Steven Paul Jobs was an American entrepreneur and business magnate. He was the chairman, chief executive officer, and a co-founder of Apple Inc., chairman and majority shareholder of Pixar,")
In [26]:
displacy.render(doc,style='ent', jupyter=True)
Steven Paul Jobs PERSON was an American NORP entrepreneur and business magnate. He was the chairman, chief executive officer, and a co-founder of Apple Inc. ORG , chairman and majority shareholder of Pixar ORG ,

Dependency parser

In [27]:
doc=nlp('Peter Piper picked a peck of pickled peppers.')
In [28]:
displacy.render(doc, style='dep', jupyter=True, options={'distance':100})
Peter PROPN Piper PROPN picked VERB a DET peck NOUN of ADP pickled VERB peppers. NOUN compound nsubj det dobj prep amod pobj

Can also host this in local server, so can open in browser

displacy.serve(doc, style='dep')

Mostly this will be in the address

http://localhost:5000

Resource list

  • spaCy official page has a very good documentation - link

  • Natural Language Processing and Computational Linguistics - link. I liked this book a lot, I think this is the only book that is entirely written on spaCy

Ignore the next cells

In [29]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))