sparse_matrix

Sparse Matrix and its visualization

This is a matrix (2 dimensional array) in which very few elements are non-zero. So, most of the elements are zero. So, the name sparse (thinly dispersed or scattered)

On contrast, if most of the elements are non-zero, it is called dense matrix.

It is computationally very expensive to store all the data (lot of zeros and their locations in the matrix).
So, there are numerous methods to store only non-zero data

SciPy gives an ease way of defining and storing this matrices. Several ways of storing sparse matrix using SciPy

Sparsity is measured by

Sparsity = number of zero elements / total elements

Sparse matrix knowledge has applications in many fields.

In CFD, this sparsity comes from the stencil ( arrangement of a nodal group). We will deal with very large sparse matrix depending on the size of the grid. Technically speaking these are banded matrices (non-zero elements are confined to a diagonal band)

In Data Science, say example for a users watch list data contains zeroes for most of the movies those user has not watched.

Further reading about sparse matrix

Defining

Say there s a data matrix of 1 million elements, but only 10,000 elements are non-zero.
Here we can save memory by just saving those 10,000 elements and their location by the index (row and column numbers)
So, we will be saving 30,000 elements instead of saving 1 million elements.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.sparse import bsr_matrix

One of the way of defining sparse matrix is Block Sparse Row matrix from SciPy

In [2]:
row = np.array([0, 0, 1, 2, 2])
col = np.array([0, 2, 2, 0, 1])
data = np.array([1, 2, 3 ,4, 5])

a = bsr_matrix((data, (row, col)), shape=(3, 3)).toarray()

This is a 3 by 3 matrix, but has six elements non-zero. Their row and column index are defined.

Visualization

In [3]:
a
Out[3]:
array([[1, 0, 2],
       [0, 0, 3],
       [4, 5, 0]], dtype=int64)
In [4]:
plt.spy(a)
plt.show()

Another large sparse matrix

Don't worry about the below code, just generating random data for the indices and data points.
Just a quick code, not an efficient one.

In [5]:
def generate_random():
    random_array = []
    run = True
    while run:
        random_no = np.random.randint(1000)  #max data point value
        if random_no not in random_array:
            random_array.append(random_no)
        if len(random_array)> 500: # max length of the array size
            run = False
    return random_array

def generate_matrix():
    indices_1 = np.array(generate_random())
    indices_2 = np.array(generate_random())
    if max(indices_1)>max(indices_2):
        row = indices_1
        col = indices_2
    else:
        row = indices_2
        col = indices_1

    return (row,col)

data = np.array(generate_random())
row,col = generate_matrix()


s=max(row)+1

a=bsr_matrix((data, (row, col)), shape=(s,s)).toarray()

plt.spy(a, markersize=3)
plt.show()
In [6]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))