Text Similarity : Python-sklearn on MongoDB Collection

Text Similarity : Python-sklearn on MongoDB Collection
Check out some Python code that can calculate the similarity of an indexed field between all the documents of a MongoDB collection.

Originally published by Anis Hajri at dzone.com


In this article, I set up a Python script that allows us to calculate the similarity of an indexed field between all the documents of a MongoDB collection. In the process I parallelized the executions on four threads to improve performance.

The script is detailed below, I hope it will be useful.

Python Script

import multiprocessing
import threading
import json, sys
import pymongo
import nltk, string

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
class SimilarityThread (threading.Thread):
   def __init__(self, threadID, data_array, totalSize, similarity_collection,startIndex):
   self.threadID = threadID
   self.data_array = data_array
   self.totalSize = totalSize
   self.similarity_collection = similarity_collection
   self.startIndex = startIndex
   def run(self):
      clacluateSimilarity( self.data_array, self.totalSize, self.similarity_collection,self.startIndex)

def clacluateDistance(txt1,txt2):
return euclidean_distances(txt1,txt2)[0][0]

def clacluateSimilarity( data_array, totalSize, similarity_collection, startIndex):
vectorizer = CountVectorizer()
for idx in range(startIndex,totalSize):
h = data_array[idx]
for idx1 in range((idx+1),totalSize):
h1 = data_array[idx1]
hSimilarity = {}
corpus = []
features = vectorizer.fit_transform(corpus).todense()
distance = clacluateDistance(features[0],features[1])
hSimilarity['distance'] = distance
if distance < 4:
print("Distance ====> %d " % distance)

def processTextSimilarity(totalSize, data_array,similarity_collection):

num_cores = multiprocessing.cpu_count()
print(":::num cores ==> %d " % num_cores)
threadList = ["Thread-1", "Thread-2", "Thread-3", "Thread-4"]
threadID = 1;
rootIndex = round(totalSize/4)
startIndex = 0
for tName in threadList:
thread = SimilarityThread(threadID, data_array, startIndex+rootIndex, similarity_collection,startIndex)
threadID += 1

# Wait for all threads to complete
for t in threads:

def main():
print('****** Text Similarity::start ******')
connection = pymongo.MongoClient("mongodb://localhost")
db = connection.kalamokomnoor
article = db.article
article_similarity = db.article_similarity

data_array = article.find({}).sort("id",pymongo.ASCENDING)
totalSize =  article.count_documents({}) 

print('###### :: totalSize : %d ' % totalSize)


print('****** Text Similarity::Ending ******')

if __name__ == '__main__':

Originally published by Anis Hajri at dzone.com


Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Complete Python Bootcamp: Go from zero to hero in Python 3

☞ Python for Time Series Data Analysis

☞ Python Programming For Beginners From Scratch

☞ MongoDB - The Complete Developer’s Guide

☞ The Complete Developers Guide to MongoDB

☞ Python Network Programming | Network Apps & Hacking Tools

☞ Intro To SQLite Databases for Python Programming

☞ Ethical Hacking With Python, JavaScript and Kali Linux

☞ Beginner’s guide on Python: Learn python from scratch! (New)

☞ Python for Beginners: Complete Python Programming

☞ Learn MongoDB : Leading NoSQL Database from scratch

☞ MongoDB Essentials - Complete MongoDB Guide

☞ Introduction to MongoDB