From database optimisation to data science – (part 2,MongoDB exemple)

In this serie of articles I will try to explain some tuning technics and the problems that we are still facing today, but with the idea to propose a new way of thinking, approaching to data scientist’s way.

Also, I will always post the script used so can play with this yourself.

Let’s start with the Library exemple, that is selling DVDs and products.

The figures are: 10 000 articles in total

50% of books

50% of DVDs

50% of products in English langage

50% of products in  French langage

So what is the fraction of rows when the langage is English and the product is a DVD?

Let’s execute our MongoDB script:

use orders;

for(var i=0;i<4990;i++) {db.test_orders.insert({“product”:”DVD”, “language”:”english”,”ordid”:i,”text”:”DDDDDDDDDDDDDDDDDDDDDDDD”});}
for(var i=4990;i<5001;i++) { db.test_orders.insert({“product”:”DVD”, “language”:”french”,”ordid”:i,”text”:”DDDDDDDDDDDDDDDDDDDDDDDD”});}
for(var i=5001;i<9990;i++) { db.test_orders.insert({“product”:”book”, “language”:”french”,”ordid”:i,”text”:”BOOOOOOOOOOOOOOOOOOOOOOOOOK”});}
for(var i=9990;i<10000;i++) { db.test_orders.insert({“product”:”book”, “language”:”english”,”ordid”:i,”text”:”BOOOOOOOOOOOOOOOOOOOOOOOOK”});}

db.test_orders.createIndex({product:1,language:2}).
db.test_orders.find({“product”:”DVD”,”language”:”french”}).explain(“executionStats”)
db.test_orders.find({“product”:”DVD”,”language”:”english”}).explain(“executionStats”)
db.test_orders.find({“product”:”book”,”language”:”french”}).explain(“executionStats”)
db.test_orders.find({“product”:”book”,”language”:”english”}).explain(“executionStats”)

db.test_orders.find({“product”:”DVD”,”language”:”french”}).explain(“executionStats”)
db.test_orders.find({“product”:”DVD”,”language”:”english”})..hint({$natural:1}).explain(“executionStats”)
db.test_orders.find({“product”:”book”,”language”:”french”})..hint({$natural:1}).explain(“executionStats”)
db.test_orders.find({“product”:”book”,”language”:”english”}).explain(“executionStats”)

db.collection.find().hint({$natural:1})

Contrary to Oracle and SQL Server, (https://jastrebicdragutin.wordpress.com/2020/12/21/from-database-optimisation-to-data-science/) , MongoDB did a good estimation.

Hower, without human intervention (giving a hint to force a full scan that MongoDB is calling the COLLSCAN) , it was not able to use the index scan for the small fraction and the full scan for the big fraction. So, without it choose the wrong execution path, probably because MongoDB optimizer is still not enough sophisticated. Using the index whenever it exists is the old-fashioned way the classical RDBMS use to do with their rule based optimizer (without using the statistics in the desicion of the execution path).

But, again the same question:

What could be a data scientist way of thinking on this ?

We could use the popular Bayes theorem:

P(A|B)= P(B|A)* P(A)/P(B)

https://en.wikipedia.org/wiki/Bayes%27_theorem

P(product|langage)=P(language|product)* p(product)/p(language) P(DVD|french)=P(french|DVD)*P(DVD)/P(french)

P(french|DVD)=10%,

P(DVD)=50%,

P(french)=50%

P(DVD|french)=10%

The Bayes theorem gives the direct answer.

See the first article related to the topic:

From database optimisation to data science – (part 1, Oracle and SQL Server exemple)

To be continued …

Leave a comment