From database optimisation to data science – (part 1, Oracle and SQL Server exemple)

In this serie of articles I will try to explain some tuning technics and the problems that we are still facing today, but with the idea to propose a new way of thinking, approaching to data scientist’s way of seeing things.

Also, I will always post the script used so can play with this yourself.

Let’s start with the Library exemple, that is selling DVDs and products.

The figures are: 10 000 articles in total

50% of books

50% of DVDs

If you look carefully, you will see that 50% of products in English language

Also, 50% of products in  French language .

So what is the fraction of rows when the langage is English and the product is a DVD?

Let’s play with our script:

alter session set container=plug1;

alter session set current_schema=orders_owner;

drop table test_orders;
create table test_orders (product varchar2(30),language varchar2(30),id number);

begin
for x in 1..5000 loop
insert into test_orders values(‘books’, null,x);
commit;
end loop;
end;
/
begin
for x in 5000..10000 loop
insert into test_orders values(‘DVD’, null,x);
commit;
end loop;
end;
/
prompt /Big fractions/
update test_orders
set language=’french’
where id between 1 and 4900;
update test_orders
set language=’english’
where id between 5001 and 9900;
prompt /Small fractions/
update test_orders
set language=’english’
where id between 4900 and 5000;
update test_orders
set language=’french’
where id between 9900 and 10000;
create index ix1 on test_orders(product);
create index ix2 on test_orders(language);

execute dbms_stats.gather_table_stats(ownname=>’ORDERS_OWNER’,method_opt=>’for all columns size auto’,tabname=>’TEST_ORDERS’);

explain plan for select * from test_orders where language=’french’ and product=’books’;

@%ORACLE_HOME%\rdbms\admin\utlxpls

explain plan for select * from test_orders where language=’english’ and product=’DVD’;

@%ORACLE_HOME%\rdbms\admin\utlxpls

execute dbms_stats.gather_table_stats(ownname=>’ORDERS_OWNER’,method_opt=>’for all columns size auto’,tabname=>’TEST_ORDERS’);

explain plan for select * from test_orders where language=’french’ and product=’DVD’;

@%ORACLE_HOME%\rdbms\admin\utlxpls

explain plan for select * from test_orders where language=’english’ and product=’books’;

@%ORACLE_HOME%\rdbms\admin\utlxpls

What is the result we see? In 4 cases, we have 2500 as the estimation, for exemple:

P(50%) x p(50%) = P(25%)

Simple !!! 2500 rows !!!

But this is WRONG … This is a well known problem of correlated data, so the simple calculation p(A) x P(B) can’t be applied.

Let’s see the SQL Server exemple:

CREATE TABLE [dbo].[test_orders](
[id] [int] IDENTITY(1,1) NOT NULL,
[product] varchar NULL,
[language] varchar NULL
) ON [PRIMARY]
GO

DECLARE @counter INT = 1;
WHILE @counter <=5000
BEGIN
SET @counter = @counter + 1;
insert into test_orders values(‘books’, null);
END
go

DECLARE @counter INT = 5000;
WHILE @counter <=10000
begin
SET @counter = @counter + 1;
insert into test_orders values(‘DVD’, null);
end
go

/Big fractions/
update test_orders
set language=’french’
where id between 1 and 4900;
update test_orders
set language=’english’
where id between 5000 and 9900;
/Small fractions/
update test_orders
set language=’english’
where id between 4900 and 5000;
update test_orders
set language=’french’
where id between 9900 and 10000;

create index ix1 on test_orders(product,language);

select * from test_orders where language=’english’ and product=’books’;

select * from test_orders where language=’english’ and product=’DVD’;

select * from test_orders where product=’books’ and language=’french’

select * from test_orders where product=’DVD’ and language=’french’

We see the same estimation problem with SQL Server:

So SQL Server has the same problem, it estimates 2500 rows …

With Oracle, we can use extended statistics or dynamic sampling to solve this problem. We used the dynamic sampling ,by simply deleting statistics, using dbms_stats.delete_stats(ownname=>’TEST’, tabname=>’TEST_ORDERS’) in our exemple and the estimation is much better for the small fraction (about 100 rows) without statistics.

and for the big fraction as well:

With SQL Server, the solution used is to create the index with where clause:

create index ix2 on test_orders(product,language)
where product=’DVD’ and language=’english’

create index ix3 on test_orders(product,language)
where product=’books’ and language=’french’

create index ix4 on test_orders(product,language)
where product=’DVD’ and language=’french’

create index ix5 on test_orders(product,language)
where product=’books’ and language=’english’

We see better estimation for both small and big fraction:

The estimation is much better as well. The way SQL Server did the statistics:

But, now the question:

What could be a data scientist way of thinking on this ?

We could use the popular Bayes theorem:

P(A|B)= P(B|A)* P(A)/P(B)

https://en.wikipedia.org/wiki/Bayes%27_theorem

P(product|language)=P(language|product)* p(product)/p(language) P(DVD|french)=P(french|DVD)*P(DVD)/P(french)

P(french|DVD)=10%,

P(DVD)=50%,

P(french)=50%

P(DVD|french)=10%

So the Bayes formula gives the direct answer.

One may ask, why today databases are not able to resolve automaticallly this kind of problem, without human intervention ?

While waiting for optimizers to become more smart (but no doubt it will happen, sooner or later) , a usefull thing that we can do is to adopt more data scientist approach, and know more about data distribution.

The classical way of thinking when tuning is the following:

Oracle: adjust SGA, PGA, parallelism, create indexes, create materialized views …

SQL Server: Adjust parameters with SP_configure, adjust parallelism, create/rebuild indexes etc.  

Every database has its own parameters to tune memory/disks, IOs, CPUs … Those techniques are of course still needed but…. If you think to tune with really understanding your data, understanding  

a) cardinalities,

b) correlation,

c) dispersion and

even

d) causalities !!!

inside your data then…

you will be able tune almost every database !!!

SQL or NOSQL, all of them had similar principles, so once you learn, you will be able to tune them…

Through those exemple, you will learn…

  • To understand the basic principles of every optimizer, be it a classical RDBMS (Oracle, SQL Server, Postgresql, MySQL etc.) or a NOSQL (MongoDB, Cassandra, Hadoop etc.) .So we will took several databases to show our exemples.
  • If you are a database specialist, then you will became a better tuner and will make a step forward a data scientist u
  • If you are a data scientist, then you will be able to understand more about databases

To be continued …

One thought on “From database optimisation to data science – (part 1, Oracle and SQL Server exemple)

  1. Pingback: From database optimisation to data science – (part 2,MongoDB exemple) | Practical Oracle,SQL Server and Big Data

Leave a comment