Technology Refreshed: 2014

May 7, 2014

What is an index in database? How does it work? What is its use?

Let’s begin with explaining why do we need a database index by going through a very simple example. Suppose that we have a database table called Student with three columns – Student_Name, Student_Age, and Scorecard. Assume that our Student table has thousands of rows.

Now, let’s say that we want to run a query to find all the details of any students who are named ‘John’? So, we decide to run a simple query like this:

SELECT * FROM Student 
WHERE Student_Name = 'John'

What would happen without an index on the table?

Once we run that query, what exactly goes on behind the scenes to find students who are named John? Well, the database software would literally have to look at every single row in the Student table to see if the Student_Name for that row is ‘John’. And, because we want every row with the name ‘John’ inside it, we can not just stop looking once we find just one row with the name ‘John’, because there could be other rows with the name ‘John’. So, every row up until the last row must be searched – which means thousands of rows in this scenario will have to be examined by the database to find the rows with the name‘John’. This is what is called a full table scan.

How a database index can help improving performance

You might be thinking that doing a full table scan sounds inefficient for something so simple – shouldn’t software be smarter? It’s almost like looking through the entire table with the human eye – very slow and not at all sleek. But, as you probably guessed by the title of this article, this is where indexes can help a great deal.

The whole point of having an index is to speed up search queries by essentially cutting down the number of records/rows in a table that need to be examined.

What is an index?

So, what is an index? Well, an index is a data structure (most commonly a B-tree) that stores the values for a specific column in a table. An index is created on a column of a table. So, the key points to remember are that an index consists of column values from one table, and that those values are stored in a data structure. The index is a data structure – remember that.

What kind of data structure is an index?

B-trees are the most commonly used data structures for indexes. The reason B-trees are the most popular data structure for indexes is due to the fact that they are time efficient – because look-ups, deletions, and insertions can all be done in logarithmic time. And, another major reason B- trees are more commonly used is because the data that is stored inside the B-tree can be sorted. The RDBMS typically determines which data structure is actually used for an index. But, in some scenarios with certain RDBMS’s, you can actually specify which data structure you want your database to use when you create the index itself.

How does a hash table index work?

Hash tables are another data structure that you may see being used as indexes – these indexes are commonly referred to as hash indexes. The reason hash indexes are used is because hash tables are extremely efficient when it comes to just looking up values. So, queries that compare for equality to a string can retrieve values very fast if they use a hash index. For instance, the query we discussed earlier (SELECT * FROM Student WHERE Student_Name = ‘John’) could benefit from a hash index created on the Student_Name column. The way a hash index would work is that the column value will be the key into the hash table and the actual value mapped to that key would just be a pointer to the row data in the table. Since a hash table is basically an associative array, a typical entry would look something like “John => 0×29856″, where 0×29856 is a reference to the table row where John is stored in memory. Looking up a value like “John” in a hash table index and getting back a reference to the row in memory is obviously a lot faster than scanning the table to find all the rows with a value of “John” in the Student_Name column.

The disadvantages of a hash index

Hash tables are not sorted data structures, and there are many types of queries which hash indexes can not even help with. For instance, suppose you want to find out all of the students who are less than 12 years old. How could you do that with a hash table index? Well, it’s not possible because a hash table is only good for looking up key value pairs – which means queries that check for equality (like “WHERE Student_name = ‘John’”). What is implied in the key value mapping in a hash table is the concept that the keys of a hash table are not sorted or stored in any particular order. This is why hash indexes are usually not the default type of data structure used by database indexes – because they aren’t as flexible as B-trees when used as the index data structure.

What are some other types of indexes?

Indexes that use a R-tree data structure are commonly used to help with spatial problems. For instance, a query like “Find all of the Starbucks within 2 kilometers of me” would be the type of query that could show enhanced performance if the database table uses a R-tree index.
Another type of index is a bitmap index, which work well on columns that contain Boolean values (like true and false), but many instances of those values – basically columns with low selectivity.

How does an index improve performance?

Because an index is basically a data structure that is used to store column values, looking up those values becomes much faster. And, if an index is using the most commonly used data structure type – a B-tree – then the data structure is also sorted. Having the column values be sorted can be a major performance enhancement – read on to find out why.
Let’s say that we create a B-tree index on the Student_Name column. This means that when we search for students named “John” using the SQL we showed earlier, then the entire Student table does not have to be searched to find students named “John”. Instead, the database will use the index to find students named John, because the index will presumably be sorted alphabetically by the Student’s name. And, because it is sorted, it means searching for a name is a lot faster because all names starting with a “J” will be right next to each other in the index! It’s also important to note that the index also stores pointers to the table row so that other column values can be retrieved – read on for more details on that.

What exactly is inside a database index?

So, now you know that a database index is created on a column in a table, and that the index stores the values in that specific column. But, it is important to understand that a database index does not store the values in the other columns of the same table. For example, if we create an index on the Student_Name column, this means that the Student_Age and Scorecard column values are not also stored in the index. If we did just store all the other columns in the index, then it would be just like creating another copy of the entire table – which would take up way too much space and would be very inefficient.

An index also stores a pointer to the table row

So, the question is if the value that we are looking for is found in an index (like ‘John’) , how does it find the other values that are in the same row (like the scorecard of John and his age)? Well, it’s quite simple – database indexes also store pointers to the corresponding rows in the table. A pointer is just a reference to a place in memory where the row data is stored on disk. So, in addition to the column value that is stored in the index, a pointer to the row in the table where that value lives is also stored in the index. This means that one of the values (or nodes) in the index for an Student_Name could be something like (“John”, 0×89856), where 0×89856 is the address on disk (the pointer) where the row data for “John” is stored. Without that pointer all you would have is a single value, which would be meaningless because you would not be able to retrieve the other values in the same row – like the scorecard and the age of a student.

How does a database know when to use an index?

When a query like “SELECT * FROM Student WHERE Student_Name = ‘John’ ” is run, the database will check to see if there is an index on the column(s) being queried. Assuming the Student_Name column does have an index created on it, the database will have to decide whether it actually makes sense to use the index to find the values being searched – because there are some scenarios where it is actually less efficient to use the database index, and more efficient just to scan the entire table.

Can you force the database to use an index on a query?

Generally, you will not tell the database when to actually use an index – that decision will be made by the database itself. Although it is worth noting that in most databases (like Oracle and MySQL), you can actually specify that you want the index to be used.

How to create an index in SQL:

Here’s what the actual SQL would look like to create an index on the Student_Name column from our example earlier:

CREATE INDEX name_index
ON Student (Student_Name)

How to create a multi-column index in SQL:

We could also create an index on two of the columns in the Student table , as shown in this SQL:

CREATE INDEX name_index
ON Student (Student_Name, Student_Age)

What is a good analogy for a database index?

A very good analogy is to think of a database index as an index in a book. If you have a book about dogs and you are looking for the section on Golden Retrievers, then why would you flip through the entire book – which is the equivalent of a full table scan in database terminology – when you can just go to the index at the back of the book, which will tell you the exact pages where you can find information on Golden Retrievers. Similarly, as a book index contains a page number, a database index contains a pointer to the row containing the value that you are searching for in your SQL.

What is the cost of having a database index?

So, what are some of the disadvantages of having a database index? Well, for one thing it takes up space – and the larger your table, the larger your index. Another performance hit with indexes is the fact that whenever you add, delete, or update rows in the corresponding table, the same operations will have to be done to your index. Remember that an index needs to contain the same up to the minute data as whatever is in the table column(s) that the index covers.
As a general rule, an index should only be created on a table if the data in the indexed column will be queried frequently.

Jan 28, 2014

Creating Singleton class

First let us understand what is singleton pattern.

Singleton pattern is a design pattern that restricts the instantiation of a class to one object. i.e. You can create only one instance for that class and no more instance can be created. So whenever you request for an instance of that class, same instance which ever created initially is retrieved. One best example of singleton class is Logger class.

Now the question is, how to create a singleton class. Of course, there are lot of ways to create such class. Now let's see one of those methods.

1. First of all, you shouldn't allow anyone to create a new instance, i.e. the creation of instance should be restricted to any external entity. Your class should take care of creating instance for this class.

First restriction is to put on constructor level. i.e. Make your constructor as private. So any other class will not have access to this class' constructor.

class MyLogger{

   private MyLogger(){
      //Code
   }
}

2. If you cannot create an instance, then how will you get the instance of this class and how will you access other methods in this class.

So we have to look for other options to access this class. We all know that static methods of a class can be accessed without it's instance. i.e. We can access all static methods using ClassName itself.

Now create a public static method that retrieves the instance for this particular class when ever some one requests for an instance.

Use a private static variable to store the class instance as following,

private static MyLogger logger = null;

//Static method (make sure your method is public,
//so it can be accessed from other classes)

public static MyLogger getInstance(){
      if(logger == null){ //logger is null only for first time.
           logger = new MyLogger();
      }

    return logger;
}

Here we can see the static method getInstance() checks for instance & creates a new one if there is no existing instance available. If it finds an existing instance, it just returns the same.

But if we take a closer look, this method doesn't seem to be thread safe. How? Consider there are no instance created for this class. Two new threads executes this method to create instance at the same time. Say, thread1 enters into if(logger==null) condition & before it creates the instance, thread2 also checks the "if" condition. The condition is true and this thread (thread2) too enters into the "if" block. Now these 2 threads are inside the "if" block & now imagine what happens. Yes, you are right. 2 instances of MyLogger is created, which violates the singleton rule.

How can we handle this? Let's keep the instance creation part into synchronized block as given below, so multi thread access is restricted at same time.

public static MyLogger getInstance(){
      if(logger == null){
           synchronized(MyLogger.class){
                  logger = new MyLogger();
           }
      }

    return logger;
}

Does this modification resolve the multiple-instance-creation issue? Nope. Why? Think if thread1 & thread2 both are into "if" block. Now thread1 enters into synchronized block, acquires the monitor, locks the class & creates the instance. thread2 waits till thread1 releases the monitor & as it is already inside "if" block, it now enters into synchronized block and creates another instance. How can we handle this?

We have a concept called double-checked-locking, where the check happens before synchronized block & inside synchronized block as given below.

public static MyLogger getInstance(){
      if(logger == null){
           synchronized(MyLogger.class){
                 if(logger == null){
                       logger = new MyLogger();
   }
           }
      }

    return logger;
}

Does it solve the issue? Of course yes. Only the thread that enters first creates the instance & rest of the threads get the already-created instance.

We can get a question here. Why can't we synchronize the whole method itself, while the synchronized block we mentioned above also locks the class anyway?

It can be done, but if we take a closer look, we can note that the synchronized static method always locks the class (Refer this link) whenever some class requests an instance, which means, the class is locked whenever an instance is requested. But with the double-checked-locking mechanism, the class is locked only once during first instance creation & rest of the calls won't go through the synchronized block.

We can also have other way of creating singleton class as given below.

class MyLogger {
private static MyLogger logger = new MyLogger();

private MyLogger(){ }

public static MyLogger getInstance(){
return logger;
}

}

This one is called eager loading, where static blocks will be loaded during class load itself. i.e. even before the request for an instance is arrived.

But our previous example creates instance only when some class requests for it. This one is called Lazy loading. Anyhow, both ways of creating singleton class serves its purpose.

Jan 25, 2014

Synchronization on static method and instance method

Before getting into synchronization concept, lets first understand what is monitor.

A monitor is a kind of lock that each class owns. And that monitor, at any time can be held only by a single Thread. When a class is instantiated, each instance is allocated with one monitor.

Consider our class is Student.java

public class Student{

public void getStudentName(){
//Code
}

synchronized(this){
//code
}
}

When some thread tries to execute the synchronized method/block, first the thread acquires the monitor that belongs to that instance so that the monitor is not available for any other thread of same instance till this thread completes its task and releases the monitor. The monitor could also be released by putting the thread into wait() status. Note here, if a thread acquires the monitor of an instance, then any synchronized block in that instance cannot be accessed by any thread. But anyhow, non synchronized block is still available for access.

Threads of other instances have access to this synchronized block as each object has its own monitor.

Static synchronized block:

If our Student class has a static synchronized block & any thread try to access this block, the monitor belongs to the class itself is acquired by the thread that executes the static block. So any thread on any other instances of this class has to wait till this monitor is released. So be cautious while you develop a static synchronized method.

static Synchronized block is given below.

public static synchronized void getStudentDetails(){
//Code here
}

A class level lock can be made even by following way. They both have similar effect.

public static void getStudentDetails(){
synchronized(Student.class){
//Code here
}
}

Java Thread: Why do we always call start() method, not run() method in thread

We can see both public void run() and public synchronized void start() are methods of Thread class but we always use thread.start() to invoke the thread rather than using thread.run(). Why? The reason is explained below.

When we invoke a thread using start() method, a new thread stack is created and the run() method starts executing in a new thread stack. But if we call the run() method directly, the new thread stack is not created, rather the run() method executes in current thread stack. The following example clearly tells the difference between calling start() & run() methods.

MyThreadClass.java

public class MyThreadClass implements Runnable {

    @Override
    public void run() {
       System.out.println(Thread.currentThread().getName()+" is running.");
    }
}

MainClass.java

public class MainClass {

   public static void main(String[] args) {
       Thread.currentThread().setName("Main Thread");
       System.out.println("This is main thread: "+Thread.currentThread().getName());

       MyThreadClass myThread = new MyThreadClass();
       Thread t = new Thread(myThread, "My Thread");
       t.start();
    }
}

Output:

This is main thread: Main Thread //This is instance of main thread.
My Thread is running. //This is new thread instance.

If we call run() method directly, we can clearly see that the output changes as follows:

Output:

This is main thread: Main Thread //This is instance of main thread.
Main Thread is running. //run() is also invoked in main thread stack.