12/08/2018, 14:59

Understanding Elasticsearch

Data In Elasticsearch Elasticsearch is document oriented , which meaning that it stores entire objects or documents. It uses JSON as the serialization format for documents. Document belongs to a type and those types live inside an index , while each document has one or more fields . This is ...

Data In Elasticsearch

Elasticsearch is document oriented, which meaning that it stores entire objects or documents. It uses JSON as the serialization format for documents. Document belongs to a type and those types live inside an index, while each document has one or more fields. This is what it looks like compare to Relational Database.

Relational DB -> Databases -> Tables -> Rows -> Columns
Elasticsearch -> Indice -> Types -> Documents -> Fields

While working with Elasticsearch you will likely come across the term index quite a lot, and it has different meaning depending on the context, so lets clearify this before we can move on.

  • index (noun): Refers to a place where document will be store in other word like a Database in a Relational Database.
  • index (verb): To store a document muck like INSERT statement in Relational Database.
  • Inverted Index: Similar to index such as UNIQUE index in Relational Database.

Interact With Elasticsearch

Java API

Elasticseach provides Java API that comes with two built-in clients which you use in your code.

  1. Node Client: this client doesn’t hold any data itself, but it knows what data lives on which node in the cluster, and can forward requests directly to the correct node.
  2. Transport Client: The lighter weight transport client can be used to send requests to a remote cluster. It doesn’t join the cluster itself, but simply forwards requests to a node in the cluster.

Both of this nodes talk to the cluster over port 9300. A cluster is a group of nodes with the same cluster name that are working together to share data and to provide failover and scale. A node is a running instance of Elasticsearch.

RESTful API

Beside Java API, elasticsearch also provides RESTful API with JSON over HTTP on port 9200 which will have the following form:

curl -XGET 'http://localhost:9200/_search?search_type=count' -d '
{
    "query": {
        "match_all": {}
    }
}
'

Document Mapping

Elasticsearch supports the following simple field types: String: string Number: byte, short, integer, long Float: float, double Boolean: boolean Date: date Beside these core type elasticsearch also support custom mapping the the using object notation.

Exact values & Full Text

Data in Elasticsearch can be broadly divided into two types: exact values & full text. Exact values are exactly what they sound like. For example the exact value "Foo" is not the same as the exact value "foo". Full text, on the other hand, refers to textual data, usually written in some human language. Exact values are easy to query. The decision is binary either matches the query, or it doesn’t. Similar to WHERE clause in SQL. Querying full text data is different, instead of asking "Does this document match the query?" it asks "How well does this document match the query?". Full text search is where elasticsearch were set apart from tranditional SQL. In order to facilitate these types of queries on full text fields, Elasticsearch first analyzes the text, then uses the results to build an inverted index.

Analysis & Analyzer

Analysis is the process of:

  1. tokenizing a block of text into individual terms suitable for use in an inverted index.
  2. normalizing these terms into a standard form to improve their “searchability” or recall.

The two steps above perform by analyzer which is the combination of three functions:

  1. Character filters: Their job is to tidy up the string before tokenization such as strip out HTML, or to convert "&" characters to the word "and".
  2. Tokenizer: split the text up into terms whenever it encounters whitespace or punctua‐ tion.
  3. Token filters: change terms (eg lowercasing "Quick"), remove terms (eg stopwords like "a", "and", "the" etc) or add terms (eg synonyms like "jump" and "leap")

Elasticsearch comes with four types of analyzers:

  • Standard analyzer (default): It splits the text on word boundaries, as defined by the Unicode Consortium, and removes most punctuation. Finally, it lowercases all terms.
  • Simple analyzer: splits the text on anything that isn’t a letter, and lowercases the terms.
  • Whitespace analyzer: splits the text on whitespace. It doesn’t lowercase.
  • Language analyzers: available for many languages, They are able to take the peculiarities of the specified language into account. For instance, the english analyzer comes with a set of English stopwords, common words like and or the which don’t have much impact on relevance which it removes, and it is able to stem English words because it understands the rules of English grammar.

Full text query understand how each field is defined, and so they can do the right thing. When you query a full text field, the query will apply the same analyzer to the query string to produce the correct list of terms to search for. By default analyzer will be apply to field with string type and turns this field into full text field.

Inverted Index

Elasticsearch uses a structure called an inverted index which is designed to allow very fast full text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears. For example two documents, each contain field with string:

  1. "No anime no life"
  2. "The term anime is a japanese word for animation video"

This will result in iverted indice in the following table.

| Term      | Doc 1 | Doc 2 |
| No        |   x   |       |
| anime     |   x   |   x   |
| no        |   x   |       |
| life      |   x   |       |
| The       |       |   x   |
| Term      |       |   x   |
| is        |       |   x   |
| a         |       |   x   |
| japanese  |       |   x   |
| word      |       |   x   |
| for       |       |   x   |
| animation |       |   x   |
| video     |       |   x   |

Now if we search for the term anime word we will get two results(hit) because the term anime appear in both Doc 1 and Doc 2, but because the term word only appear in Doc 2 this will pull Doc 2 in higher position(score) in the result list.

Conclusion

In this part we cover elasticsearch basic concept like how it store & structure data, how analysis being performed on data to create inverted index which is later on used to perform full text query. In the next part we will dive into Searching and take a look at Query DSL, which is a powerful feature of elasticsearch.

0