在Elasticsearch中, 需要搞清楚几个名词,如segment/doc/term/token/shard/index等, 其实segment/doc/term/token都是lucene中的概念。这样有助于更深入的了解和使用ES。
document: 索引和搜索的主要数据载体,对应写入到ES中的一个doc。
field: document中的各个字段。
term: 词项,搜索时的一个单位,代表文本中的某个词。
token: 词条,词项(term)在字段(field)中的一次出现,包括词项的文本、开始和结束的位移、类型等信息。
Lucene内部使用的是倒排索引的数据结构, 将词项(term)映射到文档(document)。
例如:某3个文档,假设某个字段的文本如下
ElasticSearch Server (文档1)
Matering ElasticSearch (文档2)
Apache solr 4 Cookbook (文档3)
term | 次数 | doc id |
4 | 1 | 3 |
Apache | 1 | 3 |
Cookbook | 1 | 3 |
ElasticSearch | 2 | 1,2 |
Matering | 1 | 1 |
Server | 1 | 1 |
solr | 1 | 3 |
index: 在ES中类似数据库中db
shard:
A "shard" is an instance of Lucene. It is a fully functional search engine in its own right. An "index" could consist of a single shard, but generally consists of several shards, to allow the index to grow and to be split over several machines.
A "primary shard" is the main home for a document. A "replica shard" is a copy of the primary shard that provides (1) failover in case the primary dies and (2) increased read throughput
segment:
Each shard contains multiple "segments", where a segment is an inverted index. A search in a shard will search each segment in turn, then combine their results into the final results for that shard.
While you are indexing documents, Elasticsearch collects them in memory (and in the transaction log, for safety) then every second or so, writes a new small segment to disk, and "refreshes" the search.
This makes the data in the new segment visible to search (ie they are "searchable"), but the segment has not been fsync'ed to disk, so is still at risk of data loss.
Every so often, Elasticsearch will "flush", which means fsync'ing the segments, (they are now "committed") and clearing out the transaction log, which is no longer needed because we know that the new data has been written to disk.
The more segments there are, the longer each search takes. So Elasticsearch will merge a number of segments of a similar size ("tier") into a single bigger segment, through a background merge process. Once the new bigger segment is written, the old segments are dropped. This process is repeated on the bigger segments when there are too many of the same size.
Segments are immutable. When a document is updated, it actually just marks the old document as deleted, and indexes a new document. The merge process also expunges these old deleted documents.
参考:
https://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html
http://stackoverflow.com/questions/15426441/understanding-segments-in-elasticsearch
相关推荐
Elasticsearch-深入理解索引原理
ElasticSearch中文文档(新版)
ElasticSearch Java API 中文文档 ElasticSearch Java API 中文文档
赠送jar包:elasticsearch-6.8.3.jar; 赠送原API文档:elasticsearch-6.8.3-javadoc.jar; 赠送源代码:elasticsearch-6.8.3-sources.jar; 赠送Maven依赖信息文件:elasticsearch-6.8.3.pom; 包含翻译后的API文档...
赠送jar包:elasticsearch-6.3.0.jar; 赠送原API文档:elasticsearch-6.3.0-javadoc.jar; 赠送源代码:elasticsearch-6.3.0-sources.jar; 赠送Maven依赖信息文件:elasticsearch-6.3.0.pom; 包含翻译后的API文档...
(狂神)ElasticSearch快速入门笔记,ElasticSearch基本操作以及爬虫(Java-ES仿京东实战),包含了小狂神讲的东西,特别适合新手学习,笔记保存下来可以多看看。好记性不如烂笔头哦~,ElasticSearch,简称es,es是一个...
学习elasticsearch,决定把自己用过的整成中文,已整理一部份,虽然不尽人意,但也尽力,有query dsl与一部分API,后续整理完了会继续更新。
elasticsearch elasticsearch-6.2.2 elasticsearch-6.2.2.zip 下载
深入理解ElasticSearch PDF 深入理解ElasticSearch PDF
ElasticSearch官网文档中文版
Elasticsearch电商平台中文分词词库
一、概述 一般来说我们开发Elasticsearch会选择...2、elasticsearch-head (方便查看ES中的索引及数据) 3、Kibana(方便开发通过rest api 调试ES,有代码提示) 4、中文分词elasticsearch-analysis-ik (ik) 1、下载ela
Logstash 和 Beats 有助于收集、聚合和丰富您的数据并将其存储在 Elasticsearch 中。Kibana 使您能够以交互方式探索、可视化和分享对数据的见解,并管理和监控堆栈。 Elasticsearch 为所有类型的数据提供近乎实时的...
赠送jar包:elasticsearch-6.2.3.jar; 赠送原API文档:elasticsearch-6.2.3-javadoc.jar; 赠送源代码:elasticsearch-6.2.3-sources.jar; 赠送Maven依赖信息文件:elasticsearch-6.2.3.pom; 包含翻译后的API文档...
Elasticsearch用于云计算中,能够达到实时搜索,稳定,可靠,快速,安装使用方便。官方客户端在Java、.NET(C#)、PHP、Python、Apache Groovy、Ruby和许多其他语言中都是可用的。根据DB-Engines的排名显示,Elastic...
spring-data-elasticsearch中文使用文档,spring-data-elasticsearch、elasticsearch、ES、ElasticSearch、ES中文教程
elasticsearch-analysis-ik 是一个常用的中文分词器,在 Elasticsearch 中广泛应用于中文文本的分析和搜索。下面是 elasticsearch-analysis-ik 分词器的几个主要特点: 中文分词:elasticsearch-analysis-ik 是基于...
docker run --name elasticsearch7.16.3 -p 127.0.0.1:9200:9200 -p 127.0.0.1:9300:9300 -e "discovery.type=single-node" -v /Users/xingyue/Home/xingyue/学习/工程化/es/elasticsearch.yml:/usr/share/elastic...
消费kafka数据,然后批量导入到Elasticsearch,本例子使用的kafka版本0.10,es版本是6.4,使用bulk方式批量导入到es中,也可以一条一条的导入,不过比较慢。 <groupId>org.elasticsearch <artifactId>elastic...
ElasticSearch是一个基于Lucene的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口。Elasticsearch是用Java开发的,并作为Apache许可条款下的开放源码发布...架构图方便理解Elasticsearch