Summary of Elasticsearch used at my work

ES优化方案

ES调研：

1. 以192.168.130.21为例，目前elastic服务占用服务器35.76GB（57.1%）的内存

堆内存:大小和交换 | Elasticsearch: 权威指南 | Elastic

服务器是62GB，MySQL占用8.4GB，官方建议给ES分配的内存为可用内存的50%（27GB），给Lucene缓存服务留下更多的内存

从图里看，目前给elastic服务设置的堆内存是32GB，超过27GB，

堆内存越小，Elasticsearch（更快的 GC）和 Lucene（更多的内存用于缓存）的性能越好。

官方举了个例子，给64GB服务器分配32GB内存到ES上恰好存在一个内存指针压缩的问题，最稳妥的是分配31GB内存。

2. 现有4个主分片+4个副本分片，每个分片大概225gb

官方社区建议每个分片尽量控制在10-50GB

优化建议一：分成45个分片，以现有数据规模900GB来看，每个分片大概保留20GB的数据

结论：先分成30个分片

3. 首页不提供全局搜索，只展示静态页面，和NCBI相似，用户输入内容后才显示搜索内容

4. 分库分表先不操作，代码改动比较大

5. 聚合的操作需要单独进行优化

提升导入性能

牺牲数据可靠性及搜索实时性以换取数据写入性能

{
  "settings": {
    "index": {
      "refresh_interval": "-1",
			"number_of_shards": "30",
			"number_of_replicas": 0,
			"translog": {
		      "sync_interval": "60s",
		      "durability": "async"
		    },
			"routing": {
		      "allocation": {
		        "total_shards_per_node": 15
			      }
		    },
			"mapping": {
					"total_fields":{
						"limit": 2000						
						}
				}
    },
"mapper": {
		    "dynamic": false
		  }
  }
}

PUT http://192.168.130.21:9200/nucbank1/_settings

    "index": {
      "refresh_interval": "-1",
      "number_of_shards": "30",
			"number_of_replicas": 0,
			"mapping.total_fields.limit": "2000",
			"translog": {
		      "sync_interval": "60s",
		      "durability": "async"
		    },
    },
    "routing": {
      "allocation": {
        "total_shards_per_node": 15
      }
    },
    
    
  "mappings": {
    "dynamic": false
  }
bootstrap.memory_lock: true

{
  "settings": {
    "index": {
      "refresh_interval": "-1",
			"number_of_shards": "30",
			"number_of_replicas": 0,
			"translog": {
		      "sync_interval": "60s",
		      "durability": "async"
		    },
			"routing": {
		      "allocation": {
		        "total_shards_per_node": 15
			      }
		    },
			"mapping": {
					"total_fields":{
						"limit": 2000						
						}
				}
    },
"mapper": {
		    "dynamic": false
		  }
  }
}

{
  "settings": {
    "index": {
      "refresh_interval": "60s",
			"number_of_replicas": 1,
			"translog": {
		      "sync_interval": "60s",
		      "durability": "async"
		    }
    }
  }
}

PUT http://192.168.130.21:9200/nucbank1/_settings
{
  "index.mapping.total_fields.limit": 2000
}

导入数据完成后执行http://192.168.130.21:9200/nucbank1/_refresh

使用_bulk批量导入数据

测试环境：POST http://192.168.130.19:9200/seqbank/nucleotide/_bulk

正式环境：POST http://192.168.130.21:9200/nucbank/nucleotide/_bulk

使用：

1 2	{ "index": {}} { "accession": "", "definition":"" }

不要使用：

{"index": {"_index": "nucbank", "_type": "nucleotide", "_id": 1}}
{"doc": {"accession": "", "definition":""}}
{"index": {"_index": "nucbank", "_type": "nucleotide", "_id": 2}}
{"doc": {"accession": "", "definition":""}}

注意在 json 文件末尾加多一个回车

kibana

192.168.164.19

DEC-10 Lab

Elasticsearch-enhancement