1.背景
SequoiaDB 巨杉数据库是一款金融级分布式数据库,包括了分布式 NewSQL、分布式文件系统与对象存储、与高性能 NoSQL 三种存储模式,分别对应分布式联机交易、非结构化数据和内容管理、以及海量数据管理和高性能访问场景。
集群一般会使用三副本方式以确保数据安全。假若集群发生因硬件故障等原因导致的节点故障或集群异常,数据库管理员应进行系统的分析和诊断,以确保集群正常工作,不会影响用户的正常使用。本文将与大家分享一下基本的 SequoiaDB 数据库诊断方法。
2. 数据库集群诊断
# cat /etc/default/sequoiadb NAME=sdbcm SDBADMIN_USER=sdbadmin INSTALL_DIR=/opt/sequoiadb
$ sdblist -l $ sdblist -t all

从左到右依次为SvcName(节点名称)、Role(角色名称分为:编目节点、协调节点和数节点)、PID(进程号)、GID、NID、PRY(是否为主节点)、GroupName(组名)、 StartTime(启动时间)、DBPath(安装路径)等信息,这些信息对于分析定位问题有很大的帮助。
$ sdbstart -p 11820 11820: 69 bytes out== >db role: data_test error Failed resolving arguments(error=-6), exit <== Error: Start [/opt/sequoiadb/bin/../conf/local/11820] failed, rc: 127(Invalid Argument) Total: 1; Succeed: 0; Failed: 1
通过报错信息 error 为-6参数错误,且报出 ../conf/local/11820 配置信息错误,查询节点的配置信息,节点的配置信息在安装目录下的 conf 目录中。
$ vi /opt/sequoiadb/conf/local/11820/sdb.conf svcname=11820 dbpath=/opt/sequoiadb/database/data/11820 logfilesz=64 weight=10 sortbuf=256 sharingbreak=180000 role=data_test catalogaddr=sdb1:11803,sdb2:11803,sdb3:11803
>cd /opt/sequoiadb/bin/
$ sdb 'db = new Sdb("localhost",11810,"username","password")'
$ sdb 'db.snapshot(SDB_SNAP_DATABASE)'
{
"TotalNumConnects": 0,
"TotalDataRead": 787373,
"TotalIndexRead": 0,
……
"ErrNodes": [
{
"NodeName": "sdb1:11820",
"Flag": -129
},
{
"NodeName": "sdb2:11820",
"Flag": -129
}
]
}
Return 1 row(s).
Takes 0.27826s.$ sdb 'db.snapshot(SDB_SNAP_DATABASE)'
{
"TotalNumConnects": 1,
……
"ErrNodes": []
}$ sdb 'data = new Sdb("sdb2",11820)'
$ sdb ' data.snapshot(SDB_SNAP_DATABASE)'
{
"NodeName": "sdb2:11820",
"HostName": "sdb2",
"ServiceName": "11820",
"GroupName": "dg1",
"IsPrimary": false,
"ServiceStatus": false,
"Status": "FullSync",
......2019-11-08-21.38.26.332510 Level:EVENT PID:3151 TID:3208 Function:_onAttach Line:217 File:SequoiaDB/engine/cls/clsReplSession.cpp Message: Session[Type:Sync-Dest,NodeID:1008,TID:1]: The db data is abnormal, need to synchronize full data 2019-11-08-21.38.26.333890 Level:EVENT PID:3151 TID:3208 Function:_fullSync Line:722 File:SequoiaDB/engine/cls/clsReplSession.cpp Message: Session[Type:Sync-Dest,NodeID:1008,TID:1]: Start the synchronization of full
$cd /opt/sequoiadb/bin/
$ sdb 'db = new Sdb("localhost",11810)'
$ sdb 'db.sample.employee.insert({"code":1,"name":"test1"})'
$ sdb 'db.sample.employee.find()'
$ sdb 'db.sample.employee.count()'$ sdb 'db.sample.employee. find ()' sdb.js:505 uncaught exception: -5 File Exist
-5表示文件已经存在,打开协调节点所在的服务器,打开协调节点日志文件并定位-5错误所发生的位置,查看到如下信息:
vi /opt/sequoiadb/database/coord/11810/diaglog/sdbdiag.log
2019-11-08-21.38.26.971524 Level:ERROR
PID:89651 TID:90037
Function:_queryOrDoOnCL Line:1076
File:SequoiaDB/engine/coord/coordQueryOperator.cpp
Message:
Query failed on node[{ GroupID:1000, NodeID:1002, ServiceID:2(SHARD) }], rc: -5
2019-11-08-21.38.26.971661 Level:ERROR
PID:89651 TID:90037
Function:execute Line:491
File:SequoiaDB/engine/coord/coordQueryOperator.cpp
Message:
Query failed, rc: -5
2019-11-08-21.38.26.971679 Level:ERROR
PID:89651 TID:90037
Function:_onQueryReqMsg Line:1850
File:SequoiaDB/engine/pmd/pmdProcessor.cpp
Message:
Execute operator[Query] failed, rc: -5日志中可以看到,“Query failed on node[{ GroupID:1000, NodeID:1002, ServiceID:2(SHARD) }], rc: -5”错误信息代表着真正的错误来自数据节点:分区组1000,节点ID1000,ServiceID:2错误码-5。接着在命令行使用 db.listReplicaGroups() 可以得到复制组信息:
$ sdb 'db.listReplicaGroups()'
{
……
{
"HostName": "sdb3",
"Status": 1,
"dbpath": "/opt/sequoiadb/database/data/11820/",
"Service": [
{
"Type": 0,
"Name": "11820"
},
{
"Type": 1,
"Name": "11821"
},
{
"Type": 2,
"Name": "11822"
}
],
"NodeID": 1002
},
],
"GroupID": 1000,
"GroupName": "dg1",
"PrimaryNode": 1002,
"Role": 0,
"SecretID": 1969965962,
"Status": 1,
"Version": 7,
"_id": {
"$oid": "5d843fd23e28e361958a76bc"
}
}通过遍历分区组信息,可以发现组ID1000,节点ID1002所对应的机器为sdb3的11820这个节点,数据库路径为 /opt/sequoiadb/database/data/11820,查看节点日志:
vi /opt/sequoiadb/database/data/11820/diaglog/sdbdiag.log 2019-11-08-21.38.26.584673 Level:ERROR PID:4347 TID:4370 Function:open Line:66 File:SequoiaDB/engine/oss/ossMmap.cpp Message: Failed to open file, rc: -5 2019-11-08-21.38.26.584698 Level:ERROR PID:4347 TID:4370 Function:openStorage Line:700 File:SequoiaDB/engine/dms/dmsStorageBase.cpp Message: Failed to open /opt/sequoiadb/database/data/11820/sample.1.data, rc=-5 2019-11-08-21.38.26.584721 Level:ERROR PID:4347 TID:4370 Function:open Line:1172 File:SequoiaDB/engine/dms/dmsStorageUnit.cpp Message: Open storage data su failed, rc: -5 2019-11-08-21.38.26.584756 Level:ERROR PID:4347 TID:4370 Function:rtnCreateCollectionSpaceCommand Line:1160 File:SequoiaDB/engine/rtn/rtnCommandImpl.cpp Message: Failed to create collection space sample at /opt/sequoiadb/database/data/11820/, rc: -5
通过日志文件可以发现-5的错误不存在的错误,是因为sdb3机器中的11820节点下的sample.1.data 文件存在异常。因此接下来进入数据节点所在路径检查集合空间文件,发现 sample 这个集合文件已经被损坏。
[sdbadmin@sdb3 11820]$ ll total 1233564 drwxrwxrwx. 2 sdbadmin sdbadmin_group 4096 Sep 19 19:56 archivelog drwxrwxrwx. 2 sdbadmin sdbadmin_group 4096 Sep 19 19:56 bakfile drwxrwxrwx. 2 sdbadmin sdbadmin_group 4096 Nov 8 06:11 diaglog -rw-r-----. 1 sdbadmin sdbadmin_group 0 Nov 8 05:27 sample.1.data -rw-r-----. 1 sdbadmin sdbadmin_group 0 Nov 8 05:27 sample.1.idx …… drwxrwxrwx. 2 sdbadmin sdbadmin_group 4096 Sep 19 19:56 tmp
将其他机器的 sample.1.data 和 sample.1.idx 这两个文件拷贝到 sdb3 的11820中:
$ scp -r sdbadmin@sdb2:/opt/sequoiadb/database/data/11820/sample.1.* .
$ sdb 'var dg = db.getRG("dg1")'
$ sdb 'dg.stop()'
$ sdb 'dg.start()'重新查询集合正常:
$ sdb 'db.sample.employee.find()'
{
"_id": {
"$oid": "5dc5755ec73f4486ee4efe40"
},
"a": 1
}
Return 1 row(s)3.总结