Install SGE On CentOS
本工程主要介绍了如何在CentOS上安装并使用SGE(Sun Grid Engine)。使用配合使用NFS的SGE搭建计算集群,可以实现任务多机器并行运行,例如语音识别开源框架kaldi的使用。
安装
SGE安装包下载地址: https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-8.1.9.tar.gz
主控节点安装
命令行执行,修改hostname
1 |
|
命令行执行,修改hosts文件,添加主控节点和两个计算节点信息
1 |
|
命令行执行,创建共享目录
1 |
|
命令行执行,安装epel源
1 |
|
命令行执行,安装依赖库
1 |
|
命令行执行,添加sgeadmin用户组,及sgeadmin用户
1 |
|
命令行执行,修改sudo文件,添加一行配置
1 |
|
命令行执行,下载并编译SGE。(如果您下载不了SGE安装包,请到本工程/softwares目录下获取)
1 |
|
命令行执行,安装SGE qmaster节点
1 |
|
安装过程中,需要同意一些默认配置。
press enter at the intro screen
press “y” and then specify sgeadmin as the user id
leave the install dir as /BiO/gridengine
You will now be asked about port configuration for the master, normally you would choose the default (2) which uses the /etc/services file
Accept the sge_qmaster info
You will now be asked about port configuration for the master, normally you would choose the default (2) which uses the /etc/services file
Accept the sge_execd info
Leave the cell name as “default”
Enter an appropriate cluster name when requested
Leave the spool dir as is
Press “n” for no windows hosts!
Press “y” (permissions are set correctly)
Press “y” for all hosts in one domain
If you have Java available on your Qmaster and wish to use SGE Inspect or SDM then enable the JMX MBean server and provide the requested information - probably answer “n” at this point!
Press enter to accept the directory creation notification
E nter “classic” for classic spooling (berkeleydb may be more appropriate for large clusters)
Press enter to accept the next notice
Enter “20000-20100” as the GID range (increase this range if you have execution nodes capable of running more than 100 concurrent jobs)
Accept the default spool dir or specify a different folder (for example if you wish to use a shared or local folder outside of SGE_ROOT
Enter an email address that will be sent problem reports
Press “n” to refuse to change the parameters you have just configured
Press enter to accept the next notice
Press “y” to install the startup scripts
Press enter twice to confirm the following messages
Press “n” for a file with a list of hosts
Enter the names of your hosts who will be able to administer and submit jobs (enter alone to finish adding hosts)
Skip shadow hosts for now (press “n”)
Choose “1” for normal configuration and agree with “y”
Press enter to accept the next message and “n” to refuse to see the previous screen again and then finally enter to exit the installer
命令行执行,安装NFS,将主控节点目录共享
1 |
|
命令行执行,
1 |
|
至此qmaster主控节点安装配置完毕!
计算节点安装(以compute01为例)
命令行执行,安装依赖库
1 |
|
命令行执行,修改hostname
1 |
|
命令行执行,修改hosts文件,添加主控节点和两个计算节点信息
1 |
|
命令行执行,添加sgeadmin用户组,添加sgeadmin用户
1 |
|
命令行执行,安装NFS,并启动服务
1 |
|
命令行执行,创建共享目录,将主控节点目录共享至计算节点
1 |
|
命令行执行,安装计算节点
1 |
|
测试
命令行执行,在qmaster或两台compute01、compute02均可执行qhost,查看集群主机列表
1 |
|
可以通过提交一个简单的任务(job),测试SGE的功能。
这个任务是打印当前执行及其的Linux内核版本号。在任意一个计算节点(如compute01),命令行执行,vi编辑任务执行脚本
1 |
|
命令行执行,使用qsub提交任务并执行
1 |
|
执行完毕后,您可以在执行任务的路径下看到任务运行得到的日志输出(包括错误日志)
1 |
|
SGE操作
qhost 查看集群主机列表
qstat 查看集群任务执行状态(可使用watch -n 10 -d qstat 持查看执行状态)
qdel [jobid] 通过jobid删除任务
qsub 提交任务 qsub -cwd ./run.sh 其中-cwd表示以当前目录作为执行目录,否则命令将以sgeadmin用户目录作为当前目录。