背景
linux 中為了防止進(jìn)程惡意使用資源,系統(tǒng)使用 ulimit 來(lái)限制進(jìn)程的資源使用情況(包括文件描述符,線程數(shù),內(nèi)存大小等)。同樣地在容器化場(chǎng)景中,需要限制其系統(tǒng)資源的使用量。
限制方法
ulimit: docker 默認(rèn)支持 ulimit 設(shè)置,可以在 dockerd 中配置 default-ulimits 可為宿主機(jī)所有容器配置默認(rèn)的 ulimit,docker 啟動(dòng)時(shí)可添加 –ulimit 為每個(gè)容器配置 ulimit 會(huì)覆蓋默認(rèn)的設(shè)置;目前 k8s 暫不支持 ulimit
cgroup: docker 默認(rèn)支持 cgroup 中內(nèi)存、cpu、pid 等的限制,對(duì)于線程限制可通過(guò) –pids-limit 可限制每個(gè)容器的 pid 總數(shù),dockerd 暫無(wú)默認(rèn)的 pid limit 設(shè)置;k8s 限制線程數(shù),可通過(guò)在 kubelet 中開(kāi)啟 SupportPodPidsLimit 特性,設(shè)置 pod 級(jí)別 pid limit
/etc/securiy/limits.conf,systcl.confg: 通過(guò) ulimit 命令設(shè)置只對(duì)當(dāng)前登錄用戶有效,永久設(shè)置可通過(guò) limits.conf 配置文件實(shí)現(xiàn),以及系統(tǒng)級(jí)別限制可通過(guò) systcl.confg 配置文件
實(shí)驗(yàn)對(duì)比
環(huán)境
本地環(huán)境:os: Ubuntu 16.04.6 LTS 4.4.0-154-generic docker: 18.09.7 base-image: alpine:v3.9
k8s 環(huán)境:kubelet: v1.10.11.1 docker: 18.09.6
ulimit
用戶級(jí)別資源限制,分為 soft 限制與 hard 限制
soft :用戶可修改,但不能超過(guò)硬限制
hard:只有 root 用戶可修改
修改方式:ulimit 命令,臨時(shí)修改;/etc/security/limits.conf,永久修改
工作原理:根據(jù) PAM ( Pluggable Authentication Modules 簡(jiǎn)稱 PAM)機(jī)制,應(yīng)用程序啟動(dòng)時(shí),按 /etc/pam.d 配置加載 pam_xxxx.so 模塊。/etc/pam.d 下包含了 login 、sshd 、su 、sudo 等程序的 PAM 配置文件, 因此用戶重新登錄時(shí),將調(diào)用 pam_limits.so 加載 limits.conf 配置文件
文件描述符限制
?
RLIMIT_NOFILE This specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(2), pipe(2), dup(2), etc.) to exceed this limit yield the error EMFILE. (Historically, this limit was named RLIMIT_OFILE on BSD.) Since Linux 4.5, this limit also defines the maximum number of file descriptors that an unprivileged process (one without the CAP_SYS_RESOURCE capability) may have "in flight" to other processes, by being passed across UNIX domain sockets. This limit applies to the sendmsg(2) system call. For further details, see unix(7).
?
根據(jù)定義,nofile 限制進(jìn)程所能最多打開(kāi)的文件數(shù)量,作用范圍進(jìn)程。
設(shè)置 ulimit nofile 限制 soft 100/hard 200,默認(rèn)啟動(dòng)為 root 用戶
?
$?docker?run?-d?--ulimit?nofile=100:200??cr.d.xiaomi.net/containercloud/alpine:webtool?top
?
進(jìn)入容器查看, fd soft 限制為 100 個(gè)
?
/?#?ulimit?-a -f:?file?size?(blocks)?????????????unlimited -t:?cpu?time?(seconds)?????????????unlimited -d:?data?seg?size?(kb)?????????????unlimited -s:?stack?size?(kb)????????????????8192 -c:?core?file?size?(blocks)????????unlimited -m:?resident?set?size?(kb)?????????unlimited -l:?locked?memory?(kb)?????????????64 -p:?processes??????????????????????unlimited -n:?file?descriptors???????????????100 -v:?address?space?(kb)?????????????unlimited -w:?locks??????????????????????????unlimited -e:?scheduling?priority????????????0 -r:?real-time?priority?????????????0
?
使用 ab 測(cè)試,并發(fā) 90 個(gè) http 請(qǐng)求,創(chuàng)建 90 個(gè) socket,正常運(yùn)行
?
/?#?ab?-n?1000000?-c?90?http://61.135.169.125:80/?& /?#?lsof?|?wc?-l 108 /?#?lsof?|?grep?-c?ab 94
?
并發(fā) 100 個(gè) http 請(qǐng)求,受到 ulimit 限制
?
/?#??ab?-n?1000000?-c?100?http://61.135.169.125:80/ This?is?ApacheBench,?Version?2.3?<$Revision:?1843412?$> Copyright?1996?Adam?Twiss,?Zeus?Technology?Ltd,?http://www.zeustech.net/ Licensed?to?The?Apache?Software?Foundation,?http://www.apache.org/ Benchmarking?61.135.169.125?(be?patient) socket:?No?file?descriptors?available?(24)
?
線程限制
?
RLIMIT_NPROC ??????????????This?is?a?limit?on?the?number?of?extant?process?(or,?more?pre‐ ??????????????cisely?on?Linux,?threads)?for?the?real?user?ID?of?the?calling ??????????????process.??So?long?as?the?current?number?of?processes?belonging ??????????????to?this?process's?real?user?ID?is?greater?than?or?equal?to ??????????????this?limit,?fork(2)?fails?with?the?error?EAGAIN. ??????????????The?RLIMIT_NPROC?limit?is?not?enforced?for?processes?that?have ??????????????either?the?CAP_SYS_ADMIN?or?the?CAP_SYS_RESOURCE?capability.
?
由定義可知,nproc 進(jìn)程限制的范圍是對(duì)于每個(gè) uid,并且對(duì)于 root 用戶無(wú)效。
容器 uid
同一主機(jī)上運(yùn)行的所有容器共享同一個(gè)內(nèi)核(主機(jī)的內(nèi)核),docker 通過(guò) namspace 對(duì) pid/utc/network 等進(jìn)行了隔離,雖然 docker 中已經(jīng)實(shí)現(xiàn)了 user namespace,但由于各種原因,默認(rèn)沒(méi)有開(kāi)啟,見(jiàn) docker user namespace
?
$?docker?run?-d??cr.d.xiaomi.net/containercloud/alpine:webtool?top
?
宿主機(jī)中查看 top 進(jìn)程,顯示 root 用戶
?
$?ps?-ef?|grep?top root??????4096??4080??0?15:01??????????0001?top
?
容器中查看 id,uid 為 0 對(duì)應(yīng)宿主機(jī)的 root 用戶,雖然同為 root 用戶,但 Linux Capabilities 不同,實(shí)際權(quán)限與宿主機(jī) root 要少很多
在容器中切換用戶到 operator(uid 為 11),執(zhí)行 sleep 命令,主機(jī)中查看對(duì)應(yīng)進(jìn)程用戶為 app,對(duì)應(yīng) uid 同樣為 11
?
/?#?id uid=0(root)?gid=0(root)?groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video) /?#?su?operator /?$?id uid=11(operator)?gid=0(root)?groups=0(root) /?$?sleep?100 $?ps?-ef?|grep?'sleep?100' app??????19302?19297??0?16:39?pts/0????0000?sleep?100 $?cat?/etc/passwd?|?grep?app app11:/home/app:
?
驗(yàn)證不同用戶下 ulimit 的限制
設(shè)置 ulimit nproc 限制 soft 10/hard 20,默認(rèn)啟動(dòng)為 root 用戶
?
$?docker?run?-d?--ulimit?nproc=10:20??cr.d.xiaomi.net/containercloud/alpine:webtool?top
?
進(jìn)入容器查看, fd soft 限制為 100 個(gè)
?
/?#?ulimit?-a -f:?file?size?(blocks)?????????????unlimited -t:?cpu?time?(seconds)?????????????unlimited -d:?data?seg?size?(kb)?????????????unlimited -s:?stack?size?(kb)????????????????8192 -c:?core?file?size?(blocks)????????unlimited -m:?resident?set?size?(kb)?????????unlimited -l:?locked?memory?(kb)?????????????64 -p:?processes??????????????????????10 -n:?file?descriptors???????????????1048576 -v:?address?space?(kb)?????????????unlimited -w:?locks??????????????????????????unlimited -e:?scheduling?priority????????????0 -r:?real-time?priority?????????????0
?
啟動(dòng) 30 個(gè)進(jìn)程
?
/?#?for?i?in?`seq?30`;do?sleep?100?&;?done /?#?ps?|?wc?-l 36
?
切換到 operator 用戶
?
/?#?su?operator #?啟動(dòng)多個(gè)進(jìn)程,到第11個(gè)進(jìn)程無(wú)法進(jìn)行fork /?$?for?i?in?`seq?8`;?do >?sleep?100?& >?done /?$?sleep?100?& /?$?sleep?100?& sh:?can't?fork:?Resource?temporarily?unavailable
?
root 下查看
?
/?#?ps?-ef?|?grep?operator ???79?operator??0:00?sh ???99?operator??0:00?sleep?100 ??100?operator??0:00?sleep?100 ??101?operator??0:00?sleep?100 ??102?operator??0:00?sleep?100 ??103?operator??0:00?sleep?100 ??104?operator??0:00?sleep?100 ??105?operator??0:00?sleep?100 ??106?operator??0:00?sleep?100 ??107?operator??0:00?sleep?100 ??109?root??????0:00?grep?operator /?#?ps?-ef?|?grep?operator|?wc?-l 10
?
驗(yàn)證 ulimit 在不同容器相同 uid 下的限制
設(shè)置 ulimit nproc 限制 soft 3/hard 3,默認(rèn)啟動(dòng)為 operator 用戶,起 4 個(gè)容器,第四個(gè)啟動(dòng)失敗
?
$?docker?run?-d?--ulimit?nproc=3:3?--name?nproc1?-u?operator??cr.d.xiaomi.net/containercloud/alpine:webtool?top eeb1551bf757ad4f112c61cc48d7cbe959185f65109e4b44f28085f246043e65 $?docker?run?-d?--ulimit?nproc=3:3?--name?nproc2?-u?operator??cr.d.xiaomi.net/containercloud/alpine:webtool?top 42ff29844565a9cb3af2c8dd560308b1f31306041d3dbd929011d65f1848a262 $?docker?run?-d?--ulimit?nproc=3:3?--name?nproc3?-u?operator??cr.d.xiaomi.net/containercloud/alpine:webtool?top b7c9b469e73f969d922841dd77265467959eda28ed06301af8bf83bcf18e8c23 $?docker?run?-d?--ulimit?nproc=3:3?--name?nproc4?-u?operator??cr.d.xiaomi.net/containercloud/alpine:webtool?top b49d8bb58757c88f69903059af2ee7e2a6cc2fa5774bc531941194c52edfd763 $ $?docker?ps?-a?|grep?nproc b49d8bb58757????????cr.d.xiaomi.net/containercloud/alpine:webtool??????"top"????????????????????16?seconds?ago??????Exited?(1)?15?seconds?ago???????????????????????????????nproc4 b7c9b469e73f????????cr.d.xiaomi.net/containercloud/alpine:webtool??????"top"????????????????????23?seconds?ago??????Up?22?seconds???????????????????????????????????????????nproc3 42ff29844565????????cr.d.xiaomi.net/containercloud/alpine:webtool??????"top"????????????????????31?seconds?ago??????Up?29?seconds???????????????????????????????????????????nproc2 eeb1551bf757????????cr.d.xiaomi.net/containercloud/alpine:webtool??????"top"????????????????????38?seconds?ago??????Up?36?seconds???????????????????????????????????????????nproc1
?
總結(jié)
ulimit 限制 fd 總數(shù),限制級(jí)別進(jìn)程,可對(duì)所有用戶生效
ulimit 限制線程總數(shù),限制級(jí)別用戶(uid),限制同一個(gè) uid 下所有線程/進(jìn)程數(shù),對(duì)于 root 賬號(hào)無(wú)效
對(duì)于目前線上情況,有較小的概率因 ulimit 限制導(dǎo)致 fork 失敗,如同一個(gè)宿主機(jī)中有多個(gè) work 容器且基礎(chǔ)鏡像相同(即 uid 相同),若一個(gè)容器線程泄露,由于 ulimit 限制會(huì)影響其他容器正常運(yùn)行
cgroup
cgroup 中對(duì) pid 進(jìn)行了隔離,通過(guò)更改 docker/kubelet 配置,可以限制 pid 總數(shù),從而達(dá)到限制線程總數(shù)的目的。線程數(shù)限制與系統(tǒng)中多處配置有關(guān),取最小值,參考 stackoverflow 上線程數(shù)的設(shè)置
docker,容器啟動(dòng)時(shí)設(shè)置 –pids-limit 參數(shù),限制容器級(jí)別 pid 總數(shù)
kubelet,開(kāi)啟 SupportPodPidsLimit 特性,設(shè)置–pod-max-pids 參數(shù),限制 node 每個(gè) pod 的 pid 總數(shù) 以 kubelet 為例,開(kāi)啟 SupportPodPidsLimit,--feature-gates=SupportPodPidsLimit=true
配置 kubelet,每個(gè) pod 允許最大 pid 數(shù)目為 150
?
[root@node01?~]#?ps?-ef?|grep?kubelet root?????18735?????1?14?11:19??????????0028?./kubelet?--v=1?--address=0.0.0.0?--feature-gates=SupportPodPidsLimit=true?--pod-max-pids=150?--allow-privileged=true?--pod-infra-container-image=cr.d.xiaomi.net/kubernetes/pause-amd64:3.1?--root-dir=/home/kubelet?--node-status-update-frequency=5s?--kubeconfig=/home/xbox/kubelet/conf/kubelet-kubeconfig?--fail-swap-on=false?--max-pods=254?--runtime-cgroups=/systemd/system.slice/frigga.service?--kubelet-cgroups=/systemd/system.slice/frigga.service?--make-iptables-util-chains=false
?
在 pod 中起測(cè)試線程,root 下起 100 個(gè)線程
?
/?#?for?i?in?`seq?100`;?do >?sleep?1000?& >?done /?#?ps?|?wc?-l 106
?
operator 下,創(chuàng)建線程受到限制,系統(tǒng)最多只能創(chuàng)建 150 個(gè)
?
/?#?su?operator /?$ /?$?for?i?in?`seq?100`;?do >?sleep?1000?& >?done sh:?can't?fork:?Resource?temporarily?unavailable /?$?ps?|?wc?-l 150
?
在 cgroup 中查看,pids 達(dá)到最大限制
?
[root@node01?~]#?cat?/sys/fs/cgroup/pids/kubepods/besteffort/pod8b61d4de-a7ad-11e9-b5b9-246e96ad0900/pids.current 150 [root@node01?~]#?cat?/sys/fs/cgroup/pids/kubepods/besteffort/pod8b61d4de-a7ad-11e9-b5b9-246e96ad0900/pids.max 150
?
總結(jié) cgroup 對(duì)于 pid 的限制能夠達(dá)到限制線程數(shù)目的,目前 docker 只支持對(duì)每個(gè)容器的限制,不支持全局配置;kubelet 只支持對(duì)于 node 所有 pod 的全局配置,不支持具體每個(gè) pod 的配置
limits.conf/sysctl.conf
limits.conf 是 ulimit 的具體配置,目錄項(xiàng)/etc/security/limit.d/中的配置會(huì)覆蓋 limits.conf。
sysctl.conf 為機(jī)器級(jí)別的資源限制,root 用戶可修改,目錄項(xiàng)/etc/security/sysctl.d/中的配置會(huì)覆蓋?sysctl.conf,在/etc/sysctl.conf?中添加對(duì)應(yīng)配置(fd: fs.file-max = {}; pid: kernel.pid_max = {})
測(cè)試容器中修改 sysctl.conf 文件
?
$?docker?run?-d?--ulimit?nofile=100:200?cr.d.xiaomi.net/containercloud/alpine:webtool?top cb1250c8fd217258da51c6818fa2ce2e2f6e35bf1d52648f1f432e6ce579cf0d $?docker?exec?-it?cb1250c?sh /?#?ulimit?-a -f:?file?size?(blocks)?????????????unlimited -t:?cpu?time?(seconds)?????????????unlimited -d:?data?seg?size?(kb)?????????????unlimited -s:?stack?size?(kb)????????????????8192 -c:?core?file?size?(blocks)????????unlimited -m:?resident?set?size?(kb)?????????unlimited -l:?locked?memory?(kb)?????????????64 -p:?processes??????????????????????unlimited -n:?file?descriptors???????????????100 -v:?address?space?(kb)?????????????unlimited -w:?locks??????????????????????????unlimited -e:?scheduling?priority????????????0 -r:?real-time?priority?????????????0 /?# /?#?echo?10?>?/proc/sys/kernel/pid_max sh:?can't?create?/proc/sys/kernel/pid_max:?Read-only?file?system /?#?echo?10?>?/proc/sys/kernel/pid_max sh:?can't?create?/proc/sys/kernel/pid_max:?Read-only?file?system /?#?echo?"fs.file-max=5"?>>?/etc/sysctl.conf /?#?sysctl?-p sysctl:?error?setting?key?'fs.file-max':?Read-only?file?system
?
以 priviledged 模式測(cè)試,謹(jǐn)慎測(cè)試
?
$?cat?/proc/sys/kernel/pid_max 32768 $?docker?run?-d?--?--ulimit?nofile=100:200?cr.d.xiaomi.net/containercloud/alpine:webtool?top $?docker?exec?-it?pedantic_vaughan?sh /?#?cat?/proc/sys/kernel/pid_max 32768 /?#?echo?50000?>?/proc/sys/kernel/pid_max /?#?cat?/proc/sys/kernel/pid_max 50000 /?#?exit $?cat?/proc/sys/kernel/pid_max 50000?#?宿主機(jī)的文件也變成50000
?
總結(jié) 由于 docker 隔離的不徹底,在 docker 中修改 sysctl 會(huì)覆蓋主機(jī)中的配置,不能用來(lái)實(shí)現(xiàn)容器級(jí)別資源限制 limits.conf 可以在容器中設(shè)置,效果同 ulimit
結(jié)論
推薦方案如下:
fd 限制:修改 dockerd 配置 default-ulimits,限制進(jìn)程級(jí)別 fd
thread 限制:修改 kubelet 配置--feature-gates=SupportPodPidsLimit=true - -pod-max-pids={},cgroup 級(jí)別限制 pid,從而限制線程數(shù)
其他注意事項(xiàng),調(diào)整節(jié)點(diǎn) pid.max 參數(shù);放開(kāi)或者調(diào)大鏡像中 ulimit 對(duì)非 root 賬戶 nproc 限制
審核編輯:湯梓紅
評(píng)論
查看更多