How sched_setaffinity works inside of Linux Kernel

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


YSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len, unsigned long __user *, user_mask_ptr)
-- sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
--- __set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask, bool check)
---- stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
----- migration_cpu_stop(void *data)
------ __migrate_task(struct rq *rq, struct task_struct *p, int dest_cpu)
------- move_queued_task(struct rq *rq, struct task_struct *p, int new_cpu)
-------- enqueue_task(struct rq *rq, struct task_struct *p, int flags)
--------- returns the new run queue of destination CPU

[root@h12-storage03 ~]# dmesg|grep -i numa [ 0.000000] NUMA: Initialized distance table, cnt=2 [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x807fffffff] -> [mem 0x00000000-0x807fffffff] [ 0.001000] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl [ 1.151966] pci_bus 0000:00: on NUMA node 0 [ 1.154241] pci_bus 0000:16: on NUMA node 0 [ 1.155523] pci_bus 0000:30: on NUMA node 0 [ 1.156810] pci_bus 0000:4a: on NUMA node 0 [ 1.160095] pci_bus 0000:64: on NUMA node 0 [ 1.164343] pci_bus 0000:7e: on NUMA node 0 [ 1.176501] pci_bus 0000:7f: on NUMA node 0 [ 1.178741] pci_bus 0000:80: on NUMA node 1 [ 1.181500] pci_bus 0000:97: on NUMA node 1 [ 1.182764] pci_bus 0000:b0: on NUMA node 1 [ 1.183768] pci_bus 0000:c9: on NUMA node 1 [ 1.184760] pci_bus 0000:e2: on NUMA node 1 [ 1.188830] pci_bus 0000:fe: on NUMA node 1 [ 1.200504] pci_bus 0000:ff: on NUMA node 1 [root@h12-storage03 ~]#

cpu

疑问1 What is NUMA?

dmesg|grep -i numa

[ 0.000000] No NUMA configuration found

疑问2:网卡软中断都集中在了一个cpu上，导致cpu负载升高，多余网络包被丢弃？

X86系统采用中断机制协同处理CPU与其他设备工作。长久以来网卡的中断默认由cpu0处理，在大量小包的网络环境下可能出现cpu0负载高，而其他cpu空闲。后来出现网卡多队列技术解决这个问题。

ls /sys/class/net/eth0/queues/

多队列网卡的好处是可以将每个队列产生的中断分布到cpu的多个核，实现负载均衡，避免了单个核被占用到100%而其他核还处于空闲的情况
ethtool -l eth0 eth0网卡是否支持多队列，最多支持多少、当前开启多少
1

service irqbalance stop

net

Google 对此做的统计显示，三次握手消耗的时间，在 HTTP 请求完成的时间占比在 10% 到 30% 之间

疑问1.

net.ipv4.tcp_syncookies = 1

Linux 下怎样开启 syncookies 功能呢？修改 tcp_syncookies 参数即可，其中值为 0 时表示关闭该功能，2 表示无条件开启功能，而 1 则表示仅当 SYN 半连接队列放不下时，再启用它。由于 syncookie 仅用于应对 SYN 泛洪攻击（攻击者恶意构造大量的 SYN 报文发送给服务器，造成 SYN 半连接队列溢出，导致正常客户端的连接无法建立），这种方式建立的连接，许多 TCP 特性都无法使用。所以，应当把 tcp_syncookies 设置为 1，仅在队列满时再启用。

tcp_max_tw_buckets 控制并发的 TIME_WAIT 的数量，默认值是 180000

file

疑问1: aio 我业务代码没写，需要内核调整吗？

/etc/sysctl.conf

## 修改内核异步 I/O 限制

fs.aio-max-nr=1048576

一手资料

Linux 性能优化实战

下半部用来延迟处理上半部未完成的工作，通常以内核线程的方式运行。

上半部直接处理硬件请求，也就是我们常说的硬中断，特点是快速执行；（dma）

而下半部则是由内核触发，也就是我们常说的软中断，特点是延迟执行。

cat /proc/softirqs

NET_RX 表示网络接收中断，而 NET_TX 表示网络发送中断。

ps aux | grep softirq - —48内核。

[ksoftirqd/47

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63


[root@spa-65-177-112 gadmin]# ps aux | grep softirq
root         6  0.0  0.0      0     0 ?        S     2021   0:20 [ksoftirqd/0]
root        14  0.0  0.0      0     0 ?        S     2021   0:21 [ksoftirqd/1]
root        20  0.0  0.0      0     0 ?        S     2021   0:16 [ksoftirqd/2]
root        25  0.0  0.0      0     0 ?        S     2021   0:19 [ksoftirqd/3]
root        30  0.0  0.0      0     0 ?        S     2021   0:19 [ksoftirqd/4]
root        35  0.0  0.0      0     0 ?        S     2021   0:30 [ksoftirqd/5]
root        40  0.0  0.0      0     0 ?        S     2021   0:20 [ksoftirqd/6]
root        45  0.0  0.0      0     0 ?        S     2021   0:33 [ksoftirqd/7]
root        50  0.0  0.0      0     0 ?        S     2021   0:20 [ksoftirqd/8]
root        55  0.0  0.0      0     0 ?        S     2021   0:17 [ksoftirqd/9]
root        60  0.0  0.0      0     0 ?        S     2021   0:31 [ksoftirqd/10]
root        65  0.0  0.0      0     0 ?        S     2021   0:23 [ksoftirqd/11]
root        70  0.0  0.0      0     0 ?        S     2021   0:13 [ksoftirqd/12]
root        76  0.0  0.0      0     0 ?        S     2021   0:22 [ksoftirqd/13]
root        81  0.0  0.0      0     0 ?        S     2021   0:27 [ksoftirqd/14]
root        86  0.0  0.0      0     0 ?        S     2021   0:32 [ksoftirqd/15]
root        91  0.0  0.0      0     0 ?        S     2021   0:34 [ksoftirqd/16]
root        96  0.0  0.0      0     0 ?        S     2021   0:18 [ksoftirqd/17]
root       101  0.0  0.0      0     0 ?        S     2021   0:18 [ksoftirqd/18]
root       106  0.0  0.0      0     0 ?        S     2021   0:16 [ksoftirqd/19]
root       111  0.0  0.0      0     0 ?        S     2021   0:20 [ksoftirqd/20]
root       116  0.0  0.0      0     0 ?        S     2021   0:23 [ksoftirqd/21]
root       121  0.0  0.0      0     0 ?        S     2021   0:18 [ksoftirqd/22]
root       126  0.0  0.0      0     0 ?        S     2021   0:14 [ksoftirqd/23]
root       131  0.0  0.0      0     0 ?        S     2021   0:00 [ksoftirqd/24]
root       136  0.0  0.0      0     0 ?        S     2021   0:06 [ksoftirqd/25]
root       141  0.0  0.0      0     0 ?        S     2021   0:04 [ksoftirqd/26]
root       146  0.0  0.0      0     0 ?        S     2021   0:03 [ksoftirqd/27]
root       151  0.0  0.0      0     0 ?        S     2021   0:03 [ksoftirqd/28]
root       156  0.0  0.0      0     0 ?        S     2021   0:02 [ksoftirqd/29]
root       161  0.0  0.0      0     0 ?        S     2021   0:02 [ksoftirqd/30]
root       166  0.0  0.0      0     0 ?        S     2021   0:02 [ksoftirqd/31]
root       171  0.0  0.0      0     0 ?        S     2021   0:01 [ksoftirqd/32]
root       176  0.0  0.0      0     0 ?        S     2021   0:01 [ksoftirqd/33]
root       181  0.0  0.0      0     0 ?        S     2021   0:01 [ksoftirqd/34]
root       186  0.0  0.0      0     0 ?        S     2021   0:01 [ksoftirqd/35]
root       191  0.0  0.0      0     0 ?        S     2021   0:00 [ksoftirqd/36]
root       196  0.0  0.0      0     0 ?        S     2021   0:06 [ksoftirqd/37]
root       201  0.0  0.0      0     0 ?        S     2021   0:04 [ksoftirqd/38]
root       206  0.0  0.0      0     0 ?        S     2021   0:03 [ksoftirqd/39]
root       211  0.0  0.0      0     0 ?        S     2021   0:03 [ksoftirqd/40]
root       216  0.0  0.0      0     0 ?        S     2021   0:02 [ksoftirqd/41]
root       221  0.0  0.0      0     0 ?        S     2021   0:03 [ksoftirqd/42]
root       226  0.0  0.0      0     0 ?        S     2021   0:02 [ksoftirqd/43]
root       231  0.0  0.0      0     0 ?        S     2021   0:00 [ksoftirqd/44]
root       236  0.0  0.0      0     0 ?        S     2021   0:00 [ksoftirqd/45]
root       241  0.0  0.0      0     0 ?        S     2021   0:00 [ksoftirqd/46]
root       246  0.0  0.0      0     0 ?        S     2021   0:00 [ksoftirqd/47]

[ ~]# cat /proc/irq/24/smp_affinity
0001
[ ~]# cat /proc/irq/24/smp_affinity_list 
0
#上面表示0001对应cpu0，可以直接修改绑定关系
[ ~]# echo 4 > /proc/irq/24/smp_affinity
[ ~]# cat /proc/irq/24/smp_affinity_list 
2
#此时中断号24对应的处理CPU为cpu2

mpstat -P ALL 1 1
watch -d cat /proc/softirqs
cat /proc/interrupts  

watch -d cat /proc/softirqs

ET_RX（网络接收）、SCHED（内核调度）、RCU（RCU 锁）

10 | 案例篇：系统的软中断CPU使用率升高，我该怎么办？
https://linuxhint.com/understanding_numa_architecture/
https://www.jianshu.com/p/6dfde8331380

Why - the application buffer/queue to the write/send data, understand its consequences can help a lot.
How:
- Check command: sysctl net.ipv4.tcp_rmem
- Change command: sysctl -w net.ipv4.tcp_rmem="min default max"; when changing default value, remember to restart your user space app (i.e. your web server, nginx, etc)
- How to monitor: cat /proc/net/sockstat

【11】内核参数 /etc/sysctl.conf ✅

cat /proc/sys/fs/aio-max-nr

65536

fs.aio-max-nr = 1048576

sysctl -p /etc/sysctl.conf

The Linux kernel provides the Asynchronous non-blocking I/O (AIO) feature that allows a process to initiate multiple I/O operations simultaneously without having to wait for any of them to complete. This helps boost performance for applications that are able to overlap processing and I/O. ✅

https://www.jianshu.com/p/6dfde8331380

######################## cat /proc/sys/net/core/somaxconn

# 默认值：128

# 作用：已经成功建立连接的套接字将要进入队列的长度

net.core.somaxconn = 65536

作者：济夏链接：https://www.jianshu.com/p/6dfde8331380 来源：简书著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

Socket accept 队列满，系统的 somaxconn 内核参数默认太小。

net.ipv4.tcp_max_syn_backlog = 1024

net.ipv4.tcp_fin_timeout = 60

如何提升TCP四次挥手的性能？

为什么建立连接是三次握手，而关闭连接需要四次挥手呢？

TCP 不允许连接处于半打开状态时就单向传输数据，因此合并在一起传输了。

但是当连接处于半关闭状态时，TCP 是允许单向传输数据的。

tcp_orphan_retries

net.ipv4.tcp_max_orphans = 16384

net.ipv4.tcp_fin_timeout = 60

net.ipv4.tcp_max_tw_buckets = 5000

//Linux 提供了 tcp_max_tw_buckets 参数，当 TIME_WAIT 的连接数量超过该参数时，新关闭的连接就不再经历 TIME_WAIT 而直接关闭。

Linux 并没有限制 CLOSE_WAIT 状态的持续时间。

tcp_max_orphans 定义了孤儿连接的最大数量。当进程调用 close 函数关闭连接后，无论该连接是在 FIN_WAIT1 状态，还是确实关闭了，这个连接都与该进程无关了，它变成了孤儿连接。Linux 系统为防止孤儿连接过多，导致系统资源长期被占用，就提供了 tcp_max_orphans 参数。
- 如果孤儿连接数量大于它，新增的孤儿连接将不再走四次挥手，而是直接发送 RST 复位报文强制关闭。

## 方法

1. 沟通步骤：

准备好一个ppt，在写代码之前演示最终目标和架构设计就是如何去实现的

【不要说公司部门环境不对着就是最终结果，不要试着看看，一定是可以完全上线的项目，非demo和一个知识点。自己认为真的不是闹着玩的。。】

经过领导，专家进行鸡蛋里挑骨头。【自己做好了别人路了胡扯，不会对别人产生任何影响，做事和做人一样，无论熟悉人，还是老师，领导，不相关人反对他们反馈信号，接受质疑，经过九九八十一难考验，并且你还在坚持认为对的。】
2. 最后融合别人建议，然后完善你项目。【不听老人言，吃亏在眼前，不敢接受别人批评，说明自己完全没有把握，才去否定愤怒方式】

xx发问；怎么做调度的，任务多，负载多，流量多。

我的回答

xx点评：

完善流程：

操作系统入门知识

文章目录

How sched_setaffinity works inside of Linux Kernel

cpu

疑问1 What is NUMA?

疑问2:网卡软中断都集中在了一个cpu上，导致cpu负载升高，多余网络包被丢弃？

net

file

一手资料

## 方法

1. 沟通步骤：

2. 知识点

一、这个技术出现的背景、初衷和要达到什么样的目标或是要解决什么样的问题

二、这个技术的优势和劣势分别是什么

三、这技术适用的场景。任何技术都有其适用的场景，离开了这个场景

四、技术的组成部分和关键点。

五、技术的底层原理和关键实现

六、已有的实现和它之间的对比