定時器 timerfd 是什麼？

timerfd 長什麼樣子？

什麼是 timerfd ？這是跟時間有關係的 fd 類型，通常叫做定時器 fd ，先去看一下 timerfd 的樣子吧。我在 Linux 的機器上找了一個 open 了 timerfd 的進程，如下：

root@ubuntu:~# ll /proc/6997/fd/
...
lrwx--- 1 root root 64 Aug 10 14:13 3 -> anon_inode:[timerfd]

root@ubuntu:~# cat /proc/6997/fdinfo/3 
pos: 0
flags: 02
mnt_id: 11
clockid: 0
ticks: 0
settime flags: 01
it_value: (0, 969820149)
it_interval: (1, 0)

通過 proc fs 通過 /proc/${pid}/fd/ 可以看到進程打開的句柄。這裏看到挺關鍵的信息：anon_inode:[timerfd]，說明 timerfd 綁定的是匿名 inode。

通過 /proc/${pid}/fdinfo/ 可以看到句柄的展示信息。

clockid：時鐘類型；
ticks：超時次數；
settime flags：這個是 timerfd_settime 的參數；
it_value：定時器到期還剩多少時間；
it_interval：超時間隔；

timerfd 是什麼？

timerfd 這個名字拆開來看，就是 timer fd，所謂定時器 fd 類型，那麼它的可讀可寫事件一定是跟時間有關係。timerfd 被 new 出來之後（ timerfd_create ），可以設置超時時間（ timerfd_setting ），超時之後，該句柄可讀，讀出來的是超時的次數。

文件句柄，網絡句柄都是可以 read/write/close 的，timerfd 可以做什麼？

timerfd 可以 read，poll，close ，這個從內核實現的接口可知：

// fs/timerfd.c
static const struct file_operations timerfd_fops = { 
    .release    = timerfd_release,
    .poll       = timerfd_poll,
    .read       = timerfd_read,
    .show_fdinfo    = timerfd_show,
    // ...
};

定時器句柄 timerfd 的實現就內聚在 fs/timerfd.c 一個文件。還記得上面 cat /proc/${pid}/fdinfo/ 裏面展示的信息嗎？就是 timerfd_show 負責展示的。

timerfd 的使用姿勢？

涉及到 timerfd 的系統調用有 3 個，函數原型如下：

// 創建一個 timerfd 句柄
int timerfd_create(int clockid, int flags);
// 啓動或關閉 timerfd 對應的定時器
int timerfd_settime(int fd, int flags, const struct itimerspec *new_value, struct itimerspec *old_value);
// 獲取指定 timerfd 距離下一次超時還剩的時間
int timerfd_gettime(int fd, struct itimerspec *curr_value);

timerfd 常用來做定時器的使用，設置超時時間之後，每隔一段時間 timerfd 就是可讀的。使用 man timerfd_create 就能查看到完整的文檔，有一個 c 語言的示例，簡要看下這個例子：

int main(int argc, char *argv[]) {
    // 第一次超時時間
    new_value.it_value.tv_sec = now.tv_sec + atoi(argv[1]);
    new_value.it_value.tv_nsec = now.tv_nsec;
    // 設置超時間隔
    new_value.it_interval.tv_sec = atoi(argv[2]);
    new_value.it_interval.tv_nsec = 0;
    // 創建 timerfd
    fd = timerfd_create(CLOCK_REALTIME, 0);
    // 設置第一次超時時間和超時間隔
    if (timerfd_settime(fd, TFD_TIMER_ABSTIME, &new_value, NULL) == -1)
    // 定時器循環
    for (tot_exp = 0; tot_exp < max_exp;) {
        // read timerfd，獲取到超時次數
        s = read(fd, &exp, sizeof(uint64_t));
        // 累計總超時次數
        tot_exp += exp;
        // 打印超時次數的信息
        printf("read: %llu; total=%llu\n", (unsigned long long) exp, (unsigned long long) tot_exp);
    }
}

這個例子做的事情：

通過 timerfd_create 獲取到一個句柄之後，使用 timerfd_settime 設置超時時間並啓動內核定時器；
後續使用 read 來讀數據，timerfd 沒超時之前 read 會阻塞到，直到內核定時器超時之後 read 纔會返回，這樣就達到了一個定時的效果；

上面例子相當於每隔一段時間 sleep 一下，然後打印一行信息，週期運行，這就是 timerfd 官方最簡單的例子。

timerfd 可以和 epoll 配合起來，讓 epoll 監聽 timerfd 的可讀事件，這樣 timerfd 超時觸發可讀事件，epoll_wait 被喚醒，業務進行週期處理，從而也能達到定時器的目的。

timerfd 原理剖析

我們簡要的看下內核的實現，原理其實很簡單。

1 timerfd_create

從用戶角度來看，該函數創建一個 timerfd，返回的 fd 可以進行 read、poll ( poll 、select，epoll )、close 等操作。

我們從源碼實現角度來看，timerfd_create 對應了一個系統調用：

SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
{   
    int ufd;
    struct timerfd_ctx *ctx;
    
    // timerfd 對應的核心數據結構體
    ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
 
    // 重要：初始化 ctx->wqh 隊列，這是個表頭，用來掛接 wait 對象的
    init_waitqueue_head(&ctx->wqh);
    
    // 初始化定時器
    if (isalarm(ctx))
        alarm_init(&ctx->t.alarm, ctx->clockid == CLOCK_REALTIME_ALARM ? ALARM_REALTIME : ALARM_BOOTTIME, timerfd_alarmproc);
    else
        hrtimer_init(&ctx->t.tmr, clockid, HRTIMER_MODE_ABS);
    
    // 獲取一個匿名 fd，對應 file->f_op 初始化成 timerfd_fops
    ufd = anon_inode_getfd("[timerfd]", &timerfd_fops, ctx, O_RDWR | (flags & TFD_SHARED_FCNTL_FLAGS));
    // 返回正數句柄
    return ufd;
}

函數最關鍵做了以下幾件事：

創建並初始化了一個 timerfd_ctx 的結構體（在這個 ctx 結構體內有個表頭 ctx->wqh 很重要，是和 epoll 產生聯繫的關鍵點）；
初始化了定時器，根據類型可以創建 alarm 類型或者高精度的 hrtimer 類型的定時器（注意：timerfd 本身並沒有實現定時器的功能，定時器直接使用的是內核封裝好的定時器，timerfd 只針對 “文件” 的封裝）；
創建一個匿名 fd，綁定 timerfd_fops 操作表；

核心結構體 timerfd_ctx ：

struct timerfd_ctx {
    // 真正的內核定時器
    union {
        struct hrtimer tmr;
        struct alarm alarm;
    } t;
    // wait 對象掛接的表頭
    wait_queue_head_t wqh;
    // 記錄超時的次數
    u64 ticks;
    // 定時器類型
    int clockid;
    // ...
};

這個 ctx 對象會賦值給 file->private_data 字段。後面針對 fd 的操作，就可以先通過 fd 找到 file ，取得 file->private_data ，強轉成 timerfd_ctx 類型，然後進行定時器 fd 的一系列操作。

注意到關鍵操作：獲取一個 file 結構體用的是 anon_inode_getfd 函數，這個函數是獲取一個匿名句柄的。

重點提一下匿名 fd 的事情，爲什麼會有匿名 fd ? 什麼是匿名？

在 Linux 裏一切皆文件，你理解的常見 “文件” 有什麼特性？是路徑，也就是 path ，匿名的意思說的就是沒有路徑（在內核裏面說的就是沒有有效的 dentry ）。

在 Linux 的文件體系中，一個文件句柄，對應一個 file 結構體，關聯一個 inode 。file/dentry/inode 這三駕馬車是一定要配齊的，就算是匿名的（無 path，無效 dentry），對於 file 結構體來說，一定要綁定 inode 和 dentry ，哪怕是僞造的、不完整的 inode。

anon_inodefs 就應運而生了，內核就幫你搞出來一個公共的 inode ，這就節省了所有有這樣需求的內核模塊，避免了內存的浪費，省了冗餘重複的 inode 初始化代碼。

匿名 fd 背後的是一個叫做 anon_inodefs 的內核文件系統（位於 fs/anon_inodes.c ），這個文件系統極其簡單，整個文件系統只有一個 inode ，這個 inode 是文件系統初始化的時候創建好的。

之後，所有需要一個匿名 inode 的句柄都直接跟這個 inode 關聯即可。使用匿名 inode 的句柄叫做匿名句柄。

2 timerfd_settime

該函數是啓停 timerfd 超時的，用來設置超時的時間，間隔的。參數結構如下：

struct timespec {
    time_t tv_sec;                /* Seconds */
    long   tv_nsec;               /* Nanoseconds */
};

struct itimerspec {
    struct timespec it_interval;  /* Interval for periodic timer */
    struct timespec it_value;     /* Initial expiration */
};

在結構體 itimerspec 的 it_value 字段標識定時器第一次超時時間，it_interval 標識之後的超時間隔。

主要邏輯如下：

SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags, const struct __kernel_itimerspec __user *, utmr, struct __kernel_itimerspec __user *, otmr)
{
    ret = do_timerfd_settime(ufd, flags, &new, &old);
}
static int do_timerfd_settime(int ufd, int flags, const struct itimerspec64 *new, struct itimerspec64 *old)
{
    // 通過 fd 查詢到 file 結構體
    ret = timerfd_fget(ufd, &f);
    
    // 通過 file 獲取到 timerfd_ctx
    ctx = f.file->private_data;

    // 如果有已經存在的 timer 定時器，需要先停止；
    for (;;) {
        // 定時器處理邏輯
    }
    // 保存舊的定時器設置的值
    old->it_value = ktime_to_timespec64(timerfd_get_remaining(ctx));
    old->it_interval = ktime_to_timespec64(ctx->tintv);

    // 重置定時器
    ret = timerfd_setup(ctx, flags, new);
}

static int timerfd_setup(struct timerfd_ctx *ctx, int flags, const struct itimerspec64 *ktmr)
{
    // 根據是 alarm 還是 hrtimer 類型，進行定時器初始化；
    // 用 alarm_init 或者 hrtimer_init ，主要設置時間和回調這兩個重要參數；
    // 回調函數分別是 timerfd_alarmproc 或者 timerfd_tmrproc
    if (isalarm(ctx)) {
        alarm_init(&ctx->t.alarm, ctx->clockid == CLOCK_REALTIME_ALARM ? ALARM_REALTIME : ALARM_BOOTTIME, timerfd_alarmproc);
    } else {
        hrtimer_init(&ctx->t.tmr, clockid, htmode);
    }

    if (texp != 0) {
        // 定時器啓動，用 alarm_start 或者 hrtimer_start
    }
}

操作很簡單：

通過 fd 獲取到 file，再獲取到核心結構體 timerfd_ctx；
然後再操作定時器，啓動定時器即可；

劃重點：timerfd 本身並沒有實現定時器的功能，定時功能直接使用的是內核封裝好的定時器，timerfd 只針對 “文件語義” 的封裝，讓定時器能跟文件一樣，進行 IO 操作。

3 timerfd_gettime

該函數用於獲取指定 timerfd 距離下一次超時還剩的時間。

SYSCALL_DEFINE2(timerfd_gettime, int, ufd, struct __kernel_itimerspec __user *, otmr)
{
    int ret = do_timerfd_gettime(ufd, &kotmr);
}
static int do_timerfd_gettime(int ufd, struct itimerspec64 *t)
{
    // 通過 fd 獲取到 file 結構體
    int ret = timerfd_fget(ufd, &f);
    // 通過 file 獲取到 timerfd_ctx 結構體
    ctx = f.file->private_data;
    // 計算距離下一次到期的時間
    // ...
}

操作步驟：

通過 fd 獲取到 file，再獲取到核心結構體 timerfd_ctx ；
然後通過 timerfd_ctx 裏面存儲的信息，計算舉例下一次超時的時間即可；

timerfd 和 epoll 的配合

每個 fd 類型我都會帶上 epoll ，讓大家一遍遍去理解 epoll 機制，這次的 timerfd 是個非常好的機會，因爲它足夠簡單，沒有任何複雜性。下面跟着我一起梳理下吧，抓住這個機會哦。

1 timerfd 創建的時候暗藏玄機？

前面提到了，timerfd 的核心結構是 timerfd_ctx ，掛到 file->private_data 字段上，在 ctx 裏面有一個鏈表的表頭，還記得嗎？

就是這個 timerfd_ctx->wqh ，這是一個鏈表表頭，timerfd 創建的時候初始化，這就是玄機。

這個表頭都是用來掛接 wait 對象的，在事件就緒的時候，就會遍歷這個表，依次調用 wait 對象的回調函數。

類比分享過的 Linux fd 系列的特殊 fd：

timerfd：在 timerfd_ctx 結構體中有個表頭 timerfd_ctx->wqh；
eventfd：在 eventfd_ctx 結構體中有個表頭 eventfd_ctx->wqh；
socketfd：在 sock 結構體中有個表頭 sk->sk_wq ；

劃重點：這個 wait 鏈表是核心基礎之一呀，給 poll 操作的時候，掛 wait entry 用的。

初始化過程的另一個核心是把 timerfd 的 file->f_op 設置爲 timerfd_fops 函數操作表；

重點小結下：

創建了 timerfd_ctx 結構體，裏面有個 wait entry 的隊列（ ctx->wqh ）；
file->f_op 賦值爲 timerfd_fops 操作表；

2 epoll_ctl 的配合？

問題來了，那什麼時候會往 ctx->wqh 添加元素呢？

以 timerfd 來說，timerfd_poll 函數中會使用 poll_wait 函數往這個鏈表中添加元素。調用如下：

// fs/timerfd.c
static __poll_t timerfd_poll(struct file *file, poll_table *wait) {
    // 添加元素
    poll_wait(file, &ctx->wqh, wait);
}

// include/linux/poll.h
static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p) {
    if (p && p->_qproc && wait_address)
        p->_qproc(filp, wait_address, p); 
}

wait 對象就是在 poll_wait 函數中，通過 p->_qproc 添加到鏈表的。童鞋可能會問了，這個函數回調是啥呢？這個問題先留着，先往後看。

先來思考一個問題，誰會調用到 timerfd_poll 呢？

epoll_ctl 註冊句柄的時候會！具體調用路徑如下：

epoll_ctl
    -> ep_insert
        // poll_table->_qporc 初始化成 ep_ptable_queue_proc
        -> init_poll_funcptr 
        // 掛接等待鏈表
        -> ep_item_poll
            -> vfs_poll
                -> timerfd_poll

我們知道 vfs_poll 就是調用 file->f_op->poll 操作函數，而 timerfd 的 f_op 操作表是 timerfd_fops ，.poll 接口就是 timerfd_poll ，這就串起來了呀。

同時這裏也回答了上面另一個問題，p->_qproc 是啥？

是在 ep_insert 中 init_poll_funcptr 裏初始化成 ep_ptable_queue_proc 函數了。

那我們想再看一下 ep_ptable_queue_proc 裏面究竟是怎麼添加的？

static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, poll_table *pt) {
    // 獲取到對應的 epitem
    struct epitem *epi = ep_item_from_epqueue(pt);

    // 初始化 wait entry
    init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
    pwq->whead = whead;
    pwq->base = epi; 
    
    // 添加 wait entry 到鏈表中（這個鏈表，就是 timerfd_ctx->wqh 的鏈表）
    if (epi->event.events & EPOLLEXCLUSIVE)
        add_wait_queue_exclusive(whead, &pwq->wait);
    else 
        add_wait_queue(whead, &pwq->wait);
            
}

通過這裏我們得到兩個關鍵信息：

wait entry 的回調（ wq_entry->func ）設置爲：ep_poll_callback；
pwq->base 設置爲 epi（句柄對應的 epoll item 結構體）；

ep_poll_callback 主要做兩個事情：

把對應的 epitem 掛到 epoll 的 ready list 鏈表（就緒鏈表）；
喚醒 epoll_wait 阻塞的進程（ epoll_wait 切走之前，把 cur 進程對應的一個 wait entry 掛到了 epoll 的 wait 鏈表中）；

重點小結：

epoll_ctl 裏面通過 timerfd_poll 函數，把一個跟 epitem 關聯的 wait entry 掛到 timerfd 的 ctx->wqh 隊列中；
這個 wait entry 的回調參數設置爲 ep_poll_callback ，參數爲 epitem；

有了這兩手準備工作，等 timerfd 事件準備好了之後，就能通過 wait entry 通知到 epoll 池了。

3 timerfd 怎麼喚醒 epoll_wait？

準備工作基本上就做好了，回調喚醒的路已經準備好了，下面繼續看下怎麼觸發的。

timerfd 實現定時器的功能是直接用的內核定時器，根據類型分爲兩種：

struct hrtimer tmr;
struct alarm alarm;

hrtimer 是高精度的定時器，爲了方便，我下面就只說 hrtimer 定時器。定時器可以設置回調函數，超時之後會異步調用。timerfd 設置的回調函數是 timerfd_tmrproc ，那麼不難想象，這個函數中是回調的起點。

定時器到期之後，內核調用回調：

timerfd_tmrproc (在初始化的時候配置)
-> timerfd_trigger
    -> wake_up_locked_poll （喚醒 timerfd 上所有的等待對象）
        -> ep_poll_callback

還記得 socketfd 的回調路線嗎？

-> 硬中斷
    -> 軟中斷
        -> tcp_v4_rcv（具體協議棧處理函數）
            -> sk->sk_data_ready
                -> ep_poll_callback

相同的祕方，相同的套路，至此和 epoll 的路徑全部打通了。

小結一下****完整路徑描述：

timerfd 句柄 timerfd_create 創建的時候準備好等待隊列 ctx->wqh ；
timerfd_settime 設置定時回調 timerfd_tmrproc；
epoll_ctl 註冊句柄的時候把 ep_poll_back 裝進 wait 對象並掛到 ctx->wqh 鏈表之上；
定時器超時的時候，由 timerfd_tmrproc 遍歷 ctx->wqh ，調用 ep_poll_callback 從而完成事件觸發；

說了這麼多，用一張圖來總結下，看你理解了不？

總結

procfs 是內核給用戶探視進程信息的接口，非常重要，/proc/${pid}/fd/ 下有所有打開的句柄， /proc/${pid}/fdinfo/ 下能看到句柄的詳細信息，掛鉤的是 .show_fdinfo 回調實現；
timerfd 的核心結構是 timerfd_ctx ，通過 fd 先找到 file 結構體，它就藏在 file->private_data 這裏；
timerfd 是直接複用的 hrtimer 或者 alarm 類型的定時器，timerfd 本身只是對定時器做的文件接口的封裝；
內核提供了一套名叫 anon_inodefs 的匿名文件系統，起到節省內存，代碼複用的目的。對於想實現文件接口，但又不想實現完整的 inode 功能的句柄類型來說是福音，timerfd ，eventfd，eventpoll 等類型的 fd 都得益於此；
timerfd 把定時器像文件一樣 IO 得益於 “一切皆文件” 的設計理念，timerfd 是理解內核這一設計的極佳例子，同時也是理解 epoll 管理 fd 事件的極佳例子，因爲它足夠簡單！

後記

極簡的 timerfd 配合之前的 epoll 的剖析，童鞋們應該徹底理解了 epoll 機制吧！講到匿名 fd ，還挺有意思的，timerfd，eventfd，eventpoll fd 都是用的匿名 inode 的匿名 fd ，找機會分享一波。

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/vmtZGiB9ylWewNnItA73ZQ

猜你喜歡