Skip to content
Snippets Groups Projects
  1. Oct 18, 2023
  2. Jul 26, 2023
  3. Jun 12, 2023
  4. May 15, 2023
  5. Apr 18, 2023
  6. Apr 08, 2023
    • Paolo Abeni's avatar
      epoll: use refcount to reduce ep_mutex contention · 58c9b016
      Paolo Abeni authored
      We are observing huge contention on the epmutex during an http
      connection/rate test:
      
       83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
      [...]
                 |--66.96%--__fput
                            |--60.04%--eventpoll_release_file
                                       |--58.41%--__mutex_lock.isra.6
                                                 |--56.56%--osq_lock
      
      The application is multi-threaded, creates a new epoll entry for
      each incoming connection, and does not delete it before the
      connection shutdown - that is, before the connection's fd close().
      
      Many different threads compete frequently for the epmutex lock,
      affecting the overall performance.
      
      To reduce the contention this patch introduces explicit reference counting
      for the eventpoll struct. Each registered event acquires a reference,
      and references are released at ep_remove() time.
      
      The eventpoll struct is released by whoever - among EP file close() and
      and the monitored file close() drops its last reference.
      
      Additionally, this introduces a new 'dying' flag to prevent races between
      the EP file close() and the monitored file close().
      ep_eventpoll_release() marks, under f_lock spinlock, each epitem as dying
      before removing it, while EP file close() does not touch dying epitems.
      
      The above is needed as both close operations could run concurrently and
      drop the EP reference acquired via the epitem entry. Without the above
      flag, the monitored file close() could reach the EP struct via the epitem
      list while the epitem is still listed and then try to put it after its
      disposal.
      
      An alternative could be avoiding touching the references acquired via
      the epitems at EP file close() time, but that could leave the EP struct
      alive for potentially unlimited time after EP file close(), with nasty
      side effects.
      
      With all the above in place, we can drop the epmutex usage at disposal time.
      
      Overall this produces a significant performance improvement in the
      mentioned connection/rate scenario: the mutex operations disappear from
      the topmost offenders in the perf report, and the measured connections/rate
      grows by ~60%.
      
      To make the change more readable this additionally renames ep_free() to
      ep_clear_and_put(), and moves the actual memory cleanup in a separate
      ep_free() helper.
      
      Link: https://lkml.kernel.org/r/4a57788dcaf28f5eb4f8dfddcc3a8b172a7357bb.1679504153.git.pabeni@redhat.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Co-developed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarXiumei Mu <xmu@redhiat.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Jacob Keller <jacob.e.keller@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      58c9b016
  7. Mar 13, 2023
  8. Mar 10, 2023
  9. Nov 21, 2022
    • Jens Axboe's avatar
      eventpoll: add EPOLL_URING_WAKE poll wakeup flag · caf1aeaf
      Jens Axboe authored
      
      We can have dependencies between epoll and io_uring. Consider an epoll
      context, identified by the epfd file descriptor, and an io_uring file
      descriptor identified by iofd. If we add iofd to the epfd context, and
      arm a multishot poll request for epfd with iofd, then the multishot
      poll request will repeatedly trigger and generate events until terminated
      by CQ ring overflow. This isn't a desired behavior.
      
      Add EPOLL_URING so that io_uring can pass it in as part of the poll wakeup
      key, and io_uring can check for that to detect a potential recursive
      invocation.
      
      Cc: stable@vger.kernel.org # 6.0
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      caf1aeaf
  10. Sep 12, 2022
  11. Jul 18, 2022
    • Benjamin Segall's avatar
      epoll: autoremove wakers even more aggressively · a16ceb13
      Benjamin Segall authored
      If a process is killed or otherwise exits while having active network
      connections and many threads waiting on epoll_wait, the threads will all
      be woken immediately, but not removed from ep->wq.  Then when network
      traffic scans ep->wq in wake_up, every wakeup attempt will fail, and will
      not remove the entries from the list.
      
      This means that the cost of the wakeup attempt is far higher than usual,
      does not decrease, and this also competes with the dying threads trying to
      actually make progress and remove themselves from the wq.
      
      Handle this by removing visited epoll wq entries unconditionally, rather
      than only when the wakeup succeeds - the structure of ep_poll means that
      the only potential loss is the timed_out->eavail heuristic, which now can
      race and result in a redundant ep_send_events attempt.  (But only when
      incoming data and a timeout actually race, not on every timeout)
      
      Shakeel added:
      
      : We are seeing this issue in production with real workloads and it has
      : caused hard lockups.  Particularly network heavy workloads with a lot
      : of threads in epoll_wait() can easily trigger this issue if they get
      : killed (oom-killed in our case).
      
      Link: https://lkml.kernel.org/r/xm26fsjotqda.fsf@google.com
      
      
      Signed-off-by: default avatarBen Segall <bsegall@google.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Roman Penyaev <rpenyaev@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Khazhismel Kumykov <khazhy@google.com>
      Cc: Heiher <r@hev.cc>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a16ceb13
  12. Jan 22, 2022
    • Xiaoming Ni's avatar
      eventpoll: simplify sysctl declaration with register_sysctl() · a8f5de89
      Xiaoming Ni authored
      The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
      dishes, this makes it very difficult to maintain.
      
      To help with this maintenance let's start by moving sysctls to places
      where they actually belong.  The proc sysctl maintainers do not want to
      know what sysctl knobs you wish to add for your own piece of code, we
      just care about the core logic.
      
      So move the epoll_table sysctl to fs/eventpoll.c and use
      register_sysctl().
      
      Link: https://lkml.kernel.org/r/20211123202422.819032-9-mcgrof@kernel.org
      
      
      Signed-off-by: default avatarXiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Antti Palosaari <crope@iki.fi>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Clemens Ladisch <clemens@ladisch.de>
      Cc: David Airlie <airlied@linux.ie>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Julia Lawall <julia.lawall@inria.fr>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Lukas Middendorf <kernel@tuxforce.de>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Phillip Potter <phil@philpotter.co.uk>
      Cc: Qing Wang <wangqing@vivo.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Sebastian Reichel <sre@kernel.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Stephen Kitt <steve@sk2.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Douglas Gilbert <dgilbert@interlog.com>
      Cc: James E.J. Bottomley <jejb@linux.ibm.com>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: John Ogness <john.ogness@linutronix.de>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8f5de89
  13. Sep 08, 2021
  14. Aug 20, 2021
    • Arnd Bergmann's avatar
      ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation · 249dbe74
      Arnd Bergmann authored
      
      The epoll_wait() system call wrapper is one of the remaining users of
      the set_fs() infrasturcture for Arm. Changing it to not require set_fs()
      is rather complex unfortunately.
      
      The approach I'm taking here is to allow architectures to override
      the code that copies the output to user space, and let the oabi-compat
      implementation check whether it is getting called from an EABI or OABI
      system call based on the thread_info->syscall value.
      
      The in_oabi_syscall() check here mirrors the in_compat_syscall() and
      in_x32_syscall() helpers for 32-bit compat implementations on other
      architectures.
      
      Overall, the amount of code goes down, at least with the newly added
      sys_oabi_epoll_pwait() helper getting removed again. The downside
      is added complexity in the source code for the native implementation.
      There should be no difference in runtime performance except for Arm
      kernels with CONFIG_OABI_COMPAT enabled that now have to go through
      an external function call to check which of the two variants to use.
      
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      249dbe74
  15. May 07, 2021
    • Davidlohr Bueso's avatar
      fs/epoll: restore waking from ep_done_scan() · 7fab29e3
      Davidlohr Bueso authored
      Commit 339ddb53 ("fs/epoll: remove unnecessary wakeups of nested
      epoll") changed the userspace visible behavior of exclusive waiters
      blocked on a common epoll descriptor upon a single event becoming ready.
      
      Previously, all tasks doing epoll_wait would awake, and now only one is
      awoken, potentially causing missed wakeups on applications that rely on
      this behavior, such as Apache Qpid.
      
      While the aforementioned commit aims at having only a wakeup single path
      in ep_poll_callback (with the exceptions of epoll_ctl cases), we need to
      restore the wakeup in what was the old ep_scan_ready_list() such that
      the next thread can be awoken, in a cascading style, after the waker's
      corresponding ep_send_events().
      
      Link: https://lkml.kernel.org/r/20210405231025.33829-3-dave@stgolabs.net
      
      
      Fixes: 339ddb53 ("fs/epoll: remove unnecessary wakeups of nested epoll")
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Roman Penyaev <rpenyaev@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7fab29e3
  16. Mar 07, 2021
    • Randy Dunlap's avatar
      fs: eventpoll: fix comments & kernel-doc notation · a6c67fee
      Randy Dunlap authored
      
      Use the documented kernel-doc format for function Return: descriptions.
      Begin constant values in kernel-doc comments with '%'.
      
      Remove kernel-doc "/**" from 2 functions that are not documented with
      kernel-doc notation.
      
      Fix typos, punctuation, & grammar.
      
      Also fix a few kernel-doc warnings:
      
      ../fs/eventpoll.c:1883: warning: Function parameter or member 'ep' not described in 'ep_loop_check_proc'
      ../fs/eventpoll.c:1883: warning: Excess function parameter 'priv' description in 'ep_loop_check_proc'
      ../fs/eventpoll.c:1932: warning: Function parameter or member 'ep' not described in 'ep_loop_check'
      ../fs/eventpoll.c:1932: warning: Excess function parameter 'from' description in 'ep_loop_check'
      
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
      a6c67fee
  17. Feb 16, 2021
  18. Dec 19, 2020
  19. Dec 04, 2020
  20. Nov 30, 2020
    • Björn Töpel's avatar
      net: Add SO_BUSY_POLL_BUDGET socket option · 7c951caf
      Björn Töpel authored
      
      This option lets a user set a per socket NAPI budget for
      busy-polling. If the options is not set, it will use the default of 8.
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com
      7c951caf
    • Björn Töpel's avatar
      net: Introduce preferred busy-polling · 7fd3253a
      Björn Töpel authored
      
      The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
      option or system-wide using the /proc/sys/net/core/busy_read knob, is
      an opportunistic. That means that if the NAPI context is not
      scheduled, it will poll it. If, after busy-polling, the budget is
      exceeded the busy-polling logic will schedule the NAPI onto the
      regular softirq handling.
      
      One implication of the behavior above is that a busy/heavy loaded NAPI
      context will never enter/allow for busy-polling. Some applications
      prefer that most NAPI processing would be done by busy-polling.
      
      This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
      in concert with the napi_defer_hard_irqs and gro_flush_timeout
      knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
      introduced in commit 6f8b12d6 ("net: napi: add hard irqs deferral
      feature"), and allows for a user to defer interrupts to be enabled and
      instead schedule the NAPI context from a watchdog timer. When a user
      enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
      and the NAPI context is being processed by a softirq, the softirq NAPI
      processing will exit early to allow the busy-polling to be performed.
      
      If the application stops performing busy-polling via a system call,
      the watchdog timer defined by gro_flush_timeout will timeout, and
      regular softirq handling will resume.
      
      In summary; Heavy traffic applications that prefer busy-polling over
      softirq processing should use this option.
      
      Example usage:
      
        $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
        $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
      
      Note that the timeout should be larger than the userspace processing
      window, otherwise the watchdog will timeout and fall back to regular
      softirq processing.
      
      Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
      7fd3253a
  21. Oct 26, 2020
    • Al Viro's avatar
      epoll: take epitem list out of struct file · 319c1517
      Al Viro authored
      
      Move the head of epitem list out of struct file; for epoll ones it's
      moved into struct eventpoll (->refs there), for non-epoll - into
      the new object (struct epitem_head).  In place of ->f_ep_links we
      leave a pointer to the list head (->f_ep).
      
      ->f_ep is protected by ->f_lock and it's zeroed as soon as the list
      of epitems becomes empty (that can happen only in ep_remove() by
      now).
      
      The list of files for reverse path check is *not* going through
      struct file now - it's a single-linked list going through epitem_head
      instances.  It's terminated by ERR_PTR(-1) (== EP_UNACTIVE_POINTER),
      so the elements of list can be distinguished by head->next != NULL.
      
      epitem_head instances are allocated at ep_insert() time (by
      attach_epitem()) and freed either by ep_remove() (if it empties
      the set of epitems *and* epitem_head does not belong to the
      reverse path check list) or by clear_tfile_check_list() when
      the list is emptied (if the set of epitems is empty by that
      point).  Allocations are done from a separate slab - minimal kmalloc()
      size is too large on some architectures.
      
      As the result, we trim struct file _and_ get rid of the games with
      temporary file references.
      
      Locking and barriers are interesting (aren't they always); see unlist_file()
      and ep_remove() for details.  The non-obvious part is that ep_remove() needs
      to decide if it will be the one to free the damn thing *before* actually
      storing NULL to head->epitems.first - that's what smp_load_acquire is for
      in there.  unlist_file() lockless path is safe, since we hit it only if
      we observe NULL in head->epitems.first and whoever had done that store is
      guaranteed to have observed non-NULL in head->next.  IOW, their last access
      had been the store of NULL into ->epitems.first and we can safely free
      the sucker.  OTOH, we are under rcu_read_lock() and both epitem and
      epitem->file have their freeing RCU-delayed.  So if we see non-NULL
      ->epitems.first, we can grab ->f_lock (all epitems in there share the
      same struct file) and safely recheck the emptiness of ->epitems; again,
      ->next is still non-NULL, so ep_remove() couldn't have freed head yet.
      ->f_lock serializes us wrt ep_remove(); the rest is trivial.
      
      Note that once head->epitems becomes NULL, nothing can get inserted into
      it - the only remaining reference to head after that point is from the
      reverse path check list.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      319c1517
    • Al Viro's avatar
      epoll: massage the check list insertion · d9f41e3c
      Al Viro authored
      
      in the "non-epoll target" cases do it in ep_insert() rather than
      in do_epoll_ctl(), so that we do it only with some epitem is already
      guaranteed to exist.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d9f41e3c
    • Al Viro's avatar
      b62d2706
    • Al Viro's avatar
      convert ->f_ep_links/->fllink to hlist · 44cdc1d9
      Al Viro authored
      
      we don't care about the order of elements there
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      44cdc1d9
    • Al Viro's avatar
      ep_insert(): move creation of wakeup source past the fl_ep_links insertion · d1ec50ad
      Al Viro authored
      
      That's the beginning of preparations for taking f_ep_links out of struct file.
      If insertion might fail, we will need a new failure exit.  Having wakeup
      source creation done after that point will simplify life there; ep_remove()
      can (and commonly does) live with NULL epi->ws, so it can be used for
      cleanup after ep_create_wakeup_source() failure.  It can't be used before
      the rbtree insertion, though, so if we are to unify all old failure exits,
      we need to move that thing down.  Then we would be free to do simple
      kmem_cache_free() on the failure to insert into f_ep_links - no wakeup source
      to leak on that failure exit.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d1ec50ad
    • Al Viro's avatar
      2c0b71c1
    • Al Viro's avatar
      take the common part of ep_eventpoll_poll() and ep_item_poll() into helper · ad9366b1
      Al Viro authored
      
      The only reason why ep_item_poll() can't simply call ep_eventpoll_poll()
      (or, better yet, call vfs_poll() in all cases) is that we need to tell
      lockdep how deep into the hierarchy of ->mtx we are.  So let's add
      a variant of ep_eventpoll_poll() that would take depth explicitly
      and turn ep_eventpoll_poll() into wrapper for that.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      ad9366b1
    • Al Viro's avatar
      ep_insert(): we only need tep->mtx around the insertion itself · 85353e91
      Al Viro authored
      
      We do need ep->mtx (and we are holding it all along), but that's
      the lock on the epoll we are inserting into; locking of the
      epoll being inserted is not needed for most of that work -
      as the matter of fact, we only need it to provide barriers
      for the fastpath check (for now).
      
      Move taking and releasing it into ep_insert().  The caller
      (do_epoll_ctl()) doesn't need to bother with that at all.
      Moreover, that way we kill the kludge in ep_item_poll() - now
      it's always called with tep unlocked.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      85353e91
    • Al Viro's avatar
    • Al Viro's avatar
      lift locking/unlocking ep->mtx out of ep_{start,done}_scan() · 57804b1c
      Al Viro authored
      
      get rid of depth/ep_locked arguments there and document
      the kludge in ep_item_poll() that has lead to ep_locked existence in
      the first place
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      57804b1c
Loading