io_uring_setup(2) — Linux manual page

NAME \| SYNOPSIS \| DESCRIPTION \| RETURN VALUE \| ERRORS \| SEE ALSO \| COLOPHON

io_uring_setup(2)       Linux Programmer's Manual       io_uring_setup(2)

NAME top

       io_uring_setup - setup a context for performing asynchronous I/O

SYNOPSIS top

       #include <liburing.h>

       int io_uring_setup(u32 entries, struct io_uring_params *p);

DESCRIPTION top

       The io_uring_setup(2) system call sets up a submission queue (SQ)
       and completion queue (CQ) with at least entries entries, and
       returns a file descriptor which can be used to perform subsequent
       operations on the io_uring instance.  The submission and
       completion queues are shared between userspace and the kernel,
       which eliminates the need to copy data when initiating and
       completing I/O.

       params is used by the application to pass options to the kernel,
       and by the kernel to convey information about the ring buffers.

           struct io_uring_params {
               __u32 sq_entries;
               __u32 cq_entries;
               __u32 flags;
               __u32 sq_thread_cpu;
               __u32 sq_thread_idle;
               __u32 features;
               __u32 wq_fd;
               __u32 resv[3];
               struct io_sqring_offsets sq_off;
               struct io_cqring_offsets cq_off;
           };

       The flags, sq_thread_cpu, and sq_thread_idle fields are used to
       configure the io_uring instance.  flags is a bit mask of 0 or more
       of the following values ORed together:

       IORING_SETUP_IOPOLL
              Perform busy-waiting for an I/O completion, as opposed to
              getting notifications via an asynchronous IRQ (Interrupt
              Request).  The file system (if any) and block device must
              support polling in order for this to work.  Busy-waiting
              provides lower latency, but may consume more CPU resources
              than interrupt driven I/O.  Currently, this feature is
              usable only on a file descriptor opened using the O_DIRECT
              flag.  When a read or write is submitted to a polled
              context, the application must poll for completions on the
              CQ ring by calling io_uring_enter(2).  It is illegal to mix
              and match polled and non-polled I/O on an io_uring
              instance.

              This is only applicable for storage devices for now, and
              the storage device must be configured for polling. How to
              do that depends on the device type in question. For NVMe
              devices, the nvme driver must be loaded with the
              poll_queues parameter set to the desired number of polling
              queues. The polling queues will be shared appropriately
              between the CPUs in the system, if the number is less than
              the number of online CPU threads.

       IORING_SETUP_HYBRID_IOPOLL
              This flag must be used with IORING_SETUP_IOPOLL flag.
              Hybrid io polling is a feature based on iopoll, it differs
              from strict polling in that it will delay a bit before
              doing completion side polling, to avoid wasting too much
              CPU resources. Like IOPOLL , it requires that devices
              support polling.

       IORING_SETUP_SQPOLL
              When this flag is specified, a kernel thread is created to
              perform submission queue polling.  An io_uring instance
              configured in this way enables an application to issue I/O
              without ever context switching into the kernel.  By using
              the submission queue to fill in new submission queue
              entries and watching for completions on the completion
              queue, the application can submit and reap I/Os without
              doing a single system call.

              If the kernel thread is idle for more than sq_thread_idle
              milliseconds, it will set the IORING_SQ_NEED_WAKEUP bit in
              the flags field of the struct io_sq_ring.  When this
              happens, the application must call io_uring_enter(2) to
              wake the kernel thread.  If I/O is kept busy, the kernel
              thread will never sleep.  An application making use of this
              feature will need to guard the io_uring_enter(2) call with
              the following code sequence:

                  /*
                   * Ensure that the wakeup flag is read after the tail pointer
                   * has been written. It's important to use memory load acquire
                   * semantics for the flags read, as otherwise the application
                   * and the kernel might not agree on the consistency of the
                   * wakeup flag.
                   */
                  unsigned flags = atomic_load_relaxed(sq_ring->flags);
                  if (flags & IORING_SQ_NEED_WAKEUP)
                      io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);

              where sq_ring is a submission queue ring setup using the
              struct io_sqring_offsets described below.

       Note that, when using a ring setup with
              IORING_SETUP_SQPOLL, you never directly call the
              io_uring_enter(2) system call. That is usually taken care
              of by liburing's io_uring_submit(3) function. It
              automatically determines if you are using polling mode or
              not and deals with when your program needs to call
              io_uring_enter(2) without you having to bother about it.

       Before version 5.11 of the Linux kernel, to successfully use this
       feature, the
              application must register a set of files to be used for IO
              through io_uring_register(2) using the
              IORING_REGISTER_FILES opcode. Failure to do so will result
              in submitted IO being errored with EBADF.  The presence of
              this feature can be detected by the
              IORING_FEAT_SQPOLL_NONFIXED feature flag.  In version 5.11
              and later, it is no longer necessary to register files to
              use this feature. 5.11 also allows using this as non-root,
              if the user has the CAP_SYS_NICE capability. In 5.13 this
              requirement was also relaxed, and no special privileges are
              needed for SQPOLL in newer kernels. Certain stable kernels
              older than 5.13 may also support unprivileged SQPOLL.

       IORING_SETUP_SQ_AFF
              If this flag is specified, then the poll thread will be
              bound to the cpu set in the sq_thread_cpu field of the
              struct io_uring_params.  This flag is only meaningful when
              IORING_SETUP_SQPOLL is specified. When cgroup setting
              cpuset.cpus changes (typically in container environment),
              the bounded cpu set may be changed as well.

       IORING_SETUP_CQSIZE
              Create the completion queue with struct
              io_uring_params.cq_entries entries.  The value must be
              greater than entries, and may be rounded up to the next
              power-of-two.

       IORING_SETUP_CLAMP
              If this flag is specified, and if entries exceeds
              IORING_MAX_ENTRIES, then entries will be clamped at
              IORING_MAX_ENTRIES.  If the flag IORING_SETUP_CQSIZE is
              set, and if the value of struct io_uring_params.cq_entries
              exceeds IORING_MAX_CQ_ENTRIES, then it will be clamped at
              IORING_MAX_CQ_ENTRIES.

       IORING_SETUP_ATTACH_WQ
              This flag should be set in conjunction with struct
              io_uring_params.wq_fd being set to an existing io_uring
              ring file descriptor. When set, the io_uring instance being
              created will share the asynchronous worker thread backend
              of the specified io_uring ring, rather than create a new
              separate thread pool. Additionally the sq polling thread
              will be shared, if IORING_SETUP_SQPOLL is set.

       IORING_SETUP_R_DISABLED
              If this flag is specified, the io_uring ring starts in a
              disabled state.  In this state, restrictions can be
              registered, but submissions are not allowed.  See
              io_uring_register(2) for details on how to enable the ring.
              Available since 5.10.

       IORING_SETUP_SUBMIT_ALL
              Normally io_uring stops submitting a batch of requests, if
              one of these requests results in an error. This can cause
              submission of less than what is expected, if a request ends
              in error while being submitted. If the ring is created with
              this flag, io_uring_enter(2) will continue submitting
              requests even if it encounters an error submitting a
              request. CQEs are still posted for errored request
              regardless of whether or not this flag is set at ring
              creation time, the only difference is if the submit
              sequence is halted or continued when an error is observed.
              Available since 5.18.

       IORING_SETUP_COOP_TASKRUN
              By default, io_uring will interrupt a task running in
              userspace when a completion event comes in. This is to
              ensure that completions run in a timely manner. For a lot
              of use cases, this is overkill and can cause reduced
              performance from both the inter-processor interrupt used to
              do this, the kernel/user transition, the needless
              interruption of the tasks userspace activities, and reduced
              batching if completions come in at a rapid rate. Most
              applications don't need the forceful interruption, as the
              events are processed at any kernel/user transition. The
              exception are setups where the application uses multiple
              threads operating on the same ring, where the application
              waiting on completions isn't the one that submitted them.
              For most other use cases, setting this flag will improve
              performance. Available since 5.19.

       IORING_SETUP_TASKRUN_FLAG
              Used in conjunction with IORING_SETUP_COOP_TASKRUN, this
              provides a flag, IORING_SQ_TASKRUN, which is set in the SQ
              ring flags whenever completions are pending that should be
              processed. liburing will check for this flag even when
              doing io_uring_peek_cqe(3) and enter the kernel to process
              them, and applications can do the same. This makes
              IORING_SETUP_TASKRUN_FLAG safe to use even when
              applications rely on a peek style operation on the CQ ring
              to see if anything might be pending to reap. Available
              since 5.19.

       IORING_SETUP_SQE128
              If set, io_uring will use 128-byte SQEs rather than the
              normal 64-byte sized variant. This is a requirement for
              using certain request types, as of 5.19 only the
              IORING_OP_URING_CMD passthrough command for NVMe
              passthrough needs this. Available since 5.19.

       IORING_SETUP_CQE32
              If set, io_uring will use 32-byte CQEs rather than the
              normal 16-byte sized variant. This is a requirement for
              using certain request types, as of 5.19 only the
              IORING_OP_URING_CMD passthrough command for NVMe
              passthrough needs this. Available since 5.19.

       IORING_SETUP_SINGLE_ISSUER
              A hint to the kernel that only a single task (or thread)
              will submit requests, which is used for internal
              optimisations. The submission task is either the task that
              created the ring, or if IORING_SETUP_R_DISABLED is
              specified then it is the task that enables the ring through
              io_uring_register(2).  The kernel enforces this rule,
              failing requests with -EEXIST if the restriction is
              violated.  Note that when IORING_SETUP_SQPOLL is set it is
              considered that the polling task is doing all submissions
              on behalf of the userspace and so it always complies with
              the rule disregarding how many userspace tasks do
              io_uring_enter(2).  Available since 6.0.

       IORING_SETUP_DEFER_TASKRUN
              By default, io_uring will process all outstanding work at
              the end of any system call or thread interrupt. This can
              delay the application from making other progress.  Setting
              this flag will hint to io_uring that it should defer work
              until an io_uring_enter(2) call with the
              IORING_ENTER_GETEVENTS flag set. This allows the
              application to request work to run just before it wants to
              process completions.  This flag requires the
              IORING_SETUP_SINGLE_ISSUER flag to be set, and also
              enforces that the call to io_uring_enter(2) is called from
              the same thread that submitted requests.  Note that if this
              flag is set then it is the application's responsibility to
              periodically trigger work (for example via any of the CQE
              waiting functions) or else completions may not be
              delivered.  Available since 6.1.

       IORING_SETUP_NO_MMAP
              By default, io_uring allocates kernel memory that callers
              must subsequently mmap(2).  If this flag is set, io_uring
              instead uses caller-allocated buffers; p->cq_off.user_addr
              must point to the memory for the sq/cq rings, and
              p->sq_off.user_addr must point to the memory for the sqes.
              Each allocation must be contiguous memory.  Typically,
              callers should allocate this memory by using mmap(2) to
              allocate a huge page.  If this flag is set, a subsequent
              attempt to mmap(2) the io_uring file descriptor will fail.
              Available since 6.5.

       IORING_SETUP_REGISTERED_FD_ONLY
              If this flag is set, io_uring will register the ring file
              descriptor, and return the registered descriptor index,
              without ever allocating an unregistered file descriptor.
              The caller will need to use
              IORING_REGISTER_USE_REGISTERED_RING when calling
              io_uring_register(2).  This flag only makes sense when used
              alongside with IORING_SETUP_NO_MMAP, which also needs to be
              set.  Available since 6.5.

       IORING_SETUP_NO_SQARRAY
              If this flag is set, entries in the submission queue will
              be submitted in order, wrapping around to the first entry
              after reaching the end of the queue. In other words, there
              will be no more indirection via the array of submission
              entries, and the queue will be indexed directly by the
              submission queue tail and the range of indexed represented
              by it modulo queue size. Subsequently, the user should not
              map the array of submission queue entries, and the
              corresponding offset in struct io_sqring_offsets will be
              set to zero. Available since 6.6.

       If no flags are specified, the io_uring instance is setup for
       interrupt driven I/O.  I/O may be submitted using
       io_uring_enter(2) and can be reaped by polling the completion
       queue.

       The resv array must be initialized to zero.

       features is filled in by the kernel, which specifies various
       features supported by current kernel version.

       IORING_FEAT_SINGLE_MMAP
              If this flag is set, the two SQ and CQ rings can be mapped
              with a single mmap(2) call. The SQEs must still be
              allocated separately. This brings the necessary mmap(2)
              calls down from three to two. Available since kernel 5.4.

       IORING_FEAT_NODROP
              If this flag is set, io_uring supports almost never
              dropping completion events.  A dropped event can only occur
              if the kernel runs out of memory, in which case you have
              worse problems than a lost event. Your application and
              others will likely get OOM killed anyway. If a completion
              event occurs and the CQ ring is full, the kernel stores the
              event internally until such a time that the CQ ring has
              room for more entries. In earlier kernels, if this overflow
              condition is entered, attempting to submit more IO would
              fail with the -EBUSY error value, if it can't flush the
              overflown events to the CQ ring. If this happens, the
              application must reap events from the CQ ring and attempt
              the submit again. If the kernel has no free memory to store
              the event internally it will be visible by an increase in
              the overflow value on the cqring.  Available since kernel
              5.5. Additionally io_uring_enter(2) will return -EBADR the
              next time it would otherwise sleep waiting for completions
              (since kernel 5.19).

       IORING_FEAT_SUBMIT_STABLE
              If this flag is set, applications can be certain that any
              data for async offload has been consumed when the kernel
              has consumed the SQE. Available since kernel 5.5.

       IORING_FEAT_RW_CUR_POS
              If this flag is set, applications can specify offset == -1
              with IORING_OP_{READV,WRITEV},
              IORING_OP_{READ,WRITE}_FIXED, and IORING_OP_{READ,WRITE} to
              mean current file position, which behaves like preadv2(2)
              and pwritev2(2) with offset == -1.  It'll use (and update)
              the current file position. This obviously comes with the
              caveat that if the application has multiple reads or writes
              in flight, then the end result will not be as expected.
              This is similar to threads sharing a file descriptor and
              doing IO using the current file position. Available since
              kernel 5.6.

       IORING_FEAT_CUR_PERSONALITY
              If this flag is set, then io_uring guarantees that both
              sync and async execution of a request assumes the
              credentials of the task that called io_uring_enter(2) to
              queue the requests. If this flag isn't set, then requests
              are issued with the credentials of the task that originally
              registered the io_uring. If only one task is using a ring,
              then this flag doesn't matter as the credentials will
              always be the same. Note that this is the default behavior,
              tasks can still register different personalities through
              io_uring_register(2) with IORING_REGISTER_PERSONALITY and
              specify the personality to use in the sqe. Available since
              kernel 5.6.

       IORING_FEAT_FAST_POLL
              If this flag is set, then io_uring supports using an
              internal poll mechanism to drive data/space readiness. This
              means that requests that cannot read or write data to a
              file no longer need to be punted to an async thread for
              handling, instead they will begin operation when the file
              is ready. This is similar to doing poll + read/write in
              userspace, but eliminates the need to do so. If this flag
              is set, requests waiting on space/data consume a lot less
              resources doing so as they are not blocking a thread.
              Available since kernel 5.7.

       IORING_FEAT_POLL_32BITS
              If this flag is set, the IORING_OP_POLL_ADD command accepts
              the full 32-bit range of epoll based flags. Most notably
              EPOLLEXCLUSIVE which allows exclusive (waking single
              waiters) behavior. Available since kernel 5.9.

       IORING_FEAT_SQPOLL_NONFIXED
              If this flag is set, the IORING_SETUP_SQPOLL feature no
              longer requires the use of fixed files. Any normal file
              descriptor can be used for IO commands without needing
              registration. Available since kernel 5.11.

       IORING_FEAT_EXT_ARG
              If this flag is set, then the io_uring_enter(2) system call
              supports passing in an extended argument instead of just
              the sigset_t of earlier kernels. This.  extended argument
              is of type struct io_uring_getevents_arg and allows the
              caller to pass in both a sigset_t and a timeout argument
              for waiting on events. The struct layout is as follows:

               struct io_uring_getevents_arg {
                  __u64 sigmask;
                  __u32 sigmask_sz;
                  __u32 pad;
                  __u64 ts;
              };

              and a pointer to this struct must be passed in if
              IORING_ENTER_EXT_ARG is set in the flags for the enter
              system call. Available since kernel 5.11.

       IORING_FEAT_NATIVE_WORKERS
              If this flag is set, io_uring is using native workers for
              its async helpers.  Previous kernels used kernel threads
              that assumed the identity of the original io_uring owning
              task, but later kernels will actively create what looks
              more like regular process threads instead. Available since
              kernel 5.12.

       IORING_FEAT_RSRC_TAGS
              If this flag is set, then io_uring supports a variety of
              features related to fixed files and buffers. In particular,
              it indicates that registered buffers can be updated in-
              place, whereas before the full set would have to be
              unregistered first. Available since kernel 5.13.

       IORING_FEAT_CQE_SKIP
              If this flag is set, then io_uring supports setting
              IOSQE_CQE_SKIP_SUCCESS in the submitted SQE, indicating
              that no CQE should be generated for this SQE if it executes
              normally. If an error happens processing the SQE, a CQE
              with the appropriate error value will still be generated.
              Available since kernel 5.17.

       IORING_FEAT_LINKED_FILE
              If this flag is set, then io_uring supports sane assignment
              of files for SQEs that have dependencies. For example, if a
              chain of SQEs are submitted with IOSQE_IO_LINK, then
              kernels without this flag will prepare the file for each
              link upfront.  If a previous link opens a file with a known
              index, eg if direct descriptors are used with open or
              accept, then file assignment needs to happen post execution
              of that SQE. If this flag is set, then the kernel will
              defer file assignment until execution of a given request is
              started. Available since kernel 5.17.

       IORING_FEAT_REG_REG_RING
              If this flag is set, then io_uring supports calling
              io_uring_register(2) using a registered ring fd, via
              IORING_REGISTER_USE_REGISTERED_RING.  Available since
              kernel 6.3.

       IORING_FEAT_MIN_TIMEOUT
              If this flag is set, then io_uring supports passing in a
              minimum batch wait timeout. See
              io_uring_submit_and_wait_min_timeout(3) for more details.

       IORING_FEAT_RECVSEND_BUNDLE
              If this flag is set, then io_uring supports bundled send
              and recv operations.  See io_uring_prep_send_bundle(3) for
              more information. Also implies support for provided buffers
              in send operations.

       The rest of the fields in the struct io_uring_params are filled in
       by the kernel, and provide the information necessary to memory map
       the submission queue, completion queue, and the array of
       submission queue entries.  sq_entries specifies the number of
       submission queue entries allocated.  sq_off describes the offsets
       of various ring buffer fields:

           struct io_sqring_offsets {
               __u32 head;
               __u32 tail;
               __u32 ring_mask;
               __u32 ring_entries;
               __u32 flags;
               __u32 dropped;
               __u32 array;
               __u32 resv1;
               __u64 user_addr;
           };

       Taken together, sq_entries and sq_off provide all of the
       information necessary for accessing the submission queue ring
       buffer and the submission queue entry array.  The submission queue
       can be mapped with a call like:

           ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
                      ring_fd, IORING_OFF_SQ_RING);

       where sq_off is the io_sqring_offsets structure, and ring_fd is
       the file descriptor returned from io_uring_setup(2).  The addition
       of sq_off.array to the length of the region accounts for the fact
       that the ring is located at the end of the data structure.  As an
       example, the ring buffer head pointer can be accessed by adding
       sq_off.head to the address returned from mmap(2):

           head = ptr + sq_off.head;

       The flags field is used by the kernel to communicate state
       information to the application.  Currently, it is used to inform
       the application when a call to io_uring_enter(2) is necessary.
       See the documentation for the IORING_SETUP_SQPOLL flag above.  The
       dropped member is incremented for each invalid submission queue
       entry encountered in the ring buffer.

       The head and tail track the ring buffer state.  The tail is
       incremented by the application when submitting new I/O, and the
       head is incremented by the kernel when the I/O has been
       successfully submitted.  Determining the index of the head or tail
       into the ring is accomplished by applying a mask:

           index = tail & ring_mask;

       The array of submission queue entries is mapped with:

           sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
                            PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
                            ring_fd, IORING_OFF_SQES);

       The completion queue is described by cq_entries and cq_off shown
       here:

           struct io_cqring_offsets {
               __u32 head;
               __u32 tail;
               __u32 ring_mask;
               __u32 ring_entries;
               __u32 overflow;
               __u32 cqes;
               __u32 flags;
               __u32 resv1;
               __u64 user_addr;
           };

       The completion queue is simpler, since the entries are not
       separated from the queue itself, and can be mapped with:

           ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
                      IORING_OFF_CQ_RING);

       Closing the file descriptor returned by io_uring_setup(2) will
       free all resources associated with the io_uring context. Note that
       this may happen asynchronously within the kernel, so it is not
       guaranteed that resources are freed immediately.

RETURN VALUE top

       io_uring_setup(2) returns a new file descriptor on success.  The
       application may then provide the file descriptor in a subsequent
       mmap(2) call to map the submission and completion queues, or to
       the io_uring_register(2) or io_uring_enter(2) system calls.

       On error, a negative error code is returned. The caller should not
       rely on errno variable.

ERRORS top

       EFAULT params is outside your accessible address space.

       EINVAL The resv array contains non-zero data, p.flags contains an
              unsupported flag, entries is out of bounds,
              IORING_SETUP_SQ_AFF was specified, but IORING_SETUP_SQPOLL
              was not, or IORING_SETUP_CQSIZE was specified, but
              io_uring_params.cq_entries was invalid.
              IORING_SETUP_REGISTERED_FD_ONLY was specified, but
              IORING_SETUP_NO_MMAP was not.

       EMFILE The per-process limit on the number of open file
              descriptors has been reached (see the description of
              RLIMIT_NOFILE in getrlimit(2)).

       ENFILE The system-wide limit on the total number of open files has
              been reached.

       ENOMEM Insufficient kernel resources are available.

       EPERM  IORING_SETUP_SQPOLL was specified, but the effective user
              ID of the caller did not have sufficient privileges.

       EPERM  /proc/sys/kernel/io_uring_disabled has the value 2, or it
              has the value 1 and the calling process does not hold the
              CAP_SYS_ADMIN capability or is not a member of
              /proc/sys/kernel/io_uring_group.

       ENXIO  IORING_SETUP_ATTACH_WQ was set, but params.wq_fd did not
              refer to an io_uring instance or refers to an instance that
              is in the process of shutting down.

COLOPHON top

       This page is part of the liburing (A library for io_uring)
       project.  Information about the project can be found at 
       ⟨https://github.com/axboe/liburing⟩.  If you have a bug report for
       this manual page, send it to io-uring@vger.kernel.org.  This page
       was obtained from the project's upstream Git repository
       ⟨https://github.com/axboe/liburing⟩ on 2025-02-02.  (At that time,
       the date of the most recent commit that was found in the
       repository was 2025-01-22.)  If you discover any rendering
       problems in this HTML version of the page, or you believe there is
       a better or more up-to-date source for the page, or you have
       corrections or improvements to the information in this COLOPHON
       (which is not part of the original manual page), send a mail to
       man-pages@man7.org

Linux                           2019-01-29              io_uring_setup(2)

Pages that refer to this page: io_uring_enter2(2), io_uring_enter(2), io_uring_register(2), io_uring_setup(2), syscalls(2), io_uring_enable_rings(3), io_uring_queue_exit(3), io_uring_queue_init(3), io_uring_queue_init_mem(3), io_uring_queue_init_params(3), io_uring_resize_rings(3), io_uring(7)

io_uring_setup(2) — Linux manual page

NAME top

SYNOPSIS top

DESCRIPTION top

RETURN VALUE top

ERRORS top

SEE ALSO top

COLOPHON top