CSE791 IoT: Notes for Apr. 10, 2020
08 Apr 2020Character devices
scull drivers
“scull” is short for “Simple Character Utility for Loading Localities”. It’s a collection of sample character devices provided by Linux. All the buffer of scull is stored in main memory. There are the following types of scull device:
- scull0 to scull3: samples of global and persistent device.
- Global: the device is shared by all file descriptors that open it.
- Persistent: when the device is closed and reopened, the data will not be lost.
- scullpipe0 to scullpipe3: sample of FIFO (first-in-first-out) device that acts like pipes. This involves blocking I/O, which is much more difficult to write.
- snullsingle: only allows one process at a time to use the driver
- snullpriv: private to each virtual console
- sculluid: can only be opened multiple times, but each user can open only once. When the device receives excessive open request, it will return “Device Busy”.
- scullwuid: same as sculluid, but when excessive open request comes, it implements “blocking open”.
These scull drivers are basically Linux’s tutorial to teach us how to write drivers for different types of devices.
Major & Minor Numbers
When we list the devices nodes in /dev
folder, in the size field, we are getting two numbers separated by a comma. These are actually unique identifiers for each device. The first number is called major number, the second number is called minor number.
(base) user@machine:/dev$ ll total 4 drwxr-xr-x 22 root root 5220 Apr 8 17:41 ./ drwxr-xr-x 26 root root 4096 Apr 8 17:41 ../ crw-rw-rw- 1 root root 10, 56 Apr 8 17:41 ashmem crw-r--r-- 1 root root 10, 235 Apr 8 17:41 autofs crw-rw-rw- 1 root root 511, 0 Apr 8 17:41 binder
Major number
- Major number identifies the driver associated with the device.
- Modern Linux allows multiple drivers to share major numbers.
- Many major numbers are reserved for certain devices.
- Linux uses 12 bits for major numbers.
Minor number
- Minor number determines exactly which device is being referred to.
- You can treat it as an index to a local array of devices.
- Linux uses 20 bits for minor numbers.
Major/minor number in kernel programming
// In Linux, dev_t is a 32-bit integer, where
// 12 bits are used for major number and 20 bits
// are used for minor number
// getting major/minor numbers from dev_t
MAJOR(dev_t dev);
MINOR(dev_t dev);
// combining major and minor numbers into dev_t
MKDEV(int major, int minor);
Allocating device numbers
Static allocation
In static allocation, you specify the major number.
int register_chrdev_region (
dev_t from,
unsigned count,
const char * name);
from
: the first in the desired range of device numbers; must include the major number.count
: number of device to be allocatedname
: the name of the device
Dynamic allocation
In dynamic allocation, the major number will be dynamically allocated.
int alloc_chrdev_region (
dev_t * dev,
unsigned baseminor,
unsigned count,
const char * name);
dev
: returns the allocated major/minor number (output only!)baseminor
: base index of minor numbercount
: number of device to be allocatedname
: name of the device
Registering device nodes
After calling alloc_chrdev_region
, the device name and major number will appear in /proc/devices
.
(base) user@machine:/proc$ cat devices Character devices: 1 mem 4 /dev/vc/0 4 tty 4 ttyS 5 /dev/tty 5 /dev/console 5 /dev/ptmx 5 ttyprintk 6 lp 7 vcs 10 misc 13 input 21 sg 29 fb 81 video4linux Block devices: 7 loop 8 sd 9 md 11 sr 65 sd 66 sd 67 sd 68 sd 69 sd 70 sd
Now, we need to register a file system node for the device. It can be done by using a device loading script, which usually looks like the this:
#!/bin/sh
module="scull"
device="scull"
mode="664"
# invoke insmod with all arguments we got
# and use a pathname, as newer modutils don't look in . by default
/sbin/insmod ./$module.ko $* || exit 1
# remove stale nodes
rm -f /dev/${device}[0-3]
major=$(awk "\\$2= =\"$module\" {print \\$1}" /proc/devices)
mknod /dev/${device}0 c $major 0
mknod /dev/${device}1 c $major 1
mknod /dev/${device}2 c $major 2
mknod /dev/${device}3 c $major 3
# give appropriate group/permissions, and change the group.
# Not all distributions have staff, some have "wheel" instead.
group="staff"
grep -q '^staff:' /etc/group || group="wheel"
chgrp $group /dev/${device}[0-3]
chmod $mode /dev/${device}[0-3]
- The script uses
awk
to search for device’s major number mknod
is used to associate file system nodes with a device number.- Since this script must be run with root privilege, it is important to change the persimmon of the devices at the end of the script.
Miscellaneous device
Linux provide a template to create simple character devices, which are called “miscellaneous devices”. The TUN/TAP driver is both a network interface and a misc device. The major number of all misc devices is 10.
static struct miscdevice tun_miscdev = {
.minor = TUN_MINOR,
.name = "tun",
.nodename = "net/tun",
.fops = &tun_fops,
};
More information on misc device:
Important data structures
File operations
The file_operations
structure registers methods for device operations. These are some important members of this structure:
-
struct module *owner
A pointer to the module that “owns” the structure. This field is used to prevent the module from being unloaded while its operations are in use. Almost all the time, it is simply initialized to
THIS_MODULE
, a macro defined in<linux/module.h>
. -
loff_t (*llseek) (struct file *, loff_t, int);
The llseek method is used to change the current read/write position in a file, and the new position is returned as a (positive) return value. The
loff_t
parameter is a “long offset” and is at least 64 bits wide even on 32-bit platforms. Errors are signaled by a negative return value. If this function pointer isNULL
, seek calls will modify the position counter in thefile
structure in potentially unpredictable ways. -
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
Used to retrieve data from the device. A null pointer in this position causes the read system call to fail with
-EINVAL
(“Invalid argument”). A nonnegative return value represents the number of bytes successfully read (the return value is a “signed size” type, usually the native integer type for the target platform). -
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
Sends data to the device. If
NULL
,-EINVAL
is returned to the program calling the write system call. The return value, if nonnegative, represents the number of bytes successfully written. -
unsigned int (*poll) (struct file *, struct poll_table_struct *);
The
poll
method is the back end of three system calls:poll
,epoll
, andselect
, all of which are used to query whether a read or write to one or more file descriptors would block. Thepoll
method should return a bit mask indicating whether nonblocking reads or writes are possible, and, possibly, provide the kernel with information that can be used to put the calling process to sleep until I/O becomes possible. If a driver leaves its poll methodNULL
, the device is assumed to be both readable and writable without blocking. -
int (*open) (struct inode *, struct file *);
Though this is always the first operation performed on the device file, the driver is not required to declare a corresponding method. If this entry is
NULL
, opening the device always succeeds, but your driver isn’t notified. -
int (*release) (struct inode *, struct file *);
This operation is invoked when the file structure is being released. Like
open
,release
can be NULL.release
isn’t invoked every time a process calls close. Whenever a file structure is shared (for example, after afork
or adup
),release
won’t be invoked until all copies are closed. -
int (*fsync) (struct file *, struct dentry *, int);
This method is the back end of the
fsync
system call, which a user calls to flush any pending data. If this pointer isNULL
, the system call returns-EINVAL
. -
int (*flush) (struct file *);
The
flush
operation is invoked when a process closes its copy of a file descriptor for a device; it should execute (and wait for) any outstanding operations on the device. This must not be confused with thefsync
operation requested by user programs. Currently,flush
is used in very few drivers; the SCSI tape driver uses it, for example, to ensure that all data written makes it to the tape before the device is closed. Ifflush
isNULL
, the kernel simply ignores the user application request.
More on file_operations
:
IMG
The iov_iter interface
In kernel programming, one often needs to process buffers supplied by user space. If one writes this function directly, it is really easy to get something wrong. Therefore, the kernel provides iov_iter
interface to simplify this process.
An iov_iter
is essentially an iterator for iovec
structure, which is defined as:
struct iovec
{
void __user *iov_base;
__kernel_size_t iov_len;
};
This structure matches the user-space iovec
structure used by system calls like readv
and writev
.
ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
The readv()
system call reads iovcnt
buffers from the file associated
with the file descriptor fd
into the buffers described by iov
(“scatter input”).
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
The writev()
system call writes iovcnt
buffers of data described by
iov
to the file associated with the file descriptor fd
(“gather
output”).
The definition of iov_iter
:
struct iov_iter {
int type;
size_t iov_offset;
size_t count;
const struct iovec *iov;
unsigned long nr_segs;
};
type
: properties of this iterator stored as bit mask. e.g. READ/WRITE, etc.iov_offset
: offset of cursor w.r.t. to firstiovec
pointed byiov
count
: total amount of data pointed to byiov
iov
: an array ofiovec
objects (buffers)nr_segs
: number of arrays iniov
To transfer data between kernel space and user space, one can call
size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
Here addr
is address in kernel space; bytes
is size of buffer and i
is the user-space buffer indicated by the iterator.
The file_operations
also provides read_iter
and write_iter
options to allow a device to interact with user-space buffers using iovec
. This is how packets are shipped from kernel space to user space.
TUN’s file_operations
structure:
static const struct file_operations tun_fops = {
.owner = THIS_MODULE,
.llseek = no_llseek,
.read_iter = tun_chr_read_iter,
.write_iter = tun_chr_write_iter,
.poll = tun_chr_poll,
.unlocked_ioctl = tun_chr_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = tun_chr_compat_ioctl,
#endif
.open = tun_chr_open,
.release = tun_chr_close,
.fasync = tun_chr_fasync,
#ifdef CONFIG_PROC_FS
.show_fdinfo = tun_chr_show_fdinfo,
#endif
};
More on iov_iter
:
TUN’s read implementation
static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
struct iov_iter *to,
int noblock, void *ptr)
{
ssize_t ret;
int err;
if (!iov_iter_count(to)) {
tun_ptr_free(ptr);
return 0;
}
if (!ptr) {
/* Read frames from ring */
ptr = tun_ring_recv(tfile, noblock, &err);
if (!ptr)
return err;
}
if (tun_is_xdp_frame(ptr)) {
struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
ret = tun_put_user_xdp(tun, tfile, xdpf, to);
xdp_return_frame(xdpf);
} else {
struct sk_buff *skb = ptr;
ret = tun_put_user(tun, tfile, skb, to);
if (unlikely(ret < 0))
kfree_skb(skb);
else
consume_skb(skb);
}
return ret;
}
static ssize_t tun_chr_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct file *file = iocb->ki_filp;
struct tun_file *tfile = file->private_data;
struct tun_struct *tun = tun_get(tfile);
ssize_t len = iov_iter_count(to), ret;
if (!tun)
return -EBADFD;
ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK, NULL);
ret = min_t(ssize_t, ret, len);
if (ret > 0)
iocb->ki_pos = ret;
tun_put(tun);
return ret;
}
TUN’s write implementation
/* Get packet from user space buffer */
static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
void *msg_control, struct iov_iter *from,
int noblock, bool more)
{
struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) };
struct sk_buff *skb;
size_t total_len = iov_iter_count(from);
size_t len = total_len, align = tun->align, linear;
struct virtio_net_hdr gso = { 0 };
struct tun_pcpu_stats *stats;
int good_linear;
int copylen;
bool zerocopy = false;
int err;
u32 rxhash = 0;
int skb_xdp = 1;
bool frags = tun_napi_frags_enabled(tfile);
if (!(tun->flags & IFF_NO_PI)) {
if (len < sizeof(pi))
return -EINVAL;
len -= sizeof(pi);
if (!copy_from_iter_full(&pi, sizeof(pi), from))
return -EFAULT;
}
if (tun->flags & IFF_VNET_HDR) {
int vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
if (len < vnet_hdr_sz)
return -EINVAL;
len -= vnet_hdr_sz;
if (!copy_from_iter_full(&gso, sizeof(gso), from))
return -EFAULT;
if ((gso.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
tun16_to_cpu(tun, gso.csum_start) + tun16_to_cpu(tun, gso.csum_offset) + 2 > tun16_to_cpu(tun, gso.hdr_len))
gso.hdr_len = cpu_to_tun16(tun, tun16_to_cpu(tun, gso.csum_start) + tun16_to_cpu(tun, gso.csum_offset) + 2);
if (tun16_to_cpu(tun, gso.hdr_len) > len)
return -EINVAL;
iov_iter_advance(from, vnet_hdr_sz - sizeof(gso));
}
if ((tun->flags & TUN_TYPE_MASK) == IFF_TAP) {
align += NET_IP_ALIGN;
if (unlikely(len < ETH_HLEN ||
(gso.hdr_len && tun16_to_cpu(tun, gso.hdr_len) < ETH_HLEN)))
return -EINVAL;
}
good_linear = SKB_MAX_HEAD(align);
if (msg_control) {
struct iov_iter i = *from;
/* There are 256 bytes to be copied in skb, so there is
* enough room for skb expand head in case it is used.
* The rest of the buffer is mapped from userspace.
*/
copylen = gso.hdr_len ? tun16_to_cpu(tun, gso.hdr_len) : GOODCOPY_LEN;
if (copylen > good_linear)
copylen = good_linear;
linear = copylen;
iov_iter_advance(&i, copylen);
if (iov_iter_npages(&i, INT_MAX) <= MAX_SKB_FRAGS)
zerocopy = true;
}
if (!frags && tun_can_build_skb(tun, tfile, len, noblock, zerocopy)) {
/* For the packet that is not easy to be processed
* (e.g gso or jumbo packet), we will do it at after
* skb was created with generic XDP routine.
*/
skb = tun_build_skb(tun, tfile, from, &gso, len, &skb_xdp);
if (IS_ERR(skb)) {
this_cpu_inc(tun->pcpu_stats->rx_dropped);
return PTR_ERR(skb);
}
if (!skb)
return total_len;
} else {
if (!zerocopy) {
copylen = len;
if (tun16_to_cpu(tun, gso.hdr_len) > good_linear)
linear = good_linear;
else
linear = tun16_to_cpu(tun, gso.hdr_len);
}
if (frags) {
mutex_lock(&tfile->napi_mutex);
skb = tun_napi_alloc_frags(tfile, copylen, from);
/* tun_napi_alloc_frags() enforces a layout for the skb.
* If zerocopy is enabled, then this layout will be
* overwritten by zerocopy_sg_from_iter().
*/
zerocopy = false;
} else {
skb = tun_alloc_skb(tfile, align, copylen, linear,
noblock);
}
if (IS_ERR(skb)) {
if (PTR_ERR(skb) != -EAGAIN)
this_cpu_inc(tun->pcpu_stats->rx_dropped);
if (frags)
mutex_unlock(&tfile->napi_mutex);
return PTR_ERR(skb);
}
if (zerocopy)
err = zerocopy_sg_from_iter(skb, from);
else
err = skb_copy_datagram_from_iter(skb, 0, from, len);
if (err) {
err = -EFAULT;
drop:
this_cpu_inc(tun->pcpu_stats->rx_dropped);
kfree_skb(skb);
if (frags) {
tfile->napi.skb = NULL;
mutex_unlock(&tfile->napi_mutex);
}
return err;
}
}
if (virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun))) {
this_cpu_inc(tun->pcpu_stats->rx_frame_errors);
kfree_skb(skb);
if (frags) {
tfile->napi.skb = NULL;
mutex_unlock(&tfile->napi_mutex);
}
return -EINVAL;
}
switch (tun->flags & TUN_TYPE_MASK) {
case IFF_TUN:
if (tun->flags & IFF_NO_PI) {
u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
switch (ip_version) {
case 4:
pi.proto = htons(ETH_P_IP);
break;
case 6:
pi.proto = htons(ETH_P_IPV6);
break;
default:
this_cpu_inc(tun->pcpu_stats->rx_dropped);
kfree_skb(skb);
return -EINVAL;
}
}
skb_reset_mac_header(skb);
skb->protocol = pi.proto;
skb->dev = tun->dev;
break;
case IFF_TAP:
if (!frags)
skb->protocol = eth_type_trans(skb, tun->dev);
break;
}
/* copy skb_ubuf_info for callback when skb has no error */
if (zerocopy) {
skb_shinfo(skb)->destructor_arg = msg_control;
skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
} else if (msg_control) {
struct ubuf_info *uarg = msg_control;
uarg->callback(uarg, false);
}
skb_reset_network_header(skb);
skb_probe_transport_header(skb);
if (skb_xdp) {
struct bpf_prog *xdp_prog;
int ret;
local_bh_disable();
rcu_read_lock();
xdp_prog = rcu_dereference(tun->xdp_prog);
if (xdp_prog) {
ret = do_xdp_generic(xdp_prog, skb);
if (ret != XDP_PASS) {
rcu_read_unlock();
local_bh_enable();
if (frags) {
tfile->napi.skb = NULL;
mutex_unlock(&tfile->napi_mutex);
}
return total_len;
}
}
rcu_read_unlock();
local_bh_enable();
}
/* Compute the costly rx hash only if needed for flow updates.
* We may get a very small possibility of OOO during switching, not
* worth to optimize.
*/
if (!rcu_access_pointer(tun->steering_prog) && tun->numqueues > 1 &&
!tfile->detached)
rxhash = __skb_get_hash_symmetric(skb);
rcu_read_lock();
if (unlikely(!(tun->dev->flags & IFF_UP))) {
err = -EIO;
rcu_read_unlock();
goto drop;
}
if (frags) {
/* Exercise flow dissector code path. */
u32 headlen = eth_get_headlen(tun->dev, skb->data,
skb_headlen(skb));
if (unlikely(headlen > skb_headlen(skb))) {
this_cpu_inc(tun->pcpu_stats->rx_dropped);
napi_free_frags(&tfile->napi);
rcu_read_unlock();
mutex_unlock(&tfile->napi_mutex);
WARN_ON(1);
return -ENOMEM;
}
local_bh_disable();
napi_gro_frags(&tfile->napi);
local_bh_enable();
mutex_unlock(&tfile->napi_mutex);
} else if (tfile->napi_enabled) {
struct sk_buff_head *queue = &tfile->sk.sk_write_queue;
int queue_len;
spin_lock_bh(&queue->lock);
__skb_queue_tail(queue, skb);
queue_len = skb_queue_len(queue);
spin_unlock(&queue->lock);
if (!more || queue_len > NAPI_POLL_WEIGHT)
napi_schedule(&tfile->napi);
local_bh_enable();
} else if (!IS_ENABLED(CONFIG_4KSTACKS)) {
tun_rx_batched(tun, tfile, skb, more);
} else {
netif_rx_ni(skb);
}
rcu_read_unlock();
stats = get_cpu_ptr(tun->pcpu_stats);
u64_stats_update_begin(&stats->syncp);
u64_stats_inc(&stats->rx_packets);
u64_stats_add(&stats->rx_bytes, len);
u64_stats_update_end(&stats->syncp);
put_cpu_ptr(stats);
if (rxhash)
tun_flow_update(tun, rxhash, tfile);
return total_len;
}
static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
struct tun_file *tfile = file->private_data;
struct tun_struct *tun = tun_get(tfile);
ssize_t result;
if (!tun)
return -EBADFD;
result = tun_get_user(tun, tfile, NULL, from,
file->f_flags & O_NONBLOCK, false);
tun_put(tun);
return result;
}
Resources
- XDP: Høiland-Jørgensen, Toke, et al. “The express data path: Fast programmable packet processing in the operating system kernel.” Proceedings of the 14th International Conference on Emerging Networking EXperiments and Technologies. 2018.