• Immutable Page
  • Info
  • Attachments

Linux 2 6 27

Linux 2.6.27 kernel released 9 October 2008.

Note: The 2008 Linux Kernel Summit was held September 15 and 16 in Portland, Oregon, immediately prior to the Linux Plumbers Conference. LWN, as always, has excelent coverage of the event. You can download here all the papers of the conferences in two PDF files. LWN also has coverage of the Linux Plumbers Conference

Summary: 2.6.27 add a new filesystem (UBIFS) optimized for "pure" flash-based storage devices, the page-cache is now lockless, much improved Direct I/O scalability and performance, delayed allocation for ext4, multiqueue networking, an alternative hibernation implementation based on kexec/kdump, data integrity support in the block layer for devices that support it, a simple tracer called ftrace, a mmio tracer, sysprof support, extraction of all the in-kernel's firmware to /lib/firmware, XEN support for saving/restorig VMs, improved video camera support, support for the Intel wireless 5000 series and RTL8187B network cards, a new ath9k driver for the Atheros AR5008 and AR9001 family of chipsets, more new drivers, improved support for others and many other improvements and fixes.

1. Prominent features (the cool stuff)

1.1. Lockless page cache and get_user_pages()

Recommended LWN article: "Toward better direct I/O scalability", "The lockless page cache"

The page cache is the place where the kernel keeps in RAM a copy of a file to improve performance by avoiding disk I/O when the data that needs to be read is already on RAM. Each "mapping", which is the data structure that keeps track of the correspondence between a file and the page cache, is SMP-safe thanks to its own lock. So when different processes in different CPUs access different files, there's no lock contention, but if they access the same file (shared libraries or shared data files for example), they can hit some contention on that lock. In 2.6.27, thanks to some rules on how the page cache can be used and the usage of RCU, the page cache will be able to do lookups (ie., "read" the page cache) without needing to take the mapping lock, and hence improving scalability. But it will only be noticeable on systems with lots of cpus (page fault speedup of 250x on a 64 way system have been measured).

Code: (commit 1, 2, 3)

Lockless get_user_pages(): get_user_pages() is a function used in direct I/O operations to pin the userspace memory that is going to be transferred. It's a complex function that requires to hold the mmap_sem semaphore in the mm_struct struct of the process and the page table lock. This is a scalability problem when there're several processes using get_user_pages in the same address space (for example, databases that do Direct I/O), because there will be lock contention. In 2.6.27, a new get_user_pages_fast() function has been introduced, which does the same work that get_user_pages() does, but its simplified to speed up the most common workloads that exercise those paths within the same address space. This new function can avoid taking the mmap_sem semaphore and the page table locks in those cases. Benchmarks showed a 10% speedup running a OLTP workload with a IBM DB2 database in a quad-core system

Code: (commit 1, 2, 3, 4, 5, 6)

1.2. Ext4: Delayed Allocation

In this release, Ext4 is adding one of its most important planned features: Delayed allocation (also called "Allocate-on-flush"). It doesn't change the disk format in any way, but it improves the performance in a wide range of workloads.

When an application write()s data to the disk, the data is usually not written immediately to the disk but instead is cached in RAM for a while. Without delayed allocation, despite the data not being written immediately to the disk the filesystem allocates the necessary disk structures for it immediately. Delayed allocation consists of not allocating space for that cached data - instead only the free space counter is updated when write() is called. The procedure is changed so on-disk blocks and structures are now only allocated when the cached data is finally written to the disk - not when a process writes something. This approach (used by filesystems such as XFS, btrfs, ZFS and Reiser 4) noticeably improves the performance of many workloads. It also results in better block allocation decisions because when allocation decisions are done at write()-time, the block allocator cannot know if any other write()s are going to be done.

Code: (commit 1, 2, 3, 4, 5)

There is also a new implementation of the default data=ordered journaling mode based nn inodes, not nn jbd buffer heads. Code: (commit 1, 2, 3, 4)

1.3. Kexec jump: kexec/kdump based hibernation

Recommended LWN article: "Yet another approach to software suspend"

Kexec is a Linux feature that allows loading a kernel into memory and executing it, allowing to reboot to a new kernel without rebooting. This infrastructure was used to implement kdump, a kernel crash dump system: A "safe kernel" is loaded into memory as soon as the system starts, and if the running kernel crashes, the oops code kexec's to the "safe kernel", which is able to dump the memory that it's not using to the disk or somewhere else.

This infrastructure has been enhanced in 2.6.27 to be able to be used as an hibernation implementation: Instead of kexec'ing a safe kernel to dump the system memory, a system can kexec to a kernel that will dump all the memory on the disk and then shutdown the system. When the systems boots, the initrd can load the dumped system, and restore it.

This hibernation implementation does not replace the existing hibernation implementations, it's just an alternative. It has some advantages, like not depending on ACPI. For now it only works on x86-32.

Code: http://lwn.net/Articles/242107/ (commit). (commit)

1.4. UBIFS and OMFS

Recommended LWN article: "UBIFS" "OMFS"

UBIFS is a new filesystem designed to work with flash devices, developed by Nokia with help of the University of Szeged. It's important to understand that UBIFS is very different to any traditional filesystem: UBIFS does not work with block based devices, but pure flash based devices, handled by the MTD subsystem in Linux. Hence, UBIFS does not work with what many people considers flash devices like flash-based hard drives, SD cards, USB sticks, etc; because those devices use a block device emulation layer called FTL (Flash Translation Layer) that make they look like traditional block-based storage devices to the outside world. UBIFS instead is designed to work with flash devices that do not have a block device emulation layer and that are handled by the MTD subsystem and present themselves to userspace as MTD devices.

UBIFS works on top of UBI volumes. UBI is a LVM-like layer which was included in Linux 2.6.22, which itself works on top of MTD devices. UBIFS offers various advantages to JFFS2: faster and scalable mount times (unlike JFFS2, UBIFS does not have to scan whole media when mounting), tolerance to unclean reboots (UBIFS is a journaling filesystem), write-back (which improves dramatically the performance), and support of on-the-flight compression.

Documentation: UBIFS FAQ, more documentation

Code: (commit), (commit), (commit)

OMFS stands for "Sonicblue Optimized MPEG File System support". It is the proprietary file system used by the Rio Karma music player and ReplayTV DVR. Despite the name, this filesystem is not more efficient than a standard FS for MPEG files, in fact likely the opposite is true. Code: (commit 1, 2, 3, 5, 6, 7, 8)

1.5. Block layer data integrity support

Recommended LWN article: "Block layer: integrity checking and lots of partitions"

Modern filesystems feature checksumming of data and metadata to protect against data corruption. However, the detection of the corruption is done at read time which could potentially be months after the data was written. At that point the original data that the application tried to write is most likely lost (if there's not data redundancy). The solution is to ensure that the disk is actually storing what the application meant it to. Recent additions to both the SCSI family protocols (SBC Data Integrity Field, SCC protection proposal) as well as SATA/T13 (External Path Protection) try to remedy this by adding support for appending integrity metadata to an I/O. The integrity metadata includes a checksum for each sector as well as an incrementing counter that ensures the individual sectors are written in the right order. And for some protection schemes also that the I/O is written to the right place on disk.

Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9)

1.6. Multiqueue networking

Recommended LWN article: "Multiqueue networking"

From that article: One of the fundamental data structures in the networking subsystem is the transmit queue associated with each device [...] This is a scheme which has worked well for years, but it has run into a fundamental limitation: it does not map well to devices which have multiple transmit queues. Such devices are becoming increasingly common, especially in the wireless networking area. Devices which implement the Wireless Multimedia Extensions, for example, can have four different classes of service: video, voice, best-effort, and background. Video and voice traffic may receive higher priority within the device - it is transmitted first - and the device can also take more of the available air time for such packets. Linux 2.6.27 adds support for those devices

Code: (commit)

1.7. ftrace, sysprof support

Ftrace is a very simple function tracer -unrelated to kprobes/SystemTap- which was born in the -rt patches. It uses a compiler feature to insert a small, 5-byte No-Operation instruction to the beginning of every kernel function, which NOP sequence is then dynamically patched into a tracer call when tracing is enabled by the administrator. If it's disabled, the overhead of the instructions is very small and not measurable even in micro-benchmarks. Although ftrace is the function tracer, it also includes an plugin infrastructure that allows for other types of tracing. Some of the tracers that are currently in ftrace include a tracer to trace context switches, the time it takes for a high priority task to run after it was woken up, how long interrupts are disabled, the time spent in preemption off critical sections.

The interface to access ftrace can be found in /debugfs/tracing, which are documented in Documentation/ftrace.txt. There's also a sysprof plugin that can be used with a development version of sysprof - "svn checkout http://svn.gnome.org/svn/sysprof/branches/ftrace-branch sysprof"

Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 14, 15, 16, 17)

1.8. Mmiotrace

Recommended LWN article: "Tracing memory-mapped I/O operations"

Mmiotrace is a tool for trapping memory mapped IO (MMIO) accesses within the kernel. Since MMIO is used by drivers, this tool can be used for debugging and especially for reverse engineering binary drivers.

Code: (commit), Documentation: (commit)

1.9. External firmware

Recommended LWN article: "Moving the firmware out"

Firmware is usually compiled with each driver. For some reasons (mainly, licensing reasons), distributing firmware is not allowed by some companies and some drivers have also supported loading external firmware for a long time. But even if the firmware compiled and shipped with each driver is redistributable, is not libre software, and some people thinks that this breaks the GPL. It also has some disadvantages for distros.

In 2.6.27, the firmware blobs have been moved from the drivers' source code to a new directory: firmware/. By default, the firmware won't be compiled in the kernel binary, or in the modules. It's installed in /lib/firmware when the user types "make modules_install", and drivers have been modified to call request_firmware() and load the firmware when they need it. There's also a configuration option that will compile the firmware files in the kernel binary image, like it was done previously.

Code: (commit 1, 2, 3, 4)

1.10. Improved video camera support with the gspca driver

Linux 2.6.26 was a big improvement to linux webcam support thanks to a driver that supports devices that implement the USB video class specification, which are quite a lot. 2.6.27 includes the gspca driver, which implements support for another large set of devices. With this driver, most video camera devices on the market are supported by Linux.

Code: (commit), (commit)

1.11. Extended file descriptor system calls

Recommended LWN article: "Extending system calls"

When Unix was designed, some of the interfaces didn't envisioned functionality that would be needed in the future. Many interfaces that allow creating a file descritor don't take a flag parameter, for example. That makes impossible to create file descriptors with new properties things like close-on-exec, non-blocking, or non-sequential descriptors. Being able to do such things today is neccesary - not just for fun: it also closes a security bug that can be exploited in multithreaded apps.

To solve this issue, Linux 2.6.27 is adding a new set of interfaces and syscalls that will be used by glibc.

Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)

1.12. Voltage and Current Regulator

This framework is designed to provide a generic interface to voltage and current regulators. The intention is to allow systems to dynamically control regulator output in order to save power and prolong battery life. This applies to both voltage regulators (where voltage output is controllable) and current sinks (where current output is controllable). This framework is designed around SoC based devices and has also been designed against two Power Management ICs (PMICs) currently on the market - namely the Freescale MC13783 and the Wolfson WM8350, however it is quite generic and should apply to all PMICs.

Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)

2. Architecture-specific changes

3. Core

  • sched
    • Add new API sched_setscheduler_nocheck: add a flag to control access checks (commit)

    • sched: revert revert of: fair-group: SMP-nice for group scheduling (commit)

  • Power Management:
  • ACPI PCI slot detection driver (commit)

  • rcu: make rcutorture more vicious: add stutter feature (commit), reinstate boot-time testing (commit), make quiescent rcutorture less power-hungry (commit), make quiescent rcutorture less power-hungry (commit), invoke RCU readers from irq handlers (timers) (commit)

  • cfq-iosched: add message logging through blktrace (commit)

  • ramfs: enable splice write (commit)

  • sysfs: add /sys/dev/{char,block} to lookup sysfs path by major:minor (commit), add /sys/firmware/memmap (commit)

  • remove CONFIG_KMOD from core kernel code (commit)

  • Add a basic debugging framework for memory initialisation (commit), add bootmem debugging framework (commit)

  • Allow to debug the X server: access_process_vm device memory infrastructure (commit), use generic_access_phys for /dev/mem mappings (commit)

  • tmpfs: support aio (commit)

  • hugetlbfs: per mount huge page sizes (commit), new sysfs interface (commit), modular state for hugetlb page size (commit), multiple hstates for multiple page sizes (commit), support boot allocate different sizes (commit), override default huge page size (commit)

  • vmallocinfo: add NUMA information (commit)

  • memory-hotplug: add sysfs removable attribute for hotplug memory remove (commit)

  • UBI: implement multiple volumes rename (commit), remove pre-sqnum images support (commit), allow UBI root device name (commit)

  • kprobes: improve kretprobe scalability with hashed locking (commit)

  • per-task-delay-accounting: update taskstats for memory reclaim delay (commit)

  • task IO accounting: provide distinct tgid/tid I/O statistics (commit)

  • per-task-delay-accounting: add memory reclaim delay (commit)

  • per-task-delay-accounting: update document and getdelays.c for memory reclaim (commit)

  • fuse: nfs export special lookups (commit), lockd support (commit), add export operations (commit)

  • relay: add buffer-only channels; useful for early logging (commit)

  • lguest: Support assigning a MAC address (commit), virtio-rng support (commit)

  • KVM
  • Support adding a spare to a live md array with external metadata. (commit)

  • Support changing rdev size on running arrays. (commit)

  • CPUFREQ: S3C24XX NAND driver frequency scaling support. (commit)

4. Crypto

5. Security

  • Protect legacy applications from executing with insufficient privilege (commit)

  • Filesystem capabilities refactor kernel code (commit)

  • LSM: show LSM mount options in /proc/mounts (commit)

  • Selinux:
    • Support deferred mapping of contexts (commit)

    • Enable processes with mac_admin to get the raw inode contexts (commit)

6. Networking

  • WEXT: Add support for passing PMK and capability flags to WEXT (commit)

  • Add layer1 over IP support (commit)

  • Add STP demux layer (commit)

  • bridge: Use STP demux (commit)

  • Add GARP applicant-only participant (commit)

  • virtio net: Add ethtool ops for SG/GSO (commit)

  • loopback: Enable TSO (commit)

  • netfilter
    • ebtables: add IPv6 support (commit),

    • ctnetlink: add full support for SCTP to ctnetlink (commit)

    • ip_tables: add iptables security table for mandatory access control rules (commit)

    • ip6_tables: add ip6tables security table (commit)

    • accounting rework: ct_extend + 64bit counters (commit)

  • mac80211: add spectrum capabilities (commit)

  • build algorithms into the mac80211 module (commit)

  • hostap: add radiotap support in monitor mode (commit)

  • iwlwifi : Patch adds rfkill subsystem for 3945 (commit)

  • netdev: Add support for rx flow hash configuration, using ethtool. (commit)

  • iwlwifi: enable packet injection for iwl3945 (commit)

  • tun: Interface to query tun/tap features. (commit), TUNSETFEATURES to set gso features. (commit)

  • vlan: Add GVRP support (commit)

  • vlan: Add ethtool support (commit)

  • netdev: Create netdev_queue abstraction. (commit)

  • mac80211: power management wext hooks (commit)

7. Filesystems

  • fatfs: add UTC timestamp option (commit)

  • XFS: ASCII case-insensitive support (commit)

8. Drivers

8.1. Graphics

  • fbdev: add the carmine FB driver (commit), SuperH Mobile LCDC Driver (commit), LCD backlight driver using Atmel PWM driver (commit), add new Cobalt LCD framebuffer driver (commit), add support for the ILI9320 video display controller (commit), SH7760/SH7763 LCDC framebuffer driver (commit)

  • tridentfb: add TGUI 9440 support (commit), add imageblit acceleration for Blade3D family (commit), add acceleration for TGUI families (commit)

  • Add platform_lcd driver (commit)

  • Remove old broken Cobalt LCD driver (commit)

8.2. IDE/SATA

  • SATA
  • ide
    • Remove obsoleted "hdx=" kernel parameters (commit), remove obsoleted "idebus=" kernel parameter (commit), remove obsoleted "ide=" kernel parameters (commit)

    • Remove mpc8xx-ide driver (commit)

    • palm_bk3710: add UltraDMA/100 support (commit)

    • BAST: Remove old IDE driver (commit)

8.3. Network

  • Add ath9k: Atheros IEEE 802.11n driver for AR5