Linux 2.6.27 kernel released 9 October 2008.
Note: The 2008 Linux Kernel Summit was held September 15 and 16 in Portland, Oregon, immediately prior to the Linux Plumbers Conference. LWN, as always, has excelent coverage of the event. You can download here all the papers of the conferences in two PDF files. LWN also has coverage of the Linux Plumbers Conference
Summary: 2.6.27 add a new filesystem (UBIFS) optimized for "pure" flash-based storage devices, the page-cache is now lockless, much improved Direct I/O scalability and performance, delayed allocation for ext4, multiqueue networking, an alternative hibernation implementation based on kexec/kdump, data integrity support in the block layer for devices that support it, a simple tracer called ftrace, a mmio tracer, sysprof support, extraction of all the in-kernel's firmware to /lib/firmware, XEN support for saving/restorig VMs, improved video camera support, support for the Intel wireless 5000 series and RTL8187B network cards, a new ath9k driver for the Atheros AR5008 and AR9001 family of chipsets, more new drivers, improved support for others and many other improvements and fixes.
Contents
- Prominent features (the cool stuff)
- Lockless page cache and get_user_pages()
- Ext4: Delayed Allocation
- Kexec jump: kexec/kdump based hibernation
- UBIFS and OMFS
- Block layer data integrity support
- Multiqueue networking
- ftrace, sysprof support
- Mmiotrace
- External firmware
- Improved video camera support with the gspca driver
- Extended file descriptor system calls
- Voltage and Current Regulator
- Architecture-specific changes
- Core
- Crypto
- Security
- Networking
- Filesystems
- Drivers
- The Linux Kernel in the news
1. Prominent features (the cool stuff)
1.1. Lockless page cache and get_user_pages()
Recommended LWN article: "Toward better direct I/O scalability", "The lockless page cache"
The page cache is the place where the kernel keeps in RAM a copy of a file to improve performance by avoiding disk I/O when the data that needs to be read is already on RAM. Each "mapping", which is the data structure that keeps track of the correspondence between a file and the page cache, is SMP-safe thanks to its own lock. So when different processes in different CPUs access different files, there's no lock contention, but if they access the same file (shared libraries or shared data files for example), they can hit some contention on that lock. In 2.6.27, thanks to some rules on how the page cache can be used and the usage of RCU, the page cache will be able to do lookups (ie., "read" the page cache) without needing to take the mapping lock, and hence improving scalability. But it will only be noticeable on systems with lots of cpus (page fault speedup of 250x on a 64 way system have been measured).
Lockless get_user_pages(): get_user_pages() is a function used in direct I/O operations to pin the userspace memory that is going to be transferred. It's a complex function that requires to hold the mmap_sem semaphore in the mm_struct struct of the process and the page table lock. This is a scalability problem when there're several processes using get_user_pages in the same address space (for example, databases that do Direct I/O), because there will be lock contention. In 2.6.27, a new get_user_pages_fast() function has been introduced, which does the same work that get_user_pages() does, but its simplified to speed up the most common workloads that exercise those paths within the same address space. This new function can avoid taking the mmap_sem semaphore and the page table locks in those cases. Benchmarks showed a 10% speedup running a OLTP workload with a IBM DB2 database in a quad-core system
Code: (commit 1, 2, 3, 4, 5, 6)
1.2. Ext4: Delayed Allocation
In this release, Ext4 is adding one of its most important planned features: Delayed allocation (also called "Allocate-on-flush"). It doesn't change the disk format in any way, but it improves the performance in a wide range of workloads.
When an application write()s data to the disk, the data is usually not written immediately to the disk but instead is cached in RAM for a while. Without delayed allocation, despite the data not being written immediately to the disk the filesystem allocates the necessary disk structures for it immediately. Delayed allocation consists of not allocating space for that cached data - instead only the free space counter is updated when write() is called. The procedure is changed so on-disk blocks and structures are now only allocated when the cached data is finally written to the disk - not when a process writes something. This approach (used by filesystems such as XFS, btrfs, ZFS and Reiser 4) noticeably improves the performance of many workloads. It also results in better block allocation decisions because when allocation decisions are done at write()-time, the block allocator cannot know if any other write()s are going to be done.
There is also a new implementation of the default data=ordered journaling mode based nn inodes, not nn jbd buffer heads. Code: (commit 1, 2, 3, 4)
1.3. Kexec jump: kexec/kdump based hibernation
Recommended LWN article: "Yet another approach to software suspend"
Kexec is a Linux feature that allows loading a kernel into memory and executing it, allowing to reboot to a new kernel without rebooting. This infrastructure was used to implement kdump, a kernel crash dump system: A "safe kernel" is loaded into memory as soon as the system starts, and if the running kernel crashes, the oops code kexec's to the "safe kernel", which is able to dump the memory that it's not using to the disk or somewhere else.
This infrastructure has been enhanced in 2.6.27 to be able to be used as an hibernation implementation: Instead of kexec'ing a safe kernel to dump the system memory, a system can kexec to a kernel that will dump all the memory on the disk and then shutdown the system. When the systems boots, the initrd can load the dumped system, and restore it.
This hibernation implementation does not replace the existing hibernation implementations, it's just an alternative. It has some advantages, like not depending on ACPI. For now it only works on x86-32.
Code: http://lwn.net/Articles/242107/ (commit). (commit)
1.4. UBIFS and OMFS
Recommended LWN article: "UBIFS" "OMFS"
UBIFS is a new filesystem designed to work with flash devices, developed by Nokia with help of the University of Szeged. It's important to understand that UBIFS is very different to any traditional filesystem: UBIFS does not work with block based devices, but pure flash based devices, handled by the MTD subsystem in Linux. Hence, UBIFS does not work with what many people considers flash devices like flash-based hard drives, SD cards, USB sticks, etc; because those devices use a block device emulation layer called FTL (Flash Translation Layer) that make they look like traditional block-based storage devices to the outside world. UBIFS instead is designed to work with flash devices that do not have a block device emulation layer and that are handled by the MTD subsystem and present themselves to userspace as MTD devices.
UBIFS works on top of UBI volumes. UBI is a LVM-like layer which was included in Linux 2.6.22, which itself works on top of MTD devices. UBIFS offers various advantages to JFFS2: faster and scalable mount times (unlike JFFS2, UBIFS does not have to scan whole media when mounting), tolerance to unclean reboots (UBIFS is a journaling filesystem), write-back (which improves dramatically the performance), and support of on-the-flight compression.
Documentation: UBIFS FAQ, more documentation
Code: (commit), (commit), (commit)
OMFS stands for "Sonicblue Optimized MPEG File System support". It is the proprietary file system used by the Rio Karma music player and ReplayTV DVR. Despite the name, this filesystem is not more efficient than a standard FS for MPEG files, in fact likely the opposite is true. Code: (commit 1, 2, 3, 5, 6, 7, 8)
1.5. Block layer data integrity support
Recommended LWN article: "Block layer: integrity checking and lots of partitions"
Modern filesystems feature checksumming of data and metadata to protect against data corruption. However, the detection of the corruption is done at read time which could potentially be months after the data was written. At that point the original data that the application tried to write is most likely lost (if there's not data redundancy). The solution is to ensure that the disk is actually storing what the application meant it to. Recent additions to both the SCSI family protocols (SBC Data Integrity Field, SCC protection proposal) as well as SATA/T13 (External Path Protection) try to remedy this by adding support for appending integrity metadata to an I/O. The integrity metadata includes a checksum for each sector as well as an incrementing counter that ensures the individual sectors are written in the right order. And for some protection schemes also that the I/O is written to the right place on disk.
Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9)
1.6. Multiqueue networking
Recommended LWN article: "Multiqueue networking"
From that article: One of the fundamental data structures in the networking subsystem is the transmit queue associated with each device [...] This is a scheme which has worked well for years, but it has run into a fundamental limitation: it does not map well to devices which have multiple transmit queues. Such devices are becoming increasingly common, especially in the wireless networking area. Devices which implement the Wireless Multimedia Extensions, for example, can have four different classes of service: video, voice, best-effort, and background. Video and voice traffic may receive higher priority within the device - it is transmitted first - and the device can also take more of the available air time for such packets. Linux 2.6.27 adds support for those devices
Code: (commit)
1.7. ftrace, sysprof support
Ftrace is a very simple function tracer -unrelated to kprobes/SystemTap- which was born in the -rt patches. It uses a compiler feature to insert a small, 5-byte No-Operation instruction to the beginning of every kernel function, which NOP sequence is then dynamically patched into a tracer call when tracing is enabled by the administrator. If it's disabled, the overhead of the instructions is very small and not measurable even in micro-benchmarks. Although ftrace is the function tracer, it also includes an plugin infrastructure that allows for other types of tracing. Some of the tracers that are currently in ftrace include a tracer to trace context switches, the time it takes for a high priority task to run after it was woken up, how long interrupts are disabled, the time spent in preemption off critical sections.
The interface to access ftrace can be found in /debugfs/tracing, which are documented in Documentation/ftrace.txt. There's also a sysprof plugin that can be used with a development version of sysprof - "svn checkout http://svn.gnome.org/svn/sysprof/branches/ftrace-branch sysprof"
Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 14, 15, 16, 17)
1.8. Mmiotrace
Recommended LWN article: "Tracing memory-mapped I/O operations"
Mmiotrace is a tool for trapping memory mapped IO (MMIO) accesses within the kernel. Since MMIO is used by drivers, this tool can be used for debugging and especially for reverse engineering binary drivers.
Code: (commit), Documentation: (commit)
1.9. External firmware
Recommended LWN article: "Moving the firmware out"
Firmware is usually compiled with each driver. For some reasons (mainly, licensing reasons), distributing firmware is not allowed by some companies and some drivers have also supported loading external firmware for a long time. But even if the firmware compiled and shipped with each driver is redistributable, is not libre software, and some people thinks that this breaks the GPL. It also has some disadvantages for distros.
In 2.6.27, the firmware blobs have been moved from the drivers' source code to a new directory: firmware/. By default, the firmware won't be compiled in the kernel binary, or in the modules. It's installed in /lib/firmware when the user types "make modules_install", and drivers have been modified to call request_firmware() and load the firmware when they need it. There's also a configuration option that will compile the firmware files in the kernel binary image, like it was done previously.
1.10. Improved video camera support with the gspca driver
Linux 2.6.26 was a big improvement to linux webcam support thanks to a driver that supports devices that implement the USB video class specification, which are quite a lot. 2.6.27 includes the gspca driver, which implements support for another large set of devices. With this driver, most video camera devices on the market are supported by Linux.
1.11. Extended file descriptor system calls
Recommended LWN article: "Extending system calls"
When Unix was designed, some of the interfaces didn't envisioned functionality that would be needed in the future. Many interfaces that allow creating a file descritor don't take a flag parameter, for example. That makes impossible to create file descriptors with new properties things like close-on-exec, non-blocking, or non-sequential descriptors. Being able to do such things today is neccesary - not just for fun: it also closes a security bug that can be exploited in multithreaded apps.
To solve this issue, Linux 2.6.27 is adding a new set of interfaces and syscalls that will be used by glibc.
Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
1.12. Voltage and Current Regulator
This framework is designed to provide a generic interface to voltage and current regulators. The intention is to allow systems to dynamically control regulator output in order to save power and prolong battery life. This applies to both voltage regulators (where voltage output is controllable) and current sinks (where current output is controllable). This framework is designed around SoC based devices and has also been designed against two Power Management ICs (PMICs) currently on the market - namely the Freescale MC13783 and the Wolfson WM8350, however it is quite generic and should apply to all PMICs.
Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
2. Architecture-specific changes
- x86
Make generic arch support NUMAQ (commit)
Make generic arch support VisWS (Visual Workstation): turn into generic arch (commit), (commit)
CPA: add statistics about state of direct mapping (commit)
Add a debugfs interface to dump PAT memtype (commit)
Add "debugpat" boot option (commit)
Allow up to 4096 cpus: NR_CPUS to 4096 and MAX_NUMNODES to 512 (commit), (commit)
Config option to disable info from decompression of the kernel (commit)
clockevents: add C1E aware idle function (commit)
SGI UV: TLB shootdown using broadcast assist unit (commit)
Enable memory tester support on 32-bit (commit)
Add performance variants of cpumask operators (commit)
Add a list for custom page fault handlers. (commit)
mtrr cleanup for converting continuous to discrete layout (commit), (commit)
RDC321x: add to mach-default (commit)
- ARM
kgdb ARCH=arm support (commit)
Common code for the Motorola EZX GSM phones (commit)
Orion: add QNAP TS-409 support (commit), add 88F5181L (Orion-VoIP) support (commit), add Linksys WRT350N v2 support (commit), add HP Media Vault mv2120 support (commit), add Technologic Systems TS-78xx support (commit), add Maxtor Shared Storage II support (commit), add Netgear WNR854T support (commit), add RD88F5181L-FXO support (commit), add RD88F5181L-GE support (commit)
Add e350 support (commit)
E-series UDC support (commit)
AT91: UDPHS driver (commit), (commit), (commit), Calao Systems (commit)
Initial machine support for Logitech Jive (commit)
pcm990: Add framebuffer and backlight support (commit)
pxa: add pxa3xx NAND device and clock sources (commit), add pxa3xx NAND support for zylonite (commit), add pxa3xx NAND support for littleton (commit), add base support for PXA930 (aka Tavor-P) (commit), add base support for PXA930 Evaluation Board (aka TavorEVB) (commit), add base support for PXA930 Handheld Platform (aka SAAR) (commit), add generic PWM backlight driver (commit), (commit),
Better MX2 platform (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit)
Add basic pcm037 board support (commit)
Latencytop support (commit)
Add Marvell Loki (88RC8480) SoC support (commit), Marvell Kirkwood (88F6000) SoC support (commit), Marvell 78xx0 ARM SoC support (commit)
Support Toshiba TC6393XB Mobile I/O Controller. (commit)
Core MFD support (commit)
tc6393xb: tmio-nand support (commit)
Tosa: support TC6393XB device (commit), tmio-nand data (commit), support built-in bluetooth power-up (commit)
S3c2440: Add AT2440EVB board support (commit)
AT2440EVB: Add DM9000A network controller support (commit)
Acer n30: Add support for n35 and related devices (commit)
ixp4xx: Add support for the Freecom FSG-3 board (commit)
Remove ARCH_CO285 (commit)
Support for the at91sam9g20 (commit)
Add support for PalmTX handheld computer (commit)
pxafb: Support for RGB666, RGBT666, RGB888 and RGBT888 (commit)
Support for LCD on e740 e750 e400 and e800 e-series PDAs (commit)
- SH
Initial ELF FDPIC support. (commit)
Support variable page sizes on nommu. (commit)
Add support for 16kB PAGE_SIZE. (commit)
Add support Renesas Solutions AP-325RXA board (commit)
Add SCIF2 support for SH7763. (commit)
RSK+ 7203 board support. (commit)
Renesas Solutions SH7763RDP board support (commit)
Solution Enginge 7710/7712 SH-Ether support (commit)
Renesas R0P7785LC0011RL board support (commit)
Add SuperH Mobile LCDC platform data for Migo-R (commit), add SuperH Mobile CEU platform data for Migo-R (commit)
- IA64
Add support for the SGI UV GRU. The GRU is a hardware resource located in the system chipset. The GRU contains memory that is mmaped into the user address space. This memory is used to communicate with the GRU to perform functions such as load/store, scatter/gather, bcopy, AMOs, etc (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
Allow ia64 to CONFIG_NR_CPUS up to 4096 (commit)
Paravirt_ops support for IA64 (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit)
- Xen
- Blackfin
Use the generic platform nand driver to support nand flash on bf53x board which do not have on-chip nand flash controller (commit)
Apply Bluetechnix CM-BF527 board support patch (commit)
Add support for the Blackstamp board (commit)
RTC driver: add support for power management framework (commit)
Add support for board tcm-bf537 (commit)
- S390
- MIPS
- POWERPC
Remove arch/ppc architecture. arch/powerpc supports everything now (commit)
powerpc kgdb support (commit)
Enable tracehook for the architecture (commit)
Support multiple hugepage sizes (commit), define support for 16G hugepages (commit)
Vector Scaler extensions (Power 7 processors) (commit), (commit), (commit), (commit), (commit)
Add Strong Access Ordering support (commit), (commit), (commit)
mpc5121: Add clock driver (commit), Update device tree for MPC5121ADS evaluation board (commit), add generic board support for MPC5121 platforms (commit), add support for CPLD on MPC5121ADS board (commit)
85xx: add board support for the TQM8548 modules (commit), add DOZE/NAP support for e500 core (commit), enable MSI support for 85xxds board (commit), add support for MPC8536DS (commit)
83xx: new board support: MPC8360E-RDK (commit), add support for Analogue & Micro ASP837E board (commit), Power Management support (commit)
86xx: Enable MSI support for MPC8610HPCD board (commit)
virtex: add Xilinx 440 cpu to the cputable (commit), add Xilinx Virtex 5 ppc440 platform support (commit)
4xx: Sam440ep support (commit)
C2K board driver (commit)
ibmveth: enable driver for CMO (commit)
ibmvscsi: driver enablement for CMO (commit)
ibmvfc: Add support for collaborative memory overcommit (commit)
Implement FSL GTM support (commit)
powerpc/QE: add support for QE USB clocks routing (commit)
booke: Add kprobes support for booke style processors (commit), BookE hardware watchpoint support (commit), add support for new e500mc core (commit)
fsl: PCIe MSI support for 83xx/85xx/86xx processors. (commit)
pseries: Add collaborative memory manager (commit), add CMO paging statistics (commit), iommu enablement for CMO (commit), vio bus support for CMO (commit)
Add driver for Barrier Synchronization Register (commit)
Support for latencytop (commit)
cell: Add spu aware cpufreq governor (commit). add support for power button of future IBM cell blades (commit)
Delete unused fec_8xx net driver (commit)
- AVR32
- SPARC
- v850
Remove v850 port (commit)
3. Core
- sched
- Power Management:
Recommended LWN article: "A new suspend/hibernate infrastructure"
New suspend/hibernate infrastructure (commit), (commit), (commit)
Boot time suspend selftest (commit)
ACPI PCI slot detection driver (commit)
rcu: make rcutorture more vicious: add stutter feature (commit), reinstate boot-time testing (commit), make quiescent rcutorture less power-hungry (commit), make quiescent rcutorture less power-hungry (commit), invoke RCU readers from irq handlers (timers) (commit)
cfq-iosched: add message logging through blktrace (commit)
ramfs: enable splice write (commit)
sysfs: add /sys/dev/{char,block} to lookup sysfs path by major:minor (commit), add /sys/firmware/memmap (commit)
remove CONFIG_KMOD from core kernel code (commit)
Add a basic debugging framework for memory initialisation (commit), add bootmem debugging framework (commit)
Allow to debug the X server: access_process_vm device memory infrastructure (commit), use generic_access_phys for /dev/mem mappings (commit)
tmpfs: support aio (commit)
hugetlbfs: per mount huge page sizes (commit), new sysfs interface (commit), modular state for hugetlb page size (commit), multiple hstates for multiple page sizes (commit), support boot allocate different sizes (commit), override default huge page size (commit)
vmallocinfo: add NUMA information (commit)
memory-hotplug: add sysfs removable attribute for hotplug memory remove (commit)
UBI: implement multiple volumes rename (commit), remove pre-sqnum images support (commit), allow UBI root device name (commit)
kprobes: improve kretprobe scalability with hashed locking (commit)
per-task-delay-accounting: update taskstats for memory reclaim delay (commit)
task IO accounting: provide distinct tgid/tid I/O statistics (commit)
per-task-delay-accounting: add memory reclaim delay (commit)
per-task-delay-accounting: update document and getdelays.c for memory reclaim (commit)
fuse: nfs export special lookups (commit), lockd support (commit), add export operations (commit)
relay: add buffer-only channels; useful for early logging (commit)
lguest: Support assigning a MAC address (commit), virtio-rng support (commit)
- KVM
Support adding a spare to a live md array with external metadata. (commit)
Support changing rdev size on running arrays. (commit)
CPUFREQ: S3C24XX NAND driver frequency scaling support. (commit)
4. Crypto
Add support for RIPEMD hash algorithms: RIPEMD-128,256 and 320 (commit), (commit), (commit), (commit), (commit)
hash: Add asynchronous hash support (commit), (commit), (commit)
ixp4xx - Hardware crypto support for IXP4xx CPUs (commit)
crc32c - Add ahash implementation (commit)
talitos: Freescale integrated security engine (SEC) driver (commit), (commit), (commit)
5. Security
Protect legacy applications from executing with insufficient privilege (commit)
Filesystem capabilities refactor kernel code (commit)
LSM: show LSM mount options in /proc/mounts (commit)
- Selinux:
6. Networking
WEXT: Add support for passing PMK and capability flags to WEXT (commit)
Add layer1 over IP support (commit)
Add STP demux layer (commit)
bridge: Use STP demux (commit)
Add GARP applicant-only participant (commit)
virtio net: Add ethtool ops for SG/GSO (commit)
loopback: Enable TSO (commit)
- netfilter
mac80211: add spectrum capabilities (commit)
build algorithms into the mac80211 module (commit)
hostap: add radiotap support in monitor mode (commit)
iwlwifi : Patch adds rfkill subsystem for 3945 (commit)
netdev: Add support for rx flow hash configuration, using ethtool. (commit)
iwlwifi: enable packet injection for iwl3945 (commit)
tun: Interface to query tun/tap features. (commit), TUNSETFEATURES to set gso features. (commit)
vlan: Add GVRP support (commit)
vlan: Add ethtool support (commit)
netdev: Create netdev_queue abstraction. (commit)
mac80211: power management wext hooks (commit)
7. Filesystems
8. Drivers
8.1. Graphics
fbdev: add the carmine FB driver (commit), SuperH Mobile LCDC Driver (commit), LCD backlight driver using Atmel PWM driver (commit), add new Cobalt LCD framebuffer driver (commit), add support for the ILI9320 video display controller (commit), SH7760/SH7763 LCDC framebuffer driver (commit)
tridentfb: add TGUI 9440 support (commit), add imageblit acceleration for Blade3D family (commit), add acceleration for TGUI families (commit)
Add platform_lcd driver (commit)
Remove old broken Cobalt LCD driver (commit)
8.2. IDE/SATA
- SATA
- ide
8.3. Network
Add ath9k: Atheros IEEE 802.11n driver for AR5