When a task enters a new namespace via a clone() or unshare(), a new cgroup
is created and the task moves into it.
This version names cgroups which are automatically created using
cgroup_clone() as "node_<pid>" where pid is the pid of the unsharing or
cloned process. (Thanks Pavel for the idea) This is safe because if the
process unshares again, it will create
/cgroups/(...)/node_<pid>/node_<pid>
The only possibilities (AFAICT) for a -EEXIST on unshare are
1. pid wraparound
2. a process fails an unshare, then tries again.
Case 1 is unlikely enough that I ignore it (at least for now). In case 2, the
node_<pid> will be empty and can be rmdir'ed to make the subsequent unshare()
succeed.
Changelog:
Name cloned cgroups as "node_<pid>".
[clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This example subsystem exports debugging information as an aid to diagnosing
refcount leaks, etc, in the cgroup framework.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This example demonstrates how to use the generic cgroup subsystem for a
simple resource tracker that counts, for the processes in a cgroup, the
total CPU time used and the %CPU used in the last complete 10 second interval.
Portions contributed by Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystem
The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Get rid of sparse related warnings from places that use integer as NULL
pointer.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kconfig.preempt is not included on some archs (for example, m68k). On those
archs, the Kconfig machinery complains that KVM selects an undefined symbol
PREEMPT_NOTIFIERS (which lives in Kconfig.preempt).
So move the offending symbol into a Kconfig file which is included by
everyone.
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild: (40 commits)
kbuild: introduce ccflags-y, asflags-y and ldflags-y
kbuild: enable 'make CPPFLAGS=...' to add additional options to CPP
kbuild: enable use of AFLAGS and CFLAGS on commandline
kbuild: enable 'make AFLAGS=...' to add additional options to AS
kbuild: fix AFLAGS use in h8300 and m68knommu
kbuild: check for wrong use of CFLAGS
kbuild: enable 'make CFLAGS=...' to add additional options to CC
kbuild: fix up CFLAGS usage
kbuild: make modpost detect unterminated device id lists
kbuild: call export_report from the Makefile
kbuild: move Kai Germaschewski to CREDITS
kconfig/menuconfig: distinguish between selected-by-another options and comments
kconfig: tristate choices with mixed tristate and boolean values
include/linux/Kbuild: remove duplicate entries
kbuild: kill backward compatibility checks
kbuild: kill EXTRA_ARFLAGS
kbuild: fix documentation in makefiles.txt
kbuild: call make once for all targets when O=.. is used
kbuild: pass -g to assembler under CONFIG_DEBUG_INFO
kbuild: update _shipped files for kconfig syntax cleanup
...
Fix up conflicts in arch/um/sys-{x86_64,i386}/Makefile manually.
Grouping pages by mobility can be disabled at compile-time. This was
considered undesirable by a number of people. However, in the current stack of
patches, it is not a simple case of just dropping the configurable patch as it
would cause merge conflicts. This patch backs out the configuration option.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The grouping mechanism has some memory overhead and a more complex allocation
path. This patch allows the strategy to be disabled for small memory systems
or if it is known the workload is suffering because of the strategy. It also
acts to show where the page groupings strategy interacts with the standard
buddy allocator.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Optionally add a boot delay after each kernel printk() call, crudely
measured in milliseconds, with a maximum delay of 10 seconds per printk.
Enable CONFIG_BOOT_PRINTK_DELAY=y and then add (e.g.):
"lpj=loops_per_jiffy boot_delay=100"
to the kernel command line.
It has been useful in cases like "during boot, my machine just reboots or the
screen goes black" by slowing down printk, (and adding initcall_debug), we can
usually see the last thing that happened before the lights went out which is
usually a valuable clue.
[akpm@linux-foundation.org: not all architectures implement CONFIG_HZ]
[akpm@linux-foundation.org: fix lots of stuff]
[bunk@stusta.de: kernel/printk.c: make 2 variables static]
[heiko.carstens@de.ibm.com: fix slow down printk on boot compile error]
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable user-id based fair group scheduling. This is useful for anyone
who wants to test the group scheduler w/o having to enable
CONFIG_CGROUPS.
A separate scheduling group (i.e struct task_grp) is automatically created for
every new user added to the system. Upon uid change for a task, it is made to
move to the corresponding scheduling group.
A /proc tunable (/proc/root_user_share) is also provided to tune root
user's quota of cpu bandwidth.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
With the view of supporting user-id based fair scheduling (and not just
container-based fair scheduling), this patch renames several functions
and makes them independent of whether they are being used for container
or user-id based fair scheduling.
Also fix a problem reported by KAMEZAWA Hiroyuki (wrt allocating
less-sized array for tg->cfs_rq[] and tf->se[]).
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Add interface to control cpu bandwidth allocation to task-groups.
(not yet configurable, due to missing CONFIG_CONTAINERS)
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
The variable CFLAGS is a wellknown variable and the usage by
kbuild may result in unexpected behaviour.
On top of that several people over time has asked for a way to
pass in additional flags to gcc.
This patch replace use of CFLAGS with KBUILD_CFLAGS all over the
tree and enabling one to use:
make CFLAGS=...
to specify additional gcc commandline options.
One usecase is when trying to find gcc bugs but other
use cases has been requested too.
Patch was tested on following architectures:
alpha, arm, i386, x86_64, mips, sparc, sparc64, ia64, m68k
Test was simple to do a defconfig build, apply the patch and check
that nothing got rebuild.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
There is still some confusion and disagreement over what this interface should
actually do. So it is best that we disable it in 2.6.23 until we get that
fully sorted out.
(sys_timerfd() was present in 2.6.22 but it was apparently broken, so here we
assume that nobody is using it yet).
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Davide Libenzi <davidel@xmailserver.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 8314418629 (Freezer: make kernel
threads nonfreezable by default) breaks freezing when attempting to resume
from an initrd, because the init (which is freezeable) spins while waiting
for another thread to run /linuxrc, but doesn't check whether it has been
told to enter the refrigerator. The original patch replaced a call to
try_to_freeze() with a call to yield(). I believe a simple reversion is
wrong because if !CONFIG_PM_SLEEP, try_to_freeze() is a noop. It should
still yield.
Signed-off-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan reports that maxcpus=1 is still broken in 2.6.23-rc4:
if CONFIG_HOTPLUG_CPU is not set, x86_64 bootup oopses in show_stat() -
for_each_possible_cpu accesses a per-cpu area which was never set up.
Alexey identified commit 61ec7567db
(ACPI: boot correctly with "nosmp" or "maxcpus=0") as the origin;
but it's not really to blame, just exposes a bug in 2.6.23-rc1's commit
8b3b295502 (Especially when !CONFIG_HOTPLUG_CPU,
avoid needlessy allocating resources for CPUs that can never become available).
rc1's test for max_cpus < 2 in start_kernel() wasn't working because
max_cpus was still NR_CPUS at that point: until rc4 moved the maxcpus
parsing earlier. Now it sets cpu_possible_map to 1 before allocating
all possible per-cpu areas; then smp_init() expands cpu_possible_map
to cpu_present_map (0xf in my case) later on.
rc1's commit has good intentions, but expects cpu_present_map to be
limited by maxcpus, which is only the case on i386. cpus_and(possible,
possible,present) might be good, but needs an audit of cpu_present_map
uses - there may well be assumptions that any cpu present is possible.
So stay safe for now and just revert those #ifndef CONFIG_HOTPLUG_CPU
optimizations in rc1's commit.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Len Brown <lenb@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 61ec7567db ('ACPI: boot correctly
with "nosmp" or "maxcpus=0"') broke 'maxcpus=' handling on x86[-64].
maxcpus=N is now having no effect on x86_64, and freezing bootup on i386
(because of inconsistency with the separate maxcpus parsing down in
arch/i386, I guess). That's because early_param parsing is a little
different from __setup parsing, and needs the "=" omitted: then it seems
to work as the original commit intended (no mention of IO-APIC in
/proc/interrupts when maxcpus=0).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In MPS mode, "nosmp" and "maxcpus=0" boot a UP kernel with IOAPIC disabled.
However, in ACPI mode, these parameters didn't completely disable
the IO APIC initialization code and boot failed.
init/main.c:
Disable the IO_APIC if "nosmp" or "maxcpus=0"
undefine disable_ioapic_setup() when it doesn't apply.
i386:
delete ioapic_setup(), it was a duplicate of parse_noapic()
delete undefinition of disable_ioapic_setup()
x86_64:
rename disable_ioapic_setup() to parse_noapic() to match i386
define disable_ioapic_setup() in header to match i386
http://bugzilla.kernel.org/show_bug.cgi?id=1641
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Remove the top level menu "Code maturity level options", and moves its
options into menu "General setup".
This makes Kconfig less cluttered and easier to setup.
Signed-off-by: Al Boldi <a1426z@gawab.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There doesn't seem to be a good reason for ANON_INODES being
an user visible option.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6.23:
sh: Fix fs.h removal from mm.h regressions.
sh: fix get_wchan() for SH kernels without framepointers
sh: arch/sh/boot - fix shell usage
rtc: rtc-sh: Correct sh_rtc_set_time() for some SH-3 parts.
sh: remove support for sh7300 and solution engine 7300
sh: Add sh to the CC_OPTIMIZE_FOR_SIZE dependencies.
sh: Kill off virt_to_bus()/bus_to_virt().
sh: sh-sci - fix SH7708 support
sh: Restrict DSP support to specific CPUs.
sh: Silence sq compile warning on sh4 nommu.
sh: Kill the rest of the SE73180 cruft.
sh: remove support for sh73180 and solution engine 73180
sh: remove old broken pint code
sh: Reclaim beginning of P3 space for vmalloc area.
sh: Fix Dreamcast DMA issues.
sh: Add kmap_coherent()/kunmap_coherent() interface for SH-4.
Presently we only use this with CONFIG_EXPERIMENTAL, but it is
something that can be supported commonly.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Currently, the freezer treats all tasks as freezable, except for the kernel
threads that explicitly set the PF_NOFREEZE flag for themselves. This
approach is problematic, since it requires every kernel thread to either
set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
care for the freezing of tasks at all.
It seems better to only require the kernel threads that want to or need to
be frozen to use some freezer-related code and to remove any
freezer-related code from the other (nonfreezable) kernel threads, which is
done in this patch.
The patch causes all kernel threads to be nonfreezable by default (ie. to
have PF_NOFREEZE set by default) and introduces the set_freezable()
function that should be called by the freezable kernel threads in order to
unset PF_NOFREEZE. It also makes all of the currently freezable kernel
threads call set_freezable(), so it shouldn't cause any (intentional)
change of behaviour to appear. Additionally, it updates documentation to
describe the freezing of tasks more accurately.
[akpm@linux-foundation.org: build fixes]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are some reports that 2.6.22 has SLUB as the default. Not
true!
This will make SLUB the default for 2.6.23.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Especially when !CONFIG_HOTPLUG_CPU, avoid needlessy allocating resources for
CPUs that can never become available.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's useful sometimes to disable the softlockup checker at boottime.
Especially if it triggers during a distro install.
Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Basically, it will allow a process to unshare its user_struct table,
resetting at the same time its own user_struct and all the associated
accounting.
A new root user (uid == 0) is added to the user namespace upon creation.
Such root users have full privileges and it seems that theses privileges
should be controlled through some means (process capabilities ?)
The unshare is not included in this patch.
Changes since [try #4]:
- Updated get_user_ns and put_user_ns to accept NULL, and
get_user_ns to return the namespace.
Changes since [try #3]:
- moved struct user_namespace to files user_namespace.{c,h}
Changes since [try #2]:
- removed struct user_namespace* argument from find_user()
Changes since [try #1]:
- removed struct user_namespace* argument from find_user()
- added a root_user per user namespace
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Pavel Emelianov <xemul@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Andrew Morgan <agm@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_UTS_NS and CONFIG_IPC_NS have very little value as they only
deactivate the unshare of the uts and ipc namespaces and do not improve
performance.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Acked-by: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some buses (e.g. USB and MMC) do their scanning of devices in the
background, causing a race between them and prepare_namespace(). In order
to be able to use these buses without an initrd, we now wait for the device
specified in root= to actually show up.
If the device never shows up than we will hang in an infinite loop. In
order to not mess with setups that reboot on panic, the feature must be
turned on via the command line option "rootwait".
[bunk@stusta.de: root_wait can become static]
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change menuconfig objects from "menu, config" into "menuconfig" so that the
user can disable the whole feature without entering its menu first.
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently slob is disabled if we're using sparsemem, due to an earlier
patch from Goto-san. Slob and static sparsemem work without any trouble as
it is, and the only hiccup is a missing slab_is_available() in the case of
sparsemem extreme. With this, we're rid of the last set of restrictions
for slob usage.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Beacuse SERIAL_PORT_DFNS is removed from include/asm-i386/serial.h and
include/asm-x86_64/serial.h. the serial8250_ports need to be probed late in
serial initializing stage. the console_init=>serial8250_console_init=>
register_console=>serial8250_console_setup will return -ENDEV, and console
ttyS0 can not be enabled at that time. need to wait till uart_add_one_port in
drivers/serial/serial_core.c to call register_console to get console ttyS0.
that is too late.
Make early_uart to use early_param, so uart console can be used earlier. Make
it to be bootconsole with CON_BOOT flag, so can use console handover feature.
and it will switch to corresponding normal serial console automatically.
new command line will be:
console=uart8250,io,0x3f8,9600n8
console=uart8250,mmio,0xff5e0000,115200n8
or
earlycon=uart8250,io,0x3f8,9600n8
earlycon=uart8250,mmio,0xff5e0000,115200n8
it will print in very early stage:
Early serial console at I/O port 0x3f8 (options '9600n8')
console [uart0] enabled
later for console it will print:
console handover: boot [uart0] -> real [ttyS0]
Signed-off-by: <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"menu, endmenu" that did not get cleaned up in the block patch
[ http://lkml.org/lkml/2007/4/10/251 ]
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
add the init_idle_bootup_task() callback to the bootup thread,
unused at the moment. (CFS will use it to switch the scheduling
class of the boot thread to the idle class)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
No arch sets ARCH_USES_SLAB_PAGE_STRUCT anymore.
Remove the experimental dependency as well since we want to have it as
a real alternative to SLAB.
It all comes down to killing a single line from init/Kconfig.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The SLOB allocator should implement SLAB_DESTROY_BY_RCU correctly, because
even on UP, RCU freeing semantics are not equivalent to simply freeing
immediately. This also allows SLOB to be used on SMP.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a very simple and light file descriptor, that can be used as event
wait/dispatch by userspace (both wait and dispatch) and by the kernel
(dispatch only). It can be used instead of pipe(2) in all cases where those
would simply be used to signal events. Their kernel overhead is much lower
than pipes, and they do not consume two fds. When used in the kernel, it can
offer an fd-bridge to enable, for example, functionalities like KAIO or
syslets/threadlets to signal to an fd the completion of certain operations.
But more in general, an eventfd can be used by the kernel to signal readiness,
in a POSIX poll/select way, of interfaces that would otherwise be incompatible
with it. The API is:
int eventfd(unsigned int count);
The eventfd API accepts an initial "count" parameter, and returns an eventfd
fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).
The POLLIN flag is raised when the internal counter is greater than zero.
The POLLOUT flag is raised when at least a value of "1" can be written to the
internal counter.
The POLLERR flag is raised when an overflow in the counter value is detected.
The write(2) operation can never overflow the counter, since it blocks (unless
O_NONBLOCK is set, in which case -EAGAIN is returned).
But the eventfd_signal() function can do it, since it's supposed to not sleep
during its operation.
The read(2) function reads the __u64 counter value, and reset the internal
value to zero. If the value read is equal to (__u64) -1, an overflow happened
on the internal counter (due to 2^64 eventfd_signal() posts that has never
been retired - unlickely, but possible).
The write(2) call writes an __u64 count value, and adds it to the current
counter. The eventfd fd supports O_NONBLOCK also.
On the kernel side, we have:
struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, unsigned int n);
The eventfd_fget() should be called to get a struct file* from an eventfd fd
(this is an fget() + check of f_op being an eventfd fops pointer).
The kernel can then call eventfd_signal() every time it wants to post an event
to userspace. The eventfd_signal() function can be called from any context.
An eventfd() simple test and bench is available here:
http://www.xmailserver.org/eventfd-bench.c
This is the eventfd-based version of pipetest-4 (pipe(2) based):
http://www.xmailserver.org/pipetest-4.c
Not that performance matters much in the eventfd case, but eventfd-bench
shows almost as double as performance than pipetest-4.
[akpm@linux-foundation.org: fix i386 build]
[akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces a new system call for timers events delivered though
file descriptors. This allows timer event to be used with standard POSIX
poll(2), select(2) and read(2). As a consequence of supporting the Linux
f_op->poll subsystem, they can be used with epoll(2) too.
The system call is defined as:
int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);
The "ufd" parameter allows for re-use (re-programming) of an existing timerfd
w/out going through the close/open cycle (same as signalfd). If "ufd" is -1,
s new file descriptor will be created, otherwise the existing "ufd" will be
re-programmed.
The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME. The time
specified in the "utmr->it_value" parameter is the expiry time for the timer.
If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute time,
otherwise it's a relative time.
If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
tv_nsec == 0), this is the period at which the following ticks should be
generated.
The "utmr->it_interval" should be set to zero if only one tick is requested.
Setting the "utmr->it_value" to zero will disable the timer, or will create a
timerfd without the timer enabled.
The function returns the new (or same, in case "ufd" is a valid timerfd
descriptor) file, or -1 in case of error.
As stated before, the timerfd file descriptor supports poll(2), select(2) and
epoll(2). When a timer event happened on the timerfd, a POLLIN mask will be
returned.
The read(2) call can be used, and it will return a u32 variable holding the
number of "ticks" that happened on the interface since the last call to
read(2). The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN will
be returned if no ticks happened.
A quick test program, shows timerfd working correctly on my amd64 box:
http://www.xmailserver.org/timerfd-test.c
[akpm@linux-foundation.org: add sys_timerfd to sys_ni.c]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch series implements the new signalfd() system call.
I took part of the original Linus code (and you know how badly it can be
broken :), and I added even more breakage ;) Signals are fetched from the same
signal queue used by the process, so signalfd will compete with standard
kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
seems to be working fine on my Dual Opteron machine. I made a quick test
program for it:
http://www.xmailserver.org/signafd-test.c
The signalfd() system call implements signal delivery into a file descriptor
receiver. The signalfd file descriptor if created with the following API:
int signalfd(int ufd, const sigset_t *mask, size_t masksize);
The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
signalfd file.
The "mask" allows to specify the signal mask of signals that we are interested
in. The "masksize" parameter is the size of "mask".
The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.
The read(2) system call will return a "struct signalfd_siginfo" structure in
the userspace supplied buffer. The return value is the number of bytes copied
in the supplied buffer, or -1 in case of error. The read(2) call can also
return 0, in case the sighand structure to which the signalfd was attached,
has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
return -EAGAIN in case no signal is available.
If the size of the buffer passed to read(2) is lower than sizeof(struct
signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
return -ERESTARTSYS in case a signal hits the process. The format of the
struct signalfd_siginfo is, and the valid fields depends of the (->code &
__SI_MASK) value, in the same way a struct siginfo would:
struct signalfd_siginfo {
__u32 signo; /* si_signo */
__s32 err; /* si_errno */
__s32 code; /* si_code */
__u32 pid; /* si_pid */
__u32 uid; /* si_uid */
__s32 fd; /* si_fd */
__u32 tid; /* si_fd */
__u32 band; /* si_band */
__u32 overrun; /* si_overrun */
__u32 trapno; /* si_trapno */
__s32 status; /* si_status */
__s32 svint; /* si_int */
__u64 svptr; /* si_ptr */
__u64 utime; /* si_utime */
__u64 stime; /* si_stime */
__u64 addr; /* si_addr */
};
[akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch add an anonymous inode source, to be used for files that need
and inode only in order to create a file*. We do not care of having an
inode for each file, and we do not even care of having different names in
the associated dentries (dentry names will be same for classes of file*).
This allow code reuse, and will be used by epoll, signalfd and timerfd
(and whatever else there'll be).
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Explicitly set pgid and sid of init process to 1.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: <containers@lists.osdl.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Otherwise people get asked about SLUB_DEBUG even if they have another
slab allocator enabled.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
sound: convert "sound" subdirectory to UTF-8
MAINTAINERS: Add cxacru website/mailing list
include files: convert "include" subdirectory to UTF-8
general: convert "kernel" subdirectory to UTF-8
documentation: convert the Documentation directory to UTF-8
Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
remove broken URLs from net drivers' output
Magic number prefix consistency change to Documentation/magic-number.txt
trivial: s/i_sem /i_mutex/
fix file specification in comments
drivers/base/platform.c: fix small typo in doc
misc doc and kconfig typos
Remove obsolete fat_cvf help text
Fix occurrences of "the the "
Fix minor typoes in kernel/module.c
Kconfig: Remove reference to external mqueue library
Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
Correct comments in genrtc.c to refer to correct /proc file.
Fix more "deprecated" spellos.
Fix "deprecated" typoes.
...
Fix trivial comment conflict in kernel/relay.c.
Currently there is a circular reference between work queue initialization
and kthread initialization. This prevents the kthread infrastructure from
initializing until after work queues have been initialized.
We want the properties of tasks created with kthread_create to be as close
as possible to the init_task and to not be contaminated by user processes.
The later we start our kthreadd that creates these tasks the harder it is
to avoid contamination from user processes and the more of a mess we have
to clean up because the defaults have changed on us.
So this patch modifies the kthread support to not use work queues but to
instead use a simple list of structures, and to have kthreadd start from
init_task immediately after our kernel thread that execs /sbin/init.
By being a true child of init_task we only have to change those process
settings that we want to have different from init_task, such as our process
name, the cpus that are allowed, blocking all signals and setting SIGCHLD
to SIG_IGN so that all of our children are reaped automatically.
By being a true child of init_task we also naturally get our ppid set to 0
and do not wind up as a child of PID == 1. Ensuring that tasks generated
by kthread_create will not slow down the functioning of the wait family of
functions.
[akpm@linux-foundation.org: use interruptible sleeps]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Display all possible partitions when the root filesystem is not mounted.
This helps to track spell'o's and missing drivers.
Updated to work with newer kernels.
Example output:
VFS: Cannot open root device "foobar" or unknown-block(0,0)
Please append a correct "root=" boot option; here are the available partitions:
0800 8388608 sda driver: sd
0801 192748 sda1
0802 8193150 sda2
0810 4194304 sdb driver: sd
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[akpm@linux-foundation.org: cleanups, fix printk warnings]
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Cc: Dave Gilbert <linux@treblig.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix some of the spelling issues. Fix sentences. Discourage SLOB use
since SLUB can pack objects denser.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_SLUB_DEBUG can be used to switch off the debugging and sysfs components
of SLUB. Thus SLUB will be able to replace SLOB. SLUB can arrange objects in
a denser way than SLOB and the code size should be minimal without debugging
and sysfs support.
Note that CONFIG_SLUB_DEBUG is materially different from CONFIG_SLAB_DEBUG.
CONFIG_SLAB_DEBUG is used to enable slab debugging in SLAB. SLUB enables
debugging via a boot parameter. SLUB debug code should always be present.
CONFIG_SLUB_DEBUG can be modified in the embedded config section.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the reference to an external mqueue library since that was
merged into glibc in 2004.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Several people have observed that perhaps LOG_BUF_SHIFT should be in a more
obvious place than under DEBUG_KERNEL. Under some circumstances (such as the
PARISC architecture), DEBUG_KERNEL can increase kernel size, which is an
undesirable trade off for something as trivial as increasing the kernel log
buffer size.
Instead, move LOG_BUF_SHIFT into "General Setup", so that people are more
likely to be able to change it such a circumstance that the default buffer
size is insufficient.
Signed-off-by: Alistair John Strachan <s0348365@sms.ed.ac.uk>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
enhance the initcall_debug boot option:
- measure the time the initcall took to execute and report
it in units of milliseconds.
- show the return code of initcalls (useful to see failures and
to make sure that an initcall hung)
[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a proper protype for prepare_namespace() in include/linux/init.h.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make handle_initrd() call try_to_freeze() in a suitable place instead of setting
PF_NOFREEZE for the current task.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds support for the Analog Devices Blackfin processor architecture, and
currently supports the BF533, BF532, BF531, BF537, BF536, BF534, and BF561
(Dual Core) devices, with a variety of development platforms including those
avaliable from Analog Devices (BF533-EZKit, BF533-STAMP, BF537-STAMP,
BF561-EZKIT), and Bluetechnix! Tinyboards.
The Blackfin architecture was jointly developed by Intel and Analog Devices
Inc. (ADI) as the Micro Signal Architecture (MSA) core and introduced it in
December of 2000. Since then ADI has put this core into its Blackfin
processor family of devices. The Blackfin core has the advantages of a clean,
orthogonal,RISC-like microprocessor instruction set. It combines a dual-MAC
(Multiply/Accumulate), state-of-the-art signal processing engine and
single-instruction, multiple-data (SIMD) multimedia capabilities into a single
instruction-set architecture.
The Blackfin architecture, including the instruction set, is described by the
ADSP-BF53x/BF56x Blackfin Processor Programming Reference
http://blackfin.uclinux.org/gf/download/frsrelease/29/2549/Blackfin_PRM.pdf
The Blackfin processor is already supported by major releases of gcc, and
there are binary and source rpms/tarballs for many architectures at:
http://blackfin.uclinux.org/gf/project/toolchain/frs There is complete
documentation, including "getting started" guides available at:
http://docs.blackfin.uclinux.org/ which provides links to the sources and
patches you will need in order to set up a cross-compiling environment for
bfin-linux-uclibc
This patch, as well as the other patches (toolchain, distribution,
uClibc) are actively supported by Analog Devices Inc, at:
http://blackfin.uclinux.org/
We have tested this on LTP, and our test plan (including pass/fails) can
be found at:
http://docs.blackfin.uclinux.org/doku.php?id=testing_the_linux_kernel
[m.kozlowski@tuxland.pl: balance parenthesis in blackfin header files]
Signed-off-by: Bryan Wu <bryan.wu@analog.com>
Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Signed-off-by: Aubrey Li <aubrey.li@analog.com>
Signed-off-by: Jie Zhang <jie.zhang@analog.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.
A. Management of object queues
A particular concern was the complex management of the numerous object
queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
each allocating CPU and use objects from a slab directly instead of
queueing them up.
B. Storage overhead of object queues
SLAB Object queues exist per node, per CPU. The alien cache queue even
has a queue array that contain a queue for each processor on each
node. For very large systems the number of queues and the number of
objects that may be caught in those queues grows exponentially. On our
systems with 1k nodes / processors we have several gigabytes just tied up
for storing references to objects for those queues This does not include
the objects that could be on those queues. One fears that the whole
memory of the machine could one day be consumed by those queues.
C. SLAB meta data overhead
SLAB has overhead at the beginning of each slab. This means that data
cannot be naturally aligned at the beginning of a slab block. SLUB keeps
all meta data in the corresponding page_struct. Objects can be naturally
aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
boundaries and can fit tightly into a 4k page with no bytes left over.
SLAB cannot do this.
D. SLAB has a complex cache reaper
SLUB does not need a cache reaper for UP systems. On SMP systems
the per CPU slab may be pushed back into partial list but that
operation is simple and does not require an iteration over a list
of objects. SLAB expires per CPU, shared and alien object queues
during cache reaping which may cause strange hold offs.
E. SLAB has complex NUMA policy layer support
SLUB pushes NUMA policy handling into the page allocator. This means that
allocation is coarser (SLUB does interleave on a page level) but that
situation was also present before 2.6.13. SLABs application of
policies to individual slab objects allocated in SLAB is
certainly a performance concern due to the frequent references to
memory policies which may lead a sequence of objects to come from
one node after another. SLUB will get a slab full of objects
from one node and then will switch to the next.
F. Reduction of the size of partial slab lists
SLAB has per node partial lists. This means that over time a large
number of partial slabs may accumulate on those lists. These can
only be reused if allocator occur on specific nodes. SLUB has a global
pool of partial slabs and will consume slabs from that pool to
decrease fragmentation.
G. Tunables
SLAB has sophisticated tuning abilities for each slab cache. One can
manipulate the queue sizes in detail. However, filling the queues still
requires the uses of the spin lock to check out slabs. SLUB has a global
parameter (min_slab_order) for tuning. Increasing the minimum slab
order can decrease the locking overhead. The bigger the slab order the
less motions of pages between per CPU and partial lists occur and the
better SLUB will be scaling.
G. Slab merging
We often have slab caches with similar parameters. SLUB detects those
on boot up and merges them into the corresponding general caches. This
leads to more effective memory use. About 50% of all caches can
be eliminated through slab merging. This will also decrease
slab fragmentation because partial allocated slabs can be filled
up again. Slab merging can be switched off by specifying
slub_nomerge on boot up.
Note that merging can expose heretofore unknown bugs in the kernel
because corrupted objects may now be placed differently and corrupt
differing neighboring objects. Enable sanity checks to find those.
H. Diagnostics
The current slab diagnostics are difficult to use and require a
recompilation of the kernel. SLUB contains debugging code that
is always available (but is kept out of the hot code paths).
SLUB diagnostics can be enabled via the "slab_debug" option.
Parameters can be specified to select a single or a group of
slab caches for diagnostics. This means that the system is running
with the usual performance and it is much more likely that
race conditions can be reproduced.
I. Resiliency
If basic sanity checks are on then SLUB is capable of detecting
common error conditions and recover as best as possible to allow the
system to continue.
J. Tracing
Tracing can be enabled via the slab_debug=T,<slabcache> option
during boot. SLUB will then protocol all actions on that slabcache
and dump the object contents on free.
K. On demand DMA cache creation.
Generally DMA caches are not needed. If a kmalloc is used with
__GFP_DMA then just create this single slabcache that is needed.
For systems that have no ZONE_DMA requirement the support is
completely eliminated.
L. Performance increase
Some benchmarks have shown speed improvements on kernbench in the
range of 5-10%. The locking overhead of slub is based on the
underlying base allocation size. If we can reliably allocate
larger order pages then it is possible to increase slub
performance much further. The anti-fragmentation patches may
enable further performance increases.
Tested on:
i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator
SLUB Boot options
slub_nomerge Disable merging of slabs
slub_min_order=x Require a minimum order for slab caches. This
increases the managed chunk size and therefore
reduces meta data and locking overhead.
slub_min_objects=x Mininum objects per slab. Default is 8.
slub_max_order=x Avoid generating slabs larger than order specified.
slub_debug Enable all diagnostics for all caches
slub_debug=<options> Enable selective options for all caches
slub_debug=<o>,<cache> Enable selective options for a certain set of
caches
Available Debug options
F Double Free checking, sanity and resiliency
R Red zoning
P Object / padding poisoning
U Track last free / alloc
T Trace all allocs / frees (only use for individual slabs).
To use SLUB: Apply this patch and then select SLUB as the default slab
allocator.
[hugh@veritas.com: fix an oops-causing locking error]
[akpm@linux-foundation.org: various stupid cleanups and small fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The nr_cpu_ids value is currently only calculated in smp_init. However, it
may be needed before (SLUB needs it on kmem_cache_init!) and other kernel
components may also want to allocate dynamically sized per cpu array before
smp_init. So move the determination of possible cpus into sched_init()
where we already loop over all possible cpus early in boot.
Also initialize both nr_node_ids and nr_cpu_ids with the highest value they
could take. If we have accidental users before these values are determined
then the current valud of 0 may cause too small per cpu and per node arrays
to be allocated. If it is set to the maximum possible then we only waste
some memory for early boot users.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild: (38 commits)
kconfig: fix mconf segmentation fault
kbuild: enable use of code from a different dir
kconfig: error out if recursive dependencies are found
kbuild: scripts/basic/fixdep segfault on pathological string-o-death
kconfig: correct minor typo in Kconfig warning message.
kconfig: fix path to modules.txt in Kconfig help
usr/Kconfig: fix typo
kernel-doc: alphabetically-sorted entries in index.html of 'htmldocs'
kbuild: be more explicit on missing .config file
kbuild: clarify the creation of the LOCALVERSION_AUTO string.
kbuild: propagate errors from find in scripts/gen_initramfs_list.sh
kconfig: refer to qt3 if we cannot find qt libraries
kbuild: handle compressed cpio initramfs-es
kbuild: ignore section mismatch warning for references from .paravirtprobe to .init.text
kbuild: remove stale comment in modpost.c
kbuild/mkuboot.sh: allow spaces in CROSS_COMPILE
kbuild: fix make mrproper for Documentation/DocBook/man
kbuild: remove kconfig binaries during make mrproper
kconfig/menuconfig: do not hardcode '.config'
kbuild: override build timestamp & version
...
Clarify the creation of the LOCALVERSION_AUTO string during kernel
configuration, and fix a couple typoes while we're there.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
In init/main.c we have a reference from rest_init() to .init.text
which is intentional.
Rename the function 'init' to 'kernel_init' to make it a
kernel wide unique symbol and whitelist the reference.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Let's allow page-alignment in general for per-cpu data (wanted by Xen, and
Ingo suggested KVM as well).
Because larger alignments can use more room, we increase the max per-cpu
memory to 64k rather than 32k: it's getting a little tight.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
initramfs ended up depending on BLOCK:
INITRAMFS_SOURCE <-- BLK_DEV_INITRD <-- BLOCK
This inhibits use of customized-initramfs-over-ramfs without block layer
(ramfs would still be enabled), useful in embedded applications.
Move BLK_DEV_INITRD out of 'drivers/block/Kconfig' and into 'init/Kconfig',
make it unconditional.
Signed-off-by: Dimitri Gorokhovik <dimitri.gorokhovik@free.fr>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We frequently need the maximum number of possible processors in order to
allocate arrays for all processors. So far this was done using
highest_possible_processor_id(). However, we do need the number of
processors not the highest id. Moreover the number was so far dynamically
calculated on each invokation. The number of possible processors does not
change when the system is running. We can therefore calculate that number
once.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
powerpc gets:
init/main.c: In function `do_basic_setup':
init/main.c:714: warning: implicit declaration of function `init_irq_proc'
but we cannot include linux/irq.h in generic code.
Fix it by moving the declaration into linux/interrupt.h instead.
And make sure all code that defines init_irq_proc() is including
linux/interrupt.h.
And nuke an ifdef-in-C
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With Ingo Molnar <mingo@elte.hu>
The tick-management code is the first user of the clockevents layer. It takes
clock event devices from the clock events core and uses them to provide the
periodic tick.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With this change the sysctl inodes can be cached and nothing needs to be done
when removing a sysctl table.
For a cost of 2K code we will save about 4K of static tables (when we remove
de from ctl_table) and 70K in proc_dir_entries that we will not allocate, or
about half that on a 32bit arch.
The speed feels about the same, even though we can now cache the sysctl
dentries :(
We get the core advantage that we don't need to have a 1 to 1 mapping between
ctl table entries and proc files. Making it possible to have /proc/sys vary
depending on the namespace you are in. The currently merged namespaces don't
have an issue here but the network namespace under /proc/sys/net needs to have
different directories depending on which network adapters are visible. By
simply being a cache different directories being visible depending on who you
are is trivial to implement.
[akpm@osdl.org: fix uninitialised var]
[akpm@osdl.org: fix ARM build]
[bunk@stusta.de: make things static]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is just a simple cleanup to keep kernel/sysctl.c from getting to crowded
with special cases, and by keeping all of the ipc logic to together it makes
the code a little more readable.
[gcoady.lk@gmail.com: build fix]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Grant Coady <gcoady.lk@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After Al Viro (finally) succeeded in removing the sched.h #include in module.h
recently, it makes sense again to remove other superfluous sched.h includes.
There are quite a lot of files which include it but don't actually need
anything defined in there. Presumably these includes were once needed for
macros that used to live in sched.h, but moved to other header files in the
course of cleaning it up.
To ease the pain, this time I did not fiddle with any header files and only
removed #includes from .c-files, which tend to cause less trouble.
Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
configs in arch/arm/configs on arm. I also checked that no new warnings were
introduced by the patch (actually, some warnings are removed that were emitted
by unnecessarily included header files).
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
o init() is a non __init function in .text section but it calls many
functions which are in .init.text section. Hence MODPOST generates lots
of cross reference warnings on i386 if compiled with CONFIG_RELOCATABLE=y
WARNING: vmlinux - Section mismatch: reference to .init.text:smp_prepare_cpus from .text between 'init' (at offset 0xc0101049) and 'rest_init'
WARNING: vmlinux - Section mismatch: reference to .init.text:migration_init from .text between 'init' (at offset 0xc010104e) and 'rest_init'
WARNING: vmlinux - Section mismatch: reference to .init.text:spawn_ksoftirqd from .text between 'init' (at offset 0xc0101053) and 'rest_init'
o This patch breaks down init() in two parts. One part which can go
in .init.text section and can be freed and other part which has to
be non __init(init_post()). Now init() calls init_post() and init_post()
does not call any functions present in .init sections. Hence getting
rid of warnings.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Current implementation stores a static command-line buffer allocated to
COMMAND_LINE_SIZE size. Most architectures stores two copies of this buffer,
one for future reference and one for parameter parsing.
Current kernel command-line size for most architecture is much too small for
module parameters, video settings, initramfs paramters and much more. The
problem is that setting COMMAND_LINE_SIZE to a grater value, allocates static
buffers.
In order to allow a greater command-line size, these buffers should be
dynamically allocated or marked as init disposable buffers, so unused memory
can be released.
This patch renames the static saved_command_line variable into
boot_command_line adding __initdata attribute, so that it can be disposed
after initialization. This rename is required so applications that use
saved_command_line will not be affected by this change.
It reintroduces saved_command_line as dynamically allocated buffer to match
the data in boot_command_line.
It also mark secondary command-line buffer as __initdata, and copies it to
dynamically allocated static_command_line buffer components may hold reference
to it after initialization.
This patch is for linux-2.6.20-rc4-mm1 and is divided to target each
architecture. I could not check this in any architecture so please forgive me
if I got it wrong.
The per-architecture modification is very simple, use boot_command_line in
place of saved_command_line. The common code is the change into dynamic
command-line.
This patch:
1. Rename saved_command_line into boot_command_line, mark as init
disposable.
2. Add dynamic allocated saved_command_line.
3. Add dynamic allocated static_command_line.
4. During startup copy: boot_command_line into saved_command_line. arch
command_line into static_command_line.
5. Parse static_command_line and not arch command_line, so arch
command_line may be freed.
Signed-off-by: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since they depends on TASKSTATS, it would be nice to move them closer to
another options depending on TASKSTATS.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the last (and commented out) invocation of the obsolete
smp_commence() call.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The file init/initramfs.c is always compiled and linked in the kernel
vmlinux even when BLK_DEV_RAM and BLK_DEV_INITRD are disabled and the
system isn't using any form of an initramfs or initrd. In this situation
the code is only used to unpack a (static) default initial rootfilesystem.
The current init/initramfs.c code. usr/initramfs_data.o compiles to a size
of ~15 kbytes. Disabling BLK_DEV_RAM and BLK_DEV_INTRD shrinks the kernel
code size with ~60 Kbytes.
This patch avoids compiling in the code and data for initramfs support if
CONFIG_BLK_DEV_INITRD is not defined. Instead of the initramfs code and
data it uses a small routine in init/noinitramfs.c to setup an initial
static default environment for mounting a rootfilesystem later on in the
kernel initialisation process. The new code is: 164 bytes of size.
The patch is separated in two parts:
1) doesn't compile initramfs code when CONFIG_BLK_DEV_INITRD is not set
2) changing all plaforms vmlinux.lds.S files to not reserve an area of
PAGE_SIZE when CONFIG_BLK_DEV_INITRD is not set.
[deweerdt@free.fr: warning fix]
Signed-off-by: Jean-Paul Saman <jean-paul.saman@nxp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add retain_initrd option to control freeing of initrd memory after
extraction. By default, free memory as previously.
The first boot will need to hold a copy of the in memory fs for the second
boot. This image can be large (much larger than the kernel), hence we can
save time when the memory loader is slow. Also, it reduces the memory
footprint while extracting the first boot since you don't need another copy
of the fs.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It might save a few bytes after bootup, but it causes the string to be
linked in at the end of the final vmlinux image, which defeats the whole
point of doing all this, namely allowing some broken user-space binaries
to search for the kernel version string in the kernel binary.
So just remove the __init specifier.
Cc: Olaf Hering <olaf@aepfle.de>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Andrey Borzenkov <arvidjaar@mail.ru>
Cc: Andrew Morton <akpm@osdl.org>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o Some functions which should have been in init sections as they are called
only once. Put them in init sections. Otherwise MODPOST generates warning
as these functions are placed in .text and they end up accessing something
in init sections.
WARNING: vmlinux - Section mismatch: reference to .init.text:migration_init
from .text between 'do_pre_smp_initcalls' (at offset 0xc01000d1) and
'run_init_process'
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Revert previous attempts at messing with the linux banner string and
simply use a separate format string for proc.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Jean Delvare <khali@linux-fr.org>
Cc: Andrey Borzenkov <arvidjaar@mail.ru>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The calls made by parse_parms to other initialization code might enable
interrupts again way too early.
Having interrupts on this early can make systems PANIC when they initialize
the IRQ controllers (which happens later in the code). This patch detects
that irq's are enabled again, barfs about it and disables them again as a
safety net.
[akpm@osdl.org: cleanups]
Signed-off-by: Ard van Breemen <ard@telegraafnet.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
compile.h is created super-late in the build. But proc_misc.c want to include
it, and it's generally not sane to have a header file in include/linux be
created at the end of the build: it's either not present or, worse, wrong for
most of the build.
So the patch arranges for compile.h to be built at the start of the build
process. It also consolidates the compile.h rules with those for version.h
and utsname.h, so they all get built together.
I hope. My chances of having got this right are about 2%.
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is to disallow to make SLOB with SMP or SPARSEMEM. This avoids latent
troubles of SLOB with SLAB_DESTROY_BY_RCU. And fix compile error.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The VM event counters, enabled by CONFIG_VM_EVENT_COUNTERS, which provides
VM event counters in /proc/vmstat, has become more essential to
non-EMBEDDED kernel configurations than they were in the past. Comments in
the code and the Kconfig configuration explanation were stale, downplaying
their role excessively.
Refresh those comments to correctly reflect the current role of VM event
counters.
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a prototype for driver_init() in include/linux/device.h.
Also remove a static function of the same name in drivers/acpi/ibm_acpi.c to
ibm_acpi_driver_init() to fix the namespace collision.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We should not initialize rootfs before all the core initializers have
run. So do it as a separate stage just before starting the regular
driver initializers.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As reported by Andy Whitcroft, at least the SLES9 initrd build process
depends on getting the kernel version from the kernel binary. It does
that by simply trawling the binary and looking for the signature of the
"linux_banner" string (the string "Linux version " to be exact. Which
is really broken in itself, but whatever..)
That got broken when the string was changed to allow /proc/version to
change the UTS release information dynamically, and "get_kernel_version"
thus returned "%s" (see commit a2ee8649ba:
"[PATCH] Fix linux banner utsname information").
This just restores "linux_banner" as a static string, which should fix
the version finding. And /proc/version simply uses a different string.
To avoid wasting even that miniscule amount of memory, the early boot
string should really be marked __initdata, but that just causes the same
bug in SLES9 to re-appear, since it will then find other occurrences of
"Linux version " first.
Cc: Andy Whitcroft <apw@shadowen.org>
Acked-by: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andi Kleen <ak@suse.de>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Steve Fox <drfickle@us.ibm.com>
Acked-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The present per-task IO accounting isn't very useful. It simply counts the
number of bytes passed into read() and write(). So if a process reads 1MB
from an already-cached file, it is accused of having performed 1MB of I/O,
which is wrong.
(David Wright had some comments on the applicability of the present logical IO accounting:
For billing purposes it is useless but for workload analysis it is very
useful
read_bytes/read_calls average read request size
write_bytes/write_calls average write request size
read_bytes/read_blocks ie logical/physical can indicate hit rate or thrashing
write_bytes/write_blocks ie logical/physical guess since pdflush writes can
be missed
I often look for logical larger than physical to see filesystem cache
problems. And the bytes/cpusec can help find applications that are
dominating the cache and causing slow interactive response from page cache
contention.
I want to find the IO intensive applications and make sure they are doing
efficient IO. Thus the acctcms(sysV) or csacms command would give the high
IO commands).
This patchset adds new accounting which tries to be more accurate. We account
for three things:
reads:
attempt to count the number of bytes which this process really did cause
to be fetched from the storage layer. Done at the submit_bio() level, so it
is accurate for block-backed filesystems. I also attempt to wire up NFS and
CIFS.
writes:
attempt to count the number of bytes which this process caused to be sent
to the storage layer. This is done at page-dirtying time.
The big inaccuracy here is truncate. If a process writes 1MB to a file
and then deletes the file, it will in fact perform no writeout. But it will
have been accounted as having caused 1MB of write.
So...
cancelled_writes:
account the number of bytes which this process caused to not happen, by
truncating pagecache.
We _could_ just subtract this from the process's `write' accounting. But
that means that some processes would be reported to have done negative
amounts of write IO, which is silly.
So we just report the raw number and punt this decision up to userspace.
Now, we _could_ account for writes at the physical I/O level. But
- This would require that we track memory-dirtying tasks at the per-page
level (would require a new pointer in struct page).
- It would mean that IO statistics for a process are usually only available
long after that process has exitted. Which means that we probably cannot
communicate this info via taskstats.
This patch:
Wire up the kernel-private data structures and the accessor functions to
manipulate them.
Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a per pid_namespace child-reaper. This is needed so processes are reaped
within the same pid space and do not spill over to the parent pid space. Its
also needed so containers preserve existing semantic that pid == 1 would reap
orphaned children.
This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
utsname information is shown in the linux banner, which also is used for
/proc/version (which can have different utsname values inside a uts
namespaces). this patch makes the varying data arguments and changes the
string to a format string, using those arguments.
Signed-off-by: Herbert Poetzl <herbert@13thfloor.at>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Keith says
Compiling 2.6.19-rc6 with gcc version 4.1.0 (SUSE Linux), wait_hpet_tick is
optimized away to a never ending loop and the kernel hangs on boot in timer
setup.
0000001a <wait_hpet_tick>:
1a: 55 push %ebp
1b: 89 e5 mov %esp,%ebp
1d: eb fe jmp 1d <wait_hpet_tick+0x3>
This is not a problem with gcc 3.3.5. Adding barrier() calls to
wait_hpet_tick does not help, making the variables volatile does.
And the consensus is that gcc-4.1.0 is busted. Suse went and shipped
gcc-4.1.0 so we cannot ban it. Add a warning.
Cc: Keith Owens <kaos@ocs.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It turns out that the "-c" option of cpio is highly unportable even between
distros let alone unix variants, and may actually make the wrong type of
cpio archive. I just wasted quite some time on this, and the kernel can
detect this and warn about it (it's __init memory so it gets thrown away
and thus there is no runtime overhead)
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.
[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Both lhype and Xen want to call the core of the x86 cpu detect code before
calling start_kernel.
(extracted from larger patch)
AK: folded in start_kernel header patch
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Provide a way to support older versions of udev that are shipped in
older distros. If this option is disabled, it will also turn off the
compatible symlinks in sysfs that older programs might rely on.
When in doubt, or if running a distro older than 2006, say Yes here.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The basic issue is that despite have been deprecated and warned about as a
very bad thing in the man pages since its inception there are a few real
users of sys_sysctl. It was my assumption that because sysctl had been
deprecated for all of 2.6 there would be no user space users by this point,
so I initially gave sys_sysctl a very short deprecation period.
Now that I know there are a few real users the only sane way to proceed
with deprecation is to push the time limit out to a year or two work and
work with distributions that have big testing pools like fedora core to
find these last remaining users.
Which means that the sys_sysctl interface needs to be maintained in the
meantime.
Since I have provided a technical measure that allows us to add new sysctl
entries without reserving more binary numbers I believe that is enough to
fix the sys_sysctl binary interface maintenance problems, because there is
no longer a need to change the binary interface at all.
Since the sys_sysctl implementation needs to stay around for a while and
the worst of the maintenance issues that caused us to occasionally break
the ABI have been addressed I don't see any advantage in continuing with
the removal of sys_sysctl.
So instead of merely increasing the deprecation period this patch removes
the deprecation of sys_sysctl and modifies the kernel to compile the code
in by default.
With committing to maintain sys_sysctl we get all of the advantages of a
fast interface for anything that needs it. Currently sys_sysctl is about
5x faster than /proc/sys, for the same string data.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This changes the dwarf2 unwinder to do a binary search for CIEs
instead of a linear work. The linker is unfortunately not
able to build a proper lookup table at link time, instead it creates
one at runtime as soon as the bootmem allocator is usable (so you'll continue
using the linear lookup for the first [hopefully] few calls).
The code should be ready to utilize a build-time created table once
a fixed linker becomes available.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
This should make sure that, for UML, host's configuration files are not
considered, which avoids various pains to the user. Our dependency are such
that the obtained Kconfig will be valid and will lead to successful
compilation - however they cannot prevent an user from disabling any boot
device, and if an option is not set in the read .config (say
/boot/config-XXX), with make menuconfig ARCH=um, it is not set. This always
disables UBD and all console I/O channels, which leads to non-working UML
kernels, so this bothers users - especially now, since it will happen on
almost every machine (/boot/config-`uname -r` exists almost on every machine).
It can be workarounded with make defconfig ARCH=um, but it is non-obvious and
can be avoided, so please _do_ merge this patch.
Given the existence of options, it could be interesting to implement
(additionally) "option required" - with it, Kconfig will refuse reading a
.config file (from wherever it comes) if the given option is not set. With
this, one could mark with it the option characteristic of the given
architecture (it was an old proposal of Roman Zippel, when I pointed out our
problem):
config UML
option required
default y
However this should be further discussed:
*) for x86, it must support constructs like:
==arch/i386/Kconfig==
config 64BIT
option required
default n
where Kconfig must require that CONFIG_64BIT is disabled or not present in the
read .config.
*) do we want to do such checks only for the starting defconfig or also for
.config? Which leads to:
*) I may want to port a x86_64 .config to x86 and viceversa, or even among more
different archs. Should that be allowed, and in which measure (the user may
force skipping the check for a .config or it is only given a warning by
default)?
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: <kbuild-devel@lists.sourceforge.net>
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Once upon a time we needed to fixed limit to the number of md devices,
probably because we preallocated some array. This need no longer exists, but
we still have an arbitrary limit.
So remove MAX_MD_DEVS and allow as many devices as we can fit into the 'minor'
part of a device number.
Also remove some useless noise at init time (which reports MAX_MD_DEVS) and
remove MD_THREAD_NAME_MAX which hasn't been used for a while.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are a few places in the kernel where the init task is signaled. The
ctrl+alt+del sequence is one them. It kills a task, usually init, using a
cached pid (cad_pid).
This patch replaces the pid_t by a struct pid to avoid pid wrap around
problem. The struct pid is initialized at boot time in init() and can be
modified through systctl with
/proc/sys/kernel/cad_pid
[ I haven't found any distro using it ? ]
It also introduces a small helper routine kill_cad_pid() which is used
where it seemed ok to use cad_pid instead of pid 1.
[akpm@osdl.org: cleanups, build fix]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The use of execve() in the kernel is dubious, since it relies on the
__KERNEL_SYSCALLS__ mechanism that stores the result in a global errno
variable. As a first step of getting rid of this, change all users to a
global kernel_execve function that returns a proper error code.
This function is a terrible hack, and a later patch removes it again after the
kernel syscalls are gone.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch set allows to unshare IPCs and have a private set of IPC objects
(sem, shm, msg) inside namespace. Basically, it is another building block of
containers functionality.
This patch implements core IPC namespace changes:
- ipc_namespace structure
- new config option CONFIG_IPC_NS
- adds CLONE_NEWIPC flag
- unshare support
[clg@fr.ibm.com: small fix for unshare of ipc namespace]
[akpm@osdl.org: build fix]
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch defines the uts namespace and some manipulators.
Adds the uts namespace to task_struct, and initializes a
system-wide init namespace.
It leaves a #define for system_utsname so sysctl will compile.
This define will be removed in a separate patch.
[akpm@osdl.org: build fix, cleanup]
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add extended system accounting handling over taskstats interface. A
CONFIG_TASK_XACCT flag is created to enable the extended accounting code.
Signed-off-by: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SYSCTL should still depend on EMBEDDED. This unbreaks the EMBEDDED menu
(from the recent SYSCTL_SYCALL menu option patch).
Fix typos in new SYSCTL_SYSCALL menu.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The driver for /proc/config.gz consumes rather a lot of memory and it is in
fact possible to build it as a module.
In some ways this is a bit risky, because the .config which is used for
compiling kernel/configs.c isn't necessarily the same as the .config which was
used to build vmlinux.
But OTOH the potential memory savings are decent, and it'd be fairly dumb to
build your configs.o with a different .config.
Signed-off-by: Andrew Morton <akpm@google.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since sys_sysctl is deprecated start allow it to be compiled out. This
should catch any remaining user space code that cares, and paves the way
for further sysctl cleanups.
[akpm@osdl.org: If sys_sysctl() is not compiled-in, emit a warning]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Resetting the devices during driver initialization can be a costly
operation in terms of time (especially scsi devices). This option can be
used by drivers to know that user forcibly wants the devices to be reset
during initialization.
This option can be useful while kernel is booting in unreliable
environment. For ex. during kdump boot where devices are in unknown
random state and BIOS execution has been skipped.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (225 commits)
[PATCH] Don't set calgary iommu as default y
[PATCH] i386/x86-64: New Intel feature flags
[PATCH] x86: Add a cumulative thermal throttle event counter.
[PATCH] i386: Make the jiffies compares use the 64bit safe macros.
[PATCH] x86: Refactor thermal throttle processing
[PATCH] Add 64bit jiffies compares (for use with get_jiffies_64)
[PATCH] Fix unwinder warning in traps.c
[PATCH] x86: Allow disabling early pci scans with pci=noearly or disallowing conf1
[PATCH] x86: Move direct PCI scanning functions out of line
[PATCH] i386/x86-64: Make all early PCI scans dependent on CONFIG_PCI
[PATCH] Don't leak NT bit into next task
[PATCH] i386/x86-64: Work around gcc bug with noreturn functions in unwinder
[PATCH] Fix some broken white space in ia32_signal.c
[PATCH] Initialize argument registers for 32bit signal handlers.
[PATCH] Remove all traces of signal number conversion
[PATCH] Don't synchronize time reading on single core AMD systems
[PATCH] Remove outdated comment in x86-64 mmconfig code
[PATCH] Use string instructions for Core2 copy/clear
[PATCH] x86: - restore i8259A eoi status on resume
[PATCH] i386: Split multi-line printk in oops output.
...
We currently assume that boot parameters which are handled by
early_param() will not overlap boot parameters handled by __setup: if
they do, behaviour is dependent on link order, usually meaning __setup
will not get called.
ACPI wants to use early_param("pci"), and pci uses __setup("pci="), so
we modify the core to let them coexist: "pci=noacpi" will now get
passed to both.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
This adds the infrastructure for drivers to do a threaded probe, and
waits at init time for all currently outstanding probes to complete.
A new kernel thread will be created when the probe() function for the
driver is called, if the multithread_probe bit is set in the driver
saying it can support this kind of operation.
I have tested this with USB and PCI, and it works, and shaves off a lot
of time in the boot process, but there are issues with finding root boot
disks, and some USB drivers assume that this can never happen, so it is
currently not enabled for any bus type. Individual drivers can enable
this right now if they wish, and bus authors can selectivly turn it on
as well, once they determine that their subsystem will work properly
with it.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix two problems with the CONFIG_EMBEDDED submenu:
(1) The menu was split in two by the rt_mutex patch, which moved
half the items into the "General setup" menu.
(2) CONFIG_SYSCTL and CONFIG_UID16 were added to the main menu
instead of the submenu.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Create a "taskstats" interface based on generic netlink (NETLINK_GENERIC
family), for getting statistics of tasks and thread groups during their
lifetime and when they exit. The interface is intended for use by multiple
accounting packages though it is being created in the context of delay
accounting.
This patch creates the interface without populating the fields of the data
that is sent to the user in response to a command or upon the exit of a task.
Each accounting package interested in using taskstats has to provide an
additional patch to add its stats to the common structure.
[akpm@osdl.org: cleanups, Kconfig fix]
Signed-off-by: Shailabh Nagar <nagar@us.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initialization code related to collection of per-task "delay" statistics which
measure how long it had to wait for cpu, sync block io, swapping etc. The
collection of statistics and the interface are in other patches. This patch
sets up the data structures and allows the statistics collection to be
disabled through a kernel boot parameter.
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Teach special (recursive) locking code to the lock validator. Has no effect
on non-lockdep kernels.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.
Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.
What does the lock validator do? It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules. If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal. If the
new rule could create a deadlock scenario then this condition is printed out.
When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios. In a typical system this means millions of separate
scenarios. This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem). [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]
Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically. In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!). So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself! In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.
To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class". For example, all struct inode objects
in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class. The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules. The
set of rules persist during the lifetime of the kernel.
To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:
lock-classes: 694 [max: 2048]
direct dependencies: 1598 [max: 8192]
indirect dependencies: 17896
all direct dependencies: 16206
dependency chains: 1910 [max: 8192]
in-hardirq chains: 17
in-softirq chains: 105
in-process chains: 1065
stack-trace entries: 38761 [max: 131072]
combined max dependencies: 2033928
hardirq-safe locks: 24
hardirq-unsafe locks: 176
softirq-safe locks: 53
softirq-unsafe locks: 137
irq-safe locks: 59
irq-unsafe locks: 176
The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.
More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:
http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Generic lock debugging:
- generalized lock debugging framework. For example, a bug in one lock
subsystem turns off debugging in all lock subsystems.
- got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
the mutex/rtmutex debugging code: it caused way too much prototype
hackery, and lockdep will give the same information anyway.
- ability to do silent tests
- check lock freeing in vfree too.
- more finegrained debugging options, to allow distributions to
turn off more expensive debugging features.
There's no separate 'held mutexes' list anymore - but there's a 'held locks'
stack within lockdep, which unifies deadlock detection across all lock
classes. (this is independent of the lockdep validation stuff - lockdep first
checks whether we are holding a lock already)
Here are the current debugging options:
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
which do:
config DEBUG_MUTEXES
bool "Mutex debugging, basic checks"
config DEBUG_LOCK_ALLOC
bool "Detect incorrect freeing of live mutexes"
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
s390's console_init must enable interrupts, but early_boot_irqs_on() gets
called later. To avoid problems move console_init() after local_irq_enable().
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We're not reay to take a timer interrupt until timekeeping_init() has run.
But time_init() will start the time interrupt and if it is called with
local interrupts enabled we'll immediately take an interrupt and die.
Fix that by running timekeeping_init() prior to time_init().
We don't know _why_ local interrupts got enabled on Jesse Brandeburg's
machine. That's a separate as-yet-unsolved problem. THe patch adds a little
bit of debugging to detect that.
This whole requirement that local interrupts be held off during early boot
keeps on biting us.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Jesse Brandeburg <jesse.brandeburg@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
include/linux/version.h contained both actual KERNEL version
and UTS_RELEASE that contains a subset from git SHA1 for when
kernel was compiled as part of a git repository.
This had the unfortunate side-effect that all files including version.h
would be recompiled when some git changes was made due to changes SHA1.
Split it out so we keep independent parts in separate files.
Also update checkversion.pl script to no longer check for UTS_RELEASE.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Fix the INIT_ENV_ARG_LIMIT dependencies to what seems to have been
intended.
Spotted by Jean-Luc Leger.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Presently, smp_processor_id() isn't necessarily set up until setup_arch().
But it's used in boot_cpu_init() and printk() and perhaps in other places,
prior to setup_arch() being called.
So provide a new smp_setup_processor_id() which is called before anything
else, wire it up for Voyager (which boots on a CPU other than #0, and broke).
Cc: James Bottomley <James.Bottomley@steeleye.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The remaining counters in page_state after the zoned VM counter patches
have been applied are all just for show in /proc/vmstat. They have no
essential function for the VM.
We use a simple increment of per cpu variables. In order to avoid the most
severe races we disable preempt. Preempt does not prevent the race between
an increment and an interrupt handler incrementing the same statistics
counter. However, that race is exceedingly rare, we may only loose one
increment or so and there is no requirement (at least not in kernel) that
the vm event counters have to be accurate.
In the non preempt case this results in a simple increment for each
counter. For many architectures this will be reduced by the compiler to a
single instruction. This single instruction is atomic for i386 and x86_64.
And therefore even the rare race condition in an interrupt is avoided for
both architectures in most cases.
The patchset also adds an off switch for embedded systems that allows a
building of linux kernels without these counters.
The implementation of these counters is through inline code that hopefully
results in only a single instruction increment instruction being emitted
(i386, x86_64) or in the increment being hidden though instruction
concurrency (EPIC architectures such as ia64 can get that done).
Benefits:
- VM event counter operations usually reduce to a single inline instruction
on i386 and x86_64.
- No interrupt disable, only preempt disable for the preempt case.
Preempt disable can also be avoided by moving the counter into a spinlock.
- Handling is similar to zoned VM counters.
- Simple and easily extendable.
- Can be omitted to reduce memory use for embedded use.
References:
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/devfs-2.6: (22 commits)
[PATCH] devfs: Remove it from the feature_removal.txt file
[PATCH] devfs: Last little devfs cleanups throughout the kernel tree.
[PATCH] devfs: Rename TTY_DRIVER_NO_DEVFS to TTY_DRIVER_DYNAMIC_DEV
[PATCH] devfs: Remove the tty_driver devfs_name field as it's no longer needed
[PATCH] devfs: Remove the line_driver devfs_name field as it's no longer needed
[PATCH] devfs: Remove the videodevice devfs_name field as it's no longer needed
[PATCH] devfs: Remove the gendisk devfs_name field as it's no longer needed
[PATCH] devfs: Remove the miscdevice devfs_name field as it's no longer needed
[PATCH] devfs: Remove the devfs_fs_kernel.h file from the tree
[PATCH] devfs: Remove devfs_remove() function from the kernel tree
[PATCH] devfs: Remove devfs_mk_cdev() function from the kernel tree
[PATCH] devfs: Remove devfs_mk_bdev() function from the kernel tree
[PATCH] devfs: Remove devfs_mk_symlink() function from the kernel tree
[PATCH] devfs: Remove devfs_mk_dir() function from the kernel tree
[PATCH] devfs: Remove devfs_*_tape() functions from the kernel tree
[PATCH] devfs: Remove devfs support from the sound subsystem
[PATCH] devfs: Remove devfs support from the ide subsystem.
[PATCH] devfs: Remove devfs support from the serial subsystem
[PATCH] devfs: Remove devfs from the init code
[PATCH] devfs: Remove devfs from the partition code
...
Core functions for the rt-mutex subsystem.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- add a proper prototype for the following global function:
- buffer_init()
- make the following needlessly global function static:
- end_buffer_async_write()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild: (40 commits)
kbuild: trivial fixes in Makefile
kbuild: adding symbols in Kconfig and defconfig to TAGS
kbuild: replace abort() with exit(1)
kbuild: support for %.symtypes files
kbuild: fix silentoldconfig recursion
kbuild: add option for stripping modules while installing them
kbuild: kill some false positives from modpost
kbuild: export-symbol usage report generator
kbuild: fix make -rR breakage
kbuild: append -dirty for updated but uncommited changes
kbuild: append git revision for all untagged commits
kbuild: fix module.symvers parsing in modpost
kbuild: ignore make's built-in rules & variables
kbuild: bugfix with initramfs
kbuild: modpost build fix
kbuild: check license compatibility when building modules
kbuild: export-type enhancement to modpost.c
kbuild: add dependency on kernel.release to the package targets
kbuild: `make kernelrelease' speedup
kconfig: KCONFIG_OVERWRITECONFIG
...
Architecture specific configs like this have no business at all
in init/Kconfig. This prevents it from being set on x86-64
Pointed out by H.Peter Anvin
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These are the generic bits needed to enable reliable stack traces based
on Dwarf2-like (.eh_frame) unwind information. Subsequent patches will
enable x86-64 and i386 to make use of this.
Thanks to Andi Kleen and Ingo Molnar, who pointed out several possibilities
for improvement.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch ensures that initramfs overwrites work correctly, even when dealing
with device nodes of different types. Furthermore, when replacing a file
which already exists, we must make very certain that we truncate the existing
file.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Michael Neuling <mikey@neuling.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Modify the update_wall_time function so it increments time using the
clocksource abstraction instead of jiffies. Since the only clocksource driver
currently provided is the jiffies clocksource, this should result in no
functional change. Additionally, a timekeeping_init and timekeeping_resume
function has been added to initialize and maintain some of the new timekeping
state.
[hirofumi@mail.parknet.co.jp: fixlet]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make makes sysctl non-optional unless EMBEDDED is set. There are a number
of interfaces exposed via sysctl, enough that it has to be considered core
kernel functionality at this point.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>