kernel-aes67/Documentation/filesystems/ext4/blockgroup.rst

.. SPDX-License-Identifier: GPL-2.0

Layout
------

The layout of a standard block group is approximately as follows (each
of these fields is discussed in a separate section below):

.. list-table::
   :widths: 1 1 1 1 1 1 1 1
   :header-rows: 1

   * - Group 0 Padding
     - ext4 Super Block
     - Group Descriptors
     - Reserved GDT Blocks
     - Data Block Bitmap
     - inode Bitmap
     - inode Table
     - Data Blocks
   * - 1024 bytes
     - 1 block
     - many blocks
     - many blocks
     - 1 block
     - 1 block
     - many blocks
     - many more blocks

For the special case of block group 0, the first 1024 bytes are unused,
to allow for the installation of x86 boot sectors and other oddities.
The superblock will start at offset 1024 bytes, whichever block that
happens to be (usually 0). However, if for some reason the block size =
1024, then block 0 is marked in use and the superblock goes in block 1.
For all other block groups, there is no padding.

The ext4 driver primarily works with the superblock and the group
descriptors that are found in block group 0. Redundant copies of the
superblock and group descriptors are written to some of the block groups
across the disk in case the beginning of the disk gets trashed, though
not all block groups necessarily host a redundant copy (see following
paragraph for more details). If the group does not have a redundant
copy, the block group begins with the data block bitmap. Note also that
when the filesystem is freshly formatted, mkfs will allocate “reserve
GDT block” space after the block group descriptors and before the start
of the block bitmaps to allow for future expansion of the filesystem. By
default, a filesystem is allowed to increase in size by a factor of
1024x over the original filesystem size.

The location of the inode table is given by ``grp.bg_inode_table_*``. It
is continuous range of blocks large enough to contain
``sb.s_inodes_per_group * sb.s_inode_size`` bytes.

As for the ordering of items in a block group, it is generally
established that the super block and the group descriptor table, if
present, will be at the beginning of the block group. The bitmaps and
the inode table can be anywhere, and it is quite possible for the
bitmaps to come after the inode table, or for both to be in different
groups (flex_bg). Leftover space is used for file data blocks, indirect
block maps, extent tree blocks, and extended attributes.

Flexible Block Groups
---------------------

Starting in ext4, there is a new feature called flexible block groups
(flex_bg). In a flex_bg, several block groups are tied together as one
logical block group; the bitmap spaces and the inode table space in the
first block group of the flex_bg are expanded to include the bitmaps
and inode tables of all other block groups in the flex_bg. For example,
if the flex_bg size is 4, then group 0 will contain (in order) the
superblock, group descriptors, data block bitmaps for groups 0-3, inode
bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
space in group 0 is for file data. The effect of this is to group the
block group metadata close together for faster loading, and to enable
large files to be continuous on disk. Backup copies of the superblock
and group descriptors are always at the beginning of block groups, even
if flex_bg is enabled. The number of block groups that make up a
flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.

Meta Block Groups
-----------------

Without the option META_BG, for safety concerns, all block group
descriptors copies are kept in the first block group. Given the default
128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
can have at most 2^27/64 = 2^21 block groups. This limits the entire
filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.

The solution to this problem is to use the metablock group feature
(META_BG), which is already in ext3 for all 2.6 releases. With the
META_BG feature, ext4 filesystems are partitioned into many metablock
groups. Each metablock group is a cluster of block groups whose group
descriptor structures can be stored in a single disk block. For ext4
filesystems with 4 KB block size, a single metablock group partition
includes 64 block groups, or 8 GiB of disk space. The metablock group
feature moves the location of the group descriptors from the congested
first block group of the whole filesystem into the first group of each
metablock group itself. The backups are in the second and last group of
each metablock group. This increases the 2^21 maximum block groups limit
to the hard limit 2^32, allowing support for a 512PiB filesystem.

The change in the filesystem format replaces the current scheme where
the superblock is followed by a variable-length set of block group
descriptors. Instead, the superblock and a single block group descriptor
block is placed at the beginning of the first, second, and last block
groups in a meta-block group. A meta-block group is a collection of
block groups which can be described by a single block group descriptor
block. Since the size of the block group descriptor structure is 64
bytes, a meta-block group contains 16 block groups for filesystems with
a 1KB block size, and 64 block groups for filesystems with a 4KB
blocksize. Filesystems can either be created using this new block group
descriptor layout, or existing filesystems can be resized on-line, and
the field s_first_meta_bg in the superblock will indicate the first
block group using this new layout.

Please see an important note about ``BLOCK_UNINIT`` in the section about
block and inode bitmaps.

Lazy Block Group Initialization
-------------------------------

A new feature for ext4 are three block group descriptor flags that
enable mkfs to skip initializing other parts of the block group
metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean
that the inode and block bitmaps for that group can be calculated and
therefore the on-disk bitmap blocks are not initialized. This is
generally the case for an empty block group or a block group containing
only fixed-location block group metadata. The INODE_ZEROED flag means
that the inode table has been initialized; mkfs will unset this flag and
rely on the kernel to initialize the inode tables in the background.

By not writing zeroes to the bitmaps and inode table, mkfs time is
reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM,
but the dumpe2fs output prints this as “uninit_bg”. They are the same
thing.
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00			`.. SPDX-License-Identifier: GPL-2.0`

			`Layout`
			`------`

			`The layout of a standard block group is approximately as follows (each`
			`of these fields is discussed in a separate section below):`

			`.. list-table::`
			`:widths: 1 1 1 1 1 1 1 1`
			`:header-rows: 1`

			`* - Group 0 Padding`
			`- ext4 Super Block`
			`- Group Descriptors`
			`- Reserved GDT Blocks`
			`- Data Block Bitmap`
			`- inode Bitmap`
			`- inode Table`
			`- Data Blocks`
			`* - 1024 bytes`
			`- 1 block`
			`- many blocks`
			`- many blocks`
			`- 1 block`
			`- 1 block`
			`- many blocks`
			`- many more blocks`

			`For the special case of block group 0, the first 1024 bytes are unused,`
			`to allow for the installation of x86 boot sectors and other oddities.`
			`The superblock will start at offset 1024 bytes, whichever block that`
			`happens to be (usually 0). However, if for some reason the block size =`
			`1024, then block 0 is marked in use and the superblock goes in block 1.`
			`For all other block groups, there is no padding.`

			`The ext4 driver primarily works with the superblock and the group`
			`descriptors that are found in block group 0. Redundant copies of the`
			`superblock and group descriptors are written to some of the block groups`
			`across the disk in case the beginning of the disk gets trashed, though`
			`not all block groups necessarily host a redundant copy (see following`
			`paragraph for more details). If the group does not have a redundant`
			`copy, the block group begins with the data block bitmap. Note also that`
			`when the filesystem is freshly formatted, mkfs will allocate “reserve`
			`GDT block” space after the block group descriptors and before the start`
			`of the block bitmaps to allow for future expansion of the filesystem. By`
			`default, a filesystem is allowed to increase in size by a factor of`
			`1024x over the original filesystem size.`

			The location of the inode table is given by ``grp.bg_inode_table_*``. It
			`is continuous range of blocks large enough to contain`
			``sb.s_inodes_per_group * sb.s_inode_size`` bytes.

			`As for the ordering of items in a block group, it is generally`
			`established that the super block and the group descriptor table, if`
			`present, will be at the beginning of the block group. The bitmaps and`
			`the inode table can be anywhere, and it is quite possible for the`
			`bitmaps to come after the inode table, or for both to be in different`
ext4, doc: remove unnecessary escaping Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2022-05-19 22:22:55 -04:00			`groups (flex_bg). Leftover space is used for file data blocks, indirect`
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00			`block maps, extent tree blocks, and extended attributes.`

			`Flexible Block Groups`
			`---------------------`

			`Starting in ext4, there is a new feature called flexible block groups`
ext4, doc: remove unnecessary escaping Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2022-05-19 22:22:55 -04:00			`(flex_bg). In a flex_bg, several block groups are tied together as one`
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00			`logical block group; the bitmap spaces and the inode table space in the`
ext4, doc: remove unnecessary escaping Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2022-05-19 22:22:55 -04:00			`first block group of the flex_bg are expanded to include the bitmaps`
			`and inode tables of all other block groups in the flex_bg. For example,`
			`if the flex_bg size is 4, then group 0 will contain (in order) the`
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00			`superblock, group descriptors, data block bitmaps for groups 0-3, inode`
			`bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining`
			`space in group 0 is for file data. The effect of this is to group the`
ext4: documentation fixes This commit aims to fix the following issues in ext4 documentation: - Flexible block group docs said that the aim was to group block metadata together instead of block group metadata. - The documentation consistly uses "location" instead of "block number". It is easy to confuse location to be an absolute offset on disk. Added a line to clarify all location values are in terms of block numbers. - Dirent2 docs said that the rec_len field is shortened instead of the name_len field. - Typo in bg_checksum description. - Inode size is 160 bytes now, and hence i_extra_isize is now 32. - Cluster size formula was incorrect, it did not include the +10 to s_log_cluster_size value. - Typo: there were two s_wtime_hi in the superblock struct. - Superblock struct was outdated, added the new fields which were part of s_reserved earlier. - Multiple mount protection seems to be implemented in fs/ext4/mmp.c. Signed-off-by: Ayush Ranjan <ayushr2@illinois.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> 2019-08-22 23:18:33 -04:00			`block group metadata close together for faster loading, and to enable`
			`large files to be continuous on disk. Backup copies of the superblock`
			`and group descriptors are always at the beginning of block groups, even`
ext4, doc: remove unnecessary escaping Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2022-05-19 22:22:55 -04:00			`if flex_bg is enabled. The number of block groups that make up a`
			flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00
			`Meta Block Groups`
			`-----------------`

ext4, doc: remove unnecessary escaping Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2022-05-19 22:22:55 -04:00			`Without the option META_BG, for safety concerns, all block group`
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00			`descriptors copies are kept in the first block group. Given the default`
			`128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4`
			`can have at most 2^27/64 = 2^21 block groups. This limits the entire`
docs: filesystems: ext4: blockgroup.rst: replace some characters The conversion tools used during DocBook/LaTeX/html/Markdown->ReST conversion and some cut-and-pasted text contain some characters that aren't easily reachable on standard keyboards and/or could cause troubles when parsed by the documentation build system. Replace the occurences of the following characters: - U+2217 ('∗'): ASTERISK OPERATOR use ASCII asterisk instead of the ASTERISK OPERATOR Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/c5c3c384c48779ca7c9dcd90183cefe20ac82928.1623826294.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net> 2021-06-16 02:55:12 -04:00			`filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.`
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00
			`The solution to this problem is to use the metablock group feature`
ext4, doc: remove unnecessary escaping Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2022-05-19 22:22:55 -04:00			`(META_BG), which is already in ext3 for all 2.6 releases. With the`
			`META_BG feature, ext4 filesystems are partitioned into many metablock`
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00			`groups. Each metablock group is a cluster of block groups whose group`
			`descriptor structures can be stored in a single disk block. For ext4`
			`filesystems with 4 KB block size, a single metablock group partition`
			`includes 64 block groups, or 8 GiB of disk space. The metablock group`
			`feature moves the location of the group descriptors from the congested`
			`first block group of the whole filesystem into the first group of each`
			`metablock group itself. The backups are in the second and last group of`
			`each metablock group. This increases the 2^21 maximum block groups limit`
			`to the hard limit 2^32, allowing support for a 512PiB filesystem.`

			`The change in the filesystem format replaces the current scheme where`
			`the superblock is followed by a variable-length set of block group`
			`descriptors. Instead, the superblock and a single block group descriptor`
			`block is placed at the beginning of the first, second, and last block`
			`groups in a meta-block group. A meta-block group is a collection of`
			`block groups which can be described by a single block group descriptor`
docs: ext4: modify the group desc size to 64 Since the default ext4 group desc size is 64 now (assuming that the 64-bit feature is enbled). And the size mentioned in this doc is 64 too. Change it to 64. Signed-off-by: Wu Bo <bo.wu@vivo.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20230222013525.14748-1-bo.wu@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2023-02-21 20:35:24 -05:00			`block. Since the size of the block group descriptor structure is 64`
			`bytes, a meta-block group contains 16 block groups for filesystems with`
			`a 1KB block size, and 64 block groups for filesystems with a 4KB`
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00			`blocksize. Filesystems can either be created using this new block group`
			`descriptor layout, or existing filesystems can be resized on-line, and`
ext4, doc: remove unnecessary escaping Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2022-05-19 22:22:55 -04:00			`the field s_first_meta_bg in the superblock will indicate the first`
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00			`block group using this new layout.`

			Please see an important note about ``BLOCK_UNINIT`` in the section about
			`block and inode bitmaps.`

			`Lazy Block Group Initialization`
			`-------------------------------`

			`A new feature for ext4 are three block group descriptor flags that`
			`enable mkfs to skip initializing other parts of the block group`
ext4, doc: remove unnecessary escaping Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2022-05-19 22:22:55 -04:00			`metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean`
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00			`that the inode and block bitmaps for that group can be calculated and`
			`therefore the on-disk bitmap blocks are not initialized. This is`
			`generally the case for an empty block group or a block group containing`
ext4, doc: remove unnecessary escaping Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2022-05-19 22:22:55 -04:00			`only fixed-location block group metadata. The INODE_ZEROED flag means`
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00			`that the inode table has been initialized; mkfs will unset this flag and`
			`rely on the kernel to initialize the inode tables in the background.`

			`By not writing zeroes to the bitmaps and inode table, mkfs time is`
ext4, doc: remove unnecessary escaping Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2022-05-19 22:22:55 -04:00			`reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM,`
			`but the dumpe2fs output prints this as “uninit_bg”. They are the same`
ext4: import high level design chapter from wiki page Import the chapter about high level design from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> 2018-07-29 15:38:00 -04:00			`thing.`