Order matters (provisional)
When a table is designed the columns have an order that follows the human logic that doesn’t necessary gives coincides with the way PostgreSQL stores the data ad may cause unnecessary bloat.
Let’s find out why the column order is important and how we can use it to improve our storage and consequently get better performance.
Fixed length and varlena
In PostgreSQL we have two kind of data types the fixed length and varlena.
The fixed length data types have a certain size in bytes regardless of the data stored. This kind of data type impose boundaries for the data to be stored.
For example an integer consumes 4 bytes but accepts only values from -2,147,483,648 to +2,147,483,647.
If we attempt to store a value outside the permitted range we get an error. In this particular case if want to store a number that doesn’t fit within integer’s limits we’ll have to use bigint instead.
As the name suggests bigint accepts values sitting between -9,223,372,036,854,775,808 and +9,223,372,036,854,775,807 at the cost of consuming twice the storage of an integer. Bigint consumes always 8 bytes.
The other kind of data type is the varlena. As the name suggests varlena stores VARiable LENgth data. This kind of data type normally doesn’t have fixed boundaries except for the maximum allowed storage that is 1 GB and it’s set by the design of the TOAST data type.
Depending from the tuple length then the varlena column could be a data representation as-is or compressed data or an external pointer to the corresponding TOAST table’s data chunks. TOAST data chunks can also be compressed or not compressed.
The first 4 bytes of the datum are used to store the varlena length unless the datum is smaller than 127 bytes. In that case a different space-efficient storage method uses only one byte for the datum length.
Except for that special case all the data types are aligned to 4 at least bytes by design.
This alignment have an impact on the storage, in particular if we deal with billion of rows.
Padding and alignment
When designing a table the human logic sets the columns in an order based on the meaning of the data, usually expressed by the column name.
For example a table containing the user data has the DDL written this way.
CREATE TABLE user_data
(
id serial,
user_name varchar(20),
password varchar(100),
user_city integer NOT NULL default 0,
user_super boolean NOT NULL default 'f',
date_signup timestamp with time zone NOT NULL default now(),
user_active boolean NOT NULL default 'f',
date_activation timestamp with time zone NOT NULL default '1970-01-01 00:00:00+00'::timestamptz,
constraint pk_user primary key (id)
);
An integer field is followed by two varchar with an max length set to 20 and 100. Then two booleans are followed by another integer and two timestamp with time zone.
The length of the integers is 4 bytes. The length of the timestamp with time zone data types is 8 bytes
INSERT INTO user_data
(
user_name,
password
)
SELECT
format('user_%s',id) AS user_name,
md5(format('user_%s',id)) AS password
FROM generate_series(1,20000) id;
;