Soft Partitioning
Soft partitioning is a logical partitioning strategy implemented at the ingestion layer rather than relying on native database partitioning features. Instead of storing all rows for a given entity in a single large table, we split data into time-based “partition” tables and expose them through a unified view.Currently Supported Destinations
- Redshift
- Snowflake
Why Soft Partitioning?
While some databases support native partitioning, others have limitations that make them ineffective for ingestion workloads:Partitioning Capabilities by Database
Partitioning Capabilities by Database
PostgreSQL
- Supports native partitioning with good performance
- Well-suited for both ingestion and query workloads
BigQuery
- Supports native partitioning and clustering, but partition pruning is most effective for query workloads rather than high-throughput ingestion.
- Partitioning in BigQuery is typically based on a timestamp or date column, but frequent ingestion of small batches can lead to many partitions and higher costs.
- Soft partitioning can help optimize ingestion performance and cost by controlling partition granularity and data routing at the ingestion layer.
- For instructions on native partitioning, see our BigQuery Partitioning guide.
Snowflake
- Has cluster keys that physically co-locate related rows in micro-partitions
- Re-clustering is async and can be expensive (consumes credits outside of a warehouse)
- Query pruning is less predictable
- More optimized for query performance than ingestion throughput
Redshift
- Has distribution keys (to control data placement across nodes) and sort keys (to allow pruning)
- More useful for joins and analytical queries
- Not well-suited for rapid ingestion and pruning for high throughput
- Predictable ingestion performance
- Explicit partition control
- Simpler data management
- Freedom from underlying database partitioning constraints
Usage
How It Works
How It Works
Consider a table
user_events
with the following columns:1. Dynamic Table Creation
The partition scheme can be by month or day, and is keyed off thecreated_at
column. Artie will dynamically create tables for each partition period. For example, with monthly partitions:2. Automatic Data Routing
As new data arrives, Artie writes each row to the appropriate partition table based on itscreated_at
timestamp. Future partitions (e.g., next 2-3 months) are pre-created and added to the unified view.3. Unified View
Artie automatically maintains a view foruser_events
that unions all partitions and updates the definition as new partitions are created:4. Compaction
As the number of partitions grows, Artie compacts old partitions by merging their data back into the_default
table.
You have the flexibility to specify how many previous partitions are retained before compaction is triggered.Enabling Soft Partitioning in Artie
Enabling Soft Partitioning in Artie
Note: if the table already exists in the destination, just contact us and we’ll help you migrate it to be soft partitioned.
-
Navigate to the
Tables
tab in the pipeline. Make sure that the source and destination is set and the desired table is set toReplicating
. Click on the gear icon. -
The
Enable soft partitioning
option lists the names of the partition tables that will be created if the destination is set. -
Check the checkbox and choose the partition column (in this case
created_at
). You may choose the partition frequency and the number of old partitions to be maintained as well. -
Click
Done
and save the pipeline. -
Deploy the pipeline. Validation will fail if a table exists in the destination with the same name as the view. (In this case,
inventory_user_events
.) You may choose to rename the existing table or contact us for a one time migration. - After enabling soft partitioning, a recurring job runs in the background to manage the view and the partitioned tables. You will receive a notification about actions performed as part of this job. (For e.g. creation of new partition tables)
It is not recommended to create objects that depend on the unified view or partition tables as that will interfere with the view management actions that are run periodically.These actions include:
- Creation of new partition tables. (Unified view is recreated to include the new tables)
- Compaction of old partition tables. (Unified view is recreated to exclude the old tables)
- Adding new columns to all managed tables and views. (Unified view is recreated to include the new column)
- Redshift
- Consider using late-binding views instead of standard views.
Redshift query to check for dependent objects
- Snowflake
Target Workloads
While soft partitioning can work for any datasets with a timestamp column, it’s particularly effective for:Time-Series, Append-Only Data
- Event streaming
- IoT sensor data
- Transaction logs
- Application metrics
High-Volume Ingestion Pipelines
- Where the cost of updating large tables is high
- Merge workloads that often target recent slices of data
- Scenarios requiring predictable ingestion performance
Benefits
Benefits of Soft Partitioning
Benefits of Soft Partitioning
- Faster ingestion and better performance: Smaller, time-based partitions enable quicker writes and allow queries to efficiently target specific time ranges, improving overall performance.
- Reduced lock contention: Operations on different partitions occur independently, minimizing blocking and improving concurrency.
- Simplified data management: Data is clearly separated by time periods, making it easier to manage, maintain, and apply flexible retention policies—old partitions can be dropped, compressed, or archived as needed.
- Lower storage and compute costs: Only relevant partitions are scanned during queries, reducing compute usage, and older data can be archived or compressed to save on storage.
- Predictable scaling and costs: Ingestion and query performance remain consistent as data volume grows, ensuring costs and resource usage are easier to forecast.
- Database agnostic: Soft partitioning works consistently across different database systems, providing operational flexibility.