Achieve Faster Search in ParadeDB with These Tips

Search performance in databases plays a crucial role in ensuring efficient data retrieval. ParadeDB, a PostgreSQL database optimized for search, offers advanced capabilities to enhance search speed and accuracy. This blog provides practical tips to achieve faster search results using ParadeDB.

Why ParadeDB

Today, Postgres users seeking a search and analytics engine face two primary options: adopting an external service like Elasticsearch, which, despite its robust capabilities, is challenging to manage, optimize, and synchronize, or utilizing Postgres' native search and aggregation features, which often lack critical functionality and perform inadequately with large datasets.

ParadeDB is designed to bridge this gap by offering the best of both worlds: the familiarity and reliability of Postgres combined with the high performance of a dedicated search and analytical database.

ParadeDB is particularly well-suited for:

Developers who prefer to rely on Postgres and wish to avoid the complexities of operating a non-Postgres database or query engine.
Applications requiring full-text, similarity, or hybrid searches across extensive volumes of operational data.
Systems experiencing performance bottlenecks due to Postgres’ limited analytical capabilities when processing hundreds of millions or billions of rows.
Latency-sensitive analytical queries over external object stores such as S3 and table formats like Delta Lake.

Understanding ParadeDB Search Mechanisms

Indexing Strategies

Effective indexing strategies significantly enhance search performance in ParadeDB. Indexes allow the database to locate and retrieve data quickly, reducing query response times.

Types of Indexes

ParadeDB utilizes three distinct index types to enhance its search capabilities: the BM25 index for full-text search, the HNSW index for dense vector search, and the sparse HNSW index for sparse vector search.

BM25 Index

The BM25 index is employed for full-text search within ParadeDB. Before searching, this index must be created on the relevant table columns. It is strongly consistent, meaning that new data becomes immediately searchable across all connections. Once established, the index automatically stays in sync with the underlying table as data changes.

To create a BM25 index, use the paradedb.create_bm25 function. Here is an example:

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description'),
  numeric_fields => paradedb.field('rating')
);

BM25 indexes support various field types:

Text Fields: Columns of type VARCHAR, TEXT,UUID, VARCHAR, and TEXTcan be indexed as text fields.
Numeric Fields: Columns of type SMALLINT, INTEGER, BIGINT, OID, REAL, DOUBLE PRECISION, and NUMERICcan be indexed as numeric fields.
Boolean Fields: Columns of type BOOLEAN can be indexed as boolean fields.
JSON Fields: Columns of type JSON and JSONB can be indexed as json_fields.
Datetime Fields: Columns of type DATE, TIMESTAMP, TIMESTAMPTZ, TIME, and TIMETZ can be indexed as datetime_fields.

In recent updates, ParadeDB has introduced new features and improvements to BM25 indexing:

Version 0.8.5 introduced new, refactored functions for highlighting and BM25 scoring with several robustness improvements.
Support for multi-language stemming was added.
Partial BM25 indexing was introduced.

By utilizing these features and configuration options, you can create powerful and efficient BM25 indexes tailored to your specific search requirements in ParadeDB.

HNSW Index

The HNSW (Hierarchical Navigable Small World) index is designed for dense vector search. It can be created on any column of the vector type, supporting vectors with up to 2,000 dimensions.

To create an HNSW index, use the following SQL command:

CREATE INDEX ON <schema_name>.<table_name>
USING hnsw (<column_name> <distance_metric>);

Sparse HNSW Index

The sparse HNSW index facilitates search over sparse vectors using the HNSW algorithm, powered by the pgvector Postgres extension.

To create a sparse HNSW index, use a similar SQL command as for the dense HNSW index, but ensure the column type is sparsevec:

CREATE INDEX ON <schema_name>.<table_name>
USING hnsw (<column_name> <distance_metric>);

Each of these index types serves a specific function within ParadeDB's search capabilities, enabling efficient full-text, dense vector, and sparse vector searches.

Best Practices for Indexing

Implementing best practices ensures optimal performance:

Selective Indexing: Create indexes only on columns frequently used in search queries. Avoid indexing every column to prevent unnecessary overhead.
Composite Indexes: Use composite indexes for queries involving multiple columns. Composite indexes improve performance by reducing the number of index scans.
Regular Maintenance: Perform regular maintenance tasks like REINDEX and VACUUM to keep indexes efficient. These tasks help reclaim space and optimize index performance.
Monitoring Usage: Monitor index usage with tools like pg_stat_user_indexes. This helps identify unused or inefficient indexes, allowing for timely adjustments.

Query Optimization

Optimizing queries is crucial for achieving faster search results in ParadeDB. Efficient queries minimize resource usage and improve response times.

Analyzing Query Performance

Analyzing query performance helps identify bottlenecks and areas for improvement:

EXPLAIN Command: Use the EXPLAIN command to understand the execution plan of a query. This reveals how ParadeDB processes the query and identifies potential inefficiencies.
pg_stat_statements: Enable the pg_stat_statements extension to track query performance statistics. This provides insights into the most resource-intensive queries.
Query Profiling Tools: Utilize tools like pgBadger for detailed query profiling. These tools generate reports highlighting slow queries and suggesting optimization strategies.

Writing Efficient Queries

Writing efficient queries ensures that ParadeDB performs searches quickly and accurately:

Use Index-Friendly Conditions: Structure queries to leverage existing indexes. Avoid functions or operations on indexed columns that prevent index usage.
Limit Data Retrieval: Retrieve only the necessary data by specifying columns in the SELECT statement. Avoid using SELECT * to reduce the amount of data processed.
Filter Early: Apply filters early in the query to reduce the dataset size. This minimizes the amount of data ParadeDB needs to process.
Optimize Joins: Use appropriate join types and conditions to optimize join performance. Ensure that join columns are indexed to speed up the join process.

Hardware and Configuration Tips

Optimizing Server Resources

Memory Allocation

Proper memory allocation can significantly improve ParadeDB's search performance. Allocating sufficient memory ensures that the database can handle large datasets efficiently. Increasing the shared_buffers parameter allows more data to be cached in memory, reducing disk I/O operations. Setting the work_mem parameter higher helps with complex queries and sorts, which require temporary storage.

CPU Utilization

Optimizing CPU utilization involves configuring ParadeDB to make the best use of available processor resources. Enabling parallel query execution can distribute the workload across multiple CPU cores, speeding up query processing. Adjusting the max_parallel_workers and max_parallel_workers_per_gather parameters helps control the number of parallel workers, balancing performance and resource usage. Monitoring CPU usage with tools like top or htop can identify bottlenecks and guide further optimizations.

Database Configuration Settings

Cache Settings

Effective cache settings play a crucial role in enhancing ParadeDB's performance. The effective_cache_size parameter informs the query planner about the amount of memory available for caching, helping it make better decisions. Increasing the maintenance_work_mem parameter speeds up maintenance tasks such as VACUUM and CREATE INDEX. Utilizing the pg_prewarm extension can preload frequently accessed data into the cache, reducing initial query latency.

Connection Pooling

Connection pooling manages database connections efficiently, improving overall performance. Using a connection pooler like PgBouncer reduces the overhead of establishing new connections. Configuring the max_connections parameter appropriately prevents resource exhaustion, ensuring stable performance. Setting the pool_mode to transaction or session optimizes connection reuse based on the application's requirements. Monitoring connection pool statistics helps identify issues and fine-tune settings for optimal performance.

Advanced Techniques for Faster Search

Utilizing Caching Mechanisms

Caching mechanisms store frequently accessed data in memory. This reduces the need to fetch data from disk, speeding up search operations.

Types of Caches

Different types of caches serve various purposes in ParadeDB:

In-Memory Cache: Stores data in RAM for quick access. This type of cache is ideal for frequently accessed data.
Disk Cache: Uses disk storage to cache data. While slower than in-memory caches, disk caches can store larger datasets.
Distributed Cache: Spreads cached data across multiple servers. This approach combines the benefits of in-memory and disk caches, providing scalability and fault tolerance.

Implementing Caching in ParadeDB

Implementing caching in ParadeDB involves configuring settings and using extensions:

Configuration Settings: Adjust parameters like shared_buffers and effective_cache_size to allocate memory for caching. These settings inform ParadeDB about available cache resources.
pg_prewarm Extension: Use the pg_prewarm extension to preload frequently accessed data into the cache. This reduces initial query latency and improves performance.
Cache Invalidation: Implement strategies to invalidate stale cache entries. This ensures that users receive up-to-date data while maintaining cache efficiency.

By leveraging caching mechanisms, ParadeDB can achieve faster search performance. This advanced technique optimize data distribution and retrieval, enhancing the overall efficiency of the database.

Optimizing search in ParadeDB ensures efficient data retrieval and enhances overall database performance. Key tips include effective indexing strategies, query optimization, and hardware configuration adjustments. Advanced techniques like caching further boost search speed. Implementing these practices will lead to significant improvements in search performance. ParadeDB users should apply these tips to achieve faster and more accurate search results.