logo
0
0
WeChat Login

GitLab Elasticsearch Indexer

Pipeline Status

This project indexes Git repositories into Elasticsearch for GitLab. The indexed data enables GitLab to search through code, wikis, and commits in GitLab repositories using Elasticsearch's powerful search capabilities.

The indexer is designed with a modular architecture that supports different indexing modes to optimize for various deployment scenarios. It uses structured logging to help with troubleshooting and debugging.

Dependencies

This project relies on the following dependencies:

  • ICU for text encoding
  • Go 1.20 or later for building from source
  • Gitaly for accessing Git repositories
  • Elasticsearch v7.x or compatible OpenSearch instance

Ensure the development packages for your platform are installed before running make:

Debian / Ubuntu

# apt install libicu-dev

Mac OSX

$ brew install icu4c
$ export PKG_CONFIG_PATH="$(brew --prefix)/opt/icu4c/lib/pkgconfig:$PKG_CONFIG_PATH"

Modes Architecture

The GitLab Elasticsearch Indexer supports multiple operating modes that can be configured using the GITLAB_INDEXER_MODE environment variable. Each mode is optimized for different use cases:

Advanced Mode (Default)

The Advanced Mode is the default mode for the indexer. It provides full-featured indexing with support for:

  • Indexing code (blobs), commits, and wikis
  • Project permission handling
  • Namespace traversal IDs
  • Schema versioning

This mode is recommended for most standard GitLab deployments.

export GITLAB_INDEXER_MODE=advanced # default if not specified

Chunk Mode

The Chunk Mode is an alternative indexing approach designed for large repositories or specialized deployment scenarios. This mode is currently under development and will provide enhanced features for handling very large codebases more efficiently.

To select a specific mode, set the GITLAB_INDEXER_MODE environment variable:

export GITLAB_INDEXER_MODE=chunk

Usage

Chunk mode uses command-line flags to specify the adapter and connection details, with operation-specific options passed as JSON:

gitlab-elasticsearch-indexer \
  --adapter elasticsearch \
  --connection '{"url": ["http://localhost:9200"]}' \
  --options '{
    "project_id": 123,
    "operation": "index",
    "partition_name": "gitlab-code-search",
    "partition_number": 0,
    "timeout": "5m",
    "chunk_size": 1024,
    "gitaly_config": {...}
  }'

Supported Adapters: elasticsearch, postgresql (planned), opensearch (planned)

Operations:

  • index (default): Index project files as chunks
  • delete: Remove all chunks for a project

Common Options:

  • project_id (required): Project ID
  • operation (defaults to index): Operation type (index|delete)
  • partition_name (required): Index partition name
  • partition_number (required): Index partition number
  • timeout (required): Operation timeout (e.g., 5m, 1h)

Index Operation Options:

  • from_sha, to_sha: Git commit range
  • chunk_size: Maximum chunk size in bytes
  • chunk_overlap: Overlap between chunks in bytes
  • chunk_strategy: Chunking strategy (see below)
  • gitaly_config: Gitaly connection configuration
  • gitaly_batch_size: Batch size for Gitaly operations
  • elastic_bulk_size: Bulk operation size for Elasticsearch

Chunk Strategies

Chunk mode supports different chunking strategies that determine how files are split into chunks:

  • code_bytes (Default): Uses byte-based chunking optimized for performance. This is the recommended strategy for production use, providing fast and reliable indexing.

  • code_pre_bert (Experimental): Uses token-based chunking with pre-BERT token size limits.

    ⚠️ WARNING: This strategy is EXPERIMENTAL and NOT RECOMMENDED for production use. Performance benchmarks show it is approximately 18x slower than code_bytes:

    • code_bytes: ~97 seconds to index the GitLab repository
    • code_pre_bert: 30+ minutes, often resulting in timeout errors

    This strategy should only be used for research and development purposes.

The chunking strategy is configured via the chunk_strategy option in the JSON options passed to chunk mode.

Building & Installing

Local Build

To build and install the indexer locally:

make
sudo make install

gitlab-elasticsearch-indexer will be installed to /usr/local/bin

You can change the installation path with the PREFIX environment variable. Please remember to pass the -E flag to sudo if you do so.

Example:

PREFIX=/usr sudo -E make install

Development Helpers

The project includes several helpful Makefile targets to assist with development:

# View all available Makefile targets with descriptions
make help

# Run tests in watch mode (automatically re-run on file changes)
make watch-test

Using Docker

You can also build and use the indexer as a Docker image:

docker build . -t gitlab-elasticsearch-indexer

You can edit your shell profile (like ~/.zshrc) to use the image as a binary:

func gitlab-elasticsearch-indexer() {
  docker run --rm -it gitlab-elasticsearch-indexer "$@"
}

Lefthook Static Analysis

Lefthook is a Git hooks manager that allows custom logic to be executed prior to Git committing or pushing. gitlab-elasticsearch-indexer comes with Lefthook configuration (lefthook.yml), which helps ensure code quality by running linters and static analysis tools automatically.

The configuration file is checked in but ignored until Lefthook is installed.

Install Lefthook

  1. Install lefthook

  2. Install Lefthook Git hooks:

    lefthook install
    
  3. Test Lefthook is working by running the Lefthook pre-push Git hook:

    lefthook run pre-push
    

Lefthook will now automatically run configured checks before commits and pushes.

Troubleshooting

If you encounter errors like fatal error: 'parser-c-bindings.h' file not found when running lefthook hooks, run make build to populate tmp/libparser:

make build

This error typically occurs when running lefthook for the first time after cloning, after cleaning the tmp/ directory, or after pulling dependency updates.

Testing

The project includes a comprehensive test suite and developer-friendly testing features to help ensure code quality.

Test Requirements

The test suite expects Gitaly and Elasticsearch to be running on the following ports:

  • Gitaly: 8075
  • ElasticSearch v7.14.2: 9201

Make sure you have docker and docker-compose installed. On macOS, you can use colima to run Docker since Docker Desktop cannot be used due to licensing.

brew install docker docker-compose colima
colima start

Quick Tests

# Start the test infrastructure (only needed once)
make test-infra

# Source the default connection settings
source .env.test

# Run the test suite
make test

# Run tests in watch mode (auto-rerun on file changes)
make watch-test

# Run a specific test
go test -v gitlab.com/gitlab-org/gitlab-elasticsearch-indexer -run TestIndexingGitlabTest

If you want to re-create the test infrastructure, you can run make test-infra again.

Custom Test Configuration

For testing with custom configurations:

  1. Start only the services you need:

    # Start Gitaly
    docker-compose up -d gitaly
    
    # Start ElasticSearch
    docker-compose up -d elasticsearch
    
  2. Configure the test environment:

    # These are the defaults from .env.test
    export GITALY_CONNECTION_INFO='{"address": "tcp://localhost:8075", "storage": "default"}'
    export ELASTIC_CONNECTION_INFO='{"url":["http://localhost:9201"], "index_name":"gitlab-test", "index_name_commits":"gitlab-test-commits"}'
    

    Note: When using a Unix socket, use the format unix://FULL_PATH_WITH_LEADING_SLASH

    Example with custom Gitaly connection:

    # Source default connections
    source .env.test
    
    # Override Gitaly connection for GDK
    export GITALY_CONNECTION_INFO='{"address": "unix:///gitlab/gdk/gitaly.socket", "storage": "default"}'
    
    # Run tests
    make test
    

Testing in GDK

You can test changes to the indexer in the GitLab Development Kit (GDK) in multiple ways.

Using the GITLAB_ELASTICSEARCH_INDEXER_VERSION File

Warning: Do not create tags to test code. Tags are created for released versions only.

The GITLAB_ELASTICSEARCH_INDEXER_VERSION file accepts commit SHAs and branch names. This method works for both local development and spec execution.

To test a branch or specific commit:

  1. Update the GITLAB_ELASTICSEARCH_INDEXER_VERSION file with your branch name or commit SHA
  2. Run gdk reconfigure to apply the changes

Building a Binary for GDK

You can test changes to the indexer in your GDK by:

  1. Building the indexer with the PREFIX environment variable set to your GDK directory
  2. This installs the indexer directly in the GDK, making it available for immediate testing
# Build and install directly to GDK
PREFIX=<gdk_install_directory>/gitlab-elasticsearch-indexer make install

Note: Running gdk update will reset the indexer back to the version specified in the GITLAB_ELASTICSEARCH_INDEXER_VERSION file. The specs use this file to build the indexer to <gdk_install_directory>/gitlab/tmp/tests/gitlab-elasticsearch-indexer.

Debugging Elasticsearch calls

Set ELASTIC_DEBUG environment variable to print out all calls to Elasticsearch

Example:

ELASTIC_DEBUG=1 go test -v -run TestMixedOperationsBulkSizeTracking ./internal/mode/advanced/elastic

Debugging with Delve

Delve is a powerful Go debugger that can help troubleshoot issues.

Start a debugging session with:

dlv test <path-to-package> -- -test.run <regex-matching-test-name>

Example:

dlv test gitlab.com/gitlab-org/gitlab-elasticsearch-indexer -- -test.run ^TestIndexingWikiBlobs$

Common debugging commands:

  • Set a breakpoint: break <path-to-file>:<line-number>
  • Continue execution until next breakpoint: continue
  • Print variable value: print <variable-name>
  • Step to next source line: next
  • Exit debugger: exit

For more details, see the Delve documentation.

Obtaining a package or Docker image for testing an MR

GitLab team members can use the build-package-and-qa job in their MR pipeline to trigger a pipeline in the omnibus-gitlab-mirror project. This pipeline produces:

  • An omnibus-gitlab package for Ubuntu (as an artifact of the Trigger:package job)
  • A Docker image (in the Trigger:gitlab-docker job)

These artifacts include the changes from the MR and can be used to deploy a GitLab instance locally for testing.

The job is automatically started if the MR includes changes to any of the dependencies of the project, which could potentially break builds in any of the operating systems GitLab provides packages for. For other types of MRs, this is available as a manual job for developers to run when needed.

Configuration Options

The GitLab Elasticsearch Indexer operates in two modes (Advanced and Chunk), each with its own configuration approach. Advanced mode uses traditional command-line flags, while Chunk mode uses JSON-based configuration for flexibility.

Global Configuration

These settings apply to both modes:

VariableTypeDefaultDescription
GITLAB_INDEXER_MODEstringadvancedOperating mode: advanced or chunk
GITLAB_INDEXER_DEBUG_LOGGINGboolfalseEnable debug logging for the global slog logger (used by both modes). Accepts true or 1. Advanced mode also supports the DEBUG environment variable for its logkit logger.

Advanced Mode Configuration

Advanced mode is the default operating mode with full-featured indexing support.

Command-Line Flags

FlagTypeRequiredDefaultDescription
--versionboolfalsePrint version information and exit
--skip-commitsboolfalseSkip indexing commits for the repository
--skip-blobsboolfalseSkip indexing blobs for the repository
--blob-typestringblobType of blobs to index: blob or wiki_blob
--visibility-levelint-1Project/Group visibility level: 0 (private), 10 (internal), 20 (public)
--repository-access-levelint-1Project repository access level: 0 (disabled), 10 (private), 20 (enabled)
--wiki-access-levelint-1Wiki repository access level: 0 (disabled), 10 (private), 20 (enabled)
--full-pathstring""Full path of the project or group (e.g., gitlab-org/gitlab)
--project-idint64-1Project ID
--group-idint64-1Group ID
--timeoutstring""Process timeout (e.g., 5m, 1h, 30s). Empty means no timeout
--traversal-idsstringNamespace traversal IDs (e.g., 5-1-6-). Required.
--hashed-root-namespace-idint-1Hashed root namespace ID
--from-shastring""Starting commit SHA for incremental indexing
--to-shastring""Ending commit SHA for incremental indexing
--archivedstring""Whether the project is archived (true or false)
--schema-version-blobint10Schema version for blob documents (YYVV format, e.g., 2601).
--schema-version-commitint20Schema version for commit documents (YYVV format, e.g., 2622).
--schema-version-wikiint30Schema version for wiki documents (YYVV format, e.g., 2603).
  1. Required when --blob-type=blob
  2. Required when --skip-commits is not set
  3. Required when --blob-type=wiki_blob

Environment Variables

VariableDescriptionExample
GITALY_CONNECTION_INFOGitaly connection details (JSON){"address": "tcp://localhost:8075", "storage": "default"}
ELASTIC_CONNECTION_INFOElasticsearch connection details (JSON){"url":["http://localhost:9200"], "index_name":"gitlab-production"}
CORRELATION_IDRequest correlation ID for trackingabc123-def456
DEBUGEnable debug logging for Advanced mode's logkit logger (in addition to the global slog logger controlled by GITLAB_INDEXER_DEBUG_LOGGING).true or 1

Gitaly Connection Configuration

The GITALY_CONNECTION_INFO environment variable accepts a JSON object with these fields:

FieldTypeRequiredDescription
addressstringGitaly server address (e.g., tcp://localhost:8075 or unix:///path/to/socket)
storagestringStorage name
tokenstringAuthentication token
relative_pathstringRepository relative path
project_pathstringProject path
token_versionintToken version: 0 or 2
limit_file_sizeint64Maximum file size in bytes to index (default: 1 MiB)

Elasticsearch Connection Configuration

The ELASTIC_CONNECTION_INFO environment variable accepts a JSON object with these fields:

FieldTypeRequiredDefaultDescription
url[]stringElasticsearch URLs (e.g., ["http://localhost:9200"])
index_namestringgitlabIndex name for blobs. Defaults to gitlab (or gitlab-{RAILS_ENV} if RAILS_ENV is set) if omitted.
index_name_commitsstringgitlab-commitsIndex name for commits
index_name_wikisstring""Index name for wikis
awsbooleanfalseEnable connection for AWS
aws_regionstring""AWS region
aws_access_keystring""AWS access key
aws_secret_access_keystring""AWS secret access key
max_bulk_size_bytesint10485760Maximum bulk request size (10 MiB)
max_bulk_concurrencyint10Number of concurrent bulk workers
client_request_timeoutint0Client request timeout in seconds (0 = no timeout)

Advanced Mode Usage Examples

Example 1: Basic Project Indexing

export GITALY_CONNECTION_INFO='{"address": "tcp://localhost:8075", "storage": "default"}'
export ELASTIC_CONNECTION_INFO='{"url":["http://localhost:9200"]}'

gitlab-elasticsearch-indexer \
  --project-id 42 \
  --full-path "gitlab-org/gitlab" \
  /path/to/repo

Example 2: Incremental Indexing with Commit Range

gitlab-elasticsearch-indexer \
  --project-id 42 \
  --from-sha abc123def456... \
  --to-sha 789xyz012345... \
  /path/to/repo

Example 3: Index Wikis Only

gitlab-elasticsearch-indexer \
  --project-id 42 \
  --blob-type wiki_blob \
  --skip-commits \
  /path/to/wiki/repo

Example 4: Index with Permissions and Namespace Information

gitlab-elasticsearch-indexer \
  --project-id 42 \
  --visibility-level 10 \
  --repository-access-level 20 \
  --traversal-ids "5-1-6-" \
  --hashed-root-namespace-id 5 \
  /path/to/repo

Example 5: Index Commits Only with Schema Version

gitlab-elasticsearch-indexer \
  --project-id 42 \
  --skip-blobs \
  --schema-version-commit 2305 \
  --timeout 10m \
  /path/to/repo

Example 6: AWS Elasticsearch with Custom Configuration

export ELASTIC_CONNECTION_INFO='{
  "url": ["https://search-domain.us-east-1.es.amazonaws.com"],
  "aws": true,
  "aws_region": "us-east-1",
  "aws_access_key": "<your_aws_access_key>",
  "aws_secret_access_key": "<your_aws_secret_access_key>",
  "max_bulk_size_bytes": 5242880,
  "max_bulk_concurrency": 5
}'

gitlab-elasticsearch-indexer --project-id 42 /path/to/repo

Chunk Mode Configuration

Chunk mode uses JSON-based configuration passed via command-line flags. See the Chunk Mode section for details on chunking strategies and operations.

Command-Line Flags

FlagTypeRequiredDescription
--adapterstringStorage adapter: elasticsearch, postgresql, or opensearch
--connectionstringConnection configuration as JSON (varies by adapter)
--optionsstringIndexing options as JSON (see below)

Options JSON Fields

FieldTypeRequiredDefaultDescription
project_iduint64Project ID
operationstringindexOperation type: index or delete
from_shastringStarting commit SHA for incremental indexing. Optional; defaults to the null tree SHA (beginning of repository history) if omitted. Not required for initial indexing.
to_shastringEnding commit SHA for incremental indexing. Optional; defaults to HEAD (latest commit) if omitted. Not required for initial indexing.
correlation_idstringRequest correlation ID for tracking (optional, for logging purposes only)
force_reindexboolfalseForce reindexing of all content (*index operation only)
chunk_sizeuint161000Maximum chunk size in bytes. Optional; defaults to 1000 if omitted or set to 0. (*index operation only)
chunk_overlapuint160Overlap between chunks in bytes. Optional; defaults to 0 if omitted. Must be less than chunk_size. Used in index operation only.
chunk_strategystringcode_bytesChunking strategy: code_bytes or code_pre_bert (see Chunk Strategies). Used in index operation only.
gitaly_configobject1Gitaly configuration (same fields as Advanced mode)
timeoutstringOperation timeout (e.g., 5m, 1h). Must be a valid duration string; empty string will cause an error. Unlike Advanced mode, there is no "no timeout" option.
gitaly_batch_sizeuint16128Batch size for Gitaly operations (*required for index operation only)
partition_namestringIndex/table partition name. For Elasticsearch/OpenSearch, this is combined with partition_number to form the index name (e.g., gitlab-code_0). For PostgreSQL, this is used as the table name directly.
partition_numberuint16Partition number. For Elasticsearch/OpenSearch, this is appended to partition_name to form the index name. For PostgreSQL, this is stored as metadata but the table name comes from partition_name.
schema_versionuint16Schema version (YYVV format, e.g., 2601). Optional; not currently used by the indexer but included for future compatibility.
elastic_bulk_sizeuint161000Bulk size for Elasticsearch operations. Defaults to 1000 if omitted or set to 0. (*index operation only)
  1. Required for index operation only

Connection Configuration by Adapter

Elasticsearch Adapter

FieldTypeRequiredDescription
url[]stringElasticsearch URLs

PostgreSQL Adapter

FieldTypeRequiredDescription
hoststringPostgreSQL host
portuint16PostgreSQL port
userstringDatabase user
passwordstringDatabase password
databasestringDatabase name
tablestringTable name

OpenSearch Adapter

FieldTypeRequiredDescription
url[]stringOpenSearch URLs
awsboolEnable AWS signing
aws_regionstringAWS region
aws_access_keystringAWS access key
aws_secret_access_keystringAWS secret access key
aws_role_arnstringAWS role ARN
client_request_timeoutintRequest timeout in seconds

Chunk Mode Usage Examples

Example 1: Basic Chunk Indexing with Elasticsearch

gitlab-elasticsearch-indexer \
  --adapter elasticsearch \
  --connection '{"url": ["http://localhost:9200"]}' \
  --options '{
    "project_id": 123,
    "operation": "index",
    "from_sha": "abc123def456",
    "to_sha": "789xyz012345",
    "correlation_id": "req-123-456",
    "schema_version": 2305,
    "chunk_size": 1024,
    "chunk_overlap": 128,
    "chunk_strategy": "code_bytes",
    "partition_name": "gitlab-code",
    "partition_number": 0,
    "timeout": "5m",
    "gitaly_config": {
      "address": "tcp://localhost:8075",
      "storage": "default"
    },
    "gitaly_batch_size": 128,
    "elastic_bulk_size": 1000
  }'

Example 2: Delete Chunks for a Project

gitlab-elasticsearch-indexer \
  --adapter elasticsearch \
  --connection '{"url": ["http://localhost:9200"]}' \
  --options '{
    "project_id": 123,
    "operation": "delete",
    "partition_name": "gitlab-code",
    "partition_number": 0,
    "timeout": "2m"
  }'

Example 3: Index with PostgreSQL Adapter

gitlab-elasticsearch-indexer \
  --adapter postgresql \
  --connection '{
    "host": "localhost",
    "port": 5432,
    "user": "gitlab",
    "password": "<your_password>",
    "database": "gitlab_production",
    "table": "code_chunks"
  }' \
  --options '{
    "project_id": 123,
    "from_sha": "abc123",
    "to_sha": "def456",
    "correlation_id": "req-pg-123",
    "schema_version": 2305,
    "chunk_size": 2048,
    "chunk_overlap": 256,
    "partition_name": "chunks",
    "partition_number": 0,
    "timeout": "10m",
    "gitaly_config": {
      "address": "unix:///var/opt/gitlab/gitaly/gitaly.socket",
      "storage": "default"
    },
    "gitaly_batch_size": 128
  }'

Example 4: OpenSearch with AWS Authentication

gitlab-elasticsearch-indexer \
  --adapter opensearch \
  --connection '{
    "url": ["https://search-domain.us-west-2.es.amazonaws.com"],
    "aws": true,
    "aws_region": "us-west-2",
    "aws_access_key": "YOUR_AWS_ACCESS_KEY_ID",
    "aws_secret_access_key": "YOUR_AWS_SECRET_ACCESS_KEY"
  }' \
  --options '{
    "project_id": 456,
    "from_sha": "start123",
    "to_sha": "end456",
    "correlation_id": "req-opensearch-456",
    "schema_version": 2305,
    "chunk_size": 1024,
    "chunk_overlap": 128,
    "partition_name": "gitlab-code-search",
    "partition_number": 1,
    "timeout": "15m",
    "gitaly_config": {
      "address": "tcp://gitaly.example.com:8075",
      "storage": "default",
      "token": "<your_access_token>"
    },
    "gitaly_batch_size": 128
  }'

Logging

The GitLab Elasticsearch Indexer uses structured JSON logging with the Go standard library's log/slog package. This provides:

  • Consistent log format with key-value pairs
  • Configurable log levels
  • Easy integration with log management systems

Debug Logging

Debug logging can be enabled using the following environment variables:

Global slog logger (used by both modes):

Set the GITLAB_INDEXER_DEBUG_LOGGING environment variable:

# Enable debug logging for the global slog logger
export GITLAB_INDEXER_DEBUG_LOGGING=true
# or
export GITLAB_INDEXER_DEBUG_LOGGING=1

# Run the indexer with debug logging enabled
gitlab-elasticsearch-indexer [options]

Advanced mode logkit logger (Advanced mode only):

Additionally, Advanced mode supports the DEBUG environment variable for its logkit logger:

# Enable debug logging for Advanced mode's logkit logger
export DEBUG=1

# Run the indexer with debug logging enabled
gitlab-elasticsearch-indexer [options] /path/to/repo

When debug logging is enabled, you'll see additional information about:

  • Mode selection and initialization
  • Elasticsearch queries and responses
  • Git operations
  • Performance metrics

Debug logs are automatically formatted as structured JSON for easy filtering and analysis.

CI/CD Configuration

Automatic Tag Creation

The project contains a CI job that automatically creates version tags based on the content of the VERSION file. When changes are merged to the main branch, the system checks if a tag for the current version exists and creates one if needed.

TAG_CREATOR_TOKEN Requirements

To enable automatic tag creation, you need to set up a GitLab CI/CD variable:

  • Variable Name: TAG_CREATOR_TOKEN
  • Type: Masked and Protected variable
  • Requirements:
    • Must be a project access token with Developer role
    • Scope required: api
    • The bot user created with the token must have permission to create protected tags

To set up this token:

  1. Create a project access token with Developer role and api scope
  2. Add the token as a Masked and Protected CI/CD variable in your project settings
  3. Go to your project's Settings > Repository > Protected Tags
  4. Add the project bot user (appears as "Project bot: [project-name]") to the list of users allowed to create protected tags

Contributing

Please see the following documentation for contributing to this project:

About

No description, topics, or website provided.
8.16 MiB
0 forks0 stars1 branches78 TagREADMEOther license
Language
Go92.6%
Shell5.5%
Makefile1.7%
Ruby0.1%
Others0.1%