Module ProcessContent

Module ProcessContent 

Source
Expand description

§ProcessContent

§File: Indexing/Process/ProcessContent.rs

§Role in Air Architecture

Provides content processing functionality for the File Indexer service, handling encoding detection, MIME type detection, and content tokenization.

§Primary Responsibility

Process file content for indexing by detecting encoding, mime types, and tokenizing text for search operations.

§Secondary Responsibilities

  • File encoding detection (UTF-8, UTF-16, ASCII)
  • MIME type detection from extensions and content
  • Content tokenization for search indexing
  • Language detection for code analysis

§Dependencies

External Crates:

  • None (uses std library)

Internal Modules:

  • crate::Result - Error handling type

§Dependents

  • Indexing::Scan::ScanFile - Content processing during file scan
  • Indexing::Store::StoreEntry - Index storage operations

§VSCode Pattern Reference

Inspired by VSCode’s content processing in src/vs/base/node/encoding/

§Security Considerations

  • Safe BOM marker detection
  • Null byte filtering
  • Length limits on processed content

§Performance Considerations

  • Efficient tokenization with minimal allocations
  • Early termination for binary files
  • Lazy content evaluation

§Error Handling Strategy

Content processing functions return Option or safe defaults when detection fails, rather than errors, to allow indexing to continue.

§Thread Safety

Content processing functions are pure and safe to call from parallel indexing tasks.

Functions§

ContentToString
Convert content to UTF-8 string with error handling
DetectEncoding
Detect file encoding (simplified detection)
DetectLanguage
Detect programming language from file extension and shebang
DetectMimeType
Detect MIME type with comprehensive file type detection
GetCharCount
Get char count from content
GetLineCount
Get line count from content
IsBinaryContent
Check if content is likely binary (contains null bytes or high ratio of non-text)
SanitizeContent
Remove null bytes and control characters from content
TokenizeContent
Tokenize content for indexing with improved word boundary handling
TruncateContent
Truncate content to specified maximum size in characters