Module ProcessContent

Expand description

§ProcessContent

Provides content processing functionality for the File Indexer service, handling encoding detection, MIME type detection, and content tokenization.

Process file content for indexing by detecting encoding, mime types, and tokenizing text for search operations.

External Crates:

Internal Modules:

Inspired by VSCode’s content processing in src/vs/base/node/encoding/

Content processing functions return Option or safe defaults when detection fails, rather than errors, to allow indexing to continue.

Content processing functions are pure and safe to call from parallel indexing tasks.

ContentToString: Convert content to UTF-8 string with error handling
DetectEncoding: Detect file encoding (simplified detection)
DetectLanguage: Detect programming language from file extension and shebang
DetectMimeType: Detect MIME type with comprehensive file type detection
GetCharCount: Get char count from content
GetLineCount: Get line count from content
IsBinaryContent: Check if content is likely binary (contains null bytes or high ratio of non-text)
SanitizeContent: Remove null bytes and control characters from content
TokenizeContent: Tokenize content for indexing with improved word boundary handling
TruncateContent: Truncate content to specified maximum size in characters