S3: Avoid in-memory map concurrent writes in SSE-S3 key manager (#7358)

* Fix concurrent map writes in SSE-S3 key manager

This commit fixes issue #7352 where parallel uploads to SSE-S3 enabled
buckets were causing 'fatal error: concurrent map writes' crashes.

The SSES3KeyManager struct had an unsynchronized map that was being
accessed from multiple goroutines during concurrent PUT operations.

Changes:
- Added sync.RWMutex to SSES3KeyManager struct
- Protected StoreKey() with write lock
- Protected GetKey() with read lock
- Updated GetOrCreateKey() with proper read/write locking pattern
  including double-check to prevent race conditions

All existing SSE tests pass successfully.

Fixes #7352

* Improve SSE-S3 key manager with envelope encryption

Replace in-memory key storage with envelope encryption using a super key (KEK).

Instead of storing DEKs in a map, the key manager now:
- Uses a randomly generated 256-bit super key (KEK)
- Encrypts each DEK with the super key using AES-GCM
- Stores the encrypted DEK in object metadata
- Decrypts the DEK on-demand when reading objects

Benefits:
- Eliminates unbounded memory growth from caching DEKs
- Provides better security with authenticated encryption (AES-GCM)
- Follows envelope encryption best practices (similar to AWS KMS)
- No need for mutex-protected map lookups on reads
- Each object's encrypted DEK is self-contained in its metadata

This approach matches the design pattern used in the local KMS provider
and is more suitable for production use.

* Persist SSE-S3 KEK in filer for multi-server support

Store the SSE-S3 super key (KEK) in the filer at /.seaweedfs/s3/kek
instead of generating it per-server. This ensures:

1. **Multi-server consistency**: All S3 API servers use the same KEK
2. **Persistence across restarts**: KEK survives server restarts
3. **Centralized management**: KEK stored in filer, accessible to all servers
4. **Automatic initialization**: KEK is created on first startup if it doesn't exist

The KEK is:
- Stored as hex-encoded bytes in filer
- Protected with file mode 0600 (read/write for owner only)
- Located in /.seaweedfs/s3/ directory (mode 0700)
- Loaded on S3 API server startup
- Reused across all S3 API server instances

This matches the architecture of centralized configuration in SeaweedFS
and enables proper SSE-S3 support in multi-server deployments.

* Change KEK storage location to /etc/s3/kek

Move SSE-S3 KEK from /.seaweedfs/s3/kek to /etc/s3/kek for better
organization and consistency with other SeaweedFS configuration files.

The /etc directory is the standard location for configuration files
in SeaweedFS.

* use global sse-se key manager when copying

* Update volume_growth_reservation_test.go

* Rename KEK file to sse_kek for clarity

Changed /etc/s3/kek to /etc/s3/sse_kek to make it clear this key
is specifically for SSE-S3 encryption, not for other KMS purposes.

This improves clarity and avoids potential confusion with the
separate KMS provider system used for SSE-KMS.

* Use constants for SSE-S3 KEK directory and file name

Refactored to use named constants instead of string literals:
- SSES3KEKDirectory = "/etc/s3"
- SSES3KEKParentDir = "/etc"
- SSES3KEKDirName = "s3"
- SSES3KEKFileName = "sse_kek"

This improves maintainability and makes it easier to change
the storage location if needed in the future.

* Address PR review: Improve error handling and robustness

Addresses review comments from https://github.com/seaweedfs/seaweedfs/pull/7358#pullrequestreview-3367476264

Critical fixes:
1. Distinguish between 'not found' and other errors when loading KEK
   - Only generate new KEK if ErrNotFound
   - Fail fast on connectivity/permission errors to prevent data loss
   - Prevents creating new KEK that would make existing data undecryptable

2. Make SSE-S3 initialization failure fatal
   - Return error instead of warning when initialization fails
   - Prevents server from running in broken state

3. Improve directory creation error handling
   - Only ignore 'file exists' errors
   - Fail on permission/connectivity errors

These changes ensure the SSE-S3 key manager is robust against
transient errors and prevents accidental data loss.

* Fix KEK path conflict with /etc/s3 file

Changed KEK storage from /etc/s3/sse_kek to /etc/seaweedfs/s3_sse_kek
to avoid conflict with the circuit breaker config at /etc/s3.

The /etc/s3 path is used by CircuitBreakerConfigDir and may exist as
a file (circuit_breaker.json), causing the error:
  'CreateEntry /etc/s3/sse_kek: /etc/s3 should be a directory'

New KEK location: /etc/seaweedfs/s3_sse_kek
This uses the seaweedfs subdirectory which is more appropriate
for internal SeaweedFS configuration files.

Fixes startup failure when /etc/s3 exists as a file.

* Revert KEK path back to /etc/s3/sse_kek

Changed back from /etc/seaweedfs/s3_sse_kek to /etc/s3/sse_kek
as requested. The /etc/s3 directory will be created properly
when it doesn't exist.

* Fix directory creation with proper ModeDir flag

Set FileMode to uint32(0755 | os.ModeDir) when creating /etc/s3 directory
to ensure it's created as a directory, not a file.

Without the os.ModeDir flag, the entry was being created as a file,
which caused the error 'CreateEntry: /etc/s3 is a file' when trying
to create the KEK file inside it.

Uses 0755 permissions (rwxr-xr-x) for the directory and adds os import
for os.ModeDir constant.
This commit is contained in:
Chris Lu
2025-10-22 14:12:31 -07:00
committed by GitHub
parent aed91baa2e
commit f7bd75ef3b
4 changed files with 249 additions and 42 deletions

View File

@@ -81,11 +81,14 @@ func TestVolumeGrowth_ReservationBasedAllocation(t *testing.T) {
}
// Simulate successful volume creation
// Must acquire lock before accessing children map to prevent race condition
dn.Lock()
disk := dn.children[NodeId(types.HardDriveType.String())].(*Disk)
deltaDiskUsage := &DiskUsageCounts{
volumeCount: 1,
}
disk.UpAdjustDiskUsageDelta(types.HardDriveType, deltaDiskUsage)
dn.Unlock()
// Release reservation after successful creation
reservation.releaseAllReservations()
@@ -153,11 +156,14 @@ func TestVolumeGrowth_ConcurrentAllocationPreventsRaceCondition(t *testing.T) {
// Simulate completion: increment volume count BEFORE releasing reservation
if reservation != nil {
// First, increment the volume count to reflect the created volume
// Must acquire lock before accessing children map to prevent race condition
dn.Lock()
disk := dn.children[NodeId(types.HardDriveType.String())].(*Disk)
deltaDiskUsage := &DiskUsageCounts{
volumeCount: 1,
}
disk.UpAdjustDiskUsageDelta(types.HardDriveType, deltaDiskUsage)
dn.Unlock()
// Then release the reservation
reservation.releaseAllReservations()