mirror of https://github.com/seaweedfs/seaweedfs.git synced 2025-11-24 08:46:54 +08:00

Files

Chris Lu 8be9e258fc S3: Add tests for PyArrow with native S3 filesystem (#7508 )

* PyArrow native S3 filesystem

* add sse-s3 tests

* update

* minor

* ENABLE_SSE_S3

* Update test_pyarrow_native_s3.py

* clean up

* refactoring

* Update test_pyarrow_native_s3.py

2025-11-19 13:49:22 -08:00

.gitignore

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

example_pyarrow_native.py

S3: Add tests for PyArrow with native S3 filesystem (#7508 )

2025-11-19 13:49:22 -08:00

FINAL_ROOT_CAUSE_ANALYSIS.md

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

Makefile

S3: Add tests for PyArrow with native S3 filesystem (#7508 )

2025-11-19 13:49:22 -08:00

MINIO_DIRECTORY_HANDLING.md

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

parquet_test_utils.py

S3: Add tests for PyArrow with native S3 filesystem (#7508 )

2025-11-19 13:49:22 -08:00

README.md

S3: Add tests for PyArrow with native S3 filesystem (#7508 )

2025-11-19 13:49:22 -08:00

requirements.txt

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

s3_parquet_test.py

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

TEST_COVERAGE.md

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

test_implicit_directory_fix.py

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

test_pyarrow_native_s3.py

S3: Add tests for PyArrow with native S3 filesystem (#7508 )

2025-11-19 13:49:22 -08:00

test_sse_s3_compatibility.py

S3: Add tests for PyArrow with native S3 filesystem (#7508 )

2025-11-19 13:49:22 -08:00

README.md

PyArrow Parquet S3 Compatibility Tests

This directory contains tests for PyArrow Parquet compatibility with SeaweedFS S3 API, including the implicit directory detection fix.

Overview

Status: ✅ All PyArrow methods work correctly with SeaweedFS

SeaweedFS implements implicit directory detection to improve compatibility with s3fs and PyArrow. When PyArrow writes datasets using write_dataset(), it may create directory markers that can confuse s3fs. SeaweedFS now handles these correctly by returning 404 for HEAD requests on implicit directories (directories with children), forcing s3fs to use LIST-based discovery.

Quick Start

Running the Example Script

# Start SeaweedFS server
make start-seaweedfs-ci

# Run the example script
python3 example_pyarrow_native.py

# Or with uv (if available)
uv run example_pyarrow_native.py

# Stop the server when done
make stop-seaweedfs-safe

Running Tests

# Setup Python environment
make setup-python

# Run all tests with server (small and large files)
make test-with-server

# Run quick tests with small files only (faster for development)
make test-quick

# Run implicit directory fix tests
make test-implicit-dir-with-server

# Run PyArrow native S3 filesystem tests
make test-native-s3-with-server

# Run SSE-S3 encryption tests
make test-sse-s3-compat

# Clean up
make clean

Using PyArrow with SeaweedFS

Option 1: Using s3fs (recommended for compatibility)

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import s3fs

# Configure s3fs
fs = s3fs.S3FileSystem(
    key='your_access_key',
    secret='your_secret_key',
    endpoint_url='http://localhost:8333',
    use_ssl=False
)

# Write dataset (creates directory structure)
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=fs)

# Read dataset (all methods work!)
dataset = pads.dataset('bucket/dataset', filesystem=fs)  # ✅
table = pq.read_table('bucket/dataset', filesystem=fs)   # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs)  # ✅

Option 2: Using PyArrow's native S3 filesystem (pure PyArrow)

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import pyarrow.fs as pafs

# Configure PyArrow's native S3 filesystem
s3 = pafs.S3FileSystem(
    access_key='your_access_key',
    secret_key='your_secret_key',
    endpoint_override='localhost:8333',
    scheme='http',
    allow_bucket_creation=True,
    allow_bucket_deletion=True
)

# Write dataset
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=s3)

# Read dataset (all methods work!)
table = pq.read_table('bucket/dataset', filesystem=s3)  # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=s3)  # ✅
dataset = pads.dataset('bucket/dataset', filesystem=s3)  # ✅

Test Files

Main Test Suite

s3_parquet_test.py - Comprehensive PyArrow test suite
- Tests 2 write methods × 5 read methods × 2 dataset sizes = 20 combinations
- Uses s3fs library for S3 operations
- All tests pass with the implicit directory fix ✅

PyArrow Native S3 Tests

test_pyarrow_native_s3.py - PyArrow's native S3 filesystem tests
- Tests PyArrow's built-in S3FileSystem (pyarrow.fs.S3FileSystem)
- Pure PyArrow solution without s3fs dependency
- Tests 3 read methods × 2 dataset sizes = 6 scenarios
- All tests pass ✅
test_sse_s3_compatibility.py - SSE-S3 encryption compatibility tests
- Tests PyArrow native S3 with SSE-S3 server-side encryption
- Tests 5 different file sizes (10 to 500,000 rows)
- Verifies multipart upload encryption works correctly
- All tests pass ✅

Implicit Directory Tests

test_implicit_directory_fix.py - Specific tests for the implicit directory fix
- Tests HEAD request behavior
- Tests s3fs directory detection
- Tests PyArrow dataset reading
- All 6 tests pass ✅

Examples

example_pyarrow_native.py - Simple standalone example
- Demonstrates PyArrow's native S3 filesystem usage
- Can be run with uv run or regular Python
- Minimal dependencies (pyarrow, boto3)

Configuration

Makefile - Build and test automation
requirements.txt - Python dependencies (pyarrow, s3fs, boto3)
.gitignore - Ignore patterns for test artifacts

Documentation

Technical Documentation

TEST_COVERAGE.md - Comprehensive test coverage documentation
- Unit tests (Go): 17 test cases
- Integration tests (Python): 6 test cases
- End-to-end tests (Python): 20 test cases
FINAL_ROOT_CAUSE_ANALYSIS.md - Deep technical analysis
- Root cause of the s3fs compatibility issue
- How the implicit directory fix works
- Performance considerations
MINIO_DIRECTORY_HANDLING.md - Comparison with MinIO
- How MinIO handles directory markers
- Differences in implementation approaches

The Implicit Directory Fix

Problem

When PyArrow writes datasets with write_dataset(), it may create 0-byte directory markers. s3fs's info() method calls HEAD on these paths, and if HEAD returns 200 with size=0, s3fs incorrectly reports them as files instead of directories. This causes PyArrow to fail with "Parquet file size is 0 bytes".

Solution

SeaweedFS now returns 404 for HEAD requests on implicit directories (0-byte objects or directories with children, when requested without a trailing slash). This forces s3fs to fall back to LIST-based discovery, which correctly identifies directories by checking for children.

Implementation

The fix is implemented in weed/s3api/s3api_object_handlers.go:

HeadObjectHandler - Returns 404 for implicit directories
hasChildren - Helper function to check if a path has children

See the source code for detailed inline documentation.

Test Coverage

Unit tests (Go): weed/s3api/s3api_implicit_directory_test.go
- Run: cd weed/s3api && go test -v -run TestImplicitDirectory
Integration tests (Python): test_implicit_directory_fix.py
- Run: cd test/s3/parquet && make test-implicit-dir-with-server
End-to-end tests (Python): s3_parquet_test.py
- Run: cd test/s3/parquet && make test-with-server

Makefile Targets

# Setup
make setup-python          # Create Python virtual environment and install dependencies
make build-weed           # Build SeaweedFS binary

# Testing
make test                 # Run full tests (assumes server is already running)
make test-with-server     # Run full PyArrow test suite with server (small + large files)
make test-quick           # Run quick tests with small files only (assumes server is running)
make test-implicit-dir-with-server  # Run implicit directory tests with server
make test-native-s3       # Run PyArrow native S3 tests (assumes server is running)
make test-native-s3-with-server  # Run PyArrow native S3 tests with server management
make test-sse-s3-compat   # Run comprehensive SSE-S3 encryption compatibility tests

# Server Management
make start-seaweedfs-ci   # Start SeaweedFS in background (CI mode)
make stop-seaweedfs-safe  # Stop SeaweedFS gracefully
make clean                # Clean up all test artifacts

# Development
make help                 # Show all available targets

Continuous Integration

The tests are automatically run in GitHub Actions on every push/PR that affects S3 or filer code:

Workflow: .github/workflows/s3-parquet-tests.yml

Test Matrix:

Python versions: 3.9, 3.11, 3.12
PyArrow integration tests (s3fs): 20 test combinations
PyArrow native S3 tests: 6 test scenarios ✅ NEW
SSE-S3 encryption tests: 5 file sizes ✅ NEW
Implicit directory fix tests: 6 test scenarios
Go unit tests: 17 test cases

Test Steps (run for each Python version):

Build SeaweedFS
Run PyArrow Parquet integration tests (make test-with-server)
Run implicit directory fix tests (make test-implicit-dir-with-server)
Run PyArrow native S3 filesystem tests (make test-native-s3-with-server) ✅ NEW
Run SSE-S3 encryption compatibility tests (make test-sse-s3-compat) ✅ NEW
Run Go unit tests for implicit directory handling

Triggers:

Push/PR to master (when weed/s3api/** or weed/filer/** changes)
Manual trigger via GitHub UI (workflow_dispatch)

Requirements

Python 3.8+
PyArrow 22.0.0+
s3fs 2024.12.0+
boto3 1.40.0+
SeaweedFS (latest)

AWS S3 Compatibility

The implicit directory fix makes SeaweedFS behavior more compatible with AWS S3:

AWS S3 typically doesn't create directory markers for implicit directories
HEAD on "dataset" (when only "dataset/file.txt" exists) returns 404 on AWS
SeaweedFS now matches this behavior for implicit directories with children

Edge Cases Handled

✅ Implicit directories with children → 404 (forces LIST-based discovery)
✅ Empty files (0-byte, no children) → 200 (legitimate empty file)
✅ Empty directories (no children) → 200 (legitimate empty directory)
✅ Explicit directory requests (trailing slash) → 200 (normal directory behavior)
✅ Versioned buckets → Skip implicit directory check (versioned semantics)
✅ Regular files → 200 (normal file behavior)

Performance

The implicit directory check adds minimal overhead:

Only triggered for 0-byte objects or directories without trailing slash
Cost: One LIST operation with Limit=1 (~1-5ms)
No impact on regular file operations

Contributing

When adding new tests:

Add test cases to the appropriate test file
Update TEST_COVERAGE.md
Run the full test suite to ensure no regressions
Update this README if adding new functionality

References

Last Updated: November 19, 2025
Status: All tests passing ✅

README.md Unescape Escape

PyArrow Parquet S3 Compatibility Tests

Overview

Quick Start

Running the Example Script

Running Tests

Using PyArrow with SeaweedFS

Option 1: Using s3fs (recommended for compatibility)

Option 2: Using PyArrow's native S3 filesystem (pure PyArrow)

Test Files

Main Test Suite

PyArrow Native S3 Tests

Implicit Directory Tests

Examples

Configuration

Documentation

Technical Documentation

The Implicit Directory Fix

Problem

Solution

Implementation

Test Coverage

Makefile Targets

Continuous Integration

Requirements

AWS S3 Compatibility

Edge Cases Handled

Performance

Contributing

References

README.md