* PyArrow native S3 filesystem * add sse-s3 tests * update * minor * ENABLE_SSE_S3 * Update test_pyarrow_native_s3.py * clean up * refactoring * Update test_pyarrow_native_s3.py
PyArrow Parquet S3 Compatibility Tests
This directory contains tests for PyArrow Parquet compatibility with SeaweedFS S3 API, including the implicit directory detection fix.
Overview
Status: ✅ All PyArrow methods work correctly with SeaweedFS
SeaweedFS implements implicit directory detection to improve compatibility with s3fs and PyArrow. When PyArrow writes datasets using write_dataset(), it may create directory markers that can confuse s3fs. SeaweedFS now handles these correctly by returning 404 for HEAD requests on implicit directories (directories with children), forcing s3fs to use LIST-based discovery.
Quick Start
Running the Example Script
# Start SeaweedFS server
make start-seaweedfs-ci
# Run the example script
python3 example_pyarrow_native.py
# Or with uv (if available)
uv run example_pyarrow_native.py
# Stop the server when done
make stop-seaweedfs-safe
Running Tests
# Setup Python environment
make setup-python
# Run all tests with server (small and large files)
make test-with-server
# Run quick tests with small files only (faster for development)
make test-quick
# Run implicit directory fix tests
make test-implicit-dir-with-server
# Run PyArrow native S3 filesystem tests
make test-native-s3-with-server
# Run SSE-S3 encryption tests
make test-sse-s3-compat
# Clean up
make clean
Using PyArrow with SeaweedFS
Option 1: Using s3fs (recommended for compatibility)
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import s3fs
# Configure s3fs
fs = s3fs.S3FileSystem(
key='your_access_key',
secret='your_secret_key',
endpoint_url='http://localhost:8333',
use_ssl=False
)
# Write dataset (creates directory structure)
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=fs)
# Read dataset (all methods work!)
dataset = pads.dataset('bucket/dataset', filesystem=fs) # ✅
table = pq.read_table('bucket/dataset', filesystem=fs) # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs) # ✅
Option 2: Using PyArrow's native S3 filesystem (pure PyArrow)
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import pyarrow.fs as pafs
# Configure PyArrow's native S3 filesystem
s3 = pafs.S3FileSystem(
access_key='your_access_key',
secret_key='your_secret_key',
endpoint_override='localhost:8333',
scheme='http',
allow_bucket_creation=True,
allow_bucket_deletion=True
)
# Write dataset
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=s3)
# Read dataset (all methods work!)
table = pq.read_table('bucket/dataset', filesystem=s3) # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=s3) # ✅
dataset = pads.dataset('bucket/dataset', filesystem=s3) # ✅
Test Files
Main Test Suite
s3_parquet_test.py- Comprehensive PyArrow test suite- Tests 2 write methods × 5 read methods × 2 dataset sizes = 20 combinations
- Uses s3fs library for S3 operations
- All tests pass with the implicit directory fix ✅
PyArrow Native S3 Tests
-
test_pyarrow_native_s3.py- PyArrow's native S3 filesystem tests- Tests PyArrow's built-in S3FileSystem (pyarrow.fs.S3FileSystem)
- Pure PyArrow solution without s3fs dependency
- Tests 3 read methods × 2 dataset sizes = 6 scenarios
- All tests pass ✅
-
test_sse_s3_compatibility.py- SSE-S3 encryption compatibility tests- Tests PyArrow native S3 with SSE-S3 server-side encryption
- Tests 5 different file sizes (10 to 500,000 rows)
- Verifies multipart upload encryption works correctly
- All tests pass ✅
Implicit Directory Tests
test_implicit_directory_fix.py- Specific tests for the implicit directory fix- Tests HEAD request behavior
- Tests s3fs directory detection
- Tests PyArrow dataset reading
- All 6 tests pass ✅
Examples
example_pyarrow_native.py- Simple standalone example- Demonstrates PyArrow's native S3 filesystem usage
- Can be run with
uv runor regular Python - Minimal dependencies (pyarrow, boto3)
Configuration
Makefile- Build and test automationrequirements.txt- Python dependencies (pyarrow, s3fs, boto3).gitignore- Ignore patterns for test artifacts
Documentation
Technical Documentation
-
TEST_COVERAGE.md- Comprehensive test coverage documentation- Unit tests (Go): 17 test cases
- Integration tests (Python): 6 test cases
- End-to-end tests (Python): 20 test cases
-
FINAL_ROOT_CAUSE_ANALYSIS.md- Deep technical analysis- Root cause of the s3fs compatibility issue
- How the implicit directory fix works
- Performance considerations
-
MINIO_DIRECTORY_HANDLING.md- Comparison with MinIO- How MinIO handles directory markers
- Differences in implementation approaches
The Implicit Directory Fix
Problem
When PyArrow writes datasets with write_dataset(), it may create 0-byte directory markers. s3fs's info() method calls HEAD on these paths, and if HEAD returns 200 with size=0, s3fs incorrectly reports them as files instead of directories. This causes PyArrow to fail with "Parquet file size is 0 bytes".
Solution
SeaweedFS now returns 404 for HEAD requests on implicit directories (0-byte objects or directories with children, when requested without a trailing slash). This forces s3fs to fall back to LIST-based discovery, which correctly identifies directories by checking for children.
Implementation
The fix is implemented in weed/s3api/s3api_object_handlers.go:
HeadObjectHandler- Returns 404 for implicit directorieshasChildren- Helper function to check if a path has children
See the source code for detailed inline documentation.
Test Coverage
-
Unit tests (Go):
weed/s3api/s3api_implicit_directory_test.go- Run:
cd weed/s3api && go test -v -run TestImplicitDirectory
- Run:
-
Integration tests (Python):
test_implicit_directory_fix.py- Run:
cd test/s3/parquet && make test-implicit-dir-with-server
- Run:
-
End-to-end tests (Python):
s3_parquet_test.py- Run:
cd test/s3/parquet && make test-with-server
- Run:
Makefile Targets
# Setup
make setup-python # Create Python virtual environment and install dependencies
make build-weed # Build SeaweedFS binary
# Testing
make test # Run full tests (assumes server is already running)
make test-with-server # Run full PyArrow test suite with server (small + large files)
make test-quick # Run quick tests with small files only (assumes server is running)
make test-implicit-dir-with-server # Run implicit directory tests with server
make test-native-s3 # Run PyArrow native S3 tests (assumes server is running)
make test-native-s3-with-server # Run PyArrow native S3 tests with server management
make test-sse-s3-compat # Run comprehensive SSE-S3 encryption compatibility tests
# Server Management
make start-seaweedfs-ci # Start SeaweedFS in background (CI mode)
make stop-seaweedfs-safe # Stop SeaweedFS gracefully
make clean # Clean up all test artifacts
# Development
make help # Show all available targets
Continuous Integration
The tests are automatically run in GitHub Actions on every push/PR that affects S3 or filer code:
Workflow: .github/workflows/s3-parquet-tests.yml
Test Matrix:
- Python versions: 3.9, 3.11, 3.12
- PyArrow integration tests (s3fs): 20 test combinations
- PyArrow native S3 tests: 6 test scenarios ✅ NEW
- SSE-S3 encryption tests: 5 file sizes ✅ NEW
- Implicit directory fix tests: 6 test scenarios
- Go unit tests: 17 test cases
Test Steps (run for each Python version):
- Build SeaweedFS
- Run PyArrow Parquet integration tests (
make test-with-server) - Run implicit directory fix tests (
make test-implicit-dir-with-server) - Run PyArrow native S3 filesystem tests (
make test-native-s3-with-server) ✅ NEW - Run SSE-S3 encryption compatibility tests (
make test-sse-s3-compat) ✅ NEW - Run Go unit tests for implicit directory handling
Triggers:
- Push/PR to master (when
weed/s3api/**orweed/filer/**changes) - Manual trigger via GitHub UI (workflow_dispatch)
Requirements
- Python 3.8+
- PyArrow 22.0.0+
- s3fs 2024.12.0+
- boto3 1.40.0+
- SeaweedFS (latest)
AWS S3 Compatibility
The implicit directory fix makes SeaweedFS behavior more compatible with AWS S3:
- AWS S3 typically doesn't create directory markers for implicit directories
- HEAD on "dataset" (when only "dataset/file.txt" exists) returns 404 on AWS
- SeaweedFS now matches this behavior for implicit directories with children
Edge Cases Handled
✅ Implicit directories with children → 404 (forces LIST-based discovery)
✅ Empty files (0-byte, no children) → 200 (legitimate empty file)
✅ Empty directories (no children) → 200 (legitimate empty directory)
✅ Explicit directory requests (trailing slash) → 200 (normal directory behavior)
✅ Versioned buckets → Skip implicit directory check (versioned semantics)
✅ Regular files → 200 (normal file behavior)
Performance
The implicit directory check adds minimal overhead:
- Only triggered for 0-byte objects or directories without trailing slash
- Cost: One LIST operation with Limit=1 (~1-5ms)
- No impact on regular file operations
Contributing
When adding new tests:
- Add test cases to the appropriate test file
- Update TEST_COVERAGE.md
- Run the full test suite to ensure no regressions
- Update this README if adding new functionality
References
Last Updated: November 19, 2025
Status: All tests passing ✅